## Problem 1

**Dataset:** `as_datasets/eyecolorgenderdata.csv`

**Dataset description:** A dataset containing information from college students: the gender, age, year in college, eye color, height in inches, how many miles driven per week, number of brothers, number of sisters, average hours of computer time per week, whether regular exercise is performed, how many hours on average of exercise is performed per week, how many music CDs owned, how many hours of gaming per week, and how many hours of tv per week.

**Write and discuss the steps to answering the following research question:** Subsample the full dataset by the last two digits of your CIN floor divided by 4. Does a male college student have an even chance of having blue, brown, green, or hazel eyes?

## Problem 2

**Dataset:** `as_datasets/eyecolorgenderdata.csv`

**Dataset description:** A dataset containing information from college students: the gender, age, year in college, eye color, height in inches, how many miles driven per week, number of brothers, number of sisters, average hours of computer time per week, whether regular exercise is performed, how many hours on average of exercise is performed per week, how many music CDs owned, how many hours of gaming per week, and how many hours of tv per week.

**Write and discuss the steps to answering the following research question:** Subsample the full dataset by the last two digits of your CIN floor divided by 4. Are students who drive more miles per week more likely to exercise less hours per week?

## Problem 3

**Dataset:** `as_datasets/eyecolorgenderdata.csv`

**Dataset description:** A dataset containing information from college students: the gender, age, year in college, eye color, height in inches, how many miles driven per week, number of brothers, number of sisters, average hours of computer time per week, whether regular exercise is performed, how many hours on average of exercise is performed per week, how many music CDs owned, how many hours of gaming per week, and how many hours of tv per week.

**Write and discuss the steps to answering the following research question:** Subsample the full dataset by the last two digits of your CIN floor divided by 4. Which attributes in your subsampled dataset work best to estimate the number of hours on average of exercise per week? Are the attributes that work plausible to be able to estimate the number of hours of exercise per week?

## Problem 4

**Dataset:** `as_datasets/eyecolorgenderdata.csv`

**Dataset description:** A dataset containing information from college students: the gender, age, year in college, eye color, height in inches, how many miles driven per week, number of brothers, number of sisters, average hours of computer time per week, whether regular exercise is performed, how many hours on average of exercise is performed per week, how many music CDs owned, how many hours of gaming per week, and how many hours of tv per week.

**Write and discuss the steps to answering the following research problem:** Using the full dataset, build a regression model that is able to estimate the number of exercise hours of a college student. Which features work best? Which machine learning algorithm produces the most accurate results without overfitting? Justify the machine learning algorithm you chose.

## Problem 5

**Dataset:** `ml_datasets/building_energy_efficiency.csv`

(Dataset creators: Angeliki Xifara and Athanasios Tsanas)

**Dataset description:** A dataset containing energy analysis using 12 different building shapes simulated in Ecotect. The buildings differ with respect to the glazing area, the glazing area distribution, and the orientation, amongst other parameters. We simulate various settings as functions of the afore-mentioned characteristics to obtain 768 building shapes. The dataset comprises 768 samples and 8 features, aiming to predict two real valued responses.

The dataset contains eight attributes (or features, denoted by X1…X8) and two responses (or outcomes, denoted by y1 and y2):

- X1 Relative Compactness
- X2 Surface Area
- X3 Wall Area
- X4 Roof Area
- X5 Overall Height
- X6 Orientation
- X7 Glazing Area
- X8 Glazing Area Distribution
- y1 Heating Load
- y2 Cooling Load

**Write and discuss the steps to answering the following research problem:** Build a regression model to predict the heating load. Which features work best? Which machine learning algorithm produces the most accurate results without overfitting? Justify the machine learning algorithm you chose.