Nathan VanHoudnos

9/26/2014

- Homework #1 comments
- Checkpoint #1 results
- Lecture 2 (covers pp. 22-33)

- Having laptop problems? Go to Laptop ER:
- 1:00 - 4:00 p.m.
**today** - at Norris University Center, across from the University bookstore
- first-come-first serve
- info: http://www.it.northwestern.edu/laptoper/

- 1:00 - 4:00 p.m.
- NU IT is installing R & RStudio in the library
- This will
**not**be ready until after HW 1 is due. - If you can't get them installed on your machine, then
**borrow**a friend's machine for this assignment. (**Do your own assignment!**)

- This will

If the *.Rmd file does not open when you double-click on it, then

**open**it directly in RStudio. (Demo)Follow the directions on the homework.

- Steps 1-4: Make sure that RStudio is working
- Step 7: Saves you time:

Step 7: After every question, click the

Knit HTMLbutton to make sure that the output is as expected.

Asking me questions is encouraged. I will respond!

However,

**it is disrespectful to waste my time.**- Do not ask me questions that you can answer yourself with a little bit of effort.
*“Thanks, I already figured it out.”*implies that you did not need to ask the question!

- Homework #1 comments
- Checkpoint #1 results
- Lecture 2 (covers pp. 22-33)

**Academic integrity**

- The checkpoints are
**quizzes**, they are to be your work**alone**. - I will update the syllabus to clarify

**Notes**

- 68 of 75 students have signed up on OLI
- 64 of 68 took the Checkpoint
- 1 student took Checkpoint #2 instead

- Average percent correct:
**90%** - If you have questions, see me or Aaron

- Homework #1 comments
- Checkpoint #1 results
- Lecture 2 (covers pp. 22-33)

These two distributions have …

- the same shape
- the same center
**different spreads**- Range
- Inter-quartile range (IQR)
- Standard deviation

**Age at Oscar**

```
2 | 1
2 | 56669
3 | 013333444
3 | 555789
4 | 11123
4 | 599
5 |
5 |
6 | 1
6 |
7 | 4
7 |
8 | 0
```

**Range:**- Max - Min
- 80 - 21 = 59
*All*data in range

- Max - Min
**Inter-Quartile Range**- 50% of data \( \le \)
**median** - 25% of data \( \le \)
**Q1** - 75% of data \( \le \)
**Q3** **IQR**= Q3 - Q1*50%*of data in IQR

- 50% of data \( \le \)

**Lowest**

```
2 | 1
2 | 56669
3 | 013333444
3 | 5
```

**Highest**

```
3 | 55789
4 | 11123599
5 |
6 | 1
7 | 4
8 | 0
```

- Q1 \( \approx \) median of lowest half
- 32

- Q3 \( \approx \) median of highest half
- 41.5

- IQR \( \approx \) 41.5 - 32
- 9.5 years

```
head(oscars.df, n=3) ## First three records
```

```
year age
1 1970 34
2 1971 34
3 1972 26
```

```
summary(oscars.df$age)
```

```
Min. 1st Qu. Median Mean 3rd Qu. Max.
21.0 32.5 35.0 38.5 41.2 80.0
```

**However**: 41.2 - 32.5 = 8.7 \( \ne \) 9.5

This is **earth-shattering**.

Is R wrong?

orAm I teaching you lies?

Answer: **Neither.**

R implements

**nine**methods (including “by hand”) to calculate quartilesSome methods are better

**estimators**than othersThe default of

`summary()`

is among the bestThe “by hand” method is among the worst

OLI textbook, p. 24, emphasis mine:

Note that Q1 and Q3 as reported by the various software packages differ from each other and are also slightly different from the ones we found [by hand].

This should not worry you.

**It should pique your intellectual curiosity.**

For this course, I will ask you to use

- the “by hand” method for tests & quizzes, and
- the R
`summary`

method for homeworks.

Outliers are suspected if an observation is

- below \( \left( \text{Q1} - 1.5 \times \text{IQR}\right) \) or
- above \( \left( \text{Q3} + 1.5 \times \text{IQR}\right) \)

**Age at Oscar**

```
2 | 1
2 | 56669
3 | 013333444
3 | 555789
4 | 11123
4 | 599
5 |
5 |
6 | 1
6 |
7 | 4
7 |
8 | 0
```

Recall

- Q1 = 32 and Q3 = 41.5

Which data points, if any, are suspected outliers?

- IQR = 41.5 - 32 = 9.5
- 1.5 IQR = 14.25
- Q1 - 1.5 IQR = 17.75
- Q3 + 1.5 IQR = 55.75
- suspect 61, 74, and 80

Roughly three types of outliers:

- Same data-generating process
- Different data-generating process
- Mistakes

**Earthquakes**

- the largest quakes can be the most important!
- removing such a data point might be a terrible idea

**Stock Prices**

- Highly-publicized anti-smoking congressional hearings
- Large negative pressure for stock price
- Can remove outlier
**if**interested in typical returns

**Archaeology**

Likely that '.394' was really '.094'

Often

**useful**to remove outlier

**Archaeology**

Likely that '.394' was really '.094'

Often

**useful**to remove outlier

**Age at Oscar**

```
2 | 1
2 | 56669
3 | 013333444
3 | 555789
4 | 11123
4 | 599
5 |
5 |
6 | 1
6 |
7 | 4
7 |
8 | 0
```

**5 number summary**

```
Min. Q1 Median Q3 Max.
21 32 35 41.5 80
```

**Boxplot**

```
summary(oscars.df$age)
```

```
Min. 1st Qu. Median Mean 3rd Qu. Max.
21.0 32.5 35.0 38.5 41.2 80.0
```

```
boxplot(oscars.df$age)
```

**Height in inches**

What is the height of

- the tallest person?
- 93 in. (7' 9")

- the shortest person?
- 48 in. (4')

I am taller than ¾ of people. How tall am I?

- Q3 = 70 in. (5' 10")

**Age**

Is this distribution symmetric, right-skewed, or left-skewed?

**right-skewed**

Approximately what percentage of people are under 60?

**A little over 75%**

**Age**

**Age by General Health**

**Age by General Health**

Compare and contrast

**Shape**- Excellent: right-skewed
- Poor: symmetric

**Center**- Excellent: typically 40 years
- Poor: typically 60 years

**Age by General Health**

Compare and contrast

**Spread**- Both groups have similar IQR
- Nearly 75% of people in excellent health are younger than than the youngest 25% of people in poor health.

**Age by General Health**

Compare and contrast

**Deviations from pattern**- The oldest people rate themselves in excellent health.
- These outliers are
**likely**to occur again and should be kept.

**Weight by gender**

**Desired Weight**