# Stat 202: Lecture 2 (covers pp. 22-33)

Nathan VanHoudnos
9/26/2014

### Agenda

1. Homework #1 comments
2. Checkpoint #1 results
3. Lecture 2 (covers pp. 22-33)

### Homework #1 comments

• Having laptop problems? Go to Laptop ER:
• NU IT is installing R & RStudio in the library
• This will not be ready until after HW 1 is due.
• If you can't get them installed on your machine, then borrow a friend's machine for this assignment. (Do your own assignment!)

### Homework #1 comments

• If the *.Rmd file does not open when you double-click on it, then open it directly in RStudio. (Demo)

• Follow the directions on the homework.

• Steps 1-4: Make sure that RStudio is working
• Step 7: Saves you time:

Step 7: After every question, click the Knit HTML button to make sure that the output is as expected.

### Homework #1 comments

• Asking me questions is encouraged. I will respond!

• However, it is disrespectful to waste my time.

• Do not ask me questions that you can answer yourself with a little bit of effort.
• “Thanks, I already figured it out.” implies that you did not need to ask the question!

### Agenda

1. Homework #1 comments
2. Checkpoint #1 results
3. Lecture 2 (covers pp. 22-33)

### Checkpoint #1 results

• The checkpoints are quizzes, they are to be your work alone.
• I will update the syllabus to clarify

Notes

• 68 of 75 students have signed up on OLI
• 64 of 68 took the Checkpoint
• 1 student took Checkpoint #2 instead
• Average percent correct: 90%
• If you have questions, see me or Aaron

### Agenda

1. Homework #1 comments
2. Checkpoint #1 results
3. Lecture 2 (covers pp. 22-33)

### Motivation for today:

These two distributions have …

• the same shape
• the same center
• Range
• Inter-quartile range (IQR)
• Standard deviation

### Measures of spread

Age at Oscar

  2 | 1
2 | 56669
3 | 013333444
3 | 555789
4 | 11123
4 | 599
5 |
5 |
6 | 1
6 |
7 | 4
7 |
8 | 0

• Range:
• Max - Min
• 80 - 21 = 59
• All data in range
• Inter-Quartile Range
• 50% of data $$\le$$ median
• 25% of data $$\le$$ Q1
• 75% of data $$\le$$ Q3
• IQR = Q3 - Q1
• 50% of data in IQR

### IQR by hand example

Lowest

  2 | 1
2 | 56669
3 | 013333444
3 | 5


Highest

  3 | 55789
4 | 11123599
5 |
6 | 1
7 | 4
8 | 0

• Q1 $$\approx$$ median of lowest half
• 32
• Q3 $$\approx$$ median of highest half
• 41.5
• IQR $$\approx$$ 41.5 - 32
• 9.5 years

### IQR by R example

head(oscars.df, n=3) ## First three records

  year age
1 1970  34
2 1971  34
3 1972  26

summary(oscars.df$age)   Min. 1st Qu. Median Mean 3rd Qu. Max. 21.0 32.5 35.0 38.5 41.2 80.0  However: 41.2 - 32.5 = 8.7 $$\ne$$ 9.5 ### R and "by hand" disagree! This is earth-shattering. Is R wrong? or Am I teaching you lies? Answer: Neither. • R implements nine methods (including “by hand”) to calculate quartiles • Some methods are better estimators than others • The default of summary() is among the best • The “by hand” method is among the worst ### Q1, Q3, and IQR are a little tricky OLI textbook, p. 24, emphasis mine: Note that Q1 and Q3 as reported by the various software packages differ from each other and are also slightly different from the ones we found [by hand]. This should not worry you. It should pique your intellectual curiosity. For this course, I will ask you to use • the “by hand” method for tests & quizzes, and • the R summary method for homeworks. ### Using IQR: Outliers Outliers are suspected if an observation is • below $$\left( \text{Q1} - 1.5 \times \text{IQR}\right)$$ or • above $$\left( \text{Q3} + 1.5 \times \text{IQR}\right)$$ ### Example: 1.5 IQR rule Age at Oscar  2 | 1 2 | 56669 3 | 013333444 3 | 555789 4 | 11123 4 | 599 5 | 5 | 6 | 1 6 | 7 | 4 7 | 8 | 0  Recall • Q1 = 32 and Q3 = 41.5 Which data points, if any, are suspected outliers? • IQR = 41.5 - 32 = 9.5 • 1.5 IQR = 14.25 • Q1 - 1.5 IQR = 17.75 • Q3 + 1.5 IQR = 55.75 • suspect 61, 74, and 80 ### Handling outliers: It depends Roughly three types of outliers: 1. Same data-generating process 2. Different data-generating process 3. Mistakes ### Outlier from same process Earthquakes • the largest quakes can be the most important! • removing such a data point might be a terrible idea ### Outlier from different process Stock Prices • Highly-publicized anti-smoking congressional hearings • Large negative pressure for stock price • Can remove outlier if interested in typical returns ### Outlier from mistake Archaeology • Likely that '.394' was really '.094' • Often useful to remove outlier ### Outlier from mistake Archaeology • Likely that '.394' was really '.094' • Often useful to remove outlier ### Boxplots: 5 number summary Age at Oscar  2 | 1 2 | 56669 3 | 013333444 3 | 555789 4 | 11123 4 | 599 5 | 5 | 6 | 1 6 | 7 | 4 7 | 8 | 0  5 number summary Min. Q1 Median Q3 Max. 21 32 35 41.5 80 Boxplot ### Boxplots in R summary(oscars.df$age)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
21.0    32.5    35.0    38.5    41.2    80.0

boxplot(oscars.df\$age)


### BRFSS Boxplots

Height in inches

What is the height of

• the tallest person?
• 93 in. (7' 9")
• the shortest person?
• 48 in. (4')

I am taller than ¾ of people. How tall am I?

• Q3 = 70 in. (5' 10")

### BRFSS Boxplots

Age

Is this distribution symmetric, right-skewed, or left-skewed?

• right-skewed

Approximately what percentage of people are under 60?

• A little over 75%

### Side-by-side Boxplots

Age

Age by General Health

### Side-by-side Boxplots

Age by General Health

Compare and contrast

• Shape
• Excellent: right-skewed
• Poor: symmetric
• Center
• Excellent: typically 40 years
• Poor: typically 60 years

### Side-by-side Boxplots

Age by General Health

Compare and contrast

• Both groups have similar IQR
• Nearly 75% of people in excellent health are younger than than the youngest 25% of people in poor health.

### Side-by-side Boxplots

Age by General Health

Compare and contrast

• Deviations from pattern
• The oldest people rate themselves in excellent health.
• These outliers are likely to occur again and should be kept.

Weight by gender

Desired Weight