Stat 202: Lecture 2 (covers pp. 22-33)

Nathan VanHoudnos
9/26/2014

Agenda

  1. Homework #1 comments
  2. Checkpoint #1 results
  3. Lecture 2 (covers pp. 22-33)

Homework #1 comments

  • Having laptop problems? Go to Laptop ER:
  • NU IT is installing R & RStudio in the library
    • This will not be ready until after HW 1 is due.
    • If you can't get them installed on your machine, then borrow a friend's machine for this assignment. (Do your own assignment!)

Homework #1 comments

  • If the *.Rmd file does not open when you double-click on it, then open it directly in RStudio. (Demo)

  • Follow the directions on the homework.

    • Steps 1-4: Make sure that RStudio is working
    • Step 7: Saves you time:

Step 7: After every question, click the Knit HTML button to make sure that the output is as expected.

Homework #1 comments

  • Asking me questions is encouraged. I will respond!

  • However, it is disrespectful to waste my time.

    • Do not ask me questions that you can answer yourself with a little bit of effort.
    • “Thanks, I already figured it out.” implies that you did not need to ask the question!

Agenda

  1. Homework #1 comments
  2. Checkpoint #1 results
  3. Lecture 2 (covers pp. 22-33)

Checkpoint #1 results

Academic integrity

  • The checkpoints are quizzes, they are to be your work alone.
  • I will update the syllabus to clarify

Notes

  • 68 of 75 students have signed up on OLI
  • 64 of 68 took the Checkpoint
    • 1 student took Checkpoint #2 instead
  • Average percent correct: 90%
  • If you have questions, see me or Aaron

Agenda

  1. Homework #1 comments
  2. Checkpoint #1 results
  3. Lecture 2 (covers pp. 22-33)

Motivation for today:

plot of chunk unnamed-chunk-2

These two distributions have …

  • the same shape
  • the same center
  • different spreads
    • Range
    • Inter-quartile range (IQR)
    • Standard deviation

Measures of spread

Age at Oscar

  2 | 1
  2 | 56669
  3 | 013333444
  3 | 555789
  4 | 11123
  4 | 599
  5 | 
  5 | 
  6 | 1
  6 | 
  7 | 4
  7 | 
  8 | 0
  • Range:
    • Max - Min
    • 80 - 21 = 59
    • All data in range
  • Inter-Quartile Range
    • 50% of data \( \le \) median
    • 25% of data \( \le \) Q1
    • 75% of data \( \le \) Q3
    • IQR = Q3 - Q1
    • 50% of data in IQR

IQR by hand example

Lowest

  2 | 1
  2 | 56669
  3 | 013333444
  3 | 5

Highest

  3 | 55789
  4 | 11123599
  5 | 
  6 | 1
  7 | 4
  8 | 0
  • Q1 \( \approx \) median of lowest half
    • 32
  • Q3 \( \approx \) median of highest half
    • 41.5
  • IQR \( \approx \) 41.5 - 32
    • 9.5 years

IQR by R example

head(oscars.df, n=3) ## First three records
  year age
1 1970  34
2 1971  34
3 1972  26
summary(oscars.df$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   21.0    32.5    35.0    38.5    41.2    80.0 

However: 41.2 - 32.5 = 8.7 \( \ne \) 9.5

R and "by hand" disagree!

This is earth-shattering.

Is R wrong? or Am I teaching you lies?

Answer: Neither.

  • R implements nine methods (including “by hand”) to calculate quartiles

  • Some methods are better estimators than others

  • The default of summary() is among the best

  • The “by hand” method is among the worst

Q1, Q3, and IQR are a little tricky

OLI textbook, p. 24, emphasis mine:

Note that Q1 and Q3 as reported by the various software packages differ from each other and are also slightly different from the ones we found [by hand]. This should not worry you.

It should pique your intellectual curiosity.

For this course, I will ask you to use

  • the “by hand” method for tests & quizzes, and
  • the R summary method for homeworks.

Using IQR: Outliers

Outliers are suspected if an observation is

  • below \( \left( \text{Q1} - 1.5 \times \text{IQR}\right) \) or
  • above \( \left( \text{Q3} + 1.5 \times \text{IQR}\right) \)

a

Example: 1.5 IQR rule

Age at Oscar

  2 | 1
  2 | 56669
  3 | 013333444
  3 | 555789
  4 | 11123
  4 | 599
  5 | 
  5 | 
  6 | 1
  6 | 
  7 | 4
  7 | 
  8 | 0

Recall

  • Q1 = 32 and Q3 = 41.5

Which data points, if any, are suspected outliers?

  • IQR = 41.5 - 32 = 9.5
  • 1.5 IQR = 14.25
  • Q1 - 1.5 IQR = 17.75
  • Q3 + 1.5 IQR = 55.75
  • suspect 61, 74, and 80

Handling outliers: It depends

Roughly three types of outliers:

  1. Same data-generating process
  2. Different data-generating process
  3. Mistakes

Outlier from same process

a

Earthquakes

  • the largest quakes can be the most important!
  • removing such a data point might be a terrible idea

Outlier from different process

a

Stock Prices

  • Highly-publicized anti-smoking congressional hearings
  • Large negative pressure for stock price
  • Can remove outlier if interested in typical returns

Outlier from mistake

a

Archaeology

  • Likely that '.394' was really '.094'

  • Often useful to remove outlier

Outlier from mistake

a

Archaeology

  • Likely that '.394' was really '.094'

  • Often useful to remove outlier

Boxplots: 5 number summary

Age at Oscar

  2 | 1
  2 | 56669
  3 | 013333444
  3 | 555789
  4 | 11123
  4 | 599
  5 | 
  5 | 
  6 | 1
  6 | 
  7 | 4
  7 | 
  8 | 0

5 number summary

Min. Q1  Median Q3   Max.
21   32  35     41.5 80

Boxplot a

Boxplots in R

summary(oscars.df$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   21.0    32.5    35.0    38.5    41.2    80.0 
boxplot(oscars.df$age)

plot of chunk unnamed-chunk-15

BRFSS Boxplots

Height in inches plot of chunk unnamed-chunk-17

What is the height of

  • the tallest person?
    • 93 in. (7' 9")
  • the shortest person?
    • 48 in. (4')

I am taller than ¾ of people. How tall am I?

  • Q3 = 70 in. (5' 10")

BRFSS Boxplots

Age plot of chunk unnamed-chunk-18

Is this distribution symmetric, right-skewed, or left-skewed?

  • right-skewed

Approximately what percentage of people are under 60?

  • A little over 75%

Side-by-side Boxplots

Age plot of chunk unnamed-chunk-19

Age by General Health plot of chunk unnamed-chunk-20

Side-by-side Boxplots

Age by General Health plot of chunk unnamed-chunk-21

Compare and contrast

  • Shape
    • Excellent: right-skewed
    • Poor: symmetric
  • Center
    • Excellent: typically 40 years
    • Poor: typically 60 years

Side-by-side Boxplots

Age by General Health plot of chunk unnamed-chunk-22

Compare and contrast

  • Spread
    • Both groups have similar IQR
    • Nearly 75% of people in excellent health are younger than than the youngest 25% of people in poor health.

Side-by-side Boxplots

Age by General Health plot of chunk unnamed-chunk-23

Compare and contrast

  • Deviations from pattern
    • The oldest people rate themselves in excellent health.
    • These outliers are likely to occur again and should be kept.

Tell me a story...

Weight by gender plot of chunk unnamed-chunk-24

Desired Weight plot of chunk unnamed-chunk-25

Motivation for today: