diff --git a/06_StatisticalInference/01_01_Introduction/index.Rmd b/06_StatisticalInference/01_01_Introduction/index.Rmd index 74e8b2a1e..3b5e98989 100644 --- a/06_StatisticalInference/01_01_Introduction/index.Rmd +++ b/06_StatisticalInference/01_01_Introduction/index.Rmd @@ -1,158 +1,158 @@ ---- -title : Introduction to statistical inference -subtitle : Statistical inference -author : Brian Caffo, Jeff Leek, Roger Peng -job : Johns Hopkins Bloomberg School of Public Health -logo : bloomberg_shield.png -framework : io2012 # {io2012, html5slides, shower, dzslides, ...} -highlighter : highlight.js # {highlight.js, prettify, highlight} -hitheme : tomorrow # -url: - lib: ../../librariesNew - assets: ../../assets -widgets : [mathjax] # {mathjax, quiz, bootstrap} -mode : selfcontained # {standalone, draft} ---- -## Statistical inference defined - -Statistical inference is the process of drawing formal conclusions from -data. - -In our class, we wil define formal statistical inference as settings where one wants to infer facts about a population using noisy -statistical data where uncertainty must be accounted for. - ---- - -## Motivating example: who's going to win the election? - -In every major election, pollsters would like to know, ahead of the -actual election, who's going to win. Here, the target of -estimation (the estimand) is clear, the percentage of people in -a particular group (city, state, county, country or other electoral -grouping) who will vote for each candidate. - -We can not poll everyone. Even if we could, some polled -may change their vote by the time the election occurs. -How do we collect a reasonable subset of data and quantify the -uncertainty in the process to produce a good guess at who will win? - ---- - -## Motivating example: is hormone replacement therapy effective? - -A large clinical trial (the Women’s Health Initiative) published results in 2002 that contradicted prior evidence on the efficacy of hormone replacement therapy for post menopausal women and suggested a negative impact of HRT for several key health outcomes. **Based on a statistically based protocol, the study was stopped early due an excess number of negative events.** - -Here's there's two inferential problems. - -1. Is HRT effective? -2. How long should we continue the trial in the presence of contrary -evidence? - -See WHI writing group paper JAMA 2002, Vol 288:321 - 333. for the paper and Steinkellner et al. Menopause 2012, Vol 19:616 621 for adiscussion of the long term impacts - ---- - -## Motivating example: ECMO - -In 1985 a group at a major neonatal intensive care center published the results of a trial comparing a standard treatment and a promising new extracorporeal membrane oxygenation treatment (ECMO) for newborn infants with severe respiratory failure. **Ethical considerations lead to a statistical randomization scheme whereby one infant received the control therapy, thereby opening the study to sample-size based criticisms.** - -For a review and statistical discussion, see Royall Statistical Science 1991, Vol 6, No. 1, 52-88 - ---- - -## Summary - -- These examples illustrate many of the difficulties of trying -to use data to create general conclusions about a population. -- Paramount among our concerns are: - - Is the sample representative of the population that we'd like to draw inferences about? - - Are there known and observed, known and unobserved or unknown and unobserved variables that contaminate our conclusions? - - Is there systematic bias created by missing data or the design or conduct of the study? - - What randomness exists in the data and how do we use or adjust for it? Here randomness can either be explicit via randomization -or random sampling, or implicit as the aggregation of many complex uknown processes. - - Are we trying to estimate an underlying mechanistic model of phenomena under study? -- Statistical inference requires navigating the set of assumptions and -tools and subsequently thinking about how to draw conclusions from data. - ---- -## Example goals of inference - -1. Estimate and quantify the uncertainty of an estimate of -a population quantity (the proportion of people who will - vote for a candidate). -2. Determine whether a population quantity - is a benchmark value ("is the treatment effective?"). -3. Infer a mechanistic relationship when quantities are measured with - noise ("What is the slope for Hooke's law?") -4. Determine the impact of a policy? ("If we reduce polution levels, - will asthma rates decline?") - - ---- -## Example tools of the trade - -1. Randomization: concerned with balancing unobserved variables that may confound inferences of interest -2. Random sampling: concerned with obtaining data that is representative -of the population of interest -3. Sampling models: concerned with creating a model for the sampling -process, the most common is so called "iid". -4. Hypothesis testing: concerned with decision making in the presence of uncertainty -5. Confidence intervals: concerned with quantifying uncertainty in -estimation -6. Probability models: a formal connection between the data and a population of interest. Often probability models are assumed or are -approximated. -7. Study design: the process of designing an experiment to minimize biases and variability. -8. Nonparametric bootstrapping: the process of using the data to, - with minimal probability model assumptions, create inferences. -9. Permutation, randomization and exchangeability testing: the process -of using data permutations to perform inferences. - ---- -## Different thinking about probability leads to different styles of inference - -We won't spend too much time talking about this, but there are several different -styles of inference. Two broad categories that get discussed a lot are: - -1. Frequency probability: is the long run proportion of - times an event occurs in independent, identically distributed - repetitions. -2. Frequency inference: uses frequency interpretations of probabilities -to control error rates. Answers questions like "What should I decide -given my data controlling the long run proportion of mistakes I make at -a tolerable level." -3. Bayesian probability: is the probability calculus of beliefs, given that beliefs follow certain rules. -4. Bayesian inference: the use of Bayesian probability representation -of beliefs to perform inference. Answers questions like "Given my subjective beliefs and the objective information from the data, what -should I believe now?" - -Data scientists tend to fall within shades of gray of these and various other schools of inference. - ---- -## In this class - -* In this class, we will primarily focus on basic sampling models, -basic probability models and frequency style analyses -to create standard inferences. -* Being data scientists, we will also consider some inferential strategies that rely heavily on the observed data, such as permutation testing -and bootstrapping. -* As probability modeling will be our starting point, we first build -up basic probability. - ---- -## Where to learn more on the topics not covered - -1. Explicit use of random sampling in inferences: look in references -on "finite population statistics". Used heavily in polling and -sample surveys. -2. Explicit use of randomization in inferences: look in references -on "causal inference" especially in clinical trials. -3. Bayesian probability and Bayesian statistics: look for basic itroductory books (there are many). -4. Missing data: well covered in biostatistics and econometric -references; look for references to "multiple imputation", a popular tool for -addressing missing data. -5. Study design: consider looking in the subject matter area that - you are interested in; some examples with rich histories in design: - 1. The epidemiological literature is very focused on using study design to investigate public health. - 2. The classical development of study design in agriculture broadly covers design and design principles. - 3. The industrial quality control literature covers design thoroughly. - +--- +title : Introduction to statistical inference +subtitle : Statistical inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- +## Statistical inference defined + +Statistical inference is the process of drawing formal conclusions from +data. + +In our class, we wil define formal statistical inference as settings where one wants to infer facts about a population using noisy +statistical data where uncertainty must be accounted for. + +--- + +## Motivating example: who's going to win the election? + +In every major election, pollsters would like to know, ahead of the +actual election, who's going to win. Here, the target of +estimation (the estimand) is clear, the percentage of people in +a particular group (city, state, county, country or other electoral +grouping) who will vote for each candidate. + +We can not poll everyone. Even if we could, some polled +may change their vote by the time the election occurs. +How do we collect a reasonable subset of data and quantify the +uncertainty in the process to produce a good guess at who will win? + +--- + +## Motivating example: is hormone replacement therapy effective? + +A large clinical trial (the Women’s Health Initiative) published results in 2002 that contradicted prior evidence on the efficacy of hormone replacement therapy for post menopausal women and suggested a negative impact of HRT for several key health outcomes. **Based on a statistically based protocol, the study was stopped early due an excess number of negative events.** + +Here's there's two inferential problems. + +1. Is HRT effective? +2. How long should we continue the trial in the presence of contrary +evidence? + +See WHI writing group paper JAMA 2002, Vol 288:321 - 333. for the paper and Steinkellner et al. Menopause 2012, Vol 19:616 621 for adiscussion of the long term impacts + +--- + +## Motivating example: ECMO + +In 1985 a group at a major neonatal intensive care center published the results of a trial comparing a standard treatment and a promising new extracorporeal membrane oxygenation treatment (ECMO) for newborn infants with severe respiratory failure. **Ethical considerations lead to a statistical randomization scheme whereby one infant received the control therapy, thereby opening the study to sample-size based criticisms.** + +For a review and statistical discussion, see Royall Statistical Science 1991, Vol 6, No. 1, 52-88 + +--- + +## Summary + +- These examples illustrate many of the difficulties of trying +to use data to create general conclusions about a population. +- Paramount among our concerns are: + - Is the sample representative of the population that we'd like to draw inferences about? + - Are there known and observed, known and unobserved or unknown and unobserved variables that contaminate our conclusions? + - Is there systematic bias created by missing data or the design or conduct of the study? + - What randomness exists in the data and how do we use or adjust for it? Here randomness can either be explicit via randomization +or random sampling, or implicit as the aggregation of many complex uknown processes. + - Are we trying to estimate an underlying mechanistic model of phenomena under study? +- Statistical inference requires navigating the set of assumptions and +tools and subsequently thinking about how to draw conclusions from data. + +--- +## Example goals of inference + +1. Estimate and quantify the uncertainty of an estimate of +a population quantity (the proportion of people who will + vote for a candidate). +2. Determine whether a population quantity + is a benchmark value ("is the treatment effective?"). +3. Infer a mechanistic relationship when quantities are measured with + noise ("What is the slope for Hooke's law?") +4. Determine the impact of a policy? ("If we reduce polution levels, + will asthma rates decline?") + + +--- +## Example tools of the trade + +1. Randomization: concerned with balancing unobserved variables that may confound inferences of interest +2. Random sampling: concerned with obtaining data that is representative +of the population of interest +3. Sampling models: concerned with creating a model for the sampling +process, the most common is so called "iid". +4. Hypothesis testing: concerned with decision making in the presence of uncertainty +5. Confidence intervals: concerned with quantifying uncertainty in +estimation +6. Probability models: a formal connection between the data and a population of interest. Often probability models are assumed or are +approximated. +7. Study design: the process of designing an experiment to minimize biases and variability. +8. Nonparametric bootstrapping: the process of using the data to, + with minimal probability model assumptions, create inferences. +9. Permutation, randomization and exchangeability testing: the process +of using data permutations to perform inferences. + +--- +## Different thinking about probability leads to different styles of inference + +We won't spend too much time talking about this, but there are several different +styles of inference. Two broad categories that get discussed a lot are: + +1. Frequency probability: is the long run proportion of + times an event occurs in independent, identically distributed + repetitions. +2. Frequency inference: uses frequency interpretations of probabilities +to control error rates. Answers questions like "What should I decide +given my data controlling the long run proportion of mistakes I make at +a tolerable level." +3. Bayesian probability: is the probability calculus of beliefs, given that beliefs follow certain rules. +4. Bayesian inference: the use of Bayesian probability representation +of beliefs to perform inference. Answers questions like "Given my subjective beliefs and the objective information from the data, what +should I believe now?" + +Data scientists tend to fall within shades of gray of these and various other schools of inference. + +--- +## In this class + +* In this class, we will primarily focus on basic sampling models, +basic probability models and frequency style analyses +to create standard inferences. +* Being data scientists, we will also consider some inferential strategies that rely heavily on the observed data, such as permutation testing +and bootstrapping. +* As probability modeling will be our starting point, we first build +up basic probability. + +--- +## Where to learn more on the topics not covered + +1. Explicit use of random sampling in inferences: look in references +on "finite population statistics". Used heavily in polling and +sample surveys. +2. Explicit use of randomization in inferences: look in references +on "causal inference" especially in clinical trials. +3. Bayesian probability and Bayesian statistics: look for basic itroductory books (there are many). +4. Missing data: well covered in biostatistics and econometric +references; look for references to "multiple imputation", a popular tool for +addressing missing data. +5. Study design: consider looking in the subject matter area that + you are interested in; some examples with rich histories in design: + 1. The epidemiological literature is very focused on using study design to investigate public health. + 2. The classical development of study design in agriculture broadly covers design and design principles. + 3. The industrial quality control literature covers design thoroughly. + diff --git a/06_StatisticalInference/01_01_Introduction/index.html b/06_StatisticalInference/01_01_Introduction/index.html index 391528189..f5b9aad35 100644 --- a/06_StatisticalInference/01_01_Introduction/index.html +++ b/06_StatisticalInference/01_01_Introduction/index.html @@ -1,358 +1,358 @@ - - - - Introduction to statistical inference - - - - - - - - - - - - - - - - - - - - - - - - - - -
-

Introduction to statistical inference

-

Statistical inference

-

Brian Caffo, Jeff Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

-
-
-
- - - - -
-

Statistical inference defined

-
-
-

Statistical inference is the process of drawing formal conclusions from -data.

- -

In our class, we wil define formal statistical inference as settings where one wants to infer facts about a population using noisy -statistical data where uncertainty must be accounted for.

- -
- -
- - -
-

Motivating example: who's going to win the election?

-
-
-

In every major election, pollsters would like to know, ahead of the -actual election, who's going to win. Here, the target of -estimation (the estimand) is clear, the percentage of people in -a particular group (city, state, county, country or other electoral -grouping) who will vote for each candidate.

- -

We can not poll everyone. Even if we could, some polled -may change their vote by the time the election occurs. -How do we collect a reasonable subset of data and quantify the -uncertainty in the process to produce a good guess at who will win?

- -
- -
- - -
-

Motivating example: is hormone replacement therapy effective?

-
-
-

A large clinical trial (the Women’s Health Initiative) published results in 2002 that contradicted prior evidence on the efficacy of hormone replacement therapy for post menopausal women and suggested a negative impact of HRT for several key health outcomes. Based on a statistically based protocol, the study was stopped early due an excess number of negative events.

- -

Here's there's two inferential problems.

- -
    -
  1. Is HRT effective?
  2. -
  3. How long should we continue the trial in the presence of contrary -evidence?
  4. -
- -

See WHI writing group paper JAMA 2002, Vol 288:321 - 333. for the paper and Steinkellner et al. Menopause 2012, Vol 19:616 621 for adiscussion of the long term impacts

- -
- -
- - -
-

Motivating example: ECMO

-
-
-

In 1985 a group at a major neonatal intensive care center published the results of a trial comparing a standard treatment and a promising new extracorporeal membrane oxygenation treatment (ECMO) for newborn infants with severe respiratory failure. Ethical considerations lead to a statistical randomization scheme whereby one infant received the control therapy, thereby opening the study to sample-size based criticisms.

- -

For a review and statistical discussion, see Royall Statistical Science 1991, Vol 6, No. 1, 52-88

- -
- -
- - -
-

Summary

-
-
-
    -
  • These examples illustrate many of the difficulties of trying -to use data to create general conclusions about a population.
  • -
  • Paramount among our concerns are: - -
      -
    • Is the sample representative of the population that we'd like to draw inferences about?
    • -
    • Are there known and observed, known and unobserved or unknown and unobserved variables that contaminate our conclusions?
    • -
    • Is there systematic bias created by missing data or the design or conduct of the study?
    • -
    • What randomness exists in the data and how do we use or adjust for it? Here randomness can either be explicit via randomization -or random sampling, or implicit as the aggregation of many complex uknown processes.
    • -
    • Are we trying to estimate an underlying mechanistic model of phenomena under study?
    • -
  • -
  • Statistical inference requires navigating the set of assumptions and -tools and subsequently thinking about how to draw conclusions from data.
  • -
- -
- -
- - -
-

Example goals of inference

-
-
-
    -
  1. Estimate and quantify the uncertainty of an estimate of -a population quantity (the proportion of people who will -vote for a candidate).
  2. -
  3. Determine whether a population quantity -is a benchmark value ("is the treatment effective?").
  4. -
  5. Infer a mechanistic relationship when quantities are measured with -noise ("What is the slope for Hooke's law?")
  6. -
  7. Determine the impact of a policy? ("If we reduce polution levels, -will asthma rates decline?")
  8. -
- -
- -
- - -
-

Example tools of the trade

-
-
-
    -
  1. Randomization: concerned with balancing unobserved variables that may confound inferences of interest
  2. -
  3. Random sampling: concerned with obtaining data that is representative -of the population of interest
  4. -
  5. Sampling models: concerned with creating a model for the sampling -process, the most common is so called "iid".
  6. -
  7. Hypothesis testing: concerned with decision making in the presence of uncertainty
  8. -
  9. Confidence intervals: concerned with quantifying uncertainty in -estimation
  10. -
  11. Probability models: a formal connection between the data and a population of interest. Often probability models are assumed or are -approximated.
  12. -
  13. Study design: the process of designing an experiment to minimize biases and variability.
  14. -
  15. Nonparametric bootstrapping: the process of using the data to, -with minimal probability model assumptions, create inferences.
  16. -
  17. Permutation, randomization and exchangeability testing: the process -of using data permutations to perform inferences.
  18. -
- -
- -
- - -
-

Different thinking about probability leads to different styles of inference

-
-
-

We won't spend too much time talking about this, but there are several different -styles of inference. Two broad categories that get discussed a lot are:

- -
    -
  1. Frequency probability: is the long run proportion of -times an event occurs in independent, identically distributed -repetitions.
  2. -
  3. Frequency inference: uses frequency interpretations of probabilities -to control error rates. Answers questions like "What should I decide -given my data controlling the long run proportion of mistakes I make at -a tolerable level."
  4. -
  5. Bayesian probability: is the probability calculus of beliefs, given that beliefs follow certain rules.
  6. -
  7. Bayesian inference: the use of Bayesian probability representation -of beliefs to perform inference. Answers questions like "Given my subjective beliefs and the objective information from the data, what -should I believe now?"
  8. -
- -

Data scientists tend to fall within shades of gray of these and various other schools of inference.

- -
- -
- - -
-

In this class

-
-
-
    -
  • In this class, we will primarily focus on basic sampling models, -basic probability models and frequency style analyses -to create standard inferences.
  • -
  • Being data scientists, we will also consider some inferential strategies that rely heavily on the observed data, such as permutation testing -and bootstrapping.
  • -
  • As probability modeling will be our starting point, we first build -up basic probability.
  • -
- -
- -
- - -
-

Where to learn more on the topics not covered

-
-
-
    -
  1. Explicit use of random sampling in inferences: look in references -on "finite population statistics". Used heavily in polling and -sample surveys.
  2. -
  3. Explicit use of randomization in inferences: look in references -on "causal inference" especially in clinical trials.
  4. -
  5. Bayesian probability and Bayesian statistics: look for basic itroductory books (there are many).
  6. -
  7. Missing data: well covered in biostatistics and econometric -references; look for references to "multiple imputation", a popular tool for -addressing missing data.
  8. -
  9. Study design: consider looking in the subject matter area that -you are interested in; some examples with rich histories in design: - -
      -
    1. The epidemiological literature is very focused on using study design to investigate public health.
    2. -
    3. The classical development of study design in agriculture broadly covers design and design principles.
    4. -
    5. The industrial quality control literature covers design thoroughly.
    6. -
  10. -
- -
- -
- - -
- - - - - - - - - - - - - - + + + + Introduction to statistical inference + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

Introduction to statistical inference

+

Statistical inference

+

Brian Caffo, Jeff Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

+
+
+
+ + + + +
+

Statistical inference defined

+
+
+

Statistical inference is the process of drawing formal conclusions from +data.

+ +

In our class, we wil define formal statistical inference as settings where one wants to infer facts about a population using noisy +statistical data where uncertainty must be accounted for.

+ +
+ +
+ + +
+

Motivating example: who's going to win the election?

+
+
+

In every major election, pollsters would like to know, ahead of the +actual election, who's going to win. Here, the target of +estimation (the estimand) is clear, the percentage of people in +a particular group (city, state, county, country or other electoral +grouping) who will vote for each candidate.

+ +

We can not poll everyone. Even if we could, some polled +may change their vote by the time the election occurs. +How do we collect a reasonable subset of data and quantify the +uncertainty in the process to produce a good guess at who will win?

+ +
+ +
+ + +
+

Motivating example: is hormone replacement therapy effective?

+
+
+

A large clinical trial (the Women’s Health Initiative) published results in 2002 that contradicted prior evidence on the efficacy of hormone replacement therapy for post menopausal women and suggested a negative impact of HRT for several key health outcomes. Based on a statistically based protocol, the study was stopped early due an excess number of negative events.

+ +

Here's there's two inferential problems.

+ +
    +
  1. Is HRT effective?
  2. +
  3. How long should we continue the trial in the presence of contrary +evidence?
  4. +
+ +

See WHI writing group paper JAMA 2002, Vol 288:321 - 333. for the paper and Steinkellner et al. Menopause 2012, Vol 19:616 621 for adiscussion of the long term impacts

+ +
+ +
+ + +
+

Motivating example: ECMO

+
+
+

In 1985 a group at a major neonatal intensive care center published the results of a trial comparing a standard treatment and a promising new extracorporeal membrane oxygenation treatment (ECMO) for newborn infants with severe respiratory failure. Ethical considerations lead to a statistical randomization scheme whereby one infant received the control therapy, thereby opening the study to sample-size based criticisms.

+ +

For a review and statistical discussion, see Royall Statistical Science 1991, Vol 6, No. 1, 52-88

+ +
+ +
+ + +
+

Summary

+
+
+
    +
  • These examples illustrate many of the difficulties of trying +to use data to create general conclusions about a population.
  • +
  • Paramount among our concerns are: + +
      +
    • Is the sample representative of the population that we'd like to draw inferences about?
    • +
    • Are there known and observed, known and unobserved or unknown and unobserved variables that contaminate our conclusions?
    • +
    • Is there systematic bias created by missing data or the design or conduct of the study?
    • +
    • What randomness exists in the data and how do we use or adjust for it? Here randomness can either be explicit via randomization +or random sampling, or implicit as the aggregation of many complex uknown processes.
    • +
    • Are we trying to estimate an underlying mechanistic model of phenomena under study?
    • +
  • +
  • Statistical inference requires navigating the set of assumptions and +tools and subsequently thinking about how to draw conclusions from data.
  • +
+ +
+ +
+ + +
+

Example goals of inference

+
+
+
    +
  1. Estimate and quantify the uncertainty of an estimate of +a population quantity (the proportion of people who will +vote for a candidate).
  2. +
  3. Determine whether a population quantity +is a benchmark value ("is the treatment effective?").
  4. +
  5. Infer a mechanistic relationship when quantities are measured with +noise ("What is the slope for Hooke's law?")
  6. +
  7. Determine the impact of a policy? ("If we reduce polution levels, +will asthma rates decline?")
  8. +
+ +
+ +
+ + +
+

Example tools of the trade

+
+
+
    +
  1. Randomization: concerned with balancing unobserved variables that may confound inferences of interest
  2. +
  3. Random sampling: concerned with obtaining data that is representative +of the population of interest
  4. +
  5. Sampling models: concerned with creating a model for the sampling +process, the most common is so called "iid".
  6. +
  7. Hypothesis testing: concerned with decision making in the presence of uncertainty
  8. +
  9. Confidence intervals: concerned with quantifying uncertainty in +estimation
  10. +
  11. Probability models: a formal connection between the data and a population of interest. Often probability models are assumed or are +approximated.
  12. +
  13. Study design: the process of designing an experiment to minimize biases and variability.
  14. +
  15. Nonparametric bootstrapping: the process of using the data to, +with minimal probability model assumptions, create inferences.
  16. +
  17. Permutation, randomization and exchangeability testing: the process +of using data permutations to perform inferences.
  18. +
+ +
+ +
+ + +
+

Different thinking about probability leads to different styles of inference

+
+
+

We won't spend too much time talking about this, but there are several different +styles of inference. Two broad categories that get discussed a lot are:

+ +
    +
  1. Frequency probability: is the long run proportion of +times an event occurs in independent, identically distributed +repetitions.
  2. +
  3. Frequency inference: uses frequency interpretations of probabilities +to control error rates. Answers questions like "What should I decide +given my data controlling the long run proportion of mistakes I make at +a tolerable level."
  4. +
  5. Bayesian probability: is the probability calculus of beliefs, given that beliefs follow certain rules.
  6. +
  7. Bayesian inference: the use of Bayesian probability representation +of beliefs to perform inference. Answers questions like "Given my subjective beliefs and the objective information from the data, what +should I believe now?"
  8. +
+ +

Data scientists tend to fall within shades of gray of these and various other schools of inference.

+ +
+ +
+ + +
+

In this class

+
+
+
    +
  • In this class, we will primarily focus on basic sampling models, +basic probability models and frequency style analyses +to create standard inferences.
  • +
  • Being data scientists, we will also consider some inferential strategies that rely heavily on the observed data, such as permutation testing +and bootstrapping.
  • +
  • As probability modeling will be our starting point, we first build +up basic probability.
  • +
+ +
+ +
+ + +
+

Where to learn more on the topics not covered

+
+
+
    +
  1. Explicit use of random sampling in inferences: look in references +on "finite population statistics". Used heavily in polling and +sample surveys.
  2. +
  3. Explicit use of randomization in inferences: look in references +on "causal inference" especially in clinical trials.
  4. +
  5. Bayesian probability and Bayesian statistics: look for basic itroductory books (there are many).
  6. +
  7. Missing data: well covered in biostatistics and econometric +references; look for references to "multiple imputation", a popular tool for +addressing missing data.
  8. +
  9. Study design: consider looking in the subject matter area that +you are interested in; some examples with rich histories in design: + +
      +
    1. The epidemiological literature is very focused on using study design to investigate public health.
    2. +
    3. The classical development of study design in agriculture broadly covers design and design principles.
    4. +
    5. The industrial quality control literature covers design thoroughly.
    6. +
  10. +
+ +
+ +
+ + +
+ + + + + + + + + + + + + + \ No newline at end of file diff --git a/06_StatisticalInference/01_01_Introduction/index.md b/06_StatisticalInference/01_01_Introduction/index.md index 74e8b2a1e..3b5e98989 100644 --- a/06_StatisticalInference/01_01_Introduction/index.md +++ b/06_StatisticalInference/01_01_Introduction/index.md @@ -1,158 +1,158 @@ ---- -title : Introduction to statistical inference -subtitle : Statistical inference -author : Brian Caffo, Jeff Leek, Roger Peng -job : Johns Hopkins Bloomberg School of Public Health -logo : bloomberg_shield.png -framework : io2012 # {io2012, html5slides, shower, dzslides, ...} -highlighter : highlight.js # {highlight.js, prettify, highlight} -hitheme : tomorrow # -url: - lib: ../../librariesNew - assets: ../../assets -widgets : [mathjax] # {mathjax, quiz, bootstrap} -mode : selfcontained # {standalone, draft} ---- -## Statistical inference defined - -Statistical inference is the process of drawing formal conclusions from -data. - -In our class, we wil define formal statistical inference as settings where one wants to infer facts about a population using noisy -statistical data where uncertainty must be accounted for. - ---- - -## Motivating example: who's going to win the election? - -In every major election, pollsters would like to know, ahead of the -actual election, who's going to win. Here, the target of -estimation (the estimand) is clear, the percentage of people in -a particular group (city, state, county, country or other electoral -grouping) who will vote for each candidate. - -We can not poll everyone. Even if we could, some polled -may change their vote by the time the election occurs. -How do we collect a reasonable subset of data and quantify the -uncertainty in the process to produce a good guess at who will win? - ---- - -## Motivating example: is hormone replacement therapy effective? - -A large clinical trial (the Women’s Health Initiative) published results in 2002 that contradicted prior evidence on the efficacy of hormone replacement therapy for post menopausal women and suggested a negative impact of HRT for several key health outcomes. **Based on a statistically based protocol, the study was stopped early due an excess number of negative events.** - -Here's there's two inferential problems. - -1. Is HRT effective? -2. How long should we continue the trial in the presence of contrary -evidence? - -See WHI writing group paper JAMA 2002, Vol 288:321 - 333. for the paper and Steinkellner et al. Menopause 2012, Vol 19:616 621 for adiscussion of the long term impacts - ---- - -## Motivating example: ECMO - -In 1985 a group at a major neonatal intensive care center published the results of a trial comparing a standard treatment and a promising new extracorporeal membrane oxygenation treatment (ECMO) for newborn infants with severe respiratory failure. **Ethical considerations lead to a statistical randomization scheme whereby one infant received the control therapy, thereby opening the study to sample-size based criticisms.** - -For a review and statistical discussion, see Royall Statistical Science 1991, Vol 6, No. 1, 52-88 - ---- - -## Summary - -- These examples illustrate many of the difficulties of trying -to use data to create general conclusions about a population. -- Paramount among our concerns are: - - Is the sample representative of the population that we'd like to draw inferences about? - - Are there known and observed, known and unobserved or unknown and unobserved variables that contaminate our conclusions? - - Is there systematic bias created by missing data or the design or conduct of the study? - - What randomness exists in the data and how do we use or adjust for it? Here randomness can either be explicit via randomization -or random sampling, or implicit as the aggregation of many complex uknown processes. - - Are we trying to estimate an underlying mechanistic model of phenomena under study? -- Statistical inference requires navigating the set of assumptions and -tools and subsequently thinking about how to draw conclusions from data. - ---- -## Example goals of inference - -1. Estimate and quantify the uncertainty of an estimate of -a population quantity (the proportion of people who will - vote for a candidate). -2. Determine whether a population quantity - is a benchmark value ("is the treatment effective?"). -3. Infer a mechanistic relationship when quantities are measured with - noise ("What is the slope for Hooke's law?") -4. Determine the impact of a policy? ("If we reduce polution levels, - will asthma rates decline?") - - ---- -## Example tools of the trade - -1. Randomization: concerned with balancing unobserved variables that may confound inferences of interest -2. Random sampling: concerned with obtaining data that is representative -of the population of interest -3. Sampling models: concerned with creating a model for the sampling -process, the most common is so called "iid". -4. Hypothesis testing: concerned with decision making in the presence of uncertainty -5. Confidence intervals: concerned with quantifying uncertainty in -estimation -6. Probability models: a formal connection between the data and a population of interest. Often probability models are assumed or are -approximated. -7. Study design: the process of designing an experiment to minimize biases and variability. -8. Nonparametric bootstrapping: the process of using the data to, - with minimal probability model assumptions, create inferences. -9. Permutation, randomization and exchangeability testing: the process -of using data permutations to perform inferences. - ---- -## Different thinking about probability leads to different styles of inference - -We won't spend too much time talking about this, but there are several different -styles of inference. Two broad categories that get discussed a lot are: - -1. Frequency probability: is the long run proportion of - times an event occurs in independent, identically distributed - repetitions. -2. Frequency inference: uses frequency interpretations of probabilities -to control error rates. Answers questions like "What should I decide -given my data controlling the long run proportion of mistakes I make at -a tolerable level." -3. Bayesian probability: is the probability calculus of beliefs, given that beliefs follow certain rules. -4. Bayesian inference: the use of Bayesian probability representation -of beliefs to perform inference. Answers questions like "Given my subjective beliefs and the objective information from the data, what -should I believe now?" - -Data scientists tend to fall within shades of gray of these and various other schools of inference. - ---- -## In this class - -* In this class, we will primarily focus on basic sampling models, -basic probability models and frequency style analyses -to create standard inferences. -* Being data scientists, we will also consider some inferential strategies that rely heavily on the observed data, such as permutation testing -and bootstrapping. -* As probability modeling will be our starting point, we first build -up basic probability. - ---- -## Where to learn more on the topics not covered - -1. Explicit use of random sampling in inferences: look in references -on "finite population statistics". Used heavily in polling and -sample surveys. -2. Explicit use of randomization in inferences: look in references -on "causal inference" especially in clinical trials. -3. Bayesian probability and Bayesian statistics: look for basic itroductory books (there are many). -4. Missing data: well covered in biostatistics and econometric -references; look for references to "multiple imputation", a popular tool for -addressing missing data. -5. Study design: consider looking in the subject matter area that - you are interested in; some examples with rich histories in design: - 1. The epidemiological literature is very focused on using study design to investigate public health. - 2. The classical development of study design in agriculture broadly covers design and design principles. - 3. The industrial quality control literature covers design thoroughly. - +--- +title : Introduction to statistical inference +subtitle : Statistical inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- +## Statistical inference defined + +Statistical inference is the process of drawing formal conclusions from +data. + +In our class, we wil define formal statistical inference as settings where one wants to infer facts about a population using noisy +statistical data where uncertainty must be accounted for. + +--- + +## Motivating example: who's going to win the election? + +In every major election, pollsters would like to know, ahead of the +actual election, who's going to win. Here, the target of +estimation (the estimand) is clear, the percentage of people in +a particular group (city, state, county, country or other electoral +grouping) who will vote for each candidate. + +We can not poll everyone. Even if we could, some polled +may change their vote by the time the election occurs. +How do we collect a reasonable subset of data and quantify the +uncertainty in the process to produce a good guess at who will win? + +--- + +## Motivating example: is hormone replacement therapy effective? + +A large clinical trial (the Women’s Health Initiative) published results in 2002 that contradicted prior evidence on the efficacy of hormone replacement therapy for post menopausal women and suggested a negative impact of HRT for several key health outcomes. **Based on a statistically based protocol, the study was stopped early due an excess number of negative events.** + +Here's there's two inferential problems. + +1. Is HRT effective? +2. How long should we continue the trial in the presence of contrary +evidence? + +See WHI writing group paper JAMA 2002, Vol 288:321 - 333. for the paper and Steinkellner et al. Menopause 2012, Vol 19:616 621 for adiscussion of the long term impacts + +--- + +## Motivating example: ECMO + +In 1985 a group at a major neonatal intensive care center published the results of a trial comparing a standard treatment and a promising new extracorporeal membrane oxygenation treatment (ECMO) for newborn infants with severe respiratory failure. **Ethical considerations lead to a statistical randomization scheme whereby one infant received the control therapy, thereby opening the study to sample-size based criticisms.** + +For a review and statistical discussion, see Royall Statistical Science 1991, Vol 6, No. 1, 52-88 + +--- + +## Summary + +- These examples illustrate many of the difficulties of trying +to use data to create general conclusions about a population. +- Paramount among our concerns are: + - Is the sample representative of the population that we'd like to draw inferences about? + - Are there known and observed, known and unobserved or unknown and unobserved variables that contaminate our conclusions? + - Is there systematic bias created by missing data or the design or conduct of the study? + - What randomness exists in the data and how do we use or adjust for it? Here randomness can either be explicit via randomization +or random sampling, or implicit as the aggregation of many complex uknown processes. + - Are we trying to estimate an underlying mechanistic model of phenomena under study? +- Statistical inference requires navigating the set of assumptions and +tools and subsequently thinking about how to draw conclusions from data. + +--- +## Example goals of inference + +1. Estimate and quantify the uncertainty of an estimate of +a population quantity (the proportion of people who will + vote for a candidate). +2. Determine whether a population quantity + is a benchmark value ("is the treatment effective?"). +3. Infer a mechanistic relationship when quantities are measured with + noise ("What is the slope for Hooke's law?") +4. Determine the impact of a policy? ("If we reduce polution levels, + will asthma rates decline?") + + +--- +## Example tools of the trade + +1. Randomization: concerned with balancing unobserved variables that may confound inferences of interest +2. Random sampling: concerned with obtaining data that is representative +of the population of interest +3. Sampling models: concerned with creating a model for the sampling +process, the most common is so called "iid". +4. Hypothesis testing: concerned with decision making in the presence of uncertainty +5. Confidence intervals: concerned with quantifying uncertainty in +estimation +6. Probability models: a formal connection between the data and a population of interest. Often probability models are assumed or are +approximated. +7. Study design: the process of designing an experiment to minimize biases and variability. +8. Nonparametric bootstrapping: the process of using the data to, + with minimal probability model assumptions, create inferences. +9. Permutation, randomization and exchangeability testing: the process +of using data permutations to perform inferences. + +--- +## Different thinking about probability leads to different styles of inference + +We won't spend too much time talking about this, but there are several different +styles of inference. Two broad categories that get discussed a lot are: + +1. Frequency probability: is the long run proportion of + times an event occurs in independent, identically distributed + repetitions. +2. Frequency inference: uses frequency interpretations of probabilities +to control error rates. Answers questions like "What should I decide +given my data controlling the long run proportion of mistakes I make at +a tolerable level." +3. Bayesian probability: is the probability calculus of beliefs, given that beliefs follow certain rules. +4. Bayesian inference: the use of Bayesian probability representation +of beliefs to perform inference. Answers questions like "Given my subjective beliefs and the objective information from the data, what +should I believe now?" + +Data scientists tend to fall within shades of gray of these and various other schools of inference. + +--- +## In this class + +* In this class, we will primarily focus on basic sampling models, +basic probability models and frequency style analyses +to create standard inferences. +* Being data scientists, we will also consider some inferential strategies that rely heavily on the observed data, such as permutation testing +and bootstrapping. +* As probability modeling will be our starting point, we first build +up basic probability. + +--- +## Where to learn more on the topics not covered + +1. Explicit use of random sampling in inferences: look in references +on "finite population statistics". Used heavily in polling and +sample surveys. +2. Explicit use of randomization in inferences: look in references +on "causal inference" especially in clinical trials. +3. Bayesian probability and Bayesian statistics: look for basic itroductory books (there are many). +4. Missing data: well covered in biostatistics and econometric +references; look for references to "multiple imputation", a popular tool for +addressing missing data. +5. Study design: consider looking in the subject matter area that + you are interested in; some examples with rich histories in design: + 1. The epidemiological literature is very focused on using study design to investigate public health. + 2. The classical development of study design in agriculture broadly covers design and design principles. + 3. The industrial quality control literature covers design thoroughly. + diff --git a/06_StatisticalInference/01_01_Introduction/index.pdf b/06_StatisticalInference/01_01_Introduction/index.pdf index 70d9be1bc..b50714770 100644 Binary files a/06_StatisticalInference/01_01_Introduction/index.pdf and b/06_StatisticalInference/01_01_Introduction/index.pdf differ diff --git a/06_StatisticalInference/01_02_Probability/assets/fig/unnamed-chunk-1.png b/06_StatisticalInference/01_02_Probability/assets/fig/unnamed-chunk-1.png new file mode 100644 index 000000000..21d71259c Binary files /dev/null and b/06_StatisticalInference/01_02_Probability/assets/fig/unnamed-chunk-1.png differ diff --git a/06_StatisticalInference/01_02_Probability/assets/fig/unnamed-chunk-2.png b/06_StatisticalInference/01_02_Probability/assets/fig/unnamed-chunk-2.png new file mode 100644 index 000000000..833444e0b Binary files /dev/null and b/06_StatisticalInference/01_02_Probability/assets/fig/unnamed-chunk-2.png differ diff --git a/06_StatisticalInference/01_02_Probability/index.Rmd b/06_StatisticalInference/01_02_Probability/index.Rmd index c925cc40e..3691fc14c 100644 --- a/06_StatisticalInference/01_02_Probability/index.Rmd +++ b/06_StatisticalInference/01_02_Probability/index.Rmd @@ -1,277 +1,276 @@ ---- -title : Probability -subtitle : Statistical Inference -author : Brian Caffo, Jeff Leek, Roger Peng -job : Johns Hopkins Bloomberg School of Public Health -logo : bloomberg_shield.png -framework : io2012 # {io2012, html5slides, shower, dzslides, ...} -highlighter : highlight.js # {highlight.js, prettify, highlight} -hitheme : tomorrow # -url: - lib: ../../librariesNew - assets: ../../assets -widgets : [mathjax] # {mathjax, quiz, bootstrap} -mode : selfcontained # {standalone, draft} ---- - -## Notation - -- The **sample space**, $\Omega$, is the collection of possible outcomes of an experiment - - Example: die roll $\Omega = \{1,2,3,4,5,6\}$ -- An **event**, say $E$, is a subset of $\Omega$ - - Example: die roll is even $E = \{2,4,6\}$ -- An **elementary** or **simple** event is a particular result - of an experiment - - Example: die roll is a four, $\omega = 4$ -- $\emptyset$ is called the **null event** or the **empty set** - ---- - -## Interpretation of set operations - -Normal set operations have particular interpretations in this setting - -1. $\omega \in E$ implies that $E$ occurs when $\omega$ occurs -2. $\omega \not\in E$ implies that $E$ does not occur when $\omega$ occurs -3. $E \subset F$ implies that the occurrence of $E$ implies the occurrence of $F$ -4. $E \cap F$ implies the event that both $E$ and $F$ occur -5. $E \cup F$ implies the event that at least one of $E$ or $F$ occur -6. $E \cap F=\emptyset$ means that $E$ and $F$ are **mutually exclusive**, or cannot both occur -7. $E^c$ or $\bar E$ is the event that $E$ does not occur - ---- - -## Probability - -A **probability measure**, $P$, is a function from the collection of possible events so that the following hold - -1. For an event $E\subset \Omega$, $0 \leq P(E) \leq 1$ -2. $P(\Omega) = 1$ -3. If $E_1$ and $E_2$ are mutually exclusive events - $P(E_1 \cup E_2) = P(E_1) + P(E_2)$. - -Part 3 of the definition implies **finite additivity** - -$$ -P(\cup_{i=1}^n A_i) = \sum_{i=1}^n P(A_i) -$$ -where the $\{A_i\}$ are mutually exclusive. (Note a more general version of -additivity is used in advanced classes.) - - ---- - - -## Example consequences - -- $P(\emptyset) = 0$ -- $P(E) = 1 - P(E^c)$ -- $P(A \cup B) = P(A) + P(B) - P(A \cap B)$ -- if $A \subset B$ then $P(A) \leq P(B)$ -- $P\left(A \cup B\right) = 1 - P(A^c \cap B^c)$ -- $P(A \cap B^c) = P(A) - P(A \cap B)$ -- $P(\cup_{i=1}^n E_i) \leq \sum_{i=1}^n P(E_i)$ -- $P(\cup_{i=1}^n E_i) \geq \max_i P(E_i)$ - ---- - -## Example - -The National Sleep Foundation ([www.sleepfoundation.org](http://www.sleepfoundation.org/)) reports that around 3% of the American population has sleep apnea. They also report that around 10% of the North American and European population has restless leg syndrome. Does this imply that 13% of people will have at least one sleep problems of these sorts? - ---- - -## Example continued - -Answer: No, the events are not mutually exclusive. To elaborate let: - -$$ -\begin{eqnarray*} - A_1 & = & \{\mbox{Person has sleep apnea}\} \\ - A_2 & = & \{\mbox{Person has RLS}\} - \end{eqnarray*} -$$ - -Then - -$$ -\begin{eqnarray*} - P(A_1 \cup A_2 ) & = & P(A_1) + P(A_2) - P(A_1 \cap A_2) \\ - & = & 0.13 - \mbox{Probability of having both} - \end{eqnarray*} -$$ -Likely, some fraction of the population has both. - ---- - -## Random variables - -- A **random variable** is a numerical outcome of an experiment. -- The random variables that we study will come in two varieties, - **discrete** or **continuous**. -- Discrete random variable are random variables that take on only a -countable number of possibilities. - * $P(X = k)$ -- Continuous random variable can take any value on the real line or some subset of the real line. - * $P(X \in A)$ - ---- - -## Examples of variables that can be thought of as random variables - -- The $(0-1)$ outcome of the flip of a coin -- The outcome from the roll of a die -- The BMI of a subject four years after a baseline measurement -- The hypertension status of a subject randomly drawn from a population - ---- - -## PMF - -A probability mass function evaluated at a value corresponds to the -probability that a random variable takes that value. To be a valid -pmf a function, $p$, must satisfy - - 1. $p(x) \geq 0$ for all $x$ - 2. $\sum_{x} p(x) = 1$ - -The sum is taken over all of the possible values for $x$. - ---- - -## Example - -Let $X$ be the result of a coin flip where $X=0$ represents -tails and $X = 1$ represents heads. -$$ -p(x) = (1/2)^{x} (1/2)^{1-x} ~~\mbox{ for }~~x = 0,1 -$$ -Suppose that we do not know whether or not the coin is fair; Let -$\theta$ be the probability of a head expressed as a proportion -(between 0 and 1). -$$ -p(x) = \theta^{x} (1 - \theta)^{1-x} ~~\mbox{ for }~~x = 0,1 -$$ - ---- - -## PDF - -A probability density function (pdf), is a function associated with -a continuous random variable - - *Areas under pdfs correspond to probabilities for that random variable* - -To be a valid pdf, a function $f$ must satisfy - -1. $f(x) \geq 0$ for all $x$ - -2. The area under $f(x)$ is one. - ---- -## Example - -Suppose that the proportion of help calls that get addressed in -a random day by a help line is given by -$$ -f(x) = \left\{\begin{array}{ll} - 2 x & \mbox{ for } 1 > x > 0 \\ - 0 & \mbox{ otherwise} -\end{array} \right. -$$ - -Is this a mathematically valid density? - ---- - -```{r, fig.height = 5, fig.width = 5, echo = TRUE, fig.align='center'} -x <- c(-0.5, 0, 1, 1, 1.5); y <- c( 0, 0, 2, 0, 0) -plot(x, y, lwd = 3, frame = FALSE, type = "l") -``` - ---- - -## Example continued - -What is the probability that 75% or fewer of calls get addressed? - -```{r, fig.height = 5, fig.width = 5, echo = FALSE, fig.align='center'} -plot(x, y, lwd = 3, frame = FALSE, type = "l") -polygon(c(0, .75, .75, 0), c(0, 0, 1.5, 0), lwd = 3, col = "lightblue") -``` - ---- -```{r} -1.5 * .75 / 2 -pbeta(.75, 2, 1) -``` ---- - -## CDF and survival function - -- The **cumulative distribution function** (CDF) of a random variable $X$ is defined as the function -$$ -F(x) = P(X \leq x) -$$ -- This definition applies regardless of whether $X$ is discrete or continuous. -- The **survival function** of a random variable $X$ is defined as -$$ -S(x) = P(X > x) -$$ -- Notice that $S(x) = 1 - F(x)$ -- For continuous random variables, the PDF is the derivative of the CDF - ---- - -## Example - -What are the survival function and CDF from the density considered before? - -For $1 \geq x \geq 0$ -$$ -F(x) = P(X \leq x) = \frac{1}{2} Base \times Height = \frac{1}{2} (x) \times (2 x) = x^2 -$$ - -$$ -S(x) = 1 - x^2 -$$ - -```{r} -pbeta(c(0.4, 0.5, 0.6), 2, 1) -``` - ---- - -## Quantiles - -- The $\alpha^{th}$ **quantile** of a distribution with distribution function $F$ is the point $x_\alpha$ so that -$$ -F(x_\alpha) = \alpha -$$ -- A **percentile** is simply a quantile with $\alpha$ expressed as a percent -- The **median** is the $50^{th}$ percentile - ---- -## Example -- We want to solve $0.5 = F(x) = x^2$ -- Resulting in the solution -```{r, echo = TRUE} -sqrt(0.5) -``` -- Therefore, about `r sqrt(0.5)` of calls being answered on a random day is the median. -- R can approximate quantiles for you for common distributions - -```{r} -qbeta(0.5, 2, 1) -``` - ---- - -## Summary - -- You might be wondering at this point "I've heard of a median before, it didn't require integration. Where's the data?" -- We're referring to are **population quantities**. Therefore, the median being - discussed is the **population median**. -- A probability model connects the data to the population using assumptions. -- Therefore the median we're discussing is the **estimand**, the sample median will be the **estimator** - +--- +title : Probability +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- + +## Notation + +- The **sample space**, $\Omega$, is the collection of possible outcomes of an experiment + - Example: die roll $\Omega = \{1,2,3,4,5,6\}$ +- An **event**, say $E$, is a subset of $\Omega$ + - Example: die roll is even $E = \{2,4,6\}$ +- An **elementary** or **simple** event is a particular result + of an experiment + - Example: die roll is a four, $\omega = 4$ +- $\emptyset$ is called the **null event** or the **empty set** + +--- + +## Interpretation of set operations + +Normal set operations have particular interpretations in this setting + +1. $\omega \in E$ implies that $E$ occurs when $\omega$ occurs +2. $\omega \not\in E$ implies that $E$ does not occur when $\omega$ occurs +3. $E \subset F$ implies that the occurrence of $E$ implies the occurrence of $F$ +4. $E \cap F$ implies the event that both $E$ and $F$ occur +5. $E \cup F$ implies the event that at least one of $E$ or $F$ occur +6. $E \cap F=\emptyset$ means that $E$ and $F$ are **mutually exclusive**, or cannot both occur +7. $E^c$ or $\bar E$ is the event that $E$ does not occur + +--- + +## Probability + +A **probability measure**, $P$, is a function from the collection of possible events so that the following hold + +1. For an event $E\subset \Omega$, $0 \leq P(E) \leq 1$ +2. $P(\Omega) = 1$ +3. If $E_1$ and $E_2$ are mutually exclusive events + $P(E_1 \cup E_2) = P(E_1) + P(E_2)$. + +Part 3 of the definition implies **finite additivity** + +$$ +P(\cup_{i=1}^n A_i) = \sum_{i=1}^n P(A_i) +$$ +where the $\{A_i\}$ are mutually exclusive. (Note a more general version of +additivity is used in advanced classes.) + + +--- + + +## Example consequences + +- $P(\emptyset) = 0$ +- $P(E) = 1 - P(E^c)$ +- $P(A \cup B) = P(A) + P(B) - P(A \cap B)$ +- if $A \subset B$ then $P(A) \leq P(B)$ +- $P\left(A \cup B\right) = 1 - P(A^c \cap B^c)$ +- $P(A \cap B^c) = P(A) - P(A \cap B)$ +- $P(\cup_{i=1}^n E_i) \leq \sum_{i=1}^n P(E_i)$ +- $P(\cup_{i=1}^n E_i) \geq \max_i P(E_i)$ + +--- + +## Example + +The National Sleep Foundation ([www.sleepfoundation.org](http://www.sleepfoundation.org/)) reports that around 3% of the American population has sleep apnea. They also report that around 10% of the North American and European population has restless leg syndrome. Does this imply that 13% of people will have at least one sleep problems of these sorts? + +--- + +## Example continued + +Answer: No, the events are not mutually exclusive. To elaborate let: + +$$ +\begin{eqnarray*} + A_1 & = & \{\mbox{Person has sleep apnea}\} \\ + A_2 & = & \{\mbox{Person has RLS}\} + \end{eqnarray*} +$$ + +Then + +$$ +\begin{eqnarray*} + P(A_1 \cup A_2 ) & = & P(A_1) + P(A_2) - P(A_1 \cap A_2) \\ + & = & 0.13 - \mbox{Probability of having both} + \end{eqnarray*} +$$ +Likely, some fraction of the population has both. + +--- + +## Random variables + +- A **random variable** is a numerical outcome of an experiment. +- The random variables that we study will come in two varieties, + **discrete** or **continuous**. +- Discrete random variable are random variables that take on only a +countable number of possibilities. + * $P(X = k)$ +- Continuous random variable can take any value on the real line or some subset of the real line. + * $P(X \in A)$ + +--- + +## Examples of variables that can be thought of as random variables + +- The $(0-1)$ outcome of the flip of a coin +- The outcome from the roll of a die +- The BMI of a subject four years after a baseline measurement +- The hypertension status of a subject randomly drawn from a population + +--- + +## PMF + +A probability mass function evaluated at a value corresponds to the +probability that a random variable takes that value. To be a valid +pmf a function, $p$, must satisfy + + 1. $p(x) \geq 0$ for all $x$ + 2. $\sum_{x} p(x) = 1$ + +The sum is taken over all of the possible values for $x$. + +--- + +## Example + +Let $X$ be the result of a coin flip where $X=0$ represents +tails and $X = 1$ represents heads. +$$ +p(x) = (1/2)^{x} (1/2)^{1-x} ~~\mbox{ for }~~x = 0,1 +$$ +Suppose that we do not know whether or not the coin is fair; Let +$\theta$ be the probability of a head expressed as a proportion +(between 0 and 1). +$$ +p(x) = \theta^{x} (1 - \theta)^{1-x} ~~\mbox{ for }~~x = 0,1 +$$ + +--- + +## PDF + +A probability density function (pdf), is a function associated with +a continuous random variable + + *Areas under pdfs correspond to probabilities for that random variable* + +To be a valid pdf, a function $f$ must satisfy + +1. $f(x) \geq 0$ for all $x$ + +2. The area under $f(x)$ is one. + +--- +## Example + +Suppose that the proportion of help calls that get addressed in +a random day by a help line is given by +$$ +f(x) = \left\{\begin{array}{ll} + 2 x & \mbox{ for } 1 > x > 0 \\ + 0 & \mbox{ otherwise} +\end{array} \right. +$$ + +Is this a mathematically valid density? + +--- + +```{r, fig.height = 5, fig.width = 5, echo = TRUE, fig.align='center'} +x <- c(-0.5, 0, 1, 1, 1.5); y <- c( 0, 0, 2, 0, 0) +plot(x, y, lwd = 3, frame = FALSE, type = "l") +``` + +--- + +## Example continued + +What is the probability that 75% or fewer of calls get addressed? + +```{r, fig.height = 5, fig.width = 5, echo = FALSE, fig.align='center'} +plot(x, y, lwd = 3, frame = FALSE, type = "l") +polygon(c(0, .75, .75, 0), c(0, 0, 1.5, 0), lwd = 3, col = "lightblue") +``` + +--- +```{r} +1.5 * .75 / 2 +pbeta(.75, 2, 1) +``` +--- + +## CDF and survival function + +- The **cumulative distribution function** (CDF) of a random variable $X$ is defined as the function +$$ +F(x) = P(X \leq x) +$$ +- This definition applies regardless of whether $X$ is discrete or continuous. +- The **survival function** of a random variable $X$ is defined as +$$ +S(x) = P(X > x) +$$ +- Notice that $S(x) = 1 - F(x)$ +- For continuous random variables, the PDF is the derivative of the CDF + +--- + +## Example + +What are the survival function and CDF from the density considered before? + +For $1 \geq x \geq 0$ +$$ +F(x) = P(X \leq x) = \frac{1}{2} Base \times Height = \frac{1}{2} (x) \times (2 x) = x^2 +$$ + +$$ +S(x) = 1 - x^2 +$$ + +```{r} +pbeta(c(0.4, 0.5, 0.6), 2, 1) +``` + +--- + +## Quantiles + +- The $\alpha^{th}$ **quantile** of a distribution with distribution function $F$ is the point $x_\alpha$ so that +$$ +F(x_\alpha) = \alpha +$$ +- A **percentile** is simply a quantile with $\alpha$ expressed as a percent +- The **median** is the $50^{th}$ percentile + +--- +## Example +- We want to solve $0.5 = F(x) = x^2$ +- Resulting in the solution +```{r, echo = TRUE} +sqrt(0.5) +``` +- Therefore, about `r sqrt(0.5)` of calls being answered on a random day is the median. +- R can approximate quantiles for you for common distributions + +```{r} +qbeta(0.5, 2, 1) +``` + +--- + +## Summary + +- You might be wondering at this point "I've heard of a median before, it didn't require integration. Where's the data?" +- We're referring to are **population quantities**. Therefore, the median being + discussed is the **population median**. +- A probability model connects the data to the population using assumptions. +- Therefore the median we're discussing is the **estimand**, the sample median will be the **estimator** diff --git a/06_StatisticalInference/01_02_Probability/index.html b/06_StatisticalInference/01_02_Probability/index.html index 8e224deef..f9de0d1e0 100644 --- a/06_StatisticalInference/01_02_Probability/index.html +++ b/06_StatisticalInference/01_02_Probability/index.html @@ -1,617 +1,617 @@ - - - - Probability - - - - - - - - - - - - - - - - - - - - - - - - - - -
-

Probability

-

Statistical Inference

-

Brian Caffo, Jeff Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

-
-
-
- - - - -
-

Notation

-
-
-
    -
  • The sample space, \(\Omega\), is the collection of possible outcomes of an experiment - -
      -
    • Example: die roll \(\Omega = \{1,2,3,4,5,6\}\)
    • -
  • -
  • An event, say \(E\), is a subset of \(\Omega\) - -
      -
    • Example: die roll is even \(E = \{2,4,6\}\)
    • -
  • -
  • An elementary or simple event is a particular result -of an experiment - -
      -
    • Example: die roll is a four, \(\omega = 4\)
    • -
  • -
  • \(\emptyset\) is called the null event or the empty set
  • -
- -
- -
- - -
-

Interpretation of set operations

-
-
-

Normal set operations have particular interpretations in this setting

- -
    -
  1. \(\omega \in E\) implies that \(E\) occurs when \(\omega\) occurs
  2. -
  3. \(\omega \not\in E\) implies that \(E\) does not occur when \(\omega\) occurs
  4. -
  5. \(E \subset F\) implies that the occurrence of \(E\) implies the occurrence of \(F\)
  6. -
  7. \(E \cap F\) implies the event that both \(E\) and \(F\) occur
  8. -
  9. \(E \cup F\) implies the event that at least one of \(E\) or \(F\) occur
  10. -
  11. \(E \cap F=\emptyset\) means that \(E\) and \(F\) are mutually exclusive, or cannot both occur
  12. -
  13. \(E^c\) or \(\bar E\) is the event that \(E\) does not occur
  14. -
- -
- -
- - -
-

Probability

-
-
-

A probability measure, \(P\), is a function from the collection of possible events so that the following hold

- -
    -
  1. For an event \(E\subset \Omega\), \(0 \leq P(E) \leq 1\)
  2. -
  3. \(P(\Omega) = 1\)
  4. -
  5. If \(E_1\) and \(E_2\) are mutually exclusive events -\(P(E_1 \cup E_2) = P(E_1) + P(E_2)\).
  6. -
- -

Part 3 of the definition implies finite additivity

- -

\[ -P(\cup_{i=1}^n A_i) = \sum_{i=1}^n P(A_i) -\] -where the \(\{A_i\}\) are mutually exclusive. (Note a more general version of -additivity is used in advanced classes.)

- -
- -
- - -
-

Example consequences

-
-
-
    -
  • \(P(\emptyset) = 0\)
  • -
  • \(P(E) = 1 - P(E^c)\)
  • -
  • \(P(A \cup B) = P(A) + P(B) - P(A \cap B)\)
  • -
  • if \(A \subset B\) then \(P(A) \leq P(B)\)
  • -
  • \(P\left(A \cup B\right) = 1 - P(A^c \cap B^c)\)
  • -
  • \(P(A \cap B^c) = P(A) - P(A \cap B)\)
  • -
  • \(P(\cup_{i=1}^n E_i) \leq \sum_{i=1}^n P(E_i)\)
  • -
  • \(P(\cup_{i=1}^n E_i) \geq \max_i P(E_i)\)
  • -
- -
- -
- - -
-

Example

-
-
-

The National Sleep Foundation (www.sleepfoundation.org) reports that around 3% of the American population has sleep apnea. They also report that around 10% of the North American and European population has restless leg syndrome. Does this imply that 13% of people will have at least one sleep problems of these sorts?

- -
- -
- - -
-

Example continued

-
-
-

Answer: No, the events are not mutually exclusive. To elaborate let:

- -

\[ -\begin{eqnarray*} - A_1 & = & \{\mbox{Person has sleep apnea}\} \\ - A_2 & = & \{\mbox{Person has RLS}\} - \end{eqnarray*} -\]

- -

Then

- -

\[ -\begin{eqnarray*} - P(A_1 \cup A_2 ) & = & P(A_1) + P(A_2) - P(A_1 \cap A_2) \\ - & = & 0.13 - \mbox{Probability of having both} - \end{eqnarray*} -\] -Likely, some fraction of the population has both.

- -
- -
- - -
-

Random variables

-
-
-
    -
  • A random variable is a numerical outcome of an experiment.
  • -
  • The random variables that we study will come in two varieties, -discrete or continuous.
  • -
  • Discrete random variable are random variables that take on only a -countable number of possibilities. - -
      -
    • \(P(X = k)\)
    • -
  • -
  • Continuous random variable can take any value on the real line or some subset of the real line. - -
      -
    • \(P(X \in A)\)
    • -
  • -
- -
- -
- - -
-

Examples of variables that can be thought of as random variables

-
-
-
    -
  • The \((0-1)\) outcome of the flip of a coin
  • -
  • The outcome from the roll of a die
  • -
  • The BMI of a subject four years after a baseline measurement
  • -
  • The hypertension status of a subject randomly drawn from a population
  • -
- -
- -
- - -
-

PMF

-
-
-

A probability mass function evaluated at a value corresponds to the -probability that a random variable takes that value. To be a valid -pmf a function, \(p\), must satisfy

- -
    -
  1. \(p(x) \geq 0\) for all \(x\)
  2. -
  3. \(\sum_{x} p(x) = 1\)
  4. -
- -

The sum is taken over all of the possible values for \(x\).

- -
- -
- - -
-

Example

-
-
-

Let \(X\) be the result of a coin flip where \(X=0\) represents -tails and \(X = 1\) represents heads. -\[ -p(x) = (1/2)^{x} (1/2)^{1-x} ~~\mbox{ for }~~x = 0,1 -\] -Suppose that we do not know whether or not the coin is fair; Let -\(\theta\) be the probability of a head expressed as a proportion -(between 0 and 1). -\[ -p(x) = \theta^{x} (1 - \theta)^{1-x} ~~\mbox{ for }~~x = 0,1 -\]

- -
- -
- - -
-

PDF

-
-
-

A probability density function (pdf), is a function associated with -a continuous random variable

- -

Areas under pdfs correspond to probabilities for that random variable

- -

To be a valid pdf, a function \(f\) must satisfy

- -
    -
  1. \(f(x) \geq 0\) for all \(x\)

  2. -
  3. The area under \(f(x)\) is one.

  4. -
- -
- -
- - -
-

Example

-
-
-

Suppose that the proportion of help calls that get addressed in -a random day by a help line is given by -\[ -f(x) = \left\{\begin{array}{ll} - 2 x & \mbox{ for } 1 > x > 0 \\ - 0 & \mbox{ otherwise} -\end{array} \right. -\]

- -

Is this a mathematically valid density?

- -
- -
- - -
-
x <- c(-0.5, 0, 1, 1, 1.5)
-y <- c(0, 0, 2, 0, 0)
-plot(x, y, lwd = 3, frame = FALSE, type = "l")
-
- -

plot of chunk unnamed-chunk-1

- -
- -
- - -
-

Example continued

-
-
-

What is the probability that 75% or fewer of calls get addressed?

- -

plot of chunk unnamed-chunk-2

- -
- -
- - -
-
1.5 * 0.75/2
-
- -
## [1] 0.5625
-
- -
pbeta(0.75, 2, 1)
-
- -
## [1] 0.5625
-
- -
- -
- - -
-

CDF and survival function

-
-
-
    -
  • The cumulative distribution function (CDF) of a random variable \(X\) is defined as the function -\[ -F(x) = P(X \leq x) -\]
  • -
  • This definition applies regardless of whether \(X\) is discrete or continuous.
  • -
  • The survival function of a random variable \(X\) is defined as -\[ -S(x) = P(X > x) -\]
  • -
  • Notice that \(S(x) = 1 - F(x)\)
  • -
  • For continuous random variables, the PDF is the derivative of the CDF
  • -
- -
- -
- - -
-

Example

-
-
-

What are the survival function and CDF from the density considered before?

- -

For \(1 \geq x \geq 0\) -\[ -F(x) = P(X \leq x) = \frac{1}{2} Base \times Height = \frac{1}{2} (x) \times (2 x) = x^2 -\]

- -

\[ -S(x) = 1 - x^2 -\]

- -
pbeta(c(0.4, 0.5, 0.6), 2, 1)
-
- -
## [1] 0.16 0.25 0.36
-
- -
- -
- - -
-

Quantiles

-
-
-
    -
  • The \(\alpha^{th}\) quantile of a distribution with distribution function \(F\) is the point \(x_\alpha\) so that -\[ -F(x_\alpha) = \alpha -\]
  • -
  • A percentile is simply a quantile with \(\alpha\) expressed as a percent
  • -
  • The median is the \(50^{th}\) percentile
  • -
- -
- -
- - -
-

Example

-
-
-
    -
  • We want to solve \(0.5 = F(x) = x^2\)
  • -
  • Resulting in the solution
  • -
- -
sqrt(0.5)
-
- -
## [1] 0.7071
-
- -
    -
  • Therefore, about 0.7071 of calls being answered on a random day is the median.
  • -
  • R can approximate quantiles for you for common distributions
  • -
- -
qbeta(0.5, 2, 1)
-
- -
## [1] 0.7071
-
- -
- -
- - -
-

Summary

-
-
-
    -
  • You might be wondering at this point "I've heard of a median before, it didn't require integration. Where's the data?"
  • -
  • We're referring to are population quantities. Therefore, the median being -discussed is the population median.
  • -
  • A probability model connects the data to the population using assumptions.
  • -
  • Therefore the median we're discussing is the estimand, the sample median will be the estimator
  • -
- -
- -
- - -
- - - - - - - - - - - - - - + + + + Probability + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

Probability

+

Statistical Inference

+

Brian Caffo, Jeff Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

+
+
+
+ + + + +
+

Notation

+
+
+
    +
  • The sample space, \(\Omega\), is the collection of possible outcomes of an experiment + +
      +
    • Example: die roll \(\Omega = \{1,2,3,4,5,6\}\)
    • +
  • +
  • An event, say \(E\), is a subset of \(\Omega\) + +
      +
    • Example: die roll is even \(E = \{2,4,6\}\)
    • +
  • +
  • An elementary or simple event is a particular result +of an experiment + +
      +
    • Example: die roll is a four, \(\omega = 4\)
    • +
  • +
  • \(\emptyset\) is called the null event or the empty set
  • +
+ +
+ +
+ + +
+

Interpretation of set operations

+
+
+

Normal set operations have particular interpretations in this setting

+ +
    +
  1. \(\omega \in E\) implies that \(E\) occurs when \(\omega\) occurs
  2. +
  3. \(\omega \not\in E\) implies that \(E\) does not occur when \(\omega\) occurs
  4. +
  5. \(E \subset F\) implies that the occurrence of \(E\) implies the occurrence of \(F\)
  6. +
  7. \(E \cap F\) implies the event that both \(E\) and \(F\) occur
  8. +
  9. \(E \cup F\) implies the event that at least one of \(E\) or \(F\) occur
  10. +
  11. \(E \cap F=\emptyset\) means that \(E\) and \(F\) are mutually exclusive, or cannot both occur
  12. +
  13. \(E^c\) or \(\bar E\) is the event that \(E\) does not occur
  14. +
+ +
+ +
+ + +
+

Probability

+
+
+

A probability measure, \(P\), is a function from the collection of possible events so that the following hold

+ +
    +
  1. For an event \(E\subset \Omega\), \(0 \leq P(E) \leq 1\)
  2. +
  3. \(P(\Omega) = 1\)
  4. +
  5. If \(E_1\) and \(E_2\) are mutually exclusive events +\(P(E_1 \cup E_2) = P(E_1) + P(E_2)\).
  6. +
+ +

Part 3 of the definition implies finite additivity

+ +

\[ +P(\cup_{i=1}^n A_i) = \sum_{i=1}^n P(A_i) +\] +where the \(\{A_i\}\) are mutually exclusive. (Note a more general version of +additivity is used in advanced classes.)

+ +
+ +
+ + +
+

Example consequences

+
+
+
    +
  • \(P(\emptyset) = 0\)
  • +
  • \(P(E) = 1 - P(E^c)\)
  • +
  • \(P(A \cup B) = P(A) + P(B) - P(A \cap B)\)
  • +
  • if \(A \subset B\) then \(P(A) \leq P(B)\)
  • +
  • \(P\left(A \cup B\right) = 1 - P(A^c \cap B^c)\)
  • +
  • \(P(A \cap B^c) = P(A) - P(A \cap B)\)
  • +
  • \(P(\cup_{i=1}^n E_i) \leq \sum_{i=1}^n P(E_i)\)
  • +
  • \(P(\cup_{i=1}^n E_i) \geq \max_i P(E_i)\)
  • +
+ +
+ +
+ + +
+

Example

+
+
+

The National Sleep Foundation (www.sleepfoundation.org) reports that around 3% of the American population has sleep apnea. They also report that around 10% of the North American and European population has restless leg syndrome. Does this imply that 13% of people will have at least one sleep problems of these sorts?

+ +
+ +
+ + +
+

Example continued

+
+
+

Answer: No, the events are not mutually exclusive. To elaborate let:

+ +

\[ +\begin{eqnarray*} + A_1 & = & \{\mbox{Person has sleep apnea}\} \\ + A_2 & = & \{\mbox{Person has RLS}\} + \end{eqnarray*} +\]

+ +

Then

+ +

\[ +\begin{eqnarray*} + P(A_1 \cup A_2 ) & = & P(A_1) + P(A_2) - P(A_1 \cap A_2) \\ + & = & 0.13 - \mbox{Probability of having both} + \end{eqnarray*} +\] +Likely, some fraction of the population has both.

+ +
+ +
+ + +
+

Random variables

+
+
+
    +
  • A random variable is a numerical outcome of an experiment.
  • +
  • The random variables that we study will come in two varieties, +discrete or continuous.
  • +
  • Discrete random variable are random variables that take on only a +countable number of possibilities. + +
      +
    • \(P(X = k)\)
    • +
  • +
  • Continuous random variable can take any value on the real line or some subset of the real line. + +
      +
    • \(P(X \in A)\)
    • +
  • +
+ +
+ +
+ + +
+

Examples of variables that can be thought of as random variables

+
+
+
    +
  • The \((0-1)\) outcome of the flip of a coin
  • +
  • The outcome from the roll of a die
  • +
  • The BMI of a subject four years after a baseline measurement
  • +
  • The hypertension status of a subject randomly drawn from a population
  • +
+ +
+ +
+ + +
+

PMF

+
+
+

A probability mass function evaluated at a value corresponds to the +probability that a random variable takes that value. To be a valid +pmf a function, \(p\), must satisfy

+ +
    +
  1. \(p(x) \geq 0\) for all \(x\)
  2. +
  3. \(\sum_{x} p(x) = 1\)
  4. +
+ +

The sum is taken over all of the possible values for \(x\).

+ +
+ +
+ + +
+

Example

+
+
+

Let \(X\) be the result of a coin flip where \(X=0\) represents +tails and \(X = 1\) represents heads. +\[ +p(x) = (1/2)^{x} (1/2)^{1-x} ~~\mbox{ for }~~x = 0,1 +\] +Suppose that we do not know whether or not the coin is fair; Let +\(\theta\) be the probability of a head expressed as a proportion +(between 0 and 1). +\[ +p(x) = \theta^{x} (1 - \theta)^{1-x} ~~\mbox{ for }~~x = 0,1 +\]

+ +
+ +
+ + +
+

PDF

+
+
+

A probability density function (pdf), is a function associated with +a continuous random variable

+ +

Areas under pdfs correspond to probabilities for that random variable

+ +

To be a valid pdf, a function \(f\) must satisfy

+ +
    +
  1. \(f(x) \geq 0\) for all \(x\)

  2. +
  3. The area under \(f(x)\) is one.

  4. +
+ +
+ +
+ + +
+

Example

+
+
+

Suppose that the proportion of help calls that get addressed in +a random day by a help line is given by +\[ +f(x) = \left\{\begin{array}{ll} + 2 x & \mbox{ for } 1 > x > 0 \\ + 0 & \mbox{ otherwise} +\end{array} \right. +\]

+ +

Is this a mathematically valid density?

+ +
+ +
+ + +
+
x <- c(-0.5, 0, 1, 1, 1.5)
+y <- c(0, 0, 2, 0, 0)
+plot(x, y, lwd = 3, frame = FALSE, type = "l")
+
+ +

plot of chunk unnamed-chunk-1

+ +
+ +
+ + +
+

Example continued

+
+
+

What is the probability that 75% or fewer of calls get addressed?

+ +

plot of chunk unnamed-chunk-2

+ +
+ +
+ + +
+
1.5 * 0.75/2
+
+ +
## [1] 0.5625
+
+ +
pbeta(0.75, 2, 1)
+
+ +
## [1] 0.5625
+
+ +
+ +
+ + +
+

CDF and survival function

+
+
+
    +
  • The cumulative distribution function (CDF) of a random variable \(X\) is defined as the function +\[ +F(x) = P(X \leq x) +\]
  • +
  • This definition applies regardless of whether \(X\) is discrete or continuous.
  • +
  • The survival function of a random variable \(X\) is defined as +\[ +S(x) = P(X > x) +\]
  • +
  • Notice that \(S(x) = 1 - F(x)\)
  • +
  • For continuous random variables, the PDF is the derivative of the CDF
  • +
+ +
+ +
+ + +
+

Example

+
+
+

What are the survival function and CDF from the density considered before?

+ +

For \(1 \geq x \geq 0\) +\[ +F(x) = P(X \leq x) = \frac{1}{2} Base \times Height = \frac{1}{2} (x) \times (2 x) = x^2 +\]

+ +

\[ +S(x) = 1 - x^2 +\]

+ +
pbeta(c(0.4, 0.5, 0.6), 2, 1)
+
+ +
## [1] 0.16 0.25 0.36
+
+ +
+ +
+ + +
+

Quantiles

+
+
+
    +
  • The \(\alpha^{th}\) quantile of a distribution with distribution function \(F\) is the point \(x_\alpha\) so that +\[ +F(x_\alpha) = \alpha +\]
  • +
  • A percentile is simply a quantile with \(\alpha\) expressed as a percent
  • +
  • The median is the \(50^{th}\) percentile
  • +
+ +
+ +
+ + +
+

Example

+
+
+
    +
  • We want to solve \(0.5 = F(x) = x^2\)
  • +
  • Resulting in the solution
  • +
+ +
sqrt(0.5)
+
+ +
## [1] 0.7071
+
+ +
    +
  • Therefore, about 0.7071 of calls being answered on a random day is the median.
  • +
  • R can approximate quantiles for you for common distributions
  • +
+ +
qbeta(0.5, 2, 1)
+
+ +
## [1] 0.7071
+
+ +
+ +
+ + +
+

Summary

+
+
+
    +
  • You might be wondering at this point "I've heard of a median before, it didn't require integration. Where's the data?"
  • +
  • We're referring to are population quantities. Therefore, the median being +discussed is the population median.
  • +
  • A probability model connects the data to the population using assumptions.
  • +
  • Therefore the median we're discussing is the estimand, the sample median will be the estimator
  • +
+ +
+ +
+ + +
+ + + + + + + + + + + + + + \ No newline at end of file diff --git a/06_StatisticalInference/01_02_Probability/index.md b/06_StatisticalInference/01_02_Probability/index.md index 61a470797..6b6231410 100644 --- a/06_StatisticalInference/01_02_Probability/index.md +++ b/06_StatisticalInference/01_02_Probability/index.md @@ -1,311 +1,310 @@ ---- -title : Probability -subtitle : Statistical Inference -author : Brian Caffo, Jeff Leek, Roger Peng -job : Johns Hopkins Bloomberg School of Public Health -logo : bloomberg_shield.png -framework : io2012 # {io2012, html5slides, shower, dzslides, ...} -highlighter : highlight.js # {highlight.js, prettify, highlight} -hitheme : tomorrow # -url: - lib: ../../librariesNew - assets: ../../assets -widgets : [mathjax] # {mathjax, quiz, bootstrap} -mode : selfcontained # {standalone, draft} ---- - -## Notation - -- The **sample space**, $\Omega$, is the collection of possible outcomes of an experiment - - Example: die roll $\Omega = \{1,2,3,4,5,6\}$ -- An **event**, say $E$, is a subset of $\Omega$ - - Example: die roll is even $E = \{2,4,6\}$ -- An **elementary** or **simple** event is a particular result - of an experiment - - Example: die roll is a four, $\omega = 4$ -- $\emptyset$ is called the **null event** or the **empty set** - ---- - -## Interpretation of set operations - -Normal set operations have particular interpretations in this setting - -1. $\omega \in E$ implies that $E$ occurs when $\omega$ occurs -2. $\omega \not\in E$ implies that $E$ does not occur when $\omega$ occurs -3. $E \subset F$ implies that the occurrence of $E$ implies the occurrence of $F$ -4. $E \cap F$ implies the event that both $E$ and $F$ occur -5. $E \cup F$ implies the event that at least one of $E$ or $F$ occur -6. $E \cap F=\emptyset$ means that $E$ and $F$ are **mutually exclusive**, or cannot both occur -7. $E^c$ or $\bar E$ is the event that $E$ does not occur - ---- - -## Probability - -A **probability measure**, $P$, is a function from the collection of possible events so that the following hold - -1. For an event $E\subset \Omega$, $0 \leq P(E) \leq 1$ -2. $P(\Omega) = 1$ -3. If $E_1$ and $E_2$ are mutually exclusive events - $P(E_1 \cup E_2) = P(E_1) + P(E_2)$. - -Part 3 of the definition implies **finite additivity** - -$$ -P(\cup_{i=1}^n A_i) = \sum_{i=1}^n P(A_i) -$$ -where the $\{A_i\}$ are mutually exclusive. (Note a more general version of -additivity is used in advanced classes.) - - ---- - - -## Example consequences - -- $P(\emptyset) = 0$ -- $P(E) = 1 - P(E^c)$ -- $P(A \cup B) = P(A) + P(B) - P(A \cap B)$ -- if $A \subset B$ then $P(A) \leq P(B)$ -- $P\left(A \cup B\right) = 1 - P(A^c \cap B^c)$ -- $P(A \cap B^c) = P(A) - P(A \cap B)$ -- $P(\cup_{i=1}^n E_i) \leq \sum_{i=1}^n P(E_i)$ -- $P(\cup_{i=1}^n E_i) \geq \max_i P(E_i)$ - ---- - -## Example - -The National Sleep Foundation ([www.sleepfoundation.org](http://www.sleepfoundation.org/)) reports that around 3% of the American population has sleep apnea. They also report that around 10% of the North American and European population has restless leg syndrome. Does this imply that 13% of people will have at least one sleep problems of these sorts? - ---- - -## Example continued - -Answer: No, the events are not mutually exclusive. To elaborate let: - -$$ -\begin{eqnarray*} - A_1 & = & \{\mbox{Person has sleep apnea}\} \\ - A_2 & = & \{\mbox{Person has RLS}\} - \end{eqnarray*} -$$ - -Then - -$$ -\begin{eqnarray*} - P(A_1 \cup A_2 ) & = & P(A_1) + P(A_2) - P(A_1 \cap A_2) \\ - & = & 0.13 - \mbox{Probability of having both} - \end{eqnarray*} -$$ -Likely, some fraction of the population has both. - ---- - -## Random variables - -- A **random variable** is a numerical outcome of an experiment. -- The random variables that we study will come in two varieties, - **discrete** or **continuous**. -- Discrete random variable are random variables that take on only a -countable number of possibilities. - * $P(X = k)$ -- Continuous random variable can take any value on the real line or some subset of the real line. - * $P(X \in A)$ - ---- - -## Examples of variables that can be thought of as random variables - -- The $(0-1)$ outcome of the flip of a coin -- The outcome from the roll of a die -- The BMI of a subject four years after a baseline measurement -- The hypertension status of a subject randomly drawn from a population - ---- - -## PMF - -A probability mass function evaluated at a value corresponds to the -probability that a random variable takes that value. To be a valid -pmf a function, $p$, must satisfy - - 1. $p(x) \geq 0$ for all $x$ - 2. $\sum_{x} p(x) = 1$ - -The sum is taken over all of the possible values for $x$. - ---- - -## Example - -Let $X$ be the result of a coin flip where $X=0$ represents -tails and $X = 1$ represents heads. -$$ -p(x) = (1/2)^{x} (1/2)^{1-x} ~~\mbox{ for }~~x = 0,1 -$$ -Suppose that we do not know whether or not the coin is fair; Let -$\theta$ be the probability of a head expressed as a proportion -(between 0 and 1). -$$ -p(x) = \theta^{x} (1 - \theta)^{1-x} ~~\mbox{ for }~~x = 0,1 -$$ - ---- - -## PDF - -A probability density function (pdf), is a function associated with -a continuous random variable - - *Areas under pdfs correspond to probabilities for that random variable* - -To be a valid pdf, a function $f$ must satisfy - -1. $f(x) \geq 0$ for all $x$ - -2. The area under $f(x)$ is one. - ---- -## Example - -Suppose that the proportion of help calls that get addressed in -a random day by a help line is given by -$$ -f(x) = \left\{\begin{array}{ll} - 2 x & \mbox{ for } 1 > x > 0 \\ - 0 & \mbox{ otherwise} -\end{array} \right. -$$ - -Is this a mathematically valid density? - ---- - - -```r -x <- c(-0.5, 0, 1, 1, 1.5) -y <- c(0, 0, 2, 0, 0) -plot(x, y, lwd = 3, frame = FALSE, type = "l") -``` - -![plot of chunk unnamed-chunk-1](assets/fig/unnamed-chunk-1.png) - - ---- - -## Example continued - -What is the probability that 75% or fewer of calls get addressed? - -![plot of chunk unnamed-chunk-2](assets/fig/unnamed-chunk-2.png) - - ---- - -```r -1.5 * 0.75/2 -``` - -``` -## [1] 0.5625 -``` - -```r -pbeta(0.75, 2, 1) -``` - -``` -## [1] 0.5625 -``` - ---- - -## CDF and survival function - -- The **cumulative distribution function** (CDF) of a random variable $X$ is defined as the function -$$ -F(x) = P(X \leq x) -$$ -- This definition applies regardless of whether $X$ is discrete or continuous. -- The **survival function** of a random variable $X$ is defined as -$$ -S(x) = P(X > x) -$$ -- Notice that $S(x) = 1 - F(x)$ -- For continuous random variables, the PDF is the derivative of the CDF - ---- - -## Example - -What are the survival function and CDF from the density considered before? - -For $1 \geq x \geq 0$ -$$ -F(x) = P(X \leq x) = \frac{1}{2} Base \times Height = \frac{1}{2} (x) \times (2 x) = x^2 -$$ - -$$ -S(x) = 1 - x^2 -$$ - - -```r -pbeta(c(0.4, 0.5, 0.6), 2, 1) -``` - -``` -## [1] 0.16 0.25 0.36 -``` - - ---- - -## Quantiles - -- The $\alpha^{th}$ **quantile** of a distribution with distribution function $F$ is the point $x_\alpha$ so that -$$ -F(x_\alpha) = \alpha -$$ -- A **percentile** is simply a quantile with $\alpha$ expressed as a percent -- The **median** is the $50^{th}$ percentile - ---- -## Example -- We want to solve $0.5 = F(x) = x^2$ -- Resulting in the solution - -```r -sqrt(0.5) -``` - -``` -## [1] 0.7071 -``` - -- Therefore, about 0.7071 of calls being answered on a random day is the median. -- R can approximate quantiles for you for common distributions - - -```r -qbeta(0.5, 2, 1) -``` - -``` -## [1] 0.7071 -``` - - ---- - -## Summary - -- You might be wondering at this point "I've heard of a median before, it didn't require integration. Where's the data?" -- We're referring to are **population quantities**. Therefore, the median being - discussed is the **population median**. -- A probability model connects the data to the population using assumptions. -- Therefore the median we're discussing is the **estimand**, the sample median will be the **estimator** - +--- +title : Probability +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- + +## Notation + +- The **sample space**, $\Omega$, is the collection of possible outcomes of an experiment + - Example: die roll $\Omega = \{1,2,3,4,5,6\}$ +- An **event**, say $E$, is a subset of $\Omega$ + - Example: die roll is even $E = \{2,4,6\}$ +- An **elementary** or **simple** event is a particular result + of an experiment + - Example: die roll is a four, $\omega = 4$ +- $\emptyset$ is called the **null event** or the **empty set** + +--- + +## Interpretation of set operations + +Normal set operations have particular interpretations in this setting + +1. $\omega \in E$ implies that $E$ occurs when $\omega$ occurs +2. $\omega \not\in E$ implies that $E$ does not occur when $\omega$ occurs +3. $E \subset F$ implies that the occurrence of $E$ implies the occurrence of $F$ +4. $E \cap F$ implies the event that both $E$ and $F$ occur +5. $E \cup F$ implies the event that at least one of $E$ or $F$ occur +6. $E \cap F=\emptyset$ means that $E$ and $F$ are **mutually exclusive**, or cannot both occur +7. $E^c$ or $\bar E$ is the event that $E$ does not occur + +--- + +## Probability + +A **probability measure**, $P$, is a function from the collection of possible events so that the following hold + +1. For an event $E\subset \Omega$, $0 \leq P(E) \leq 1$ +2. $P(\Omega) = 1$ +3. If $E_1$ and $E_2$ are mutually exclusive events + $P(E_1 \cup E_2) = P(E_1) + P(E_2)$. + +Part 3 of the definition implies **finite additivity** + +$$ +P(\cup_{i=1}^n A_i) = \sum_{i=1}^n P(A_i) +$$ +where the $\{A_i\}$ are mutually exclusive. (Note a more general version of +additivity is used in advanced classes.) + + +--- + + +## Example consequences + +- $P(\emptyset) = 0$ +- $P(E) = 1 - P(E^c)$ +- $P(A \cup B) = P(A) + P(B) - P(A \cap B)$ +- if $A \subset B$ then $P(A) \leq P(B)$ +- $P\left(A \cup B\right) = 1 - P(A^c \cap B^c)$ +- $P(A \cap B^c) = P(A) - P(A \cap B)$ +- $P(\cup_{i=1}^n E_i) \leq \sum_{i=1}^n P(E_i)$ +- $P(\cup_{i=1}^n E_i) \geq \max_i P(E_i)$ + +--- + +## Example + +The National Sleep Foundation ([www.sleepfoundation.org](http://www.sleepfoundation.org/)) reports that around 3% of the American population has sleep apnea. They also report that around 10% of the North American and European population has restless leg syndrome. Does this imply that 13% of people will have at least one sleep problems of these sorts? + +--- + +## Example continued + +Answer: No, the events are not mutually exclusive. To elaborate let: + +$$ +\begin{eqnarray*} + A_1 & = & \{\mbox{Person has sleep apnea}\} \\ + A_2 & = & \{\mbox{Person has RLS}\} + \end{eqnarray*} +$$ + +Then + +$$ +\begin{eqnarray*} + P(A_1 \cup A_2 ) & = & P(A_1) + P(A_2) - P(A_1 \cap A_2) \\ + & = & 0.13 - \mbox{Probability of having both} + \end{eqnarray*} +$$ +Likely, some fraction of the population has both. + +--- + +## Random variables + +- A **random variable** is a numerical outcome of an experiment. +- The random variables that we study will come in two varieties, + **discrete** or **continuous**. +- Discrete random variable are random variables that take on only a +countable number of possibilities. + * $P(X = k)$ +- Continuous random variable can take any value on the real line or some subset of the real line. + * $P(X \in A)$ + +--- + +## Examples of variables that can be thought of as random variables + +- The $(0-1)$ outcome of the flip of a coin +- The outcome from the roll of a die +- The BMI of a subject four years after a baseline measurement +- The hypertension status of a subject randomly drawn from a population + +--- + +## PMF + +A probability mass function evaluated at a value corresponds to the +probability that a random variable takes that value. To be a valid +pmf a function, $p$, must satisfy + + 1. $p(x) \geq 0$ for all $x$ + 2. $\sum_{x} p(x) = 1$ + +The sum is taken over all of the possible values for $x$. + +--- + +## Example + +Let $X$ be the result of a coin flip where $X=0$ represents +tails and $X = 1$ represents heads. +$$ +p(x) = (1/2)^{x} (1/2)^{1-x} ~~\mbox{ for }~~x = 0,1 +$$ +Suppose that we do not know whether or not the coin is fair; Let +$\theta$ be the probability of a head expressed as a proportion +(between 0 and 1). +$$ +p(x) = \theta^{x} (1 - \theta)^{1-x} ~~\mbox{ for }~~x = 0,1 +$$ + +--- + +## PDF + +A probability density function (pdf), is a function associated with +a continuous random variable + + *Areas under pdfs correspond to probabilities for that random variable* + +To be a valid pdf, a function $f$ must satisfy + +1. $f(x) \geq 0$ for all $x$ + +2. The area under $f(x)$ is one. + +--- +## Example + +Suppose that the proportion of help calls that get addressed in +a random day by a help line is given by +$$ +f(x) = \left\{\begin{array}{ll} + 2 x & \mbox{ for } 1 > x > 0 \\ + 0 & \mbox{ otherwise} +\end{array} \right. +$$ + +Is this a mathematically valid density? + +--- + + +```r +x <- c(-0.5, 0, 1, 1, 1.5) +y <- c(0, 0, 2, 0, 0) +plot(x, y, lwd = 3, frame = FALSE, type = "l") +``` + +plot of chunk unnamed-chunk-1 + + +--- + +## Example continued + +What is the probability that 75% or fewer of calls get addressed? + +plot of chunk unnamed-chunk-2 + + +--- + +```r +1.5 * 0.75/2 +``` + +``` +## [1] 0.5625 +``` + +```r +pbeta(0.75, 2, 1) +``` + +``` +## [1] 0.5625 +``` + +--- + +## CDF and survival function + +- The **cumulative distribution function** (CDF) of a random variable $X$ is defined as the function +$$ +F(x) = P(X \leq x) +$$ +- This definition applies regardless of whether $X$ is discrete or continuous. +- The **survival function** of a random variable $X$ is defined as +$$ +S(x) = P(X > x) +$$ +- Notice that $S(x) = 1 - F(x)$ +- For continuous random variables, the PDF is the derivative of the CDF + +--- + +## Example + +What are the survival function and CDF from the density considered before? + +For $1 \geq x \geq 0$ +$$ +F(x) = P(X \leq x) = \frac{1}{2} Base \times Height = \frac{1}{2} (x) \times (2 x) = x^2 +$$ + +$$ +S(x) = 1 - x^2 +$$ + + +```r +pbeta(c(0.4, 0.5, 0.6), 2, 1) +``` + +``` +## [1] 0.16 0.25 0.36 +``` + + +--- + +## Quantiles + +- The $\alpha^{th}$ **quantile** of a distribution with distribution function $F$ is the point $x_\alpha$ so that +$$ +F(x_\alpha) = \alpha +$$ +- A **percentile** is simply a quantile with $\alpha$ expressed as a percent +- The **median** is the $50^{th}$ percentile + +--- +## Example +- We want to solve $0.5 = F(x) = x^2$ +- Resulting in the solution + +```r +sqrt(0.5) +``` + +``` +## [1] 0.7071 +``` + +- Therefore, about 0.7071 of calls being answered on a random day is the median. +- R can approximate quantiles for you for common distributions + + +```r +qbeta(0.5, 2, 1) +``` + +``` +## [1] 0.7071 +``` + + +--- + +## Summary + +- You might be wondering at this point "I've heard of a median before, it didn't require integration. Where's the data?" +- We're referring to are **population quantities**. Therefore, the median being + discussed is the **population median**. +- A probability model connects the data to the population using assumptions. +- Therefore the median we're discussing is the **estimand**, the sample median will be the **estimator** diff --git a/06_StatisticalInference/01_02_Probability/index.pdf b/06_StatisticalInference/01_02_Probability/index.pdf index b431ce394..fddebae9e 100644 Binary files a/06_StatisticalInference/01_02_Probability/index.pdf and b/06_StatisticalInference/01_02_Probability/index.pdf differ diff --git a/06_StatisticalInference/01_03_Expectations/assets/fig/lsm.png b/06_StatisticalInference/01_03_Expectations/assets/fig/lsm.png new file mode 100644 index 000000000..d37f19c84 Binary files /dev/null and b/06_StatisticalInference/01_03_Expectations/assets/fig/lsm.png differ diff --git a/06_StatisticalInference/01_03_Expectations/assets/fig/unnamed-chunk-1.png b/06_StatisticalInference/01_03_Expectations/assets/fig/unnamed-chunk-1.png new file mode 100644 index 000000000..d55969896 Binary files /dev/null and b/06_StatisticalInference/01_03_Expectations/assets/fig/unnamed-chunk-1.png differ diff --git a/06_StatisticalInference/01_03_Expectations/assets/fig/unnamed-chunk-2.png b/06_StatisticalInference/01_03_Expectations/assets/fig/unnamed-chunk-2.png new file mode 100644 index 000000000..94882ce30 Binary files /dev/null and b/06_StatisticalInference/01_03_Expectations/assets/fig/unnamed-chunk-2.png differ diff --git a/06_StatisticalInference/01_03_Expectations/assets/fig/unnamed-chunk-3.png b/06_StatisticalInference/01_03_Expectations/assets/fig/unnamed-chunk-3.png new file mode 100644 index 000000000..15a643336 Binary files /dev/null and b/06_StatisticalInference/01_03_Expectations/assets/fig/unnamed-chunk-3.png differ diff --git a/06_StatisticalInference/01_03_Expectations/index.html b/06_StatisticalInference/01_03_Expectations/index.html index 04a508ac3..72dd4b7b5 100644 --- a/06_StatisticalInference/01_03_Expectations/index.html +++ b/06_StatisticalInference/01_03_Expectations/index.html @@ -1,549 +1,552 @@ - - - - Expected values - - - - - - - - - - - - - - - - - - - - - - - - - - -
-

Expected values

-

Statistical Inference

-

Brian Caffo, Jeff Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

-
-
-
- - - - -
-

Expected values

-
-
-
    -
  • The expected value or mean of a random variable is the center of its distribution
  • -
  • For discrete random variable \(X\) with PMF \(p(x)\), it is defined as follows -\[ -E[X] = \sum_x xp(x). -\] -where the sum is taken over the possible values of \(x\)
  • -
  • \(E[X]\) represents the center of mass of a collection of locations and weights, \(\{x, p(x)\}\)
  • -
- -
- -
- - -
-

Example

-
-
-

Find the center of mass of the bars

- -

plot of chunk unnamed-chunk-1

- -
- -
- - -
-

Using manipulate

-
-
-
library(manipulate)
-myHist <- function(mu){
-  hist(galton$child,col="blue",breaks=100)
-  lines(c(mu, mu), c(0, 150),col="red",lwd=5)
-  mse <- mean((galton$child - mu)^2)
-  text(63, 150, paste("mu = ", mu))
-  text(63, 140, paste("Imbalance = ", round(mse, 2)))
-}
-manipulate(myHist(mu), mu = slider(62, 74, step = 0.5))
-
- -
- -
- - -
-

The center of mass is the empirical mean

-
-
-
hist(galton$child, col = "blue", breaks = 100)
-meanChild <- mean(galton$child)
-lines(rep(meanChild, 100), seq(0, 150, length = 100), col = "red", lwd = 5)
-
- -

plot of chunk lsm

- -
- -
- - -
-

Example

-
-
-
    -
  • Suppose a coin is flipped and \(X\) is declared \(0\) or \(1\) corresponding to a head or a tail, respectively
  • -
  • What is the expected value of \(X\)? -\[ -E[X] = .5 \times 0 + .5 \times 1 = .5 -\]
  • -
  • Note, if thought about geometrically, this answer is obvious; if two equal weights are spaced at 0 and 1, the center of mass will be \(.5\)
  • -
- -

plot of chunk unnamed-chunk-2

- -
- -
- - -
-

Example

-
-
-
    -
  • Suppose that a die is rolled and \(X\) is the number face up
  • -
  • What is the expected value of \(X\)? -\[ -E[X] = 1 \times \frac{1}{6} + 2 \times \frac{1}{6} + -3 \times \frac{1}{6} + 4 \times \frac{1}{6} + -5 \times \frac{1}{6} + 6 \times \frac{1}{6} = 3.5 -\]
  • -
  • Again, the geometric argument makes this answer obvious without calculation.
  • -
- -
- -
- - -
-

Continuous random variables

-
-
-
    -
  • For a continuous random variable, \(X\), with density, \(f\), the expected -value is defined as follows -\[ -E[X] = \mbox{the area under the function}~~~ t f(t) -\]
  • -
  • This definition borrows from the definition of center of mass for a continuous body
  • -
- -
- -
- - -
-

Example

-
-
-
    -
  • Consider a density where \(f(x) = 1\) for \(x\) between zero and one
  • -
  • (Is this a valid density?)
  • -
  • Suppose that \(X\) follows this density; what is its expected value?
    -plot of chunk unnamed-chunk-3
  • -
- -
- -
- - -
-

Rules about expected values

-
-
-
    -
  • The expected value is a linear operator
  • -
  • If \(a\) and \(b\) are not random and \(X\) and \(Y\) are two random variables then - -
      -
    • \(E[aX + b] = a E[X] + b\)
    • -
    • \(E[X + Y] = E[X] + E[Y]\)
    • -
  • -
- -
- -
- - -
-

Example

-
-
-
    -
  • You flip a coin, \(X\) and simulate a uniform random number \(Y\), what is the expected value of their sum? -\[ -E[X + Y] = E[X] + E[Y] = .5 + .5 = 1 -\]
  • -
  • Another example, you roll a die twice. What is the expected value of the average?
  • -
  • Let \(X_1\) and \(X_2\) be the results of the two rolls -\[ -E[(X_1 + X_2) / 2] = \frac{1}{2}(E[X_1] + E[X_2]) -= \frac{1}{2}(3.5 + 3.5) = 3.5 -\]
  • -
- -
- -
- - -
-

Example

-
-
-
    -
  1. Let \(X_i\) for \(i=1,\ldots,n\) be a collection of random variables, each from a distribution with mean \(\mu\)
  2. -
  3. Calculate the expected value of the sample average of the \(X_i\) -\[ -\begin{eqnarray*} -E\left[ \frac{1}{n}\sum_{i=1}^n X_i\right] -& = & \frac{1}{n} E\left[\sum_{i=1}^n X_i\right] \\ -& = & \frac{1}{n} \sum_{i=1}^n E\left[X_i\right] \\ -& = & \frac{1}{n} \sum_{i=1}^n \mu = \mu. -\end{eqnarray*} -\]
  4. -
- -
- -
- - -
-

Remark

-
-
-
    -
  • Therefore, the expected value of the sample mean is the population mean that it's trying to estimate
  • -
  • When the expected value of an estimator is what its trying to estimate, we say that the estimator is unbiased
  • -
- -
- -
- - -
-

The variance

-
-
-
    -
  • The variance of a random variable is a measure of spread
  • -
  • If \(X\) is a random variable with mean \(\mu\), the variance of \(X\) is defined as
  • -
- -

\[ -Var(X) = E[(X - \mu)^2] -\]

- -

the expected (squared) distance from the mean

- -
    -
  • Densities with a higher variance are more spread out than densities with a lower variance
  • -
- -
- -
- - -
-
    -
  • Convenient computational form -\[ -Var(X) = E[X^2] - E[X]^2 -\]
  • -
  • If \(a\) is constant then \(Var(aX) = a^2 Var(X)\)
  • -
  • The square root of the variance is called the standard deviation
  • -
  • The standard deviation has the same units as \(X\)
  • -
- -
- -
- - -
-

Example

-
-
-
    -
  • What's the sample variance from the result of a toss of a die?

    - -
      -
    • \(E[X] = 3.5\)
    • -
    • \(E[X^2] = 1 ^ 2 \times \frac{1}{6} + 2 ^ 2 \times \frac{1}{6} + 3 ^ 2 \times \frac{1}{6} + 4 ^ 2 \times \frac{1}{6} + 5 ^ 2 \times \frac{1}{6} + 6 ^ 2 \times \frac{1}{6} = 15.17\)
    • -
  • -
  • \(Var(X) = E[X^2] - E[X]^2 \approx 2.92\)

  • -
- -
- -
- - -
-

Example

-
-
-
    -
  • What's the sample variance from the result of the toss of a coin with probability of heads (1) of \(p\)?

    - -
      -
    • \(E[X] = 0 \times (1 - p) + 1 \times p = p\)
    • -
    • \(E[X^2] = E[X] = p\)
    • -
  • -
  • \(Var(X) = E[X^2] - E[X]^2 = p - p^2 = p(1 - p)\)

  • -
- -
- -
- - -
-

Interpreting variances

-
-
-
    -
  • Chebyshev's inequality is useful for interpreting variances
  • -
  • This inequality states that -\[ -P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2} -\]
  • -
  • For example, the probability that a random variable lies beyond \(k\) standard deviations from its mean is less than \(1/k^2\) -\[ -\begin{eqnarray*} -2\sigma & \rightarrow & 25\% \\ -3\sigma & \rightarrow & 11\% \\ -4\sigma & \rightarrow & 6\% -\end{eqnarray*} -\]
  • -
  • Note this is only a bound; the actual probability might be quite a bit smaller
  • -
- -
- -
- - -
-

Example

-
-
-
    -
  • IQs are often said to be distributed with a mean of \(100\) and a sd of \(15\)
  • -
  • What is the probability of a randomly drawn person having an IQ higher than \(160\) or below \(40\)?
  • -
  • Thus we want to know the probability of a person being more than \(4\) standard deviations from the mean
  • -
  • Thus Chebyshev's inequality suggests that this will be no larger than 6\%
  • -
  • IQs distributions are often cited as being bell shaped, in which case this bound is very conservative
  • -
  • The probability of a random draw from a bell curve being \(4\) standard deviations from the mean is on the order of \(10^{-5}\) (one thousandth of one percent)
  • -
- -
- -
- - -
-

Example

-
-
-
    -
  • A former buzz phrase in industrial quality control is Motorola's "Six Sigma" whereby businesses are suggested to control extreme events or rare defective parts
  • -
  • Chebyshev's inequality states that the probability of a "Six Sigma" event is less than \(1/6^2 \approx 3\%\)
  • -
  • If a bell curve is assumed, the probability of a "six sigma" event is on the order of \(10^{-9}\) (one ten millionth of a percent)
  • -
- -
- -
- - -
- - - - - - - - - - - - - - + + + + Expected values + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

Expected values

+

Statistical Inference

+

Brian Caffo, Jeff Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

+
+
+
+ + + + +
+

Expected values

+
+
+
    +
  • The expected value or mean of a random variable is the center of its distribution
  • +
  • For discrete random variable \(X\) with PMF \(p(x)\), it is defined as follows +\[ +E[X] = \sum_x xp(x). +\] +where the sum is taken over the possible values of \(x\)
  • +
  • \(E[X]\) represents the center of mass of a collection of locations and weights, \(\{x, p(x)\}\)
  • +
+ +
+ +
+ + +
+

Example

+
+
+

Find the center of mass of the bars

+ +
## Loading required package: MASS
+
+ +

plot of chunk unnamed-chunk-1

+ +
+ +
+ + +
+

Using manipulate

+
+
+
library(manipulate)
+myHist <- function(mu){
+  hist(galton$child,col="blue",breaks=100)
+  lines(c(mu, mu), c(0, 150),col="red",lwd=5)
+  mse <- mean((galton$child - mu)^2)
+  text(63, 150, paste("mu = ", mu))
+  text(63, 140, paste("Imbalance = ", round(mse, 2)))
+}
+manipulate(myHist(mu), mu = slider(62, 74, step = 0.5))
+
+ +
+ +
+ + +
+

The center of mass is the empirical mean

+
+
+
hist(galton$child, col = "blue", breaks = 100)
+meanChild <- mean(galton$child)
+lines(rep(meanChild, 100), seq(0, 150, length = 100), col = "red", lwd = 5)
+
+ +

plot of chunk lsm

+ +
+ +
+ + +
+

Example

+
+
+
    +
  • Suppose a coin is flipped and \(X\) is declared \(0\) or \(1\) corresponding to a head or a tail, respectively
  • +
  • What is the expected value of \(X\)? +\[ +E[X] = .5 \times 0 + .5 \times 1 = .5 +\]
  • +
  • Note, if thought about geometrically, this answer is obvious; if two equal weights are spaced at 0 and 1, the center of mass will be \(.5\)
  • +
+ +

plot of chunk unnamed-chunk-2

+ +
+ +
+ + +
+

Example

+
+
+
    +
  • Suppose that a die is rolled and \(X\) is the number face up
  • +
  • What is the expected value of \(X\)? +\[ +E[X] = 1 \times \frac{1}{6} + 2 \times \frac{1}{6} + +3 \times \frac{1}{6} + 4 \times \frac{1}{6} + +5 \times \frac{1}{6} + 6 \times \frac{1}{6} = 3.5 +\]
  • +
  • Again, the geometric argument makes this answer obvious without calculation.
  • +
+ +
+ +
+ + +
+

Continuous random variables

+
+
+
    +
  • For a continuous random variable, \(X\), with density, \(f\), the expected +value is defined as follows +\[ +E[X] = \mbox{the area under the function}~~~ t f(t) +\]
  • +
  • This definition borrows from the definition of center of mass for a continuous body
  • +
+ +
+ +
+ + +
+

Example

+
+
+
    +
  • Consider a density where \(f(x) = 1\) for \(x\) between zero and one
  • +
  • (Is this a valid density?)
  • +
  • Suppose that \(X\) follows this density; what is its expected value?
    +plot of chunk unnamed-chunk-3
  • +
+ +
+ +
+ + +
+

Rules about expected values

+
+
+
    +
  • The expected value is a linear operator
  • +
  • If \(a\) and \(b\) are not random and \(X\) and \(Y\) are two random variables then + +
      +
    • \(E[aX + b] = a E[X] + b\)
    • +
    • \(E[X + Y] = E[X] + E[Y]\)
    • +
  • +
+ +
+ +
+ + +
+

Example

+
+
+
    +
  • You flip a coin, \(X\) and simulate a uniform random number \(Y\), what is the expected value of their sum? +\[ +E[X + Y] = E[X] + E[Y] = .5 + .5 = 1 +\]
  • +
  • Another example, you roll a die twice. What is the expected value of the average?
  • +
  • Let \(X_1\) and \(X_2\) be the results of the two rolls +\[ +E[(X_1 + X_2) / 2] = \frac{1}{2}(E[X_1] + E[X_2]) += \frac{1}{2}(3.5 + 3.5) = 3.5 +\]
  • +
+ +
+ +
+ + +
+

Example

+
+
+
    +
  1. Let \(X_i\) for \(i=1,\ldots,n\) be a collection of random variables, each from a distribution with mean \(\mu\)
  2. +
  3. Calculate the expected value of the sample average of the \(X_i\) +\[ +\begin{eqnarray*} +E\left[ \frac{1}{n}\sum_{i=1}^n X_i\right] +& = & \frac{1}{n} E\left[\sum_{i=1}^n X_i\right] \\ +& = & \frac{1}{n} \sum_{i=1}^n E\left[X_i\right] \\ +& = & \frac{1}{n} \sum_{i=1}^n \mu = \mu. +\end{eqnarray*} +\]
  4. +
+ +
+ +
+ + +
+

Remark

+
+
+
    +
  • Therefore, the expected value of the sample mean is the population mean that it's trying to estimate
  • +
  • When the expected value of an estimator is what its trying to estimate, we say that the estimator is unbiased
  • +
+ +
+ +
+ + +
+

The variance

+
+
+
    +
  • The variance of a random variable is a measure of spread
  • +
  • If \(X\) is a random variable with mean \(\mu\), the variance of \(X\) is defined as
  • +
+ +

\[ +Var(X) = E[(X - \mu)^2] +\]

+ +

the expected (squared) distance from the mean

+ +
    +
  • Densities with a higher variance are more spread out than densities with a lower variance
  • +
+ +
+ +
+ + +
+
    +
  • Convenient computational form +\[ +Var(X) = E[X^2] - E[X]^2 +\]
  • +
  • If \(a\) is constant then \(Var(aX) = a^2 Var(X)\)
  • +
  • The square root of the variance is called the standard deviation
  • +
  • The standard deviation has the same units as \(X\)
  • +
+ +
+ +
+ + +
+

Example

+
+
+
    +
  • What's the sample variance from the result of a toss of a die?

    + +
      +
    • \(E[X] = 3.5\)
    • +
    • \(E[X^2] = 1 ^ 2 \times \frac{1}{6} + 2 ^ 2 \times \frac{1}{6} + 3 ^ 2 \times \frac{1}{6} + 4 ^ 2 \times \frac{1}{6} + 5 ^ 2 \times \frac{1}{6} + 6 ^ 2 \times \frac{1}{6} = 15.17\)
    • +
  • +
  • \(Var(X) = E[X^2] - E[X]^2 \approx 2.92\)

  • +
+ +
+ +
+ + +
+

Example

+
+
+
    +
  • What's the sample variance from the result of the toss of a coin with probability of heads (1) of \(p\)?

    + +
      +
    • \(E[X] = 0 \times (1 - p) + 1 \times p = p\)
    • +
    • \(E[X^2] = E[X] = p\)
    • +
  • +
  • \(Var(X) = E[X^2] - E[X]^2 = p - p^2 = p(1 - p)\)

  • +
+ +
+ +
+ + +
+

Interpreting variances

+
+
+
    +
  • Chebyshev's inequality is useful for interpreting variances
  • +
  • This inequality states that +\[ +P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2} +\]
  • +
  • For example, the probability that a random variable lies beyond \(k\) standard deviations from its mean is less than \(1/k^2\) +\[ +\begin{eqnarray*} +2\sigma & \rightarrow & 25\% \\ +3\sigma & \rightarrow & 11\% \\ +4\sigma & \rightarrow & 6\% +\end{eqnarray*} +\]
  • +
  • Note this is only a bound; the actual probability might be quite a bit smaller
  • +
+ +
+ +
+ + +
+

Example

+
+
+
    +
  • IQs are often said to be distributed with a mean of \(100\) and a sd of \(15\)
  • +
  • What is the probability of a randomly drawn person having an IQ higher than \(160\) or below \(40\)?
  • +
  • Thus we want to know the probability of a person being more than \(4\) standard deviations from the mean
  • +
  • Thus Chebyshev's inequality suggests that this will be no larger than 6\%
  • +
  • IQs distributions are often cited as being bell shaped, in which case this bound is very conservative
  • +
  • The probability of a random draw from a bell curve being \(4\) standard deviations from the mean is on the order of \(10^{-5}\) (one thousandth of one percent)
  • +
+ +
+ +
+ + +
+

Example

+
+
+
    +
  • A former buzz phrase in industrial quality control is Motorola's "Six Sigma" whereby businesses are suggested to control extreme events or rare defective parts
  • +
  • Chebyshev's inequality states that the probability of a "Six Sigma" event is less than \(1/6^2 \approx 3\%\)
  • +
  • If a bell curve is assumed, the probability of a "six sigma" event is on the order of \(10^{-9}\) (one ten millionth of a percent)
  • +
+ +
+ +
+ + +
+ + + + + + + + + + + + + + \ No newline at end of file diff --git a/06_StatisticalInference/01_03_Expectations/index.md b/06_StatisticalInference/01_03_Expectations/index.md index 8ac21861b..e86b30289 100644 --- a/06_StatisticalInference/01_03_Expectations/index.md +++ b/06_StatisticalInference/01_03_Expectations/index.md @@ -1,234 +1,239 @@ ---- -title : Expected values -subtitle : Statistical Inference -author : Brian Caffo, Jeff Leek, Roger Peng -job : Johns Hopkins Bloomberg School of Public Health -logo : bloomberg_shield.png -framework : io2012 # {io2012, html5slides, shower, dzslides, ...} -highlighter : highlight.js # {highlight.js, prettify, highlight} -hitheme : tomorrow # -url: - lib: ../../librariesNew - assets: ../../assets -widgets : [mathjax] # {mathjax, quiz, bootstrap} -mode : selfcontained # {standalone, draft} ---- -## Expected values - -- The **expected value** or **mean** of a random variable is the center of its distribution -- For discrete random variable $X$ with PMF $p(x)$, it is defined as follows - $$ - E[X] = \sum_x xp(x). - $$ - where the sum is taken over the possible values of $x$ -- $E[X]$ represents the center of mass of a collection of locations and weights, $\{x, p(x)\}$ - ---- - -## Example -### Find the center of mass of the bars -![plot of chunk unnamed-chunk-1](assets/fig/unnamed-chunk-1.png) - - ---- -## Using manipulate -``` -library(manipulate) -myHist <- function(mu){ - hist(galton$child,col="blue",breaks=100) - lines(c(mu, mu), c(0, 150),col="red",lwd=5) - mse <- mean((galton$child - mu)^2) - text(63, 150, paste("mu = ", mu)) - text(63, 140, paste("Imbalance = ", round(mse, 2))) -} -manipulate(myHist(mu), mu = slider(62, 74, step = 0.5)) -``` - ---- -## The center of mass is the empirical mean - -```r -hist(galton$child, col = "blue", breaks = 100) -meanChild <- mean(galton$child) -lines(rep(meanChild, 100), seq(0, 150, length = 100), col = "red", lwd = 5) -``` - -![plot of chunk lsm](assets/fig/lsm.png) - - ---- -## Example - -- Suppose a coin is flipped and $X$ is declared $0$ or $1$ corresponding to a head or a tail, respectively -- What is the expected value of $X$? - $$ - E[X] = .5 \times 0 + .5 \times 1 = .5 - $$ -- Note, if thought about geometrically, this answer is obvious; if two equal weights are spaced at 0 and 1, the center of mass will be $.5$ - -![plot of chunk unnamed-chunk-2](assets/fig/unnamed-chunk-2.png) - ---- - -## Example - -- Suppose that a die is rolled and $X$ is the number face up -- What is the expected value of $X$? - $$ - E[X] = 1 \times \frac{1}{6} + 2 \times \frac{1}{6} + - 3 \times \frac{1}{6} + 4 \times \frac{1}{6} + - 5 \times \frac{1}{6} + 6 \times \frac{1}{6} = 3.5 - $$ -- Again, the geometric argument makes this answer obvious without calculation. - ---- - -## Continuous random variables - -- For a continuous random variable, $X$, with density, $f$, the expected - value is defined as follows - $$ - E[X] = \mbox{the area under the function}~~~ t f(t) - $$ -- This definition borrows from the definition of center of mass for a continuous body - ---- - -## Example - -- Consider a density where $f(x) = 1$ for $x$ between zero and one -- (Is this a valid density?) -- Suppose that $X$ follows this density; what is its expected value? -![plot of chunk unnamed-chunk-3](assets/fig/unnamed-chunk-3.png) - - ---- - -## Rules about expected values - -- The expected value is a linear operator -- If $a$ and $b$ are not random and $X$ and $Y$ are two random variables then - - $E[aX + b] = a E[X] + b$ - - $E[X + Y] = E[X] + E[Y]$ - ---- - -## Example - -- You flip a coin, $X$ and simulate a uniform random number $Y$, what is the expected value of their sum? - $$ - E[X + Y] = E[X] + E[Y] = .5 + .5 = 1 - $$ -- Another example, you roll a die twice. What is the expected value of the average? -- Let $X_1$ and $X_2$ be the results of the two rolls - $$ - E[(X_1 + X_2) / 2] = \frac{1}{2}(E[X_1] + E[X_2]) - = \frac{1}{2}(3.5 + 3.5) = 3.5 - $$ - ---- - -## Example - -1. Let $X_i$ for $i=1,\ldots,n$ be a collection of random variables, each from a distribution with mean $\mu$ -2. Calculate the expected value of the sample average of the $X_i$ -$$ - \begin{eqnarray*} - E\left[ \frac{1}{n}\sum_{i=1}^n X_i\right] - & = & \frac{1}{n} E\left[\sum_{i=1}^n X_i\right] \\ - & = & \frac{1}{n} \sum_{i=1}^n E\left[X_i\right] \\ - & = & \frac{1}{n} \sum_{i=1}^n \mu = \mu. - \end{eqnarray*} -$$ - ---- - -## Remark - -- Therefore, the expected value of the **sample mean** is the population mean that it's trying to estimate -- When the expected value of an estimator is what its trying to estimate, we say that the estimator is **unbiased** - ---- - -## The variance - -- The variance of a random variable is a measure of *spread* -- If $X$ is a random variable with mean $\mu$, the variance of $X$ is defined as - -$$ -Var(X) = E[(X - \mu)^2] -$$ - -the expected (squared) distance from the mean -- Densities with a higher variance are more spread out than densities with a lower variance - ---- - -- Convenient computational form -$$ -Var(X) = E[X^2] - E[X]^2 -$$ -- If $a$ is constant then $Var(aX) = a^2 Var(X)$ -- The square root of the variance is called the **standard deviation** -- The standard deviation has the same units as $X$ - ---- - -## Example - -- What's the sample variance from the result of a toss of a die? - - - $E[X] = 3.5$ - - $E[X^2] = 1 ^ 2 \times \frac{1}{6} + 2 ^ 2 \times \frac{1}{6} + 3 ^ 2 \times \frac{1}{6} + 4 ^ 2 \times \frac{1}{6} + 5 ^ 2 \times \frac{1}{6} + 6 ^ 2 \times \frac{1}{6} = 15.17$ - -- $Var(X) = E[X^2] - E[X]^2 \approx 2.92$ - ---- - -## Example - -- What's the sample variance from the result of the toss of a coin with probability of heads (1) of $p$? - - - $E[X] = 0 \times (1 - p) + 1 \times p = p$ - - $E[X^2] = E[X] = p$ - -- $Var(X) = E[X^2] - E[X]^2 = p - p^2 = p(1 - p)$ - ---- - -## Interpreting variances - -- Chebyshev's inequality is useful for interpreting variances -- This inequality states that -$$ -P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2} -$$ -- For example, the probability that a random variable lies beyond $k$ standard deviations from its mean is less than $1/k^2$ -$$ -\begin{eqnarray*} - 2\sigma & \rightarrow & 25\% \\ - 3\sigma & \rightarrow & 11\% \\ - 4\sigma & \rightarrow & 6\% -\end{eqnarray*} -$$ -- Note this is only a bound; the actual probability might be quite a bit smaller - ---- - -## Example - -- IQs are often said to be distributed with a mean of $100$ and a sd of $15$ -- What is the probability of a randomly drawn person having an IQ higher than $160$ or below $40$? -- Thus we want to know the probability of a person being more than $4$ standard deviations from the mean -- Thus Chebyshev's inequality suggests that this will be no larger than 6\% -- IQs distributions are often cited as being bell shaped, in which case this bound is very conservative -- The probability of a random draw from a bell curve being $4$ standard deviations from the mean is on the order of $10^{-5}$ (one thousandth of one percent) - ---- - -## Example - -- A former buzz phrase in industrial quality control is Motorola's "Six Sigma" whereby businesses are suggested to control extreme events or rare defective parts -- Chebyshev's inequality states that the probability of a "Six Sigma" event is less than $1/6^2 \approx 3\%$ -- If a bell curve is assumed, the probability of a "six sigma" event is on the order of $10^{-9}$ (one ten millionth of a percent) - +--- +title : Expected values +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- +## Expected values + +- The **expected value** or **mean** of a random variable is the center of its distribution +- For discrete random variable $X$ with PMF $p(x)$, it is defined as follows + $$ + E[X] = \sum_x xp(x). + $$ + where the sum is taken over the possible values of $x$ +- $E[X]$ represents the center of mass of a collection of locations and weights, $\{x, p(x)\}$ + +--- + +## Example +### Find the center of mass of the bars + +``` +## Loading required package: MASS +``` + +plot of chunk unnamed-chunk-1 + + +--- +## Using manipulate +``` +library(manipulate) +myHist <- function(mu){ + hist(galton$child,col="blue",breaks=100) + lines(c(mu, mu), c(0, 150),col="red",lwd=5) + mse <- mean((galton$child - mu)^2) + text(63, 150, paste("mu = ", mu)) + text(63, 140, paste("Imbalance = ", round(mse, 2))) +} +manipulate(myHist(mu), mu = slider(62, 74, step = 0.5)) +``` + +--- +## The center of mass is the empirical mean + +```r +hist(galton$child, col = "blue", breaks = 100) +meanChild <- mean(galton$child) +lines(rep(meanChild, 100), seq(0, 150, length = 100), col = "red", lwd = 5) +``` + +plot of chunk lsm + + +--- +## Example + +- Suppose a coin is flipped and $X$ is declared $0$ or $1$ corresponding to a head or a tail, respectively +- What is the expected value of $X$? + $$ + E[X] = .5 \times 0 + .5 \times 1 = .5 + $$ +- Note, if thought about geometrically, this answer is obvious; if two equal weights are spaced at 0 and 1, the center of mass will be $.5$ + +plot of chunk unnamed-chunk-2 + +--- + +## Example + +- Suppose that a die is rolled and $X$ is the number face up +- What is the expected value of $X$? + $$ + E[X] = 1 \times \frac{1}{6} + 2 \times \frac{1}{6} + + 3 \times \frac{1}{6} + 4 \times \frac{1}{6} + + 5 \times \frac{1}{6} + 6 \times \frac{1}{6} = 3.5 + $$ +- Again, the geometric argument makes this answer obvious without calculation. + +--- + +## Continuous random variables + +- For a continuous random variable, $X$, with density, $f$, the expected + value is defined as follows + $$ + E[X] = \mbox{the area under the function}~~~ t f(t) + $$ +- This definition borrows from the definition of center of mass for a continuous body + +--- + +## Example + +- Consider a density where $f(x) = 1$ for $x$ between zero and one +- (Is this a valid density?) +- Suppose that $X$ follows this density; what is its expected value? +![plot of chunk unnamed-chunk-3](assets/fig/unnamed-chunk-3.png) + + +--- + +## Rules about expected values + +- The expected value is a linear operator +- If $a$ and $b$ are not random and $X$ and $Y$ are two random variables then + - $E[aX + b] = a E[X] + b$ + - $E[X + Y] = E[X] + E[Y]$ + +--- + +## Example + +- You flip a coin, $X$ and simulate a uniform random number $Y$, what is the expected value of their sum? + $$ + E[X + Y] = E[X] + E[Y] = .5 + .5 = 1 + $$ +- Another example, you roll a die twice. What is the expected value of the average? +- Let $X_1$ and $X_2$ be the results of the two rolls + $$ + E[(X_1 + X_2) / 2] = \frac{1}{2}(E[X_1] + E[X_2]) + = \frac{1}{2}(3.5 + 3.5) = 3.5 + $$ + +--- + +## Example + +1. Let $X_i$ for $i=1,\ldots,n$ be a collection of random variables, each from a distribution with mean $\mu$ +2. Calculate the expected value of the sample average of the $X_i$ +$$ + \begin{eqnarray*} + E\left[ \frac{1}{n}\sum_{i=1}^n X_i\right] + & = & \frac{1}{n} E\left[\sum_{i=1}^n X_i\right] \\ + & = & \frac{1}{n} \sum_{i=1}^n E\left[X_i\right] \\ + & = & \frac{1}{n} \sum_{i=1}^n \mu = \mu. + \end{eqnarray*} +$$ + +--- + +## Remark + +- Therefore, the expected value of the **sample mean** is the population mean that it's trying to estimate +- When the expected value of an estimator is what its trying to estimate, we say that the estimator is **unbiased** + +--- + +## The variance + +- The variance of a random variable is a measure of *spread* +- If $X$ is a random variable with mean $\mu$, the variance of $X$ is defined as + +$$ +Var(X) = E[(X - \mu)^2] +$$ + +the expected (squared) distance from the mean +- Densities with a higher variance are more spread out than densities with a lower variance + +--- + +- Convenient computational form +$$ +Var(X) = E[X^2] - E[X]^2 +$$ +- If $a$ is constant then $Var(aX) = a^2 Var(X)$ +- The square root of the variance is called the **standard deviation** +- The standard deviation has the same units as $X$ + +--- + +## Example + +- What's the sample variance from the result of a toss of a die? + + - $E[X] = 3.5$ + - $E[X^2] = 1 ^ 2 \times \frac{1}{6} + 2 ^ 2 \times \frac{1}{6} + 3 ^ 2 \times \frac{1}{6} + 4 ^ 2 \times \frac{1}{6} + 5 ^ 2 \times \frac{1}{6} + 6 ^ 2 \times \frac{1}{6} = 15.17$ + +- $Var(X) = E[X^2] - E[X]^2 \approx 2.92$ + +--- + +## Example + +- What's the sample variance from the result of the toss of a coin with probability of heads (1) of $p$? + + - $E[X] = 0 \times (1 - p) + 1 \times p = p$ + - $E[X^2] = E[X] = p$ + +- $Var(X) = E[X^2] - E[X]^2 = p - p^2 = p(1 - p)$ + +--- + +## Interpreting variances + +- Chebyshev's inequality is useful for interpreting variances +- This inequality states that +$$ +P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2} +$$ +- For example, the probability that a random variable lies beyond $k$ standard deviations from its mean is less than $1/k^2$ +$$ +\begin{eqnarray*} + 2\sigma & \rightarrow & 25\% \\ + 3\sigma & \rightarrow & 11\% \\ + 4\sigma & \rightarrow & 6\% +\end{eqnarray*} +$$ +- Note this is only a bound; the actual probability might be quite a bit smaller + +--- + +## Example + +- IQs are often said to be distributed with a mean of $100$ and a sd of $15$ +- What is the probability of a randomly drawn person having an IQ higher than $160$ or below $40$? +- Thus we want to know the probability of a person being more than $4$ standard deviations from the mean +- Thus Chebyshev's inequality suggests that this will be no larger than 6\% +- IQs distributions are often cited as being bell shaped, in which case this bound is very conservative +- The probability of a random draw from a bell curve being $4$ standard deviations from the mean is on the order of $10^{-5}$ (one thousandth of one percent) + +--- + +## Example + +- A former buzz phrase in industrial quality control is Motorola's "Six Sigma" whereby businesses are suggested to control extreme events or rare defective parts +- Chebyshev's inequality states that the probability of a "Six Sigma" event is less than $1/6^2 \approx 3\%$ +- If a bell curve is assumed, the probability of a "six sigma" event is on the order of $10^{-9}$ (one ten millionth of a percent) + diff --git a/06_StatisticalInference/01_03_Expectations/index.pdf b/06_StatisticalInference/01_03_Expectations/index.pdf index c9c43b5a3..aa71b7bd4 100644 Binary files a/06_StatisticalInference/01_03_Expectations/index.pdf and b/06_StatisticalInference/01_03_Expectations/index.pdf differ diff --git a/06_StatisticalInference/01_04_Independence/assets/fig/unnamed-chunk-2.png b/06_StatisticalInference/01_04_Independence/assets/fig/unnamed-chunk-2.png new file mode 100644 index 000000000..eded2a301 Binary files /dev/null and b/06_StatisticalInference/01_04_Independence/assets/fig/unnamed-chunk-2.png differ diff --git a/06_StatisticalInference/01_04_Independence/index.html b/06_StatisticalInference/01_04_Independence/index.html index 57c94e8b1..3ce8e98d5 100644 --- a/06_StatisticalInference/01_04_Independence/index.html +++ b/06_StatisticalInference/01_04_Independence/index.html @@ -1,496 +1,496 @@ - - - - Independence - - - - - - - - - - - - - - - - - - - - - - - - - - -
-

Independence

-

Statistical Inference

-

Brian Caffo, Jeff Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

-
-
-
- - - - -
-

Independent events

-
-
-
    -
  • Two events \(A\) and \(B\) are independent if \[P(A \cap B) = P(A)P(B)\]
  • -
  • Two random variables, \(X\) and \(Y\) are independent if for any two sets \(A\) and \(B\) \[P([X \in A] \cap [Y \in B]) = P(X\in A)P(Y\in B)\]
  • -
  • If \(A\) is independent of \(B\) then

    - -
      -
    • \(A^c\) is independent of \(B\)
    • -
    • \(A\) is independent of \(B^c\)
    • -
    • \(A^c\) is independent of \(B^c\)
    • -
  • -
- -
- -
- - -
-

Example

-
-
-
    -
  • What is the probability of getting two consecutive heads?
  • -
  • \(A = \{\mbox{Head on flip 1}\}\) ~ \(P(A) = .5\)
  • -
  • \(B = \{\mbox{Head on flip 2}\}\) ~ \(P(B) = .5\)
  • -
  • \(A \cap B = \{\mbox{Head on flips 1 and 2}\}\)
  • -
  • \(P(A \cap B) = P(A)P(B) = .5 \times .5 = .25\)
  • -
- -
- -
- - -
-

Example

-
-
-
    -
  • Volume 309 of Science reports on a physician who was on trial for expert testimony in a criminal trial
  • -
  • Based on an estimated prevalence of sudden infant death syndrome of \(1\) out of \(8,543\), Dr Meadow testified that that the probability of a mother having two children with SIDS was \(\left(\frac{1}{8,543}\right)^2\)
  • -
  • The mother on trial was convicted of murder
  • -
- -
- -
- - -
-

Example: continued

-
-
-
    -
  • For the purposes of this class, the principal mistake was to assume that the probabilities of having SIDs within a family are independent
  • -
  • That is, \(P(A_1 \cap A_2)\) is not necessarily equal to \(P(A_1)P(A_2)\)
  • -
  • Biological processes that have a believed genetic or familiar environmental component, of course, tend to be dependent within families
  • -
  • (There are many other statistical points of discussion for this case.)
  • -
- -
- -
- - -
-

Useful fact

-
-
-

We will use the following fact extensively in this class:

- -

If a collection of random variables \(X_1, X_2, \ldots, X_n\) are independent, then their joint distribution is the product of their individual densities or mass functions

- -

That is, if \(f_i\) is the density for random variable \(X_i\) we have that -\[ -f(x_1,\ldots, x_n) = \prod_{i=1}^n f_i(x_i) -\]

- -
- -
- - -
-

IID random variables

-
-
-
    -
  • Random variables are said to be iid if they are independent and identically distributed
  • -
  • iid random variables are the default model for random samples
  • -
  • Many of the important theories of statistics are founded on assuming that variables are iid
  • -
- -
- -
- - -
-

Example

-
-
-
    -
  • Suppose that we flip a biased coin with success probability \(p\) \(n\) times, what is the join density of the collection of outcomes?
  • -
  • These random variables are iid with densities \(p^{x_i} (1 - p)^{1-x_i}\)
  • -
  • Therefore -\[ -f(x_1,\ldots,x_n) = \prod_{i=1}^n p^{x_i} (1 - p)^{1-x_i} = p^{\sum x_i} (1 - p)^{n - \sum x_i} -\]
  • -
- -
- -
- - -
-

Correlation

-
-
-
    -
  • The covariance between two random variables \(X\) and \(Y\) is defined as -\[ -Cov(X, Y) = E[(X - \mu_x)(Y - \mu_y)] = E[X Y] - E[X]E[Y] -\]
  • -
  • The following are useful facts about covariance - -
      -
    1. \(Cov(X, Y) = Cov(Y, X)\)
    2. -
    3. \(Cov(X, Y)\) can be negative or positive
    4. -
    5. \(|Cov(X, Y)| \leq \sqrt{Var(X) Var(y)}\)
    6. -
  • -
- -
- -
- - -
-

Correlation

-
-
-
    -
  • The correlation between \(X\) and \(Y\) is -\[ -Cor(X, Y) = Cov(X, Y) / \sqrt{Var(X) Var(y)} -\]
  • -
- -
    -
  1. \(-1 \leq Cor(X, Y) \leq 1\)
  2. -
  3. \(Cor(X, Y) = \pm 1\) if and only if \(X = a + bY\) for some constants \(a\) and \(b\)
  4. -
  5. \(Cor(X, Y)\) is unitless
  6. -
  7. \(X\) and \(Y\) are uncorrelated if \(Cor(X, Y) = 0\)
  8. -
  9. \(X\) and \(Y\) are more positively correlated, the closer \(Cor(X,Y)\) is to \(1\)
  10. -
  11. \(X\) and \(Y\) are more negatively correlated, the closer \(Cor(X,Y)\) is to \(-1\)
  12. -
- -
- -
- - -
-

Some useful results

-
-
-
    -
  • Let \(\{X_i\}_{i=1}^n\) be a collection of random variables

    - -
      -
    • When the \(\{X_i\}\) are uncorrelated \[Var\left(\sum_{i=1}^n a_i X_i + b\right) = \sum_{i=1}^n a_i^2 Var(X_i)\]
    • -
  • -
  • A commonly used subcase from these properties is that if a collection of random variables \(\{X_i\}\) are uncorrelated, then the variance of the sum is the sum of the variances -\[ -Var\left(\sum_{i=1}^n X_i \right) = \sum_{i=1}^n Var(X_i) -\]

  • -
  • Therefore, it is sums of variances that tend to be useful, not sums of standard deviations; that is, the standard deviation of the sum of bunch of independent random variables is the square root of the sum of the variances, not the sum of the standard deviations

  • -
- -
- -
- - -
-

The sample mean

-
-
-

Suppose \(X_i\) are iid with variance \(\sigma^2\)

- -

\[ -\begin{eqnarray*} - Var(\bar X) & = & Var \left( \frac{1}{n}\sum_{i=1}^n X_i \right)\\ \\ - & = & \frac{1}{n^2} Var\left(\sum_{i=1}^n X_i \right)\\ \\ - & = & \frac{1}{n^2} \sum_{i=1}^n Var(X_i) \\ \\ - & = & \frac{1}{n^2} \times n\sigma^2 \\ \\ - & = & \frac{\sigma^2}{n} - \end{eqnarray*} -\]

- -
- -
- - -
-

Some comments

-
-
-
    -
  • When \(X_i\) are independent with a common variance \(Var(\bar X) = \frac{\sigma^2}{n}\)
  • -
  • \(\sigma/\sqrt{n}\) is called the standard error of the sample mean
  • -
  • The standard error of the sample mean is the standard deviation of the distribution of the sample mean
  • -
  • \(\sigma\) is the standard deviation of the distribution of a single observation
  • -
  • Easy way to remember, the sample mean has to be less variable than a single observation, therefore its standard deviation is divided by a \(\sqrt{n}\)
  • -
- -
- -
- - -
-

The sample variance

-
-
-
    -
  • The sample variance is defined as -\[ -S^2 = \frac{\sum_{i=1}^n (X_i - \bar X)^2}{n-1} -\]
  • -
  • The sample variance is an estimator of \(\sigma^2\)
  • -
  • The numerator has a version that's quicker for calculation -\[ -\sum_{i=1}^n (X_i - \bar X)^2 = \sum_{i=1}^n X_i^2 - n \bar X^2 -\]
  • -
  • The sample variance is (nearly) the mean of the squared deviations from the mean
  • -
- -
- -
- - -
-

The sample variance is unbiased

-
-
-

\[ - \begin{eqnarray*} - E\left[\sum_{i=1}^n (X_i - \bar X)^2\right] & = & \sum_{i=1}^n E\left[X_i^2\right] - n E\left[\bar X^2\right] \\ \\ - & = & \sum_{i=1}^n \left\{Var(X_i) + \mu^2\right\} - n \left\{Var(\bar X) + \mu^2\right\} \\ \\ - & = & \sum_{i=1}^n \left\{\sigma^2 + \mu^2\right\} - n \left\{\sigma^2 / n + \mu^2\right\} \\ \\ - & = & n \sigma^2 + n \mu ^ 2 - \sigma^2 - n \mu^2 \\ \\ - & = & (n - 1) \sigma^2 - \end{eqnarray*} -\]

- -
- -
- - -
-

Hoping to avoid some confusion

-
-
-
    -
  • Suppose \(X_i\) are iid with mean \(\mu\) and variance \(\sigma^2\)
  • -
  • \(S^2\) estimates \(\sigma^2\)
  • -
  • The calculation of \(S^2\) involves dividing by \(n-1\)
  • -
  • \(S / \sqrt{n}\) estimates \(\sigma / \sqrt{n}\) the standard error of the mean
  • -
  • \(S / \sqrt{n}\) is called the sample standard error (of the mean)
  • -
- -
- -
- - -
-

Example

-
-
-
data(father.son)
-x <- father.son$sheight
-n <- length(x)
-
- -
- -
- - -
-

plot of chunk unnamed-chunk-2

- -
round(c(sum((x - mean(x))^2)/(n - 1), var(x), var(x)/n, sd(x), sd(x)/sqrt(n)), 
-    2)
-
- -
## [1] 7.92 7.92 0.01 2.81 0.09
-
- -
- -
- - -
- - - - - - - - - - - - - - + + + + Independence + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

Independence

+

Statistical Inference

+

Brian Caffo, Jeff Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

+
+
+
+ + + + +
+

Independent events

+
+
+
    +
  • Two events \(A\) and \(B\) are independent if \[P(A \cap B) = P(A)P(B)\]
  • +
  • Two random variables, \(X\) and \(Y\) are independent if for any two sets \(A\) and \(B\) \[P([X \in A] \cap [Y \in B]) = P(X\in A)P(Y\in B)\]
  • +
  • If \(A\) is independent of \(B\) then

    + +
      +
    • \(A^c\) is independent of \(B\)
    • +
    • \(A\) is independent of \(B^c\)
    • +
    • \(A^c\) is independent of \(B^c\)
    • +
  • +
+ +
+ +
+ + +
+

Example

+
+
+
    +
  • What is the probability of getting two consecutive heads?
  • +
  • \(A = \{\mbox{Head on flip 1}\}\) ~ \(P(A) = .5\)
  • +
  • \(B = \{\mbox{Head on flip 2}\}\) ~ \(P(B) = .5\)
  • +
  • \(A \cap B = \{\mbox{Head on flips 1 and 2}\}\)
  • +
  • \(P(A \cap B) = P(A)P(B) = .5 \times .5 = .25\)
  • +
+ +
+ +
+ + +
+

Example

+
+
+
    +
  • Volume 309 of Science reports on a physician who was on trial for expert testimony in a criminal trial
  • +
  • Based on an estimated prevalence of sudden infant death syndrome of \(1\) out of \(8,543\), Dr Meadow testified that that the probability of a mother having two children with SIDS was \(\left(\frac{1}{8,543}\right)^2\)
  • +
  • The mother on trial was convicted of murder
  • +
+ +
+ +
+ + +
+

Example: continued

+
+
+
    +
  • For the purposes of this class, the principal mistake was to assume that the probabilities of having SIDs within a family are independent
  • +
  • That is, \(P(A_1 \cap A_2)\) is not necessarily equal to \(P(A_1)P(A_2)\)
  • +
  • Biological processes that have a believed genetic or familiar environmental component, of course, tend to be dependent within families
  • +
  • (There are many other statistical points of discussion for this case.)
  • +
+ +
+ +
+ + +
+

Useful fact

+
+
+

We will use the following fact extensively in this class:

+ +

If a collection of random variables \(X_1, X_2, \ldots, X_n\) are independent, then their joint distribution is the product of their individual densities or mass functions

+ +

That is, if \(f_i\) is the density for random variable \(X_i\) we have that +\[ +f(x_1,\ldots, x_n) = \prod_{i=1}^n f_i(x_i) +\]

+ +
+ +
+ + +
+

IID random variables

+
+
+
    +
  • Random variables are said to be iid if they are independent and identically distributed
  • +
  • iid random variables are the default model for random samples
  • +
  • Many of the important theories of statistics are founded on assuming that variables are iid
  • +
+ +
+ +
+ + +
+

Example

+
+
+
    +
  • Suppose that we flip a biased coin with success probability \(p\) \(n\) times, what is the join density of the collection of outcomes?
  • +
  • These random variables are iid with densities \(p^{x_i} (1 - p)^{1-x_i}\)
  • +
  • Therefore +\[ +f(x_1,\ldots,x_n) = \prod_{i=1}^n p^{x_i} (1 - p)^{1-x_i} = p^{\sum x_i} (1 - p)^{n - \sum x_i} +\]
  • +
+ +
+ +
+ + +
+

Correlation

+
+
+
    +
  • The covariance between two random variables \(X\) and \(Y\) is defined as +\[ +Cov(X, Y) = E[(X - \mu_x)(Y - \mu_y)] = E[X Y] - E[X]E[Y] +\]
  • +
  • The following are useful facts about covariance + +
      +
    1. \(Cov(X, Y) = Cov(Y, X)\)
    2. +
    3. \(Cov(X, Y)\) can be negative or positive
    4. +
    5. \(|Cov(X, Y)| \leq \sqrt{Var(X) Var(y)}\)
    6. +
  • +
+ +
+ +
+ + +
+

Correlation

+
+
+
    +
  • The correlation between \(X\) and \(Y\) is +\[ +Cor(X, Y) = Cov(X, Y) / \sqrt{Var(X) Var(y)} +\]
  • +
+ +
    +
  1. \(-1 \leq Cor(X, Y) \leq 1\)
  2. +
  3. \(Cor(X, Y) = \pm 1\) if and only if \(X = a + bY\) for some constants \(a\) and \(b\)
  4. +
  5. \(Cor(X, Y)\) is unitless
  6. +
  7. \(X\) and \(Y\) are uncorrelated if \(Cor(X, Y) = 0\)
  8. +
  9. \(X\) and \(Y\) are more positively correlated, the closer \(Cor(X,Y)\) is to \(1\)
  10. +
  11. \(X\) and \(Y\) are more negatively correlated, the closer \(Cor(X,Y)\) is to \(-1\)
  12. +
+ +
+ +
+ + +
+

Some useful results

+
+
+
    +
  • Let \(\{X_i\}_{i=1}^n\) be a collection of random variables

    + +
      +
    • When the \(\{X_i\}\) are uncorrelated \[Var\left(\sum_{i=1}^n a_i X_i + b\right) = \sum_{i=1}^n a_i^2 Var(X_i)\]
    • +
  • +
  • A commonly used subcase from these properties is that if a collection of random variables \(\{X_i\}\) are uncorrelated, then the variance of the sum is the sum of the variances +\[ +Var\left(\sum_{i=1}^n X_i \right) = \sum_{i=1}^n Var(X_i) +\]

  • +
  • Therefore, it is sums of variances that tend to be useful, not sums of standard deviations; that is, the standard deviation of the sum of bunch of independent random variables is the square root of the sum of the variances, not the sum of the standard deviations

  • +
+ +
+ +
+ + +
+

The sample mean

+
+
+

Suppose \(X_i\) are iid with variance \(\sigma^2\)

+ +

\[ +\begin{eqnarray*} + Var(\bar X) & = & Var \left( \frac{1}{n}\sum_{i=1}^n X_i \right)\\ \\ + & = & \frac{1}{n^2} Var\left(\sum_{i=1}^n X_i \right)\\ \\ + & = & \frac{1}{n^2} \sum_{i=1}^n Var(X_i) \\ \\ + & = & \frac{1}{n^2} \times n\sigma^2 \\ \\ + & = & \frac{\sigma^2}{n} + \end{eqnarray*} +\]

+ +
+ +
+ + +
+

Some comments

+
+
+
    +
  • When \(X_i\) are independent with a common variance \(Var(\bar X) = \frac{\sigma^2}{n}\)
  • +
  • \(\sigma/\sqrt{n}\) is called the standard error of the sample mean
  • +
  • The standard error of the sample mean is the standard deviation of the distribution of the sample mean
  • +
  • \(\sigma\) is the standard deviation of the distribution of a single observation
  • +
  • Easy way to remember, the sample mean has to be less variable than a single observation, therefore its standard deviation is divided by a \(\sqrt{n}\)
  • +
+ +
+ +
+ + +
+

The sample variance

+
+
+
    +
  • The sample variance is defined as +\[ +S^2 = \frac{\sum_{i=1}^n (X_i - \bar X)^2}{n-1} +\]
  • +
  • The sample variance is an estimator of \(\sigma^2\)
  • +
  • The numerator has a version that's quicker for calculation +\[ +\sum_{i=1}^n (X_i - \bar X)^2 = \sum_{i=1}^n X_i^2 - n \bar X^2 +\]
  • +
  • The sample variance is (nearly) the mean of the squared deviations from the mean
  • +
+ +
+ +
+ + +
+

The sample variance is unbiased

+
+
+

\[ + \begin{eqnarray*} + E\left[\sum_{i=1}^n (X_i - \bar X)^2\right] & = & \sum_{i=1}^n E\left[X_i^2\right] - n E\left[\bar X^2\right] \\ \\ + & = & \sum_{i=1}^n \left\{Var(X_i) + \mu^2\right\} - n \left\{Var(\bar X) + \mu^2\right\} \\ \\ + & = & \sum_{i=1}^n \left\{\sigma^2 + \mu^2\right\} - n \left\{\sigma^2 / n + \mu^2\right\} \\ \\ + & = & n \sigma^2 + n \mu ^ 2 - \sigma^2 - n \mu^2 \\ \\ + & = & (n - 1) \sigma^2 + \end{eqnarray*} +\]

+ +
+ +
+ + +
+

Hoping to avoid some confusion

+
+
+
    +
  • Suppose \(X_i\) are iid with mean \(\mu\) and variance \(\sigma^2\)
  • +
  • \(S^2\) estimates \(\sigma^2\)
  • +
  • The calculation of \(S^2\) involves dividing by \(n-1\)
  • +
  • \(S / \sqrt{n}\) estimates \(\sigma / \sqrt{n}\) the standard error of the mean
  • +
  • \(S / \sqrt{n}\) is called the sample standard error (of the mean)
  • +
+ +
+ +
+ + +
+

Example

+
+
+
data(father.son)
+x <- father.son$sheight
+n <- length(x)
+
+ +
+ +
+ + +
+

plot of chunk unnamed-chunk-2

+ +
round(c(sum((x - mean(x))^2)/(n - 1), var(x), var(x)/n, sd(x), sd(x)/sqrt(n)), 
+    2)
+
+ +
## [1] 7.92 7.92 0.01 2.81 0.09
+
+ +
+ +
+ + +
+ + + + + + + + + + + + + + \ No newline at end of file diff --git a/06_StatisticalInference/01_04_Independence/index.md b/06_StatisticalInference/01_04_Independence/index.md index 12b7e2264..6ba1c02e5 100644 --- a/06_StatisticalInference/01_04_Independence/index.md +++ b/06_StatisticalInference/01_04_Independence/index.md @@ -1,216 +1,216 @@ ---- -title : Independence -subtitle : Statistical Inference -author : Brian Caffo, Jeff Leek, Roger Peng -job : Johns Hopkins Bloomberg School of Public Health -logo : bloomberg_shield.png -framework : io2012 # {io2012, html5slides, shower, dzslides, ...} -highlighter : highlight.js # {highlight.js, prettify, highlight} -hitheme : tomorrow # -url: - lib: ../../librariesNew - assets: ../../assets -widgets : [mathjax] # {mathjax, quiz, bootstrap} -mode : selfcontained # {standalone, draft} ---- - -## Independent events - -- Two events $A$ and $B$ are **independent** if $$P(A \cap B) = P(A)P(B)$$ -- Two random variables, $X$ and $Y$ are independent if for any two sets $A$ and $B$ $$P([X \in A] \cap [Y \in B]) = P(X\in A)P(Y\in B)$$ -- If $A$ is independent of $B$ then - - - $A^c$ is independent of $B$ - - $A$ is independent of $B^c$ - - $A^c$ is independent of $B^c$ - - ---- - -## Example - -- What is the probability of getting two consecutive heads? -- $A = \{\mbox{Head on flip 1}\}$ ~ $P(A) = .5$ -- $B = \{\mbox{Head on flip 2}\}$ ~ $P(B) = .5$ -- $A \cap B = \{\mbox{Head on flips 1 and 2}\}$ -- $P(A \cap B) = P(A)P(B) = .5 \times .5 = .25$ - ---- - -## Example - -- Volume 309 of Science reports on a physician who was on trial for expert testimony in a criminal trial -- Based on an estimated prevalence of sudden infant death syndrome of $1$ out of $8,543$, Dr Meadow testified that that the probability of a mother having two children with SIDS was $\left(\frac{1}{8,543}\right)^2$ -- The mother on trial was convicted of murder - ---- - -## Example: continued - -- For the purposes of this class, the principal mistake was to *assume* that the probabilities of having SIDs within a family are independent -- That is, $P(A_1 \cap A_2)$ is not necessarily equal to $P(A_1)P(A_2)$ -- Biological processes that have a believed genetic or familiar environmental component, of course, tend to be dependent within families -- (There are many other statistical points of discussion for this case.) - ---- - -## Useful fact - -We will use the following fact extensively in this class: - -*If a collection of random variables $X_1, X_2, \ldots, X_n$ are independent, then their joint distribution is the product of their individual densities or mass functions* - -*That is, if $f_i$ is the density for random variable $X_i$ we have that* -$$ -f(x_1,\ldots, x_n) = \prod_{i=1}^n f_i(x_i) -$$ - ---- - -## IID random variables - -- Random variables are said to be iid if they are independent and identically distributed -- iid random variables are the default model for random samples -- Many of the important theories of statistics are founded on assuming that variables are iid - - ---- - -## Example - -- Suppose that we flip a biased coin with success probability $p$ $n$ times, what is the join density of the collection of outcomes? -- These random variables are iid with densities $p^{x_i} (1 - p)^{1-x_i}$ -- Therefore - $$ - f(x_1,\ldots,x_n) = \prod_{i=1}^n p^{x_i} (1 - p)^{1-x_i} = p^{\sum x_i} (1 - p)^{n - \sum x_i} - $$ - ---- - -## Correlation - -- The **covariance** between two random variables $X$ and $Y$ is defined as -$$ -Cov(X, Y) = E[(X - \mu_x)(Y - \mu_y)] = E[X Y] - E[X]E[Y] -$$ -- The following are useful facts about covariance - 1. $Cov(X, Y) = Cov(Y, X)$ - 2. $Cov(X, Y)$ can be negative or positive - 3. $|Cov(X, Y)| \leq \sqrt{Var(X) Var(y)}$ - ---- - -## Correlation - -- The **correlation** between $X$ and $Y$ is -$$ -Cor(X, Y) = Cov(X, Y) / \sqrt{Var(X) Var(y)} -$$ - - 1. $-1 \leq Cor(X, Y) \leq 1$ - 2. $Cor(X, Y) = \pm 1$ if and only if $X = a + bY$ for some constants $a$ and $b$ - 3. $Cor(X, Y)$ is unitless - 4. $X$ and $Y$ are **uncorrelated** if $Cor(X, Y) = 0$ - 5. $X$ and $Y$ are more positively correlated, the closer $Cor(X,Y)$ is to $1$ - 6. $X$ and $Y$ are more negatively correlated, the closer $Cor(X,Y)$ is to $-1$ - ---- - -## Some useful results - -- Let $\{X_i\}_{i=1}^n$ be a collection of random variables - - When the $\{X_i\}$ are uncorrelated $$Var\left(\sum_{i=1}^n a_i X_i + b\right) = \sum_{i=1}^n a_i^2 Var(X_i)$$ - -- A commonly used subcase from these properties is that *if a collection of random variables $\{X_i\}$ are uncorrelated*, then the variance of the sum is the sum of the variances -$$ -Var\left(\sum_{i=1}^n X_i \right) = \sum_{i=1}^n Var(X_i) -$$ -- Therefore, it is sums of variances that tend to be useful, not sums of standard deviations; that is, the standard deviation of the sum of bunch of independent random variables is the square root of the sum of the variances, not the sum of the standard deviations - ---- - -## The sample mean - -Suppose $X_i$ are iid with variance $\sigma^2$ - -$$ -\begin{eqnarray*} - Var(\bar X) & = & Var \left( \frac{1}{n}\sum_{i=1}^n X_i \right)\\ \\ - & = & \frac{1}{n^2} Var\left(\sum_{i=1}^n X_i \right)\\ \\ - & = & \frac{1}{n^2} \sum_{i=1}^n Var(X_i) \\ \\ - & = & \frac{1}{n^2} \times n\sigma^2 \\ \\ - & = & \frac{\sigma^2}{n} - \end{eqnarray*} -$$ - ---- - -## Some comments - -- When $X_i$ are independent with a common variance $Var(\bar X) = \frac{\sigma^2}{n}$ -- $\sigma/\sqrt{n}$ is called *the standard error* of the sample mean -- The standard error of the sample mean is the standard deviation of the distribution of the sample mean -- $\sigma$ is the standard deviation of the distribution of a single observation -- Easy way to remember, the sample mean has to be less variable than a single observation, therefore its standard deviation is divided by a $\sqrt{n}$ - ---- - -## The sample variance -- The **sample variance** is defined as -$$ -S^2 = \frac{\sum_{i=1}^n (X_i - \bar X)^2}{n-1} -$$ -- The sample variance is an estimator of $\sigma^2$ -- The numerator has a version that's quicker for calculation -$$ -\sum_{i=1}^n (X_i - \bar X)^2 = \sum_{i=1}^n X_i^2 - n \bar X^2 -$$ -- The sample variance is (nearly) the mean of the squared deviations from the mean - ---- - -## The sample variance is unbiased - -$$ - \begin{eqnarray*} - E\left[\sum_{i=1}^n (X_i - \bar X)^2\right] & = & \sum_{i=1}^n E\left[X_i^2\right] - n E\left[\bar X^2\right] \\ \\ - & = & \sum_{i=1}^n \left\{Var(X_i) + \mu^2\right\} - n \left\{Var(\bar X) + \mu^2\right\} \\ \\ - & = & \sum_{i=1}^n \left\{\sigma^2 + \mu^2\right\} - n \left\{\sigma^2 / n + \mu^2\right\} \\ \\ - & = & n \sigma^2 + n \mu ^ 2 - \sigma^2 - n \mu^2 \\ \\ - & = & (n - 1) \sigma^2 - \end{eqnarray*} -$$ - ---- - -## Hoping to avoid some confusion - -- Suppose $X_i$ are iid with mean $\mu$ and variance $\sigma^2$ -- $S^2$ estimates $\sigma^2$ -- The calculation of $S^2$ involves dividing by $n-1$ -- $S / \sqrt{n}$ estimates $\sigma / \sqrt{n}$ the standard error of the mean -- $S / \sqrt{n}$ is called the sample standard error (of the mean) - ---- -## Example - -```r -data(father.son) -x <- father.son$sheight -n <- length(x) -``` - - ---- -![plot of chunk unnamed-chunk-2](assets/fig/unnamed-chunk-2.png) - - -```r -round(c(sum((x - mean(x))^2)/(n - 1), var(x), var(x)/n, sd(x), sd(x)/sqrt(n)), - 2) -``` - -``` -## [1] 7.92 7.92 0.01 2.81 0.09 -``` - +--- +title : Independence +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- + +## Independent events + +- Two events $A$ and $B$ are **independent** if $$P(A \cap B) = P(A)P(B)$$ +- Two random variables, $X$ and $Y$ are independent if for any two sets $A$ and $B$ $$P([X \in A] \cap [Y \in B]) = P(X\in A)P(Y\in B)$$ +- If $A$ is independent of $B$ then + + - $A^c$ is independent of $B$ + - $A$ is independent of $B^c$ + - $A^c$ is independent of $B^c$ + + +--- + +## Example + +- What is the probability of getting two consecutive heads? +- $A = \{\mbox{Head on flip 1}\}$ ~ $P(A) = .5$ +- $B = \{\mbox{Head on flip 2}\}$ ~ $P(B) = .5$ +- $A \cap B = \{\mbox{Head on flips 1 and 2}\}$ +- $P(A \cap B) = P(A)P(B) = .5 \times .5 = .25$ + +--- + +## Example + +- Volume 309 of Science reports on a physician who was on trial for expert testimony in a criminal trial +- Based on an estimated prevalence of sudden infant death syndrome of $1$ out of $8,543$, Dr Meadow testified that that the probability of a mother having two children with SIDS was $\left(\frac{1}{8,543}\right)^2$ +- The mother on trial was convicted of murder + +--- + +## Example: continued + +- For the purposes of this class, the principal mistake was to *assume* that the probabilities of having SIDs within a family are independent +- That is, $P(A_1 \cap A_2)$ is not necessarily equal to $P(A_1)P(A_2)$ +- Biological processes that have a believed genetic or familiar environmental component, of course, tend to be dependent within families +- (There are many other statistical points of discussion for this case.) + +--- + +## Useful fact + +We will use the following fact extensively in this class: + +*If a collection of random variables $X_1, X_2, \ldots, X_n$ are independent, then their joint distribution is the product of their individual densities or mass functions* + +*That is, if $f_i$ is the density for random variable $X_i$ we have that* +$$ +f(x_1,\ldots, x_n) = \prod_{i=1}^n f_i(x_i) +$$ + +--- + +## IID random variables + +- Random variables are said to be iid if they are independent and identically distributed +- iid random variables are the default model for random samples +- Many of the important theories of statistics are founded on assuming that variables are iid + + +--- + +## Example + +- Suppose that we flip a biased coin with success probability $p$ $n$ times, what is the join density of the collection of outcomes? +- These random variables are iid with densities $p^{x_i} (1 - p)^{1-x_i}$ +- Therefore + $$ + f(x_1,\ldots,x_n) = \prod_{i=1}^n p^{x_i} (1 - p)^{1-x_i} = p^{\sum x_i} (1 - p)^{n - \sum x_i} + $$ + +--- + +## Correlation + +- The **covariance** between two random variables $X$ and $Y$ is defined as +$$ +Cov(X, Y) = E[(X - \mu_x)(Y - \mu_y)] = E[X Y] - E[X]E[Y] +$$ +- The following are useful facts about covariance + 1. $Cov(X, Y) = Cov(Y, X)$ + 2. $Cov(X, Y)$ can be negative or positive + 3. $|Cov(X, Y)| \leq \sqrt{Var(X) Var(y)}$ + +--- + +## Correlation + +- The **correlation** between $X$ and $Y$ is +$$ +Cor(X, Y) = Cov(X, Y) / \sqrt{Var(X) Var(y)} +$$ + + 1. $-1 \leq Cor(X, Y) \leq 1$ + 2. $Cor(X, Y) = \pm 1$ if and only if $X = a + bY$ for some constants $a$ and $b$ + 3. $Cor(X, Y)$ is unitless + 4. $X$ and $Y$ are **uncorrelated** if $Cor(X, Y) = 0$ + 5. $X$ and $Y$ are more positively correlated, the closer $Cor(X,Y)$ is to $1$ + 6. $X$ and $Y$ are more negatively correlated, the closer $Cor(X,Y)$ is to $-1$ + +--- + +## Some useful results + +- Let $\{X_i\}_{i=1}^n$ be a collection of random variables + - When the $\{X_i\}$ are uncorrelated $$Var\left(\sum_{i=1}^n a_i X_i + b\right) = \sum_{i=1}^n a_i^2 Var(X_i)$$ + +- A commonly used subcase from these properties is that *if a collection of random variables $\{X_i\}$ are uncorrelated*, then the variance of the sum is the sum of the variances +$$ +Var\left(\sum_{i=1}^n X_i \right) = \sum_{i=1}^n Var(X_i) +$$ +- Therefore, it is sums of variances that tend to be useful, not sums of standard deviations; that is, the standard deviation of the sum of bunch of independent random variables is the square root of the sum of the variances, not the sum of the standard deviations + +--- + +## The sample mean + +Suppose $X_i$ are iid with variance $\sigma^2$ + +$$ +\begin{eqnarray*} + Var(\bar X) & = & Var \left( \frac{1}{n}\sum_{i=1}^n X_i \right)\\ \\ + & = & \frac{1}{n^2} Var\left(\sum_{i=1}^n X_i \right)\\ \\ + & = & \frac{1}{n^2} \sum_{i=1}^n Var(X_i) \\ \\ + & = & \frac{1}{n^2} \times n\sigma^2 \\ \\ + & = & \frac{\sigma^2}{n} + \end{eqnarray*} +$$ + +--- + +## Some comments + +- When $X_i$ are independent with a common variance $Var(\bar X) = \frac{\sigma^2}{n}$ +- $\sigma/\sqrt{n}$ is called *the standard error* of the sample mean +- The standard error of the sample mean is the standard deviation of the distribution of the sample mean +- $\sigma$ is the standard deviation of the distribution of a single observation +- Easy way to remember, the sample mean has to be less variable than a single observation, therefore its standard deviation is divided by a $\sqrt{n}$ + +--- + +## The sample variance +- The **sample variance** is defined as +$$ +S^2 = \frac{\sum_{i=1}^n (X_i - \bar X)^2}{n-1} +$$ +- The sample variance is an estimator of $\sigma^2$ +- The numerator has a version that's quicker for calculation +$$ +\sum_{i=1}^n (X_i - \bar X)^2 = \sum_{i=1}^n X_i^2 - n \bar X^2 +$$ +- The sample variance is (nearly) the mean of the squared deviations from the mean + +--- + +## The sample variance is unbiased + +$$ + \begin{eqnarray*} + E\left[\sum_{i=1}^n (X_i - \bar X)^2\right] & = & \sum_{i=1}^n E\left[X_i^2\right] - n E\left[\bar X^2\right] \\ \\ + & = & \sum_{i=1}^n \left\{Var(X_i) + \mu^2\right\} - n \left\{Var(\bar X) + \mu^2\right\} \\ \\ + & = & \sum_{i=1}^n \left\{\sigma^2 + \mu^2\right\} - n \left\{\sigma^2 / n + \mu^2\right\} \\ \\ + & = & n \sigma^2 + n \mu ^ 2 - \sigma^2 - n \mu^2 \\ \\ + & = & (n - 1) \sigma^2 + \end{eqnarray*} +$$ + +--- + +## Hoping to avoid some confusion + +- Suppose $X_i$ are iid with mean $\mu$ and variance $\sigma^2$ +- $S^2$ estimates $\sigma^2$ +- The calculation of $S^2$ involves dividing by $n-1$ +- $S / \sqrt{n}$ estimates $\sigma / \sqrt{n}$ the standard error of the mean +- $S / \sqrt{n}$ is called the sample standard error (of the mean) + +--- +## Example + +```r +data(father.son) +x <- father.son$sheight +n <- length(x) +``` + + +--- +![plot of chunk unnamed-chunk-2](assets/fig/unnamed-chunk-2.png) + + +```r +round(c(sum((x - mean(x))^2)/(n - 1), var(x), var(x)/n, sd(x), sd(x)/sqrt(n)), + 2) +``` + +``` +## [1] 7.92 7.92 0.01 2.81 0.09 +``` + diff --git a/06_StatisticalInference/01_04_Independence/index.pdf b/06_StatisticalInference/01_04_Independence/index.pdf index ba92e4d8e..fd2201506 100644 Binary files a/06_StatisticalInference/01_04_Independence/index.pdf and b/06_StatisticalInference/01_04_Independence/index.pdf differ diff --git a/06_StatisticalInference/01_05_ConditionalProbability/index.html b/06_StatisticalInference/01_05_ConditionalProbability/index.html index dba5dc912..0267feacb 100644 --- a/06_StatisticalInference/01_05_ConditionalProbability/index.html +++ b/06_StatisticalInference/01_05_ConditionalProbability/index.html @@ -1,411 +1,411 @@ - - - - Conditional Probability - - - - - - - - - - - - - - - - - - - - - - - - - - -
-

Conditional Probability

-

Statistical Inference

-

Brian Caffo, Jeff Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

-
-
-
- - - - -
-

Conditional probability, motivation

-
-
-
    -
  • The probability of getting a one when rolling a (standard) die -is usually assumed to be one sixth
  • -
  • Suppose you were given the extra information that the die roll -was an odd number (hence 1, 3 or 5)
  • -
  • conditional on this new information, the probability of a -one is now one third
  • -
- -
- -
- - -
-

Conditional probability, definition

-
-
-
    -
  • Let \(B\) be an event so that \(P(B) > 0\)
  • -
  • Then the conditional probability of an event \(A\) given that \(B\) has occurred is -\[ -P(A ~|~ B) = \frac{P(A \cap B)}{P(B)} -\]
  • -
  • Notice that if \(A\) and \(B\) are independent, then -\[ -P(A ~|~ B) = \frac{P(A) P(B)}{P(B)} = P(A) -\]
  • -
- -
- -
- - -
-

Example

-
-
-
    -
  • Consider our die roll example
  • -
  • \(B = \{1, 3, 5\}\)
  • -
  • \(A = \{1\}\) -\[ -\begin{eqnarray*} -P(\mbox{one given that roll is odd}) & = & P(A ~|~ B) \\ \\ -& = & \frac{P(A \cap B)}{P(B)} \\ \\ -& = & \frac{P(A)}{P(B)} \\ \\ -& = & \frac{1/6}{3/6} = \frac{1}{3} -\end{eqnarray*} -\]
  • -
- -
- -
- - -
-

Bayes' rule

-
-
-

\[ -P(B ~|~ A) = \frac{P(A ~|~ B) P(B)}{P(A ~|~ B) P(B) + P(A ~|~ B^c)P(B^c)}. -\]

- -
- -
- - -
-

Diagnostic tests

-
-
-
    -
  • Let \(+\) and \(-\) be the events that the result of a diagnostic test is positive or negative respectively
  • -
  • Let \(D\) and \(D^c\) be the event that the subject of the test has or does not have the disease respectively
  • -
  • The sensitivity is the probability that the test is positive given that the subject actually has the disease, \(P(+ ~|~ D)\)
  • -
  • The specificity is the probability that the test is negative given that the subject does not have the disease, \(P(- ~|~ D^c)\)
  • -
- -
- -
- - -
-

More definitions

-
-
-
    -
  • The positive predictive value is the probability that the subject has the disease given that the test is positive, \(P(D ~|~ +)\)
  • -
  • The negative predictive value is the probability that the subject does not have the disease given that the test is negative, \(P(D^c ~|~ -)\)
  • -
  • The prevalence of the disease is the marginal probability of disease, \(P(D)\)
  • -
- -
- -
- - -
-

More definitions

-
-
-
    -
  • The diagnostic likelihood ratio of a positive test, labeled \(DLR_+\), is \(P(+ ~|~ D) / P(+ ~|~ D^c)\), which is the \[sensitivity / (1 - specificity)\]
  • -
  • The diagnostic likelihood ratio of a negative test, labeled \(DLR_-\), is \(P(- ~|~ D) / P(- ~|~ D^c)\), which is the \[(1 - sensitivity) / specificity\]
  • -
- -
- -
- - -
-

Example

-
-
-
    -
  • A study comparing the efficacy of HIV tests, reports on an experiment which concluded that HIV antibody tests have a sensitivity of 99.7% and a specificity of 98.5%
  • -
  • Suppose that a subject, from a population with a .1% prevalence of HIV, receives a positive test result. What is the probability that this subject has HIV?
  • -
  • Mathematically, we want \(P(D ~|~ +)\) given the sensitivity, \(P(+ ~|~ D) = .997\), the specificity, \(P(- ~|~ D^c) =.985\), and the prevalence \(P(D) = .001\)
  • -
- -
- -
- - -
-

Using Bayes' formula

-
-
-

\[ -\begin{eqnarray*} - P(D ~|~ +) & = &\frac{P(+~|~D)P(D)}{P(+~|~D)P(D) + P(+~|~D^c)P(D^c)}\\ \\ - & = & \frac{P(+~|~D)P(D)}{P(+~|~D)P(D) + \{1-P(-~|~D^c)\}\{1 - P(D)\}} \\ \\ - & = & \frac{.997\times .001}{.997 \times .001 + .015 \times .999}\\ \\ - & = & .062 -\end{eqnarray*} -\]

- -
    -
  • In this population a positive test result only suggests a 6% probability that the subject has the disease
  • -
  • (The positive predictive value is 6% for this test)
  • -
- -
- -
- - -
-

More on this example

-
-
-
    -
  • The low positive predictive value is due to low prevalence of disease and the somewhat modest specificity
  • -
  • Suppose it was known that the subject was an intravenous drug user and routinely had intercourse with an HIV infected partner
  • -
  • Notice that the evidence implied by a positive test result does not change because of the prevalence of disease in the subject's population, only our interpretation of that evidence changes
  • -
- -
- -
- - -
-

Likelihood ratios

-
-
-
    -
  • Using Bayes rule, we have -\[ -P(D ~|~ +) = \frac{P(+~|~D)P(D)}{P(+~|~D)P(D) + P(+~|~D^c)P(D^c)} -\] -and -\[ -P(D^c ~|~ +) = \frac{P(+~|~D^c)P(D^c)}{P(+~|~D)P(D) + P(+~|~D^c)P(D^c)}. -\]
  • -
- -
- -
- - -
-

Likelihood ratios

-
-
-
    -
  • Therefore -\[ -\frac{P(D ~|~ +)}{P(D^c ~|~ +)} = \frac{P(+~|~D)}{P(+~|~D^c)}\times \frac{P(D)}{P(D^c)} -\] -ie -\[ -\mbox{post-test odds of }D = DLR_+\times\mbox{pre-test odds of }D -\]
  • -
  • Similarly, \(DLR_-\) relates the decrease in the odds of the -disease after a negative test result to the odds of disease prior to -the test.
  • -
- -
- -
- - -
-

HIV example revisited

-
-
-
    -
  • Suppose a subject has a positive HIV test
  • -
  • \(DLR_+ = .997 / (1 - .985) \approx 66\)
  • -
  • The result of the positive test is that the odds of disease is now 66 times the pretest odds
  • -
  • Or, equivalently, the hypothesis of disease is 66 times more supported by the data than the hypothesis of no disease
  • -
- -
- -
- - -
-

HIV example revisited

-
-
-
    -
  • Suppose that a subject has a negative test result
  • -
  • \(DLR_- = (1 - .997) / .985 \approx .003\)
  • -
  • Therefore, the post-test odds of disease is now \(.3\%\) of the pretest odds given the negative test.
  • -
  • Or, the hypothesis of disease is supported \(.003\) times that of the hypothesis of absence of disease given the negative test result
  • -
- -
- -
- - -
- - - - - - - - - - - - - - + + + + Conditional Probability + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

Conditional Probability

+

Statistical Inference

+

Brian Caffo, Jeff Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

+
+
+
+ + + + +
+

Conditional probability, motivation

+
+
+
    +
  • The probability of getting a one when rolling a (standard) die +is usually assumed to be one sixth
  • +
  • Suppose you were given the extra information that the die roll +was an odd number (hence 1, 3 or 5)
  • +
  • conditional on this new information, the probability of a +one is now one third
  • +
+ +
+ +
+ + +
+

Conditional probability, definition

+
+
+
    +
  • Let \(B\) be an event so that \(P(B) > 0\)
  • +
  • Then the conditional probability of an event \(A\) given that \(B\) has occurred is +\[ +P(A ~|~ B) = \frac{P(A \cap B)}{P(B)} +\]
  • +
  • Notice that if \(A\) and \(B\) are independent, then +\[ +P(A ~|~ B) = \frac{P(A) P(B)}{P(B)} = P(A) +\]
  • +
+ +
+ +
+ + +
+

Example

+
+
+
    +
  • Consider our die roll example
  • +
  • \(B = \{1, 3, 5\}\)
  • +
  • \(A = \{1\}\) +\[ +\begin{eqnarray*} +P(\mbox{one given that roll is odd}) & = & P(A ~|~ B) \\ \\ +& = & \frac{P(A \cap B)}{P(B)} \\ \\ +& = & \frac{P(A)}{P(B)} \\ \\ +& = & \frac{1/6}{3/6} = \frac{1}{3} +\end{eqnarray*} +\]
  • +
+ +
+ +
+ + +
+

Bayes' rule

+
+
+

\[ +P(B ~|~ A) = \frac{P(A ~|~ B) P(B)}{P(A ~|~ B) P(B) + P(A ~|~ B^c)P(B^c)}. +\]

+ +
+ +
+ + +
+

Diagnostic tests

+
+
+
    +
  • Let \(+\) and \(-\) be the events that the result of a diagnostic test is positive or negative respectively
  • +
  • Let \(D\) and \(D^c\) be the event that the subject of the test has or does not have the disease respectively
  • +
  • The sensitivity is the probability that the test is positive given that the subject actually has the disease, \(P(+ ~|~ D)\)
  • +
  • The specificity is the probability that the test is negative given that the subject does not have the disease, \(P(- ~|~ D^c)\)
  • +
+ +
+ +
+ + +
+

More definitions

+
+
+
    +
  • The positive predictive value is the probability that the subject has the disease given that the test is positive, \(P(D ~|~ +)\)
  • +
  • The negative predictive value is the probability that the subject does not have the disease given that the test is negative, \(P(D^c ~|~ -)\)
  • +
  • The prevalence of the disease is the marginal probability of disease, \(P(D)\)
  • +
+ +
+ +
+ + +
+

More definitions

+
+
+
    +
  • The diagnostic likelihood ratio of a positive test, labeled \(DLR_+\), is \(P(+ ~|~ D) / P(+ ~|~ D^c)\), which is the \[sensitivity / (1 - specificity)\]
  • +
  • The diagnostic likelihood ratio of a negative test, labeled \(DLR_-\), is \(P(- ~|~ D) / P(- ~|~ D^c)\), which is the \[(1 - sensitivity) / specificity\]
  • +
+ +
+ +
+ + +
+

Example

+
+
+
    +
  • A study comparing the efficacy of HIV tests, reports on an experiment which concluded that HIV antibody tests have a sensitivity of 99.7% and a specificity of 98.5%
  • +
  • Suppose that a subject, from a population with a .1% prevalence of HIV, receives a positive test result. What is the probability that this subject has HIV?
  • +
  • Mathematically, we want \(P(D ~|~ +)\) given the sensitivity, \(P(+ ~|~ D) = .997\), the specificity, \(P(- ~|~ D^c) =.985\), and the prevalence \(P(D) = .001\)
  • +
+ +
+ +
+ + +
+

Using Bayes' formula

+
+
+

\[ +\begin{eqnarray*} + P(D ~|~ +) & = &\frac{P(+~|~D)P(D)}{P(+~|~D)P(D) + P(+~|~D^c)P(D^c)}\\ \\ + & = & \frac{P(+~|~D)P(D)}{P(+~|~D)P(D) + \{1-P(-~|~D^c)\}\{1 - P(D)\}} \\ \\ + & = & \frac{.997\times .001}{.997 \times .001 + .015 \times .999}\\ \\ + & = & .062 +\end{eqnarray*} +\]

+ +
    +
  • In this population a positive test result only suggests a 6% probability that the subject has the disease
  • +
  • (The positive predictive value is 6% for this test)
  • +
+ +
+ +
+ + +
+

More on this example

+
+
+
    +
  • The low positive predictive value is due to low prevalence of disease and the somewhat modest specificity
  • +
  • Suppose it was known that the subject was an intravenous drug user and routinely had intercourse with an HIV infected partner
  • +
  • Notice that the evidence implied by a positive test result does not change because of the prevalence of disease in the subject's population, only our interpretation of that evidence changes
  • +
+ +
+ +
+ + +
+

Likelihood ratios

+
+
+
    +
  • Using Bayes rule, we have +\[ +P(D ~|~ +) = \frac{P(+~|~D)P(D)}{P(+~|~D)P(D) + P(+~|~D^c)P(D^c)} +\] +and +\[ +P(D^c ~|~ +) = \frac{P(+~|~D^c)P(D^c)}{P(+~|~D)P(D) + P(+~|~D^c)P(D^c)}. +\]
  • +
+ +
+ +
+ + +
+

Likelihood ratios

+
+
+
    +
  • Therefore +\[ +\frac{P(D ~|~ +)}{P(D^c ~|~ +)} = \frac{P(+~|~D)}{P(+~|~D^c)}\times \frac{P(D)}{P(D^c)} +\] +ie +\[ +\mbox{post-test odds of }D = DLR_+\times\mbox{pre-test odds of }D +\]
  • +
  • Similarly, \(DLR_-\) relates the decrease in the odds of the +disease after a negative test result to the odds of disease prior to +the test.
  • +
+ +
+ +
+ + +
+

HIV example revisited

+
+
+
    +
  • Suppose a subject has a positive HIV test
  • +
  • \(DLR_+ = .997 / (1 - .985) \approx 66\)
  • +
  • The result of the positive test is that the odds of disease is now 66 times the pretest odds
  • +
  • Or, equivalently, the hypothesis of disease is 66 times more supported by the data than the hypothesis of no disease
  • +
+ +
+ +
+ + +
+

HIV example revisited

+
+
+
    +
  • Suppose that a subject has a negative test result
  • +
  • \(DLR_- = (1 - .997) / .985 \approx .003\)
  • +
  • Therefore, the post-test odds of disease is now \(.3\%\) of the pretest odds given the negative test.
  • +
  • Or, the hypothesis of disease is supported \(.003\) times that of the hypothesis of absence of disease given the negative test result
  • +
+ +
+ +
+ + +
+ + + + + + + + + + + + + + \ No newline at end of file diff --git a/06_StatisticalInference/01_05_ConditionalProbability/index.md b/06_StatisticalInference/01_05_ConditionalProbability/index.md index db2151e51..93dad87ce 100644 --- a/06_StatisticalInference/01_05_ConditionalProbability/index.md +++ b/06_StatisticalInference/01_05_ConditionalProbability/index.md @@ -1,169 +1,169 @@ ---- -title : Conditional Probability -subtitle : Statistical Inference -author : Brian Caffo, Jeff Leek, Roger Peng -job : Johns Hopkins Bloomberg School of Public Health -logo : bloomberg_shield.png -framework : io2012 # {io2012, html5slides, shower, dzslides, ...} -highlighter : highlight.js # {highlight.js, prettify, highlight} -hitheme : tomorrow # -url: - lib: ../../librariesNew - assets: ../../assets -widgets : [mathjax] # {mathjax, quiz, bootstrap} -mode : selfcontained # {standalone, draft} ---- - -## Conditional probability, motivation - -- The probability of getting a one when rolling a (standard) die - is usually assumed to be one sixth -- Suppose you were given the extra information that the die roll - was an odd number (hence 1, 3 or 5) -- *conditional on this new information*, the probability of a - one is now one third - ---- - -## Conditional probability, definition - -- Let $B$ be an event so that $P(B) > 0$ -- Then the conditional probability of an event $A$ given that $B$ has occurred is - $$ - P(A ~|~ B) = \frac{P(A \cap B)}{P(B)} - $$ -- Notice that if $A$ and $B$ are independent, then - $$ - P(A ~|~ B) = \frac{P(A) P(B)}{P(B)} = P(A) - $$ - ---- - -## Example - -- Consider our die roll example -- $B = \{1, 3, 5\}$ -- $A = \{1\}$ -$$ - \begin{eqnarray*} -P(\mbox{one given that roll is odd}) & = & P(A ~|~ B) \\ \\ - & = & \frac{P(A \cap B)}{P(B)} \\ \\ - & = & \frac{P(A)}{P(B)} \\ \\ - & = & \frac{1/6}{3/6} = \frac{1}{3} - \end{eqnarray*} -$$ - - - ---- - -## Bayes' rule - -$$ -P(B ~|~ A) = \frac{P(A ~|~ B) P(B)}{P(A ~|~ B) P(B) + P(A ~|~ B^c)P(B^c)}. -$$ - - ---- - -## Diagnostic tests - -- Let $+$ and $-$ be the events that the result of a diagnostic test is positive or negative respectively -- Let $D$ and $D^c$ be the event that the subject of the test has or does not have the disease respectively -- The **sensitivity** is the probability that the test is positive given that the subject actually has the disease, $P(+ ~|~ D)$ -- The **specificity** is the probability that the test is negative given that the subject does not have the disease, $P(- ~|~ D^c)$ - ---- - -## More definitions - -- The **positive predictive value** is the probability that the subject has the disease given that the test is positive, $P(D ~|~ +)$ -- The **negative predictive value** is the probability that the subject does not have the disease given that the test is negative, $P(D^c ~|~ -)$ -- The **prevalence of the disease** is the marginal probability of disease, $P(D)$ - ---- - -## More definitions - -- The **diagnostic likelihood ratio of a positive test**, labeled $DLR_+$, is $P(+ ~|~ D) / P(+ ~|~ D^c)$, which is the $$sensitivity / (1 - specificity)$$ -- The **diagnostic likelihood ratio of a negative test**, labeled $DLR_-$, is $P(- ~|~ D) / P(- ~|~ D^c)$, which is the $$(1 - sensitivity) / specificity$$ - ---- - -## Example - -- A study comparing the efficacy of HIV tests, reports on an experiment which concluded that HIV antibody tests have a sensitivity of 99.7% and a specificity of 98.5% -- Suppose that a subject, from a population with a .1% prevalence of HIV, receives a positive test result. What is the probability that this subject has HIV? -- Mathematically, we want $P(D ~|~ +)$ given the sensitivity, $P(+ ~|~ D) = .997$, the specificity, $P(- ~|~ D^c) =.985$, and the prevalence $P(D) = .001$ - ---- - -## Using Bayes' formula - -$$ -\begin{eqnarray*} - P(D ~|~ +) & = &\frac{P(+~|~D)P(D)}{P(+~|~D)P(D) + P(+~|~D^c)P(D^c)}\\ \\ - & = & \frac{P(+~|~D)P(D)}{P(+~|~D)P(D) + \{1-P(-~|~D^c)\}\{1 - P(D)\}} \\ \\ - & = & \frac{.997\times .001}{.997 \times .001 + .015 \times .999}\\ \\ - & = & .062 -\end{eqnarray*} -$$ - -- In this population a positive test result only suggests a 6% probability that the subject has the disease -- (The positive predictive value is 6% for this test) - ---- - -## More on this example - -- The low positive predictive value is due to low prevalence of disease and the somewhat modest specificity -- Suppose it was known that the subject was an intravenous drug user and routinely had intercourse with an HIV infected partner -- Notice that the evidence implied by a positive test result does not change because of the prevalence of disease in the subject's population, only our interpretation of that evidence changes - ---- - -## Likelihood ratios - -- Using Bayes rule, we have - $$ - P(D ~|~ +) = \frac{P(+~|~D)P(D)}{P(+~|~D)P(D) + P(+~|~D^c)P(D^c)} - $$ - and - $$ - P(D^c ~|~ +) = \frac{P(+~|~D^c)P(D^c)}{P(+~|~D)P(D) + P(+~|~D^c)P(D^c)}. - $$ - ---- - -## Likelihood ratios - -- Therefore -$$ -\frac{P(D ~|~ +)}{P(D^c ~|~ +)} = \frac{P(+~|~D)}{P(+~|~D^c)}\times \frac{P(D)}{P(D^c)} -$$ -ie -$$ -\mbox{post-test odds of }D = DLR_+\times\mbox{pre-test odds of }D -$$ -- Similarly, $DLR_-$ relates the decrease in the odds of the - disease after a negative test result to the odds of disease prior to - the test. - ---- - -## HIV example revisited - -- Suppose a subject has a positive HIV test -- $DLR_+ = .997 / (1 - .985) \approx 66$ -- The result of the positive test is that the odds of disease is now 66 times the pretest odds -- Or, equivalently, the hypothesis of disease is 66 times more supported by the data than the hypothesis of no disease - ---- - -## HIV example revisited - -- Suppose that a subject has a negative test result -- $DLR_- = (1 - .997) / .985 \approx .003$ -- Therefore, the post-test odds of disease is now $.3\%$ of the pretest odds given the negative test. -- Or, the hypothesis of disease is supported $.003$ times that of the hypothesis of absence of disease given the negative test result - +--- +title : Conditional Probability +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- + +## Conditional probability, motivation + +- The probability of getting a one when rolling a (standard) die + is usually assumed to be one sixth +- Suppose you were given the extra information that the die roll + was an odd number (hence 1, 3 or 5) +- *conditional on this new information*, the probability of a + one is now one third + +--- + +## Conditional probability, definition + +- Let $B$ be an event so that $P(B) > 0$ +- Then the conditional probability of an event $A$ given that $B$ has occurred is + $$ + P(A ~|~ B) = \frac{P(A \cap B)}{P(B)} + $$ +- Notice that if $A$ and $B$ are independent, then + $$ + P(A ~|~ B) = \frac{P(A) P(B)}{P(B)} = P(A) + $$ + +--- + +## Example + +- Consider our die roll example +- $B = \{1, 3, 5\}$ +- $A = \{1\}$ +$$ + \begin{eqnarray*} +P(\mbox{one given that roll is odd}) & = & P(A ~|~ B) \\ \\ + & = & \frac{P(A \cap B)}{P(B)} \\ \\ + & = & \frac{P(A)}{P(B)} \\ \\ + & = & \frac{1/6}{3/6} = \frac{1}{3} + \end{eqnarray*} +$$ + + + +--- + +## Bayes' rule + +$$ +P(B ~|~ A) = \frac{P(A ~|~ B) P(B)}{P(A ~|~ B) P(B) + P(A ~|~ B^c)P(B^c)}. +$$ + + +--- + +## Diagnostic tests + +- Let $+$ and $-$ be the events that the result of a diagnostic test is positive or negative respectively +- Let $D$ and $D^c$ be the event that the subject of the test has or does not have the disease respectively +- The **sensitivity** is the probability that the test is positive given that the subject actually has the disease, $P(+ ~|~ D)$ +- The **specificity** is the probability that the test is negative given that the subject does not have the disease, $P(- ~|~ D^c)$ + +--- + +## More definitions + +- The **positive predictive value** is the probability that the subject has the disease given that the test is positive, $P(D ~|~ +)$ +- The **negative predictive value** is the probability that the subject does not have the disease given that the test is negative, $P(D^c ~|~ -)$ +- The **prevalence of the disease** is the marginal probability of disease, $P(D)$ + +--- + +## More definitions + +- The **diagnostic likelihood ratio of a positive test**, labeled $DLR_+$, is $P(+ ~|~ D) / P(+ ~|~ D^c)$, which is the $$sensitivity / (1 - specificity)$$ +- The **diagnostic likelihood ratio of a negative test**, labeled $DLR_-$, is $P(- ~|~ D) / P(- ~|~ D^c)$, which is the $$(1 - sensitivity) / specificity$$ + +--- + +## Example + +- A study comparing the efficacy of HIV tests, reports on an experiment which concluded that HIV antibody tests have a sensitivity of 99.7% and a specificity of 98.5% +- Suppose that a subject, from a population with a .1% prevalence of HIV, receives a positive test result. What is the probability that this subject has HIV? +- Mathematically, we want $P(D ~|~ +)$ given the sensitivity, $P(+ ~|~ D) = .997$, the specificity, $P(- ~|~ D^c) =.985$, and the prevalence $P(D) = .001$ + +--- + +## Using Bayes' formula + +$$ +\begin{eqnarray*} + P(D ~|~ +) & = &\frac{P(+~|~D)P(D)}{P(+~|~D)P(D) + P(+~|~D^c)P(D^c)}\\ \\ + & = & \frac{P(+~|~D)P(D)}{P(+~|~D)P(D) + \{1-P(-~|~D^c)\}\{1 - P(D)\}} \\ \\ + & = & \frac{.997\times .001}{.997 \times .001 + .015 \times .999}\\ \\ + & = & .062 +\end{eqnarray*} +$$ + +- In this population a positive test result only suggests a 6% probability that the subject has the disease +- (The positive predictive value is 6% for this test) + +--- + +## More on this example + +- The low positive predictive value is due to low prevalence of disease and the somewhat modest specificity +- Suppose it was known that the subject was an intravenous drug user and routinely had intercourse with an HIV infected partner +- Notice that the evidence implied by a positive test result does not change because of the prevalence of disease in the subject's population, only our interpretation of that evidence changes + +--- + +## Likelihood ratios + +- Using Bayes rule, we have + $$ + P(D ~|~ +) = \frac{P(+~|~D)P(D)}{P(+~|~D)P(D) + P(+~|~D^c)P(D^c)} + $$ + and + $$ + P(D^c ~|~ +) = \frac{P(+~|~D^c)P(D^c)}{P(+~|~D)P(D) + P(+~|~D^c)P(D^c)}. + $$ + +--- + +## Likelihood ratios + +- Therefore +$$ +\frac{P(D ~|~ +)}{P(D^c ~|~ +)} = \frac{P(+~|~D)}{P(+~|~D^c)}\times \frac{P(D)}{P(D^c)} +$$ +ie +$$ +\mbox{post-test odds of }D = DLR_+\times\mbox{pre-test odds of }D +$$ +- Similarly, $DLR_-$ relates the decrease in the odds of the + disease after a negative test result to the odds of disease prior to + the test. + +--- + +## HIV example revisited + +- Suppose a subject has a positive HIV test +- $DLR_+ = .997 / (1 - .985) \approx 66$ +- The result of the positive test is that the odds of disease is now 66 times the pretest odds +- Or, equivalently, the hypothesis of disease is 66 times more supported by the data than the hypothesis of no disease + +--- + +## HIV example revisited + +- Suppose that a subject has a negative test result +- $DLR_- = (1 - .997) / .985 \approx .003$ +- Therefore, the post-test odds of disease is now $.3\%$ of the pretest odds given the negative test. +- Or, the hypothesis of disease is supported $.003$ times that of the hypothesis of absence of disease given the negative test result + diff --git a/06_StatisticalInference/01_05_ConditionalProbability/index.pdf b/06_StatisticalInference/01_05_ConditionalProbability/index.pdf index 7cbac08c4..a5a7edead 100644 Binary files a/06_StatisticalInference/01_05_ConditionalProbability/index.pdf and b/06_StatisticalInference/01_05_ConditionalProbability/index.pdf differ diff --git a/06_StatisticalInference/02_01_CommonDistributions/index.Rmd b/06_StatisticalInference/02_01_CommonDistributions/index.Rmd index 7293d2964..586fd0eae 100644 --- a/06_StatisticalInference/02_01_CommonDistributions/index.Rmd +++ b/06_StatisticalInference/02_01_CommonDistributions/index.Rmd @@ -1,346 +1,331 @@ ---- -title : Some Common Distributions -subtitle : Statistical Inference -author : Brian Caffo, Jeff Leek, Roger Peng -job : Johns Hopkins Bloomberg School of Public Health -logo : bloomberg_shield.png -framework : io2012 # {io2012, html5slides, shower, dzslides, ...} -highlighter : highlight.js # {highlight.js, prettify, highlight} -hitheme : tomorrow # -url: - lib: ../../librariesNew - assets: ../../assets -widgets : [mathjax] # {mathjax, quiz, bootstrap} -mode : selfcontained # {standalone, draft} ---- -```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F, results='hide'} -# make this an external chunk that can be included in any file -options(width = 100) -opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/') - -options(xtable.type = 'html') -knit_hooks$set(inline = function(x) { - if(is.numeric(x)) { - round(x, getOption('digits')) - } else { - paste(as.character(x), collapse = ', ') - } -}) -knit_hooks$set(plot = knitr:::hook_plot_html) -runif(1) -``` - -## The Bernoulli distribution - -- The **Bernoulli distribution** arises as the result of a binary outcome -- Bernoulli random variables take (only) the values 1 and 0 with probabilities of (say) $p$ and $1-p$ respectively -- The PMF for a Bernoulli random variable $X$ is $$P(X = x) = p^x (1 - p)^{1 - x}$$ -- The mean of a Bernoulli random variable is $p$ and the variance is $p(1 - p)$ -- If we let $X$ be a Bernoulli random variable, it is typical to call $X=1$ as a "success" and $X=0$ as a "failure" - ---- - -## iid Bernoulli trials - -- If several iid Bernoulli observations, say $x_1,\ldots, x_n$, are observed the -likelihood is -$$ - \prod_{i=1}^n p^{x_i} (1 - p)^{1 - x_i} = p^{\sum x_i} (1 - p)^{n - \sum x_i} -$$ -- Notice that the likelihood depends only on the sum of the $x_i$ -- Because $n$ is fixed and assumed known, this implies that the sample proportion $\sum_i x_i / n$ contains all of the relevant information about $p$ -- We can maximize the Bernoulli likelihood over $p$ to obtain that $\hat p = \sum_i x_i / n$ is the maximum likelihood estimator for $p$ - ---- -## Plotting all possible likelihoods for a small n -``` -n <- 5 -pvals <- seq(0, 1, length = 1000) -plot(c(0, 1), c(0, 1.2), type = "n", frame = FALSE, xlab = "p", ylab = "likelihood") -text((0 : n) /n, 1.1, as.character(0 : n)) -sapply(0 : n, function(x) { - phat <- x / n - if (x == 0) lines(pvals, ( (1 - pvals) / (1 - phat) )^(n-x), lwd = 3) - else if (x == n) lines(pvals, (pvals / phat) ^ x, lwd = 3) - else lines(pvals, (pvals / phat ) ^ x * ( (1 - pvals) / (1 - phat) ) ^ (n-x), lwd = 3) - } -) -title(paste("Likelihoods for n = ", n)) -``` - ---- -```{r, fig.height=6, fig.width=6, echo = FALSE, results='hide'} -n <- 5 -pvals <- seq(0, 1, length = 1000) -plot(c(0, 1), c(0, 1.2), type = "n", frame = FALSE, xlab = "p", ylab = "likelihood") -text((0 : n) /n, 1.1, as.character(0 : n)) -sapply(0 : n, function(x) { - phat <- x / n - if (x == 0) lines(pvals, ( (1 - pvals) / (1 - phat) )^(n-x), lwd = 3) - else if (x == n) lines(pvals, (pvals / phat) ^ x, lwd = 3) - else lines(pvals, (pvals / phat ) ^ x * ( (1 - pvals) / (1 - phat) ) ^ (n-x), lwd = 3) - } -) -title(paste("Likelihoods for n = ", n)) -``` - ---- - -## Binomial trials - -- The *binomial random variables* are obtained as the sum of iid Bernoulli trials -- In specific, let $X_1,\ldots,X_n$ be iid Bernoulli$(p)$; then $X = \sum_{i=1}^n X_i$ is a binomial random variable -- The binomial mass function is -$$ -P(X = x) = -\left( -\begin{array}{c} - n \\ x -\end{array} -\right) -p^x(1 - p)^{n-x} -$$ -for $x=0,\ldots,n$ - ---- - -## Choose - -- Recall that the notation - $$\left( - \begin{array}{c} - n \\ x - \end{array} - \right) = \frac{n!}{x!(n-x)!} - $$ (read "$n$ choose $x$") counts the number of ways of selecting $x$ items out of $n$ - without replacement disregarding the order of the items - -$$\left( - \begin{array}{c} - n \\ 0 - \end{array} - \right) = -\left( - \begin{array}{c} - n \\ n - \end{array} - \right) = 1 - $$ - ---- - -## Example justification of the binomial likelihood - -- Consider the probability of getting $6$ heads out of $10$ coin flips from a coin with success probability $p$ -- The probability of getting $6$ heads and $4$ tails in any specific order is - $$ - p^6(1-p)^4 - $$ -- There are -$$\left( -\begin{array}{c} - 10 \\ 6 -\end{array} -\right) -$$ -possible orders of $6$ heads and $4$ tails - ---- - -## Example - -- Suppose a friend has $8$ children (oh my!), $7$ of which are girls and none are twins -- If each gender has an independent $50$% probability for each birth, what's the probability of getting $7$ or more girls out of $8$ births? -$$\left( -\begin{array}{c} - 8 \\ 7 -\end{array} -\right) .5^{7}(1-.5)^{1} -+ -\left( -\begin{array}{c} - 8 \\ 8 -\end{array} -\right) .5^{8}(1-.5)^{0} \approx 0.04 -$$ -```{r} -choose(8, 7) * .5 ^ 8 + choose(8, 8) * .5 ^ 8 -pbinom(6, size = 8, prob = .5, lower.tail = FALSE) -``` - ---- -```{r, fig.height=5, fig.width=5} -plot(pvals, dbinom(7, 8, pvals) / dbinom(7, 8, 7/8) , - lwd = 3, frame = FALSE, type = "l", xlab = "p", ylab = "likelihood") -``` - ---- - -## The normal distribution - -- A random variable is said to follow a **normal** or **Gaussian** distribution with mean $\mu$ and variance $\sigma^2$ if the associated density is - $$ - (2\pi \sigma^2)^{-1/2}e^{-(x - \mu)^2/2\sigma^2} - $$ - If $X$ a RV with this density then $E[X] = \mu$ and $Var(X) = \sigma^2$ -- We write $X\sim \mbox{N}(\mu, \sigma^2)$ -- When $\mu = 0$ and $\sigma = 1$ the resulting distribution is called **the standard normal distribution** -- The standard normal density function is labeled $\phi$ -- Standard normal RVs are often labeled $Z$ - ---- -```{r, fig.height=4.5, fig.width=4.5} -zvals <- seq(-3, 3, length = 1000) -plot(zvals, dnorm(zvals), - type = "l", lwd = 3, frame = FALSE, xlab = "z", ylab = "Density") -sapply(-3 : 3, function(k) abline(v = k)) -``` - ---- - -## Facts about the normal density - -- If $X \sim \mbox{N}(\mu,\sigma^2)$ the $Z = \frac{X -\mu}{\sigma}$ is standard normal -- If $Z$ is standard normal $$X = \mu + \sigma Z \sim \mbox{N}(\mu, \sigma^2)$$ -- The non-standard normal density is $$\phi\{(x - \mu) / \sigma\}/\sigma$$ - ---- - -## More facts about the normal density - -1. Approximately $68\%$, $95\%$ and $99\%$ of the normal density lies within $1$, $2$ and $3$ standard deviations from the mean, respectively -2. $-1.28$, $-1.645$, $-1.96$ and $-2.33$ are the $10^{th}$, $5^{th}$, $2.5^{th}$ and $1^{st}$ percentiles of the standard normal distribution respectively -3. By symmetry, $1.28$, $1.645$, $1.96$ and $2.33$ are the $90^{th}$, $95^{th}$, $97.5^{th}$ and $99^{th}$ percentiles of the standard normal distribution respectively - ---- - -## Question - -- What is the $95^{th}$ percentile of a $N(\mu, \sigma^2)$ distribution? - - Quick answer in R `qnorm(.95, mean = mu, sd = sd)` -- We want the point $x_0$ so that $P(X \leq x_0) = .95$ -$$ - \begin{eqnarray*} - P(X \leq x_0) & = & P\left(\frac{X - \mu}{\sigma} \leq \frac{x_0 - \mu}{\sigma}\right) \\ \\ - & = & P\left(Z \leq \frac{x_0 - \mu}{\sigma}\right) = .95 - \end{eqnarray*} -$$ -- Therefore - $$\frac{x_0 - \mu}{\sigma} = 1.645$$ - or $x_0 = \mu + \sigma 1.645$ -- In general $x_0 = \mu + \sigma z_0$ where $z_0$ is the appropriate standard normal quantile - ---- - -## Question - -- What is the probability that a $\mbox{N}(\mu,\sigma^2)$ RV is 2 standard deviations above the mean? -- We want to know -$$ - \begin{eqnarray*} - P(X > \mu + 2\sigma) & = & -P\left(\frac{X -\mu}{\sigma} > \frac{\mu + 2\sigma - \mu}{\sigma}\right) \\ \\ -& = & P(Z \geq 2 ) \\ \\ -& \approx & 2.5\% - \end{eqnarray*} -$$ - ---- - -## Other properties - -- The normal distribution is symmetric and peaked about its mean (therefore the mean, median and mode are all equal) -- A constant times a normally distributed random variable is also normally distributed (what is the mean and variance?) -- Sums of normally distributed random variables are again normally distributed even if the variables are dependent (what is the mean and variance?) -- Sample means of normally distributed random variables are again normally distributed (with what mean and variance?) -- The square of a *standard normal* random variable follows what is called **chi-squared** distribution -- The exponent of a normally distributed random variables follows what is called the **log-normal** distribution -- As we will see later, many random variables, properly normalized, *limit* to a normal distribution - ---- - -## Final thoughts on normal likelihoods -- The MLE for $\mu$ is $\bar X$. -- The MLE for $\sigma^2$ is - $$ - \frac{\sum_{i=1}^n (X_i - \bar X)^2}{n} - $$ - (Which is the biased version of the sample variance.) -- The MLE of $\sigma$ is simply the square root of this - estimate - ---- -## The Poisson distribution -* Used to model counts -* The Poisson mass function is -$$ -P(X = x; \lambda) = \frac{\lambda^x e^{-\lambda}}{x!} -$$ -for $x=0,1,\ldots$ -* The mean of this distribution is $\lambda$ -* The variance of this distribution is $\lambda$ -* Notice that $x$ ranges from $0$ to $\infty$ - ---- -## Some uses for the Poisson distribution -* Modeling event/time data -* Modeling radioactive decay -* Modeling survival data -* Modeling unbounded count data -* Modeling contingency tables -* Approximating binomials when $n$ is large and $p$ is small - ---- -## Poisson derivation -* $\lambda$ is the mean number of events per unit time -* Let $h$ be very small -* Suppose we assume that - * Prob. of an event in an interval of length $h$ is $\lambda h$ - while the prob. of more than one event is negligible - * Whether or not an event occurs in one small interval - does not impact whether or not an event occurs in another - small interval -then, the number of events per unit time is Poisson with mean $\lambda$ - ---- -## Rates and Poisson random variables -* Poisson random variables are used to model rates -* $X \sim Poisson(\lambda t)$ where - * $\lambda = E[X / t]$ is the expected count per unit of time - * $t$ is the total monitoring time - ---- -## Poisson approximation to the binomial -* When $n$ is large and $p$ is small the Poisson distribution - is an accurate approximation to the binomial distribution -* Notation - * $\lambda = n p$ - * $X \sim \mbox{Binomial}(n, p)$, $\lambda = n p$ and - * $n$ gets large - * $p$ gets small - * $\lambda$ stays constant - ---- -## Example -The number of people that show up at a bus stop is Poisson with -a mean of $2.5$ per hour. - -If watching the bus stop for 4 hours, what is the probability that $3$ -or fewer people show up for the whole time? - -```{r} -ppois(3, lambda = 2.5 * 4) -``` - ---- -## Example, Poisson approximation to the binomial - -We flip a coin with success probablity $0.01$ five hundred times. - -What's the probability of 2 or fewer successes? - -```{r} -pbinom(2, size = 500, prob = .01) -ppois(2, lambda=500 * .01) -``` - +--- +title : Some Common Distributions +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- + + +## The Bernoulli distribution + +- The **Bernoulli distribution** arises as the result of a binary outcome +- Bernoulli random variables take (only) the values 1 and 0 with probabilities of (say) $p$ and $1-p$ respectively +- The PMF for a Bernoulli random variable $X$ is $$P(X = x) = p^x (1 - p)^{1 - x}$$ +- The mean of a Bernoulli random variable is $p$ and the variance is $p(1 - p)$ +- If we let $X$ be a Bernoulli random variable, it is typical to call $X=1$ as a "success" and $X=0$ as a "failure" + +--- + +## iid Bernoulli trials + +- If several iid Bernoulli observations, say $x_1,\ldots, x_n$, are observed the +likelihood is +$$ + \prod_{i=1}^n p^{x_i} (1 - p)^{1 - x_i} = p^{\sum x_i} (1 - p)^{n - \sum x_i} +$$ +- Notice that the likelihood depends only on the sum of the $x_i$ +- Because $n$ is fixed and assumed known, this implies that the sample proportion $\sum_i x_i / n$ contains all of the relevant information about $p$ +- We can maximize the Bernoulli likelihood over $p$ to obtain that $\hat p = \sum_i x_i / n$ is the maximum likelihood estimator for $p$ + +--- +## Plotting all possible likelihoods for a small n +``` +n <- 5 +pvals <- seq(0, 1, length = 1000) +plot(c(0, 1), c(0, 1.2), type = "n", frame = FALSE, xlab = "p", ylab = "likelihood") +text((0 : n) /n, 1.1, as.character(0 : n)) +sapply(0 : n, function(x) { + phat <- x / n + if (x == 0) lines(pvals, ( (1 - pvals) / (1 - phat) )^(n-x), lwd = 3) + else if (x == n) lines(pvals, (pvals / phat) ^ x, lwd = 3) + else lines(pvals, (pvals / phat ) ^ x * ( (1 - pvals) / (1 - phat) ) ^ (n-x), lwd = 3) + } +) +title(paste("Likelihoods for n = ", n)) +``` + +--- +```{r, fig.height=6, fig.width=6, echo = FALSE, results='hide'} +n <- 5 +pvals <- seq(0, 1, length = 1000) +plot(c(0, 1), c(0, 1.2), type = "n", frame = FALSE, xlab = "p", ylab = "likelihood") +text((0 : n) /n, 1.1, as.character(0 : n)) +sapply(0 : n, function(x) { + phat <- x / n + if (x == 0) lines(pvals, ( (1 - pvals) / (1 - phat) )^(n-x), lwd = 3) + else if (x == n) lines(pvals, (pvals / phat) ^ x, lwd = 3) + else lines(pvals, (pvals / phat ) ^ x * ( (1 - pvals) / (1 - phat) ) ^ (n-x), lwd = 3) + } +) +title(paste("Likelihoods for n = ", n)) +``` + +--- + +## Binomial trials + +- The *binomial random variables* are obtained as the sum of iid Bernoulli trials +- In specific, let $X_1,\ldots,X_n$ be iid Bernoulli$(p)$; then $X = \sum_{i=1}^n X_i$ is a binomial random variable +- The binomial mass function is +$$ +P(X = x) = +\left( +\begin{array}{c} + n \\ x +\end{array} +\right) +p^x(1 - p)^{n-x} +$$ +for $x=0,\ldots,n$ + +--- + +## Choose + +- Recall that the notation + $$\left( + \begin{array}{c} + n \\ x + \end{array} + \right) = \frac{n!}{x!(n-x)!} + $$ (read "$n$ choose $x$") counts the number of ways of selecting $x$ items out of $n$ + without replacement disregarding the order of the items + +$$\left( + \begin{array}{c} + n \\ 0 + \end{array} + \right) = +\left( + \begin{array}{c} + n \\ n + \end{array} + \right) = 1 + $$ + +--- + +## Example justification of the binomial likelihood + +- Consider the probability of getting $6$ heads out of $10$ coin flips from a coin with success probability $p$ +- The probability of getting $6$ heads and $4$ tails in any specific order is + $$ + p^6(1-p)^4 + $$ +- There are +$$\left( +\begin{array}{c} + 10 \\ 6 +\end{array} +\right) +$$ +possible orders of $6$ heads and $4$ tails + +--- + +## Example + +- Suppose a friend has $8$ children (oh my!), $7$ of which are girls and none are twins +- If each gender has an independent $50$% probability for each birth, what's the probability of getting $7$ or more girls out of $8$ births? +$$\left( +\begin{array}{c} + 8 \\ 7 +\end{array} +\right) .5^{7}(1-.5)^{1} ++ +\left( +\begin{array}{c} + 8 \\ 8 +\end{array} +\right) .5^{8}(1-.5)^{0} \approx 0.04 +$$ +```{r} +choose(8, 7) * .5 ^ 8 + choose(8, 8) * .5 ^ 8 +pbinom(6, size = 8, prob = .5, lower.tail = FALSE) +``` + +--- +```{r, fig.height=5, fig.width=5} +plot(pvals, dbinom(7, 8, pvals) / dbinom(7, 8, 7/8) , + lwd = 3, frame = FALSE, type = "l", xlab = "p", ylab = "likelihood") +``` + +--- + +## The normal distribution + +- A random variable is said to follow a **normal** or **Gaussian** distribution with mean $\mu$ and variance $\sigma^2$ if the associated density is + $$ + (2\pi \sigma^2)^{-1/2}e^{-(x - \mu)^2/2\sigma^2} + $$ + If $X$ a RV with this density then $E[X] = \mu$ and $Var(X) = \sigma^2$ +- We write $X\sim \mbox{N}(\mu, \sigma^2)$ +- When $\mu = 0$ and $\sigma = 1$ the resulting distribution is called **the standard normal distribution** +- The standard normal density function is labeled $\phi$ +- Standard normal RVs are often labeled $Z$ + +--- +```{r, fig.height=5, fig.width=5, fig.align='center', results='hide'} +zvals <- seq(-3, 3, length = 1000) +plot(zvals, dnorm(zvals), + type = "l", lwd = 3, frame = FALSE, xlab = "z", ylab = "Density") +sapply(-3 : 3, function(k) abline(v = k)) +``` + +--- + +## Facts about the normal density + +- If $X \sim \mbox{N}(\mu,\sigma^2)$ the $Z = \frac{X -\mu}{\sigma}$ is standard normal +- If $Z$ is standard normal $$X = \mu + \sigma Z \sim \mbox{N}(\mu, \sigma^2)$$ +- The non-standard normal density is $$\phi\{(x - \mu) / \sigma\}/\sigma$$ + +--- + +## More facts about the normal density + +1. Approximately $68\%$, $95\%$ and $99\%$ of the normal density lies within $1$, $2$ and $3$ standard deviations from the mean, respectively +2. $-1.28$, $-1.645$, $-1.96$ and $-2.33$ are the $10^{th}$, $5^{th}$, $2.5^{th}$ and $1^{st}$ percentiles of the standard normal distribution respectively +3. By symmetry, $1.28$, $1.645$, $1.96$ and $2.33$ are the $90^{th}$, $95^{th}$, $97.5^{th}$ and $99^{th}$ percentiles of the standard normal distribution respectively + +--- + +## Question + +- What is the $95^{th}$ percentile of a $N(\mu, \sigma^2)$ distribution? + - Quick answer in R `qnorm(.95, mean = mu, sd = sd)` +- We want the point $x_0$ so that $P(X \leq x_0) = .95$ +$$ + \begin{eqnarray*} + P(X \leq x_0) & = & P\left(\frac{X - \mu}{\sigma} \leq \frac{x_0 - \mu}{\sigma}\right) \\ \\ + & = & P\left(Z \leq \frac{x_0 - \mu}{\sigma}\right) = .95 + \end{eqnarray*} +$$ +- Therefore + $$\frac{x_0 - \mu}{\sigma} = 1.645$$ + or $x_0 = \mu + \sigma 1.645$ +- In general $x_0 = \mu + \sigma z_0$ where $z_0$ is the appropriate standard normal quantile + +--- + +## Question + +- What is the probability that a $\mbox{N}(\mu,\sigma^2)$ RV is 2 standard deviations above the mean? +- We want to know +$$ + \begin{eqnarray*} + P(X > \mu + 2\sigma) & = & +P\left(\frac{X -\mu}{\sigma} > \frac{\mu + 2\sigma - \mu}{\sigma}\right) \\ \\ +& = & P(Z \geq 2 ) \\ \\ +& \approx & 2.5\% + \end{eqnarray*} +$$ + +--- + +## Other properties + +- The normal distribution is symmetric and peaked about its mean (therefore the mean, median and mode are all equal) +- A constant times a normally distributed random variable is also normally distributed (what is the mean and variance?) +- Sums of normally distributed random variables are again normally distributed even if the variables are dependent (what is the mean and variance?) +- Sample means of normally distributed random variables are again normally distributed (with what mean and variance?) +- The square of a *standard normal* random variable follows what is called **chi-squared** distribution +- The exponent of a normally distributed random variables follows what is called the **log-normal** distribution +- As we will see later, many random variables, properly normalized, *limit* to a normal distribution + +--- + +## Final thoughts on normal likelihoods +- The MLE for $\mu$ is $\bar X$. +- The MLE for $\sigma^2$ is + $$ + \frac{\sum_{i=1}^n (X_i - \bar X)^2}{n} + $$ + (Which is the biased version of the sample variance.) +- The MLE of $\sigma$ is simply the square root of this + estimate + +--- +## The Poisson distribution +* Used to model counts +* The Poisson mass function is +$$ +P(X = x; \lambda) = \frac{\lambda^x e^{-\lambda}}{x!} +$$ +for $x=0,1,\ldots$ +* The mean of this distribution is $\lambda$ +* The variance of this distribution is $\lambda$ +* Notice that $x$ ranges from $0$ to $\infty$ + +--- +## Some uses for the Poisson distribution +* Modeling event/time data +* Modeling radioactive decay +* Modeling survival data +* Modeling unbounded count data +* Modeling contingency tables +* Approximating binomials when $n$ is large and $p$ is small + +--- +## Poisson derivation +* $\lambda$ is the mean number of events per unit time +* Let $h$ be very small +* Suppose we assume that + * Prob. of an event in an interval of length $h$ is $\lambda h$ + while the prob. of more than one event is negligible + * Whether or not an event occurs in one small interval + does not impact whether or not an event occurs in another + small interval +then, the number of events per unit time is Poisson with mean $\lambda$ + +--- +## Rates and Poisson random variables +* Poisson random variables are used to model rates +* $X \sim Poisson(\lambda t)$ where + * $\lambda = E[X / t]$ is the expected count per unit of time + * $t$ is the total monitoring time + +--- +## Poisson approximation to the binomial +* When $n$ is large and $p$ is small the Poisson distribution + is an accurate approximation to the binomial distribution +* Notation + * $\lambda = n p$ + * $X \sim \mbox{Binomial}(n, p)$, $\lambda = n p$ and + * $n$ gets large + * $p$ gets small + * $\lambda$ stays constant + +--- +## Example +The number of people that show up at a bus stop is Poisson with +a mean of $2.5$ per hour. + +If watching the bus stop for 4 hours, what is the probability that $3$ +or fewer people show up for the whole time? + +```{r} +ppois(3, lambda = 2.5 * 4) +``` + +--- +## Example, Poisson approximation to the binomial + +We flip a coin with success probablity $0.01$ five hundred times. + +What's the probability of 2 or fewer successes? + +```{r} +pbinom(2, size = 500, prob = .01) +ppois(2, lambda=500 * .01) +``` + diff --git a/06_StatisticalInference/02_01_CommonDistributions/index.html b/06_StatisticalInference/02_01_CommonDistributions/index.html index be8769a08..8b616ccc0 100644 --- a/06_StatisticalInference/02_01_CommonDistributions/index.html +++ b/06_StatisticalInference/02_01_CommonDistributions/index.html @@ -1,750 +1,727 @@ - - - - Some Common Distributions - - - - - - - - - - - - - - - - - - - - - - - - - - -
-

Some Common Distributions

-

Statistical Inference

-

Brian Caffo, Jeff Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

-
-
-
- - - - -
-

The Bernoulli distribution

-
-
-
    -
  • The Bernoulli distribution arises as the result of a binary outcome
  • -
  • Bernoulli random variables take (only) the values 1 and 0 with probabilities of (say) \(p\) and \(1-p\) respectively
  • -
  • The PMF for a Bernoulli random variable \(X\) is \[P(X = x) = p^x (1 - p)^{1 - x}\]
  • -
  • The mean of a Bernoulli random variable is \(p\) and the variance is \(p(1 - p)\)
  • -
  • If we let \(X\) be a Bernoulli random variable, it is typical to call \(X=1\) as a "success" and \(X=0\) as a "failure"
  • -
- -
- -
- - -
-

iid Bernoulli trials

-
-
-
    -
  • If several iid Bernoulli observations, say \(x_1,\ldots, x_n\), are observed the -likelihood is -\[ -\prod_{i=1}^n p^{x_i} (1 - p)^{1 - x_i} = p^{\sum x_i} (1 - p)^{n - \sum x_i} -\]
  • -
  • Notice that the likelihood depends only on the sum of the \(x_i\)
  • -
  • Because \(n\) is fixed and assumed known, this implies that the sample proportion \(\sum_i x_i / n\) contains all of the relevant information about \(p\)
  • -
  • We can maximize the Bernoulli likelihood over \(p\) to obtain that \(\hat p = \sum_i x_i / n\) is the maximum likelihood estimator for \(p\)
  • -
- -
- -
- - -
-

Plotting all possible likelihoods for a small n

-
-
-
n <- 5
-pvals <- seq(0, 1, length = 1000)
-plot(c(0, 1), c(0, 1.2), type = "n", frame = FALSE, xlab = "p", ylab = "likelihood")
-text((0 : n) /n, 1.1, as.character(0 : n))
-sapply(0 : n, function(x) {
-  phat <- x / n
-  if (x == 0) lines(pvals,  ( (1 - pvals) / (1 - phat) )^(n-x), lwd = 3)
-  else if (x == n) lines(pvals, (pvals / phat) ^ x, lwd = 3)
-  else lines(pvals, (pvals / phat ) ^ x * ( (1 - pvals) / (1 - phat) ) ^ (n-x), lwd = 3) 
-  }
-)
-title(paste("Likelihoods for n = ", n))
-
- -
- -
- - -
-
plot of chunk unnamed-chunk-1
- -
- -
- - -
-

Binomial trials

-
-
-
    -
  • The binomial random variables are obtained as the sum of iid Bernoulli trials
  • -
  • In specific, let \(X_1,\ldots,X_n\) be iid Bernoulli\((p)\); then \(X = \sum_{i=1}^n X_i\) is a binomial random variable
  • -
  • The binomial mass function is -\[ -P(X = x) = -\left( -\begin{array}{c} -n \\ x -\end{array} -\right) -p^x(1 - p)^{n-x} -\] -for \(x=0,\ldots,n\)
  • -
- -
- -
- - -
-

Choose

-
-
-
    -
  • Recall that the notation -\[\left( -\begin{array}{c} - n \\ x -\end{array} -\right) = \frac{n!}{x!(n-x)!} -\] (read "\(n\) choose \(x\)") counts the number of ways of selecting \(x\) items out of \(n\) -without replacement disregarding the order of the items
  • -
- -

\[\left( - \begin{array}{c} - n \\ 0 - \end{array} - \right) = -\left( - \begin{array}{c} - n \\ n - \end{array} - \right) = 1 - \]

- -
- -
- - -
-

Example justification of the binomial likelihood

-
-
-
    -
  • Consider the probability of getting \(6\) heads out of \(10\) coin flips from a coin with success probability \(p\)
  • -
  • The probability of getting \(6\) heads and \(4\) tails in any specific order is -\[ -p^6(1-p)^4 -\]
  • -
  • There are -\[\left( -\begin{array}{c} -10 \\ 6 -\end{array} -\right) -\] -possible orders of \(6\) heads and \(4\) tails
  • -
- -
- -
- - -
-

Example

-
-
-
    -
  • Suppose a friend has \(8\) children (oh my!), \(7\) of which are girls and none are twins
  • -
  • If each gender has an independent \(50\)% probability for each birth, what's the probability of getting \(7\) or more girls out of \(8\) births? -\[\left( -\begin{array}{c} -8 \\ 7 -\end{array} -\right) .5^{7}(1-.5)^{1} -+ -\left( -\begin{array}{c} -8 \\ 8 -\end{array} -\right) .5^{8}(1-.5)^{0} \approx 0.04 -\]
  • -
- -
choose(8, 7) * .5 ^ 8 + choose(8, 8) * .5 ^ 8 
-
- -
[1] 0.03516
-
- -
pbinom(6, size = 8, prob = .5, lower.tail = FALSE)
-
- -
[1] 0.03516
-
- -
- -
- - -
-
plot(pvals, dbinom(7, 8, pvals) / dbinom(7, 8, 7/8) , 
-     lwd = 3, frame = FALSE, type = "l", xlab = "p", ylab = "likelihood")
-
- -
plot of chunk unnamed-chunk-3
- -
- -
- - -
-

The normal distribution

-
-
-
    -
  • A random variable is said to follow a normal or Gaussian distribution with mean \(\mu\) and variance \(\sigma^2\) if the associated density is -\[ -(2\pi \sigma^2)^{-1/2}e^{-(x - \mu)^2/2\sigma^2} -\] -If \(X\) a RV with this density then \(E[X] = \mu\) and \(Var(X) = \sigma^2\)
  • -
  • We write \(X\sim \mbox{N}(\mu, \sigma^2)\)
  • -
  • When \(\mu = 0\) and \(\sigma = 1\) the resulting distribution is called the standard normal distribution
  • -
  • The standard normal density function is labeled \(\phi\)
  • -
  • Standard normal RVs are often labeled \(Z\)
  • -
- -
- -
- - -
-
zvals <- seq(-3, 3, length = 1000)
-plot(zvals, dnorm(zvals), 
-     type = "l", lwd = 3, frame = FALSE, xlab = "z", ylab = "Density")
-sapply(-3 : 3, function(k) abline(v = k))
-
- -
plot of chunk unnamed-chunk-4
- -
[[1]]
-NULL
-
-[[2]]
-NULL
-
-[[3]]
-NULL
-
-[[4]]
-NULL
-
-[[5]]
-NULL
-
-[[6]]
-NULL
-
-[[7]]
-NULL
-
- -
- -
- - -
-

Facts about the normal density

-
-
-
    -
  • If \(X \sim \mbox{N}(\mu,\sigma^2)\) the \(Z = \frac{X -\mu}{\sigma}\) is standard normal
  • -
  • If \(Z\) is standard normal \[X = \mu + \sigma Z \sim \mbox{N}(\mu, \sigma^2)\]
  • -
  • The non-standard normal density is \[\phi\{(x - \mu) / \sigma\}/\sigma\]
  • -
- -
- -
- - -
-

More facts about the normal density

-
-
-
    -
  1. Approximately \(68\%\), \(95\%\) and \(99\%\) of the normal density lies within \(1\), \(2\) and \(3\) standard deviations from the mean, respectively
  2. -
  3. \(-1.28\), \(-1.645\), \(-1.96\) and \(-2.33\) are the \(10^{th}\), \(5^{th}\), \(2.5^{th}\) and \(1^{st}\) percentiles of the standard normal distribution respectively
  4. -
  5. By symmetry, \(1.28\), \(1.645\), \(1.96\) and \(2.33\) are the \(90^{th}\), \(95^{th}\), \(97.5^{th}\) and \(99^{th}\) percentiles of the standard normal distribution respectively
  6. -
- -
- -
- - -
-

Question

-
-
-
    -
  • What is the \(95^{th}\) percentile of a \(N(\mu, \sigma^2)\) distribution? - -
      -
    • Quick answer in R qnorm(.95, mean = mu, sd = sd)
    • -
  • -
  • We want the point \(x_0\) so that \(P(X \leq x_0) = .95\) -\[ -\begin{eqnarray*} -P(X \leq x_0) & = & P\left(\frac{X - \mu}{\sigma} \leq \frac{x_0 - \mu}{\sigma}\right) \\ \\ - & = & P\left(Z \leq \frac{x_0 - \mu}{\sigma}\right) = .95 -\end{eqnarray*} -\]
  • -
  • Therefore -\[\frac{x_0 - \mu}{\sigma} = 1.645\] -or \(x_0 = \mu + \sigma 1.645\)
  • -
  • In general \(x_0 = \mu + \sigma z_0\) where \(z_0\) is the appropriate standard normal quantile
  • -
- -
- -
- - -
-

Question

-
-
-
    -
  • What is the probability that a \(\mbox{N}(\mu,\sigma^2)\) RV is 2 standard deviations above the mean?
  • -
  • We want to know -\[ -\begin{eqnarray*} -P(X > \mu + 2\sigma) & = & -P\left(\frac{X -\mu}{\sigma} > \frac{\mu + 2\sigma - \mu}{\sigma}\right) \\ \\ -& = & P(Z \geq 2 ) \\ \\ -& \approx & 2.5\% -\end{eqnarray*} -\]
  • -
- -
- -
- - -
-

Other properties

-
-
-
    -
  • The normal distribution is symmetric and peaked about its mean (therefore the mean, median and mode are all equal)
  • -
  • A constant times a normally distributed random variable is also normally distributed (what is the mean and variance?)
  • -
  • Sums of normally distributed random variables are again normally distributed even if the variables are dependent (what is the mean and variance?)
  • -
  • Sample means of normally distributed random variables are again normally distributed (with what mean and variance?)
  • -
  • The square of a standard normal random variable follows what is called chi-squared distribution
  • -
  • The exponent of a normally distributed random variables follows what is called the log-normal distribution
  • -
  • As we will see later, many random variables, properly normalized, limit to a normal distribution
  • -
- -
- -
- - -
-

Final thoughts on normal likelihoods

-
-
-
    -
  • The MLE for \(\mu\) is \(\bar X\).
  • -
  • The MLE for \(\sigma^2\) is -\[ -\frac{\sum_{i=1}^n (X_i - \bar X)^2}{n} -\] -(Which is the biased version of the sample variance.)
  • -
  • The MLE of \(\sigma\) is simply the square root of this -estimate
  • -
- -
- -
- - -
-

The Poisson distribution

-
-
-
    -
  • Used to model counts
  • -
  • The Poisson mass function is -\[ -P(X = x; \lambda) = \frac{\lambda^x e^{-\lambda}}{x!} -\] -for \(x=0,1,\ldots\)
  • -
  • The mean of this distribution is \(\lambda\)
  • -
  • The variance of this distribution is \(\lambda\)
  • -
  • Notice that \(x\) ranges from \(0\) to \(\infty\)
  • -
- -
- -
- - -
-

Some uses for the Poisson distribution

-
-
-
    -
  • Modeling event/time data
  • -
  • Modeling radioactive decay
  • -
  • Modeling survival data
  • -
  • Modeling unbounded count data
  • -
  • Modeling contingency tables
  • -
  • Approximating binomials when \(n\) is large and \(p\) is small
  • -
- -
- -
- - -
-

Poisson derivation

-
-
-
    -
  • \(\lambda\) is the mean number of events per unit time
  • -
  • Let \(h\) be very small
  • -
  • Suppose we assume that - -
      -
    • Prob. of an event in an interval of length \(h\) is \(\lambda h\) -while the prob. of more than one event is negligible
    • -
    • Whether or not an event occurs in one small interval -does not impact whether or not an event occurs in another -small interval -then, the number of events per unit time is Poisson with mean \(\lambda\)
    • -
  • -
- -
- -
- - -
-

Rates and Poisson random variables

-
-
-
    -
  • Poisson random variables are used to model rates
  • -
  • \(X \sim Poisson(\lambda t)\) where - -
      -
    • \(\lambda = E[X / t]\) is the expected count per unit of time
    • -
    • \(t\) is the total monitoring time
    • -
  • -
- -
- -
- - -
-

Poisson approximation to the binomial

-
-
-
    -
  • When \(n\) is large and \(p\) is small the Poisson distribution -is an accurate approximation to the binomial distribution
  • -
  • Notation - -
      -
    • \(\lambda = n p\)
    • -
    • \(X \sim \mbox{Binomial}(n, p)\), \(\lambda = n p\) and
    • -
    • \(n\) gets large
    • -
    • \(p\) gets small
    • -
    • \(\lambda\) stays constant
    • -
  • -
- -
- -
- - -
-

Example

-
-
-

The number of people that show up at a bus stop is Poisson with -a mean of \(2.5\) per hour.

- -

If watching the bus stop for 4 hours, what is the probability that \(3\) -or fewer people show up for the whole time?

- -
ppois(3, lambda = 2.5 * 4)
-
- -
[1] 0.01034
-
- -
- -
- - -
-

Example, Poisson approximation to the binomial

-
-
-

We flip a coin with success probablity \(0.01\) five hundred times.

- -

What's the probability of 2 or fewer successes?

- -
pbinom(2, size = 500, prob = .01)
-
- -
[1] 0.1234
-
- -
ppois(2, lambda=500 * .01)
-
- -
[1] 0.1247
-
- -
- -
- - -
- - - - - - - - - - - - - - + + + + Some Common Distributions + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

Some Common Distributions

+

Statistical Inference

+

Brian Caffo, Jeff Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

+
+
+
+ + + + +
+

The Bernoulli distribution

+
+
+
    +
  • The Bernoulli distribution arises as the result of a binary outcome
  • +
  • Bernoulli random variables take (only) the values 1 and 0 with probabilities of (say) \(p\) and \(1-p\) respectively
  • +
  • The PMF for a Bernoulli random variable \(X\) is \[P(X = x) = p^x (1 - p)^{1 - x}\]
  • +
  • The mean of a Bernoulli random variable is \(p\) and the variance is \(p(1 - p)\)
  • +
  • If we let \(X\) be a Bernoulli random variable, it is typical to call \(X=1\) as a "success" and \(X=0\) as a "failure"
  • +
+ +
+ +
+ + +
+

iid Bernoulli trials

+
+
+
    +
  • If several iid Bernoulli observations, say \(x_1,\ldots, x_n\), are observed the +likelihood is +\[ +\prod_{i=1}^n p^{x_i} (1 - p)^{1 - x_i} = p^{\sum x_i} (1 - p)^{n - \sum x_i} +\]
  • +
  • Notice that the likelihood depends only on the sum of the \(x_i\)
  • +
  • Because \(n\) is fixed and assumed known, this implies that the sample proportion \(\sum_i x_i / n\) contains all of the relevant information about \(p\)
  • +
  • We can maximize the Bernoulli likelihood over \(p\) to obtain that \(\hat p = \sum_i x_i / n\) is the maximum likelihood estimator for \(p\)
  • +
+ +
+ +
+ + +
+

Plotting all possible likelihoods for a small n

+
+
+
n <- 5
+pvals <- seq(0, 1, length = 1000)
+plot(c(0, 1), c(0, 1.2), type = "n", frame = FALSE, xlab = "p", ylab = "likelihood")
+text((0 : n) /n, 1.1, as.character(0 : n))
+sapply(0 : n, function(x) {
+  phat <- x / n
+  if (x == 0) lines(pvals,  ( (1 - pvals) / (1 - phat) )^(n-x), lwd = 3)
+  else if (x == n) lines(pvals, (pvals / phat) ^ x, lwd = 3)
+  else lines(pvals, (pvals / phat ) ^ x * ( (1 - pvals) / (1 - phat) ) ^ (n-x), lwd = 3) 
+  }
+)
+title(paste("Likelihoods for n = ", n))
+
+ +
+ +
+ + +
+

plot of chunk unnamed-chunk-1

+ +
+ +
+ + +
+

Binomial trials

+
+
+
    +
  • The binomial random variables are obtained as the sum of iid Bernoulli trials
  • +
  • In specific, let \(X_1,\ldots,X_n\) be iid Bernoulli\((p)\); then \(X = \sum_{i=1}^n X_i\) is a binomial random variable
  • +
  • The binomial mass function is +\[ +P(X = x) = +\left( +\begin{array}{c} +n \\ x +\end{array} +\right) +p^x(1 - p)^{n-x} +\] +for \(x=0,\ldots,n\)
  • +
+ +
+ +
+ + +
+

Choose

+
+
+
    +
  • Recall that the notation +\[\left( +\begin{array}{c} + n \\ x +\end{array} +\right) = \frac{n!}{x!(n-x)!} +\] (read "\(n\) choose \(x\)") counts the number of ways of selecting \(x\) items out of \(n\) +without replacement disregarding the order of the items
  • +
+ +

\[\left( + \begin{array}{c} + n \\ 0 + \end{array} + \right) = +\left( + \begin{array}{c} + n \\ n + \end{array} + \right) = 1 + \]

+ +
+ +
+ + +
+

Example justification of the binomial likelihood

+
+
+
    +
  • Consider the probability of getting \(6\) heads out of \(10\) coin flips from a coin with success probability \(p\)
  • +
  • The probability of getting \(6\) heads and \(4\) tails in any specific order is +\[ +p^6(1-p)^4 +\]
  • +
  • There are +\[\left( +\begin{array}{c} +10 \\ 6 +\end{array} +\right) +\] +possible orders of \(6\) heads and \(4\) tails
  • +
+ +
+ +
+ + +
+

Example

+
+
+
    +
  • Suppose a friend has \(8\) children (oh my!), \(7\) of which are girls and none are twins
  • +
  • If each gender has an independent \(50\)% probability for each birth, what's the probability of getting \(7\) or more girls out of \(8\) births? +\[\left( +\begin{array}{c} +8 \\ 7 +\end{array} +\right) .5^{7}(1-.5)^{1} ++ +\left( +\begin{array}{c} +8 \\ 8 +\end{array} +\right) .5^{8}(1-.5)^{0} \approx 0.04 +\]
  • +
+ +
choose(8, 7) * 0.5^8 + choose(8, 8) * 0.5^8
+
+ +
## [1] 0.03516
+
+ +
pbinom(6, size = 8, prob = 0.5, lower.tail = FALSE)
+
+ +
## [1] 0.03516
+
+ +
+ +
+ + +
+
plot(pvals, dbinom(7, 8, pvals)/dbinom(7, 8, 7/8), lwd = 3, frame = FALSE, type = "l", 
+    xlab = "p", ylab = "likelihood")
+
+ +

plot of chunk unnamed-chunk-3

+ +
+ +
+ + +
+

The normal distribution

+
+
+
    +
  • A random variable is said to follow a normal or Gaussian distribution with mean \(\mu\) and variance \(\sigma^2\) if the associated density is +\[ +(2\pi \sigma^2)^{-1/2}e^{-(x - \mu)^2/2\sigma^2} +\] +If \(X\) a RV with this density then \(E[X] = \mu\) and \(Var(X) = \sigma^2\)
  • +
  • We write \(X\sim \mbox{N}(\mu, \sigma^2)\)
  • +
  • When \(\mu = 0\) and \(\sigma = 1\) the resulting distribution is called the standard normal distribution
  • +
  • The standard normal density function is labeled \(\phi\)
  • +
  • Standard normal RVs are often labeled \(Z\)
  • +
+ +
+ +
+ + +
+
zvals <- seq(-3, 3, length = 1000)
+plot(zvals, dnorm(zvals), type = "l", lwd = 3, frame = FALSE, xlab = "z", ylab = "Density")
+sapply(-3:3, function(k) abline(v = k))
+
+ +

plot of chunk unnamed-chunk-4

+ +
+ +
+ + +
+

Facts about the normal density

+
+
+
    +
  • If \(X \sim \mbox{N}(\mu,\sigma^2)\) the \(Z = \frac{X -\mu}{\sigma}\) is standard normal
  • +
  • If \(Z\) is standard normal \[X = \mu + \sigma Z \sim \mbox{N}(\mu, \sigma^2)\]
  • +
  • The non-standard normal density is \[\phi\{(x - \mu) / \sigma\}/\sigma\]
  • +
+ +
+ +
+ + +
+

More facts about the normal density

+
+
+
    +
  1. Approximately \(68\%\), \(95\%\) and \(99\%\) of the normal density lies within \(1\), \(2\) and \(3\) standard deviations from the mean, respectively
  2. +
  3. \(-1.28\), \(-1.645\), \(-1.96\) and \(-2.33\) are the \(10^{th}\), \(5^{th}\), \(2.5^{th}\) and \(1^{st}\) percentiles of the standard normal distribution respectively
  4. +
  5. By symmetry, \(1.28\), \(1.645\), \(1.96\) and \(2.33\) are the \(90^{th}\), \(95^{th}\), \(97.5^{th}\) and \(99^{th}\) percentiles of the standard normal distribution respectively
  6. +
+ +
+ +
+ + +
+

Question

+
+
+
    +
  • What is the \(95^{th}\) percentile of a \(N(\mu, \sigma^2)\) distribution? + +
      +
    • Quick answer in R qnorm(.95, mean = mu, sd = sd)
    • +
  • +
  • We want the point \(x_0\) so that \(P(X \leq x_0) = .95\) +\[ +\begin{eqnarray*} +P(X \leq x_0) & = & P\left(\frac{X - \mu}{\sigma} \leq \frac{x_0 - \mu}{\sigma}\right) \\ \\ + & = & P\left(Z \leq \frac{x_0 - \mu}{\sigma}\right) = .95 +\end{eqnarray*} +\]
  • +
  • Therefore +\[\frac{x_0 - \mu}{\sigma} = 1.645\] +or \(x_0 = \mu + \sigma 1.645\)
  • +
  • In general \(x_0 = \mu + \sigma z_0\) where \(z_0\) is the appropriate standard normal quantile
  • +
+ +
+ +
+ + +
+

Question

+
+
+
    +
  • What is the probability that a \(\mbox{N}(\mu,\sigma^2)\) RV is 2 standard deviations above the mean?
  • +
  • We want to know +\[ +\begin{eqnarray*} +P(X > \mu + 2\sigma) & = & +P\left(\frac{X -\mu}{\sigma} > \frac{\mu + 2\sigma - \mu}{\sigma}\right) \\ \\ +& = & P(Z \geq 2 ) \\ \\ +& \approx & 2.5\% +\end{eqnarray*} +\]
  • +
+ +
+ +
+ + +
+

Other properties

+
+
+
    +
  • The normal distribution is symmetric and peaked about its mean (therefore the mean, median and mode are all equal)
  • +
  • A constant times a normally distributed random variable is also normally distributed (what is the mean and variance?)
  • +
  • Sums of normally distributed random variables are again normally distributed even if the variables are dependent (what is the mean and variance?)
  • +
  • Sample means of normally distributed random variables are again normally distributed (with what mean and variance?)
  • +
  • The square of a standard normal random variable follows what is called chi-squared distribution
  • +
  • The exponent of a normally distributed random variables follows what is called the log-normal distribution
  • +
  • As we will see later, many random variables, properly normalized, limit to a normal distribution
  • +
+ +
+ +
+ + +
+

Final thoughts on normal likelihoods

+
+
+
    +
  • The MLE for \(\mu\) is \(\bar X\).
  • +
  • The MLE for \(\sigma^2\) is +\[ +\frac{\sum_{i=1}^n (X_i - \bar X)^2}{n} +\] +(Which is the biased version of the sample variance.)
  • +
  • The MLE of \(\sigma\) is simply the square root of this +estimate
  • +
+ +
+ +
+ + +
+

The Poisson distribution

+
+
+
    +
  • Used to model counts
  • +
  • The Poisson mass function is +\[ +P(X = x; \lambda) = \frac{\lambda^x e^{-\lambda}}{x!} +\] +for \(x=0,1,\ldots\)
  • +
  • The mean of this distribution is \(\lambda\)
  • +
  • The variance of this distribution is \(\lambda\)
  • +
  • Notice that \(x\) ranges from \(0\) to \(\infty\)
  • +
+ +
+ +
+ + +
+

Some uses for the Poisson distribution

+
+
+
    +
  • Modeling event/time data
  • +
  • Modeling radioactive decay
  • +
  • Modeling survival data
  • +
  • Modeling unbounded count data
  • +
  • Modeling contingency tables
  • +
  • Approximating binomials when \(n\) is large and \(p\) is small
  • +
+ +
+ +
+ + +
+

Poisson derivation

+
+
+
    +
  • \(\lambda\) is the mean number of events per unit time
  • +
  • Let \(h\) be very small
  • +
  • Suppose we assume that + +
      +
    • Prob. of an event in an interval of length \(h\) is \(\lambda h\) +while the prob. of more than one event is negligible
    • +
    • Whether or not an event occurs in one small interval +does not impact whether or not an event occurs in another +small interval +then, the number of events per unit time is Poisson with mean \(\lambda\)
    • +
  • +
+ +
+ +
+ + +
+

Rates and Poisson random variables

+
+
+
    +
  • Poisson random variables are used to model rates
  • +
  • \(X \sim Poisson(\lambda t)\) where + +
      +
    • \(\lambda = E[X / t]\) is the expected count per unit of time
    • +
    • \(t\) is the total monitoring time
    • +
  • +
+ +
+ +
+ + +
+

Poisson approximation to the binomial

+
+
+
    +
  • When \(n\) is large and \(p\) is small the Poisson distribution +is an accurate approximation to the binomial distribution
  • +
  • Notation + +
      +
    • \(\lambda = n p\)
    • +
    • \(X \sim \mbox{Binomial}(n, p)\), \(\lambda = n p\) and
    • +
    • \(n\) gets large
    • +
    • \(p\) gets small
    • +
    • \(\lambda\) stays constant
    • +
  • +
+ +
+ +
+ + +
+

Example

+
+
+

The number of people that show up at a bus stop is Poisson with +a mean of \(2.5\) per hour.

+ +

If watching the bus stop for 4 hours, what is the probability that \(3\) +or fewer people show up for the whole time?

+ +
ppois(3, lambda = 2.5 * 4)
+
+ +
## [1] 0.01034
+
+ +
+ +
+ + +
+

Example, Poisson approximation to the binomial

+
+
+

We flip a coin with success probablity \(0.01\) five hundred times.

+ +

What's the probability of 2 or fewer successes?

+ +
pbinom(2, size = 500, prob = 0.01)
+
+ +
## [1] 0.1234
+
+ +
ppois(2, lambda = 500 * 0.01)
+
+ +
## [1] 0.1247
+
+ +
+ +
+ + +
+ + + + + + + + + + + + + + \ No newline at end of file diff --git a/06_StatisticalInference/02_01_CommonDistributions/index.md b/06_StatisticalInference/02_01_CommonDistributions/index.md index 186885c92..a7c796780 100644 --- a/06_StatisticalInference/02_01_CommonDistributions/index.md +++ b/06_StatisticalInference/02_01_CommonDistributions/index.md @@ -1,383 +1,358 @@ ---- -title : Some Common Distributions -subtitle : Statistical Inference -author : Brian Caffo, Jeff Leek, Roger Peng -job : Johns Hopkins Bloomberg School of Public Health -logo : bloomberg_shield.png -framework : io2012 # {io2012, html5slides, shower, dzslides, ...} -highlighter : highlight.js # {highlight.js, prettify, highlight} -hitheme : tomorrow # -url: - lib: ../../librariesNew - assets: ../../assets -widgets : [mathjax] # {mathjax, quiz, bootstrap} -mode : selfcontained # {standalone, draft} ---- - - - -## The Bernoulli distribution - -- The **Bernoulli distribution** arises as the result of a binary outcome -- Bernoulli random variables take (only) the values 1 and 0 with probabilities of (say) $p$ and $1-p$ respectively -- The PMF for a Bernoulli random variable $X$ is $$P(X = x) = p^x (1 - p)^{1 - x}$$ -- The mean of a Bernoulli random variable is $p$ and the variance is $p(1 - p)$ -- If we let $X$ be a Bernoulli random variable, it is typical to call $X=1$ as a "success" and $X=0$ as a "failure" - ---- - -## iid Bernoulli trials - -- If several iid Bernoulli observations, say $x_1,\ldots, x_n$, are observed the -likelihood is -$$ - \prod_{i=1}^n p^{x_i} (1 - p)^{1 - x_i} = p^{\sum x_i} (1 - p)^{n - \sum x_i} -$$ -- Notice that the likelihood depends only on the sum of the $x_i$ -- Because $n$ is fixed and assumed known, this implies that the sample proportion $\sum_i x_i / n$ contains all of the relevant information about $p$ -- We can maximize the Bernoulli likelihood over $p$ to obtain that $\hat p = \sum_i x_i / n$ is the maximum likelihood estimator for $p$ - ---- -## Plotting all possible likelihoods for a small n -``` -n <- 5 -pvals <- seq(0, 1, length = 1000) -plot(c(0, 1), c(0, 1.2), type = "n", frame = FALSE, xlab = "p", ylab = "likelihood") -text((0 : n) /n, 1.1, as.character(0 : n)) -sapply(0 : n, function(x) { - phat <- x / n - if (x == 0) lines(pvals, ( (1 - pvals) / (1 - phat) )^(n-x), lwd = 3) - else if (x == n) lines(pvals, (pvals / phat) ^ x, lwd = 3) - else lines(pvals, (pvals / phat ) ^ x * ( (1 - pvals) / (1 - phat) ) ^ (n-x), lwd = 3) - } -) -title(paste("Likelihoods for n = ", n)) -``` - ---- -
plot of chunk unnamed-chunk-1
- - ---- - -## Binomial trials - -- The *binomial random variables* are obtained as the sum of iid Bernoulli trials -- In specific, let $X_1,\ldots,X_n$ be iid Bernoulli$(p)$; then $X = \sum_{i=1}^n X_i$ is a binomial random variable -- The binomial mass function is -$$ -P(X = x) = -\left( -\begin{array}{c} - n \\ x -\end{array} -\right) -p^x(1 - p)^{n-x} -$$ -for $x=0,\ldots,n$ - ---- - -## Choose - -- Recall that the notation - $$\left( - \begin{array}{c} - n \\ x - \end{array} - \right) = \frac{n!}{x!(n-x)!} - $$ (read "$n$ choose $x$") counts the number of ways of selecting $x$ items out of $n$ - without replacement disregarding the order of the items - -$$\left( - \begin{array}{c} - n \\ 0 - \end{array} - \right) = -\left( - \begin{array}{c} - n \\ n - \end{array} - \right) = 1 - $$ - ---- - -## Example justification of the binomial likelihood - -- Consider the probability of getting $6$ heads out of $10$ coin flips from a coin with success probability $p$ -- The probability of getting $6$ heads and $4$ tails in any specific order is - $$ - p^6(1-p)^4 - $$ -- There are -$$\left( -\begin{array}{c} - 10 \\ 6 -\end{array} -\right) -$$ -possible orders of $6$ heads and $4$ tails - ---- - -## Example - -- Suppose a friend has $8$ children (oh my!), $7$ of which are girls and none are twins -- If each gender has an independent $50$% probability for each birth, what's the probability of getting $7$ or more girls out of $8$ births? -$$\left( -\begin{array}{c} - 8 \\ 7 -\end{array} -\right) .5^{7}(1-.5)^{1} -+ -\left( -\begin{array}{c} - 8 \\ 8 -\end{array} -\right) .5^{8}(1-.5)^{0} \approx 0.04 -$$ - -```r -choose(8, 7) * .5 ^ 8 + choose(8, 8) * .5 ^ 8 -``` - -``` -[1] 0.03516 -``` - -```r -pbinom(6, size = 8, prob = .5, lower.tail = FALSE) -``` - -``` -[1] 0.03516 -``` - - ---- - -```r -plot(pvals, dbinom(7, 8, pvals) / dbinom(7, 8, 7/8) , - lwd = 3, frame = FALSE, type = "l", xlab = "p", ylab = "likelihood") -``` - -
plot of chunk unnamed-chunk-3
- - ---- - -## The normal distribution - -- A random variable is said to follow a **normal** or **Gaussian** distribution with mean $\mu$ and variance $\sigma^2$ if the associated density is - $$ - (2\pi \sigma^2)^{-1/2}e^{-(x - \mu)^2/2\sigma^2} - $$ - If $X$ a RV with this density then $E[X] = \mu$ and $Var(X) = \sigma^2$ -- We write $X\sim \mbox{N}(\mu, \sigma^2)$ -- When $\mu = 0$ and $\sigma = 1$ the resulting distribution is called **the standard normal distribution** -- The standard normal density function is labeled $\phi$ -- Standard normal RVs are often labeled $Z$ - ---- - -```r -zvals <- seq(-3, 3, length = 1000) -plot(zvals, dnorm(zvals), - type = "l", lwd = 3, frame = FALSE, xlab = "z", ylab = "Density") -sapply(-3 : 3, function(k) abline(v = k)) -``` - -
plot of chunk unnamed-chunk-4
- -``` -[[1]] -NULL - -[[2]] -NULL - -[[3]] -NULL - -[[4]] -NULL - -[[5]] -NULL - -[[6]] -NULL - -[[7]] -NULL -``` - - ---- - -## Facts about the normal density - -- If $X \sim \mbox{N}(\mu,\sigma^2)$ the $Z = \frac{X -\mu}{\sigma}$ is standard normal -- If $Z$ is standard normal $$X = \mu + \sigma Z \sim \mbox{N}(\mu, \sigma^2)$$ -- The non-standard normal density is $$\phi\{(x - \mu) / \sigma\}/\sigma$$ - ---- - -## More facts about the normal density - -1. Approximately $68\%$, $95\%$ and $99\%$ of the normal density lies within $1$, $2$ and $3$ standard deviations from the mean, respectively -2. $-1.28$, $-1.645$, $-1.96$ and $-2.33$ are the $10^{th}$, $5^{th}$, $2.5^{th}$ and $1^{st}$ percentiles of the standard normal distribution respectively -3. By symmetry, $1.28$, $1.645$, $1.96$ and $2.33$ are the $90^{th}$, $95^{th}$, $97.5^{th}$ and $99^{th}$ percentiles of the standard normal distribution respectively - ---- - -## Question - -- What is the $95^{th}$ percentile of a $N(\mu, \sigma^2)$ distribution? - - Quick answer in R `qnorm(.95, mean = mu, sd = sd)` -- We want the point $x_0$ so that $P(X \leq x_0) = .95$ -$$ - \begin{eqnarray*} - P(X \leq x_0) & = & P\left(\frac{X - \mu}{\sigma} \leq \frac{x_0 - \mu}{\sigma}\right) \\ \\ - & = & P\left(Z \leq \frac{x_0 - \mu}{\sigma}\right) = .95 - \end{eqnarray*} -$$ -- Therefore - $$\frac{x_0 - \mu}{\sigma} = 1.645$$ - or $x_0 = \mu + \sigma 1.645$ -- In general $x_0 = \mu + \sigma z_0$ where $z_0$ is the appropriate standard normal quantile - ---- - -## Question - -- What is the probability that a $\mbox{N}(\mu,\sigma^2)$ RV is 2 standard deviations above the mean? -- We want to know -$$ - \begin{eqnarray*} - P(X > \mu + 2\sigma) & = & -P\left(\frac{X -\mu}{\sigma} > \frac{\mu + 2\sigma - \mu}{\sigma}\right) \\ \\ -& = & P(Z \geq 2 ) \\ \\ -& \approx & 2.5\% - \end{eqnarray*} -$$ - ---- - -## Other properties - -- The normal distribution is symmetric and peaked about its mean (therefore the mean, median and mode are all equal) -- A constant times a normally distributed random variable is also normally distributed (what is the mean and variance?) -- Sums of normally distributed random variables are again normally distributed even if the variables are dependent (what is the mean and variance?) -- Sample means of normally distributed random variables are again normally distributed (with what mean and variance?) -- The square of a *standard normal* random variable follows what is called **chi-squared** distribution -- The exponent of a normally distributed random variables follows what is called the **log-normal** distribution -- As we will see later, many random variables, properly normalized, *limit* to a normal distribution - ---- - -## Final thoughts on normal likelihoods -- The MLE for $\mu$ is $\bar X$. -- The MLE for $\sigma^2$ is - $$ - \frac{\sum_{i=1}^n (X_i - \bar X)^2}{n} - $$ - (Which is the biased version of the sample variance.) -- The MLE of $\sigma$ is simply the square root of this - estimate - ---- -## The Poisson distribution -* Used to model counts -* The Poisson mass function is -$$ -P(X = x; \lambda) = \frac{\lambda^x e^{-\lambda}}{x!} -$$ -for $x=0,1,\ldots$ -* The mean of this distribution is $\lambda$ -* The variance of this distribution is $\lambda$ -* Notice that $x$ ranges from $0$ to $\infty$ - ---- -## Some uses for the Poisson distribution -* Modeling event/time data -* Modeling radioactive decay -* Modeling survival data -* Modeling unbounded count data -* Modeling contingency tables -* Approximating binomials when $n$ is large and $p$ is small - ---- -## Poisson derivation -* $\lambda$ is the mean number of events per unit time -* Let $h$ be very small -* Suppose we assume that - * Prob. of an event in an interval of length $h$ is $\lambda h$ - while the prob. of more than one event is negligible - * Whether or not an event occurs in one small interval - does not impact whether or not an event occurs in another - small interval -then, the number of events per unit time is Poisson with mean $\lambda$ - ---- -## Rates and Poisson random variables -* Poisson random variables are used to model rates -* $X \sim Poisson(\lambda t)$ where - * $\lambda = E[X / t]$ is the expected count per unit of time - * $t$ is the total monitoring time - ---- -## Poisson approximation to the binomial -* When $n$ is large and $p$ is small the Poisson distribution - is an accurate approximation to the binomial distribution -* Notation - * $\lambda = n p$ - * $X \sim \mbox{Binomial}(n, p)$, $\lambda = n p$ and - * $n$ gets large - * $p$ gets small - * $\lambda$ stays constant - ---- -## Example -The number of people that show up at a bus stop is Poisson with -a mean of $2.5$ per hour. - -If watching the bus stop for 4 hours, what is the probability that $3$ -or fewer people show up for the whole time? - - -```r -ppois(3, lambda = 2.5 * 4) -``` - -``` -[1] 0.01034 -``` - - ---- -## Example, Poisson approximation to the binomial - -We flip a coin with success probablity $0.01$ five hundred times. - -What's the probability of 2 or fewer successes? - - -```r -pbinom(2, size = 500, prob = .01) -``` - -``` -[1] 0.1234 -``` - -```r -ppois(2, lambda=500 * .01) -``` - -``` -[1] 0.1247 -``` - - +--- +title : Some Common Distributions +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- + + +## The Bernoulli distribution + +- The **Bernoulli distribution** arises as the result of a binary outcome +- Bernoulli random variables take (only) the values 1 and 0 with probabilities of (say) $p$ and $1-p$ respectively +- The PMF for a Bernoulli random variable $X$ is $$P(X = x) = p^x (1 - p)^{1 - x}$$ +- The mean of a Bernoulli random variable is $p$ and the variance is $p(1 - p)$ +- If we let $X$ be a Bernoulli random variable, it is typical to call $X=1$ as a "success" and $X=0$ as a "failure" + +--- + +## iid Bernoulli trials + +- If several iid Bernoulli observations, say $x_1,\ldots, x_n$, are observed the +likelihood is +$$ + \prod_{i=1}^n p^{x_i} (1 - p)^{1 - x_i} = p^{\sum x_i} (1 - p)^{n - \sum x_i} +$$ +- Notice that the likelihood depends only on the sum of the $x_i$ +- Because $n$ is fixed and assumed known, this implies that the sample proportion $\sum_i x_i / n$ contains all of the relevant information about $p$ +- We can maximize the Bernoulli likelihood over $p$ to obtain that $\hat p = \sum_i x_i / n$ is the maximum likelihood estimator for $p$ + +--- +## Plotting all possible likelihoods for a small n +``` +n <- 5 +pvals <- seq(0, 1, length = 1000) +plot(c(0, 1), c(0, 1.2), type = "n", frame = FALSE, xlab = "p", ylab = "likelihood") +text((0 : n) /n, 1.1, as.character(0 : n)) +sapply(0 : n, function(x) { + phat <- x / n + if (x == 0) lines(pvals, ( (1 - pvals) / (1 - phat) )^(n-x), lwd = 3) + else if (x == n) lines(pvals, (pvals / phat) ^ x, lwd = 3) + else lines(pvals, (pvals / phat ) ^ x * ( (1 - pvals) / (1 - phat) ) ^ (n-x), lwd = 3) + } +) +title(paste("Likelihoods for n = ", n)) +``` + +--- +![plot of chunk unnamed-chunk-1](assets/fig/unnamed-chunk-1.png) + + +--- + +## Binomial trials + +- The *binomial random variables* are obtained as the sum of iid Bernoulli trials +- In specific, let $X_1,\ldots,X_n$ be iid Bernoulli$(p)$; then $X = \sum_{i=1}^n X_i$ is a binomial random variable +- The binomial mass function is +$$ +P(X = x) = +\left( +\begin{array}{c} + n \\ x +\end{array} +\right) +p^x(1 - p)^{n-x} +$$ +for $x=0,\ldots,n$ + +--- + +## Choose + +- Recall that the notation + $$\left( + \begin{array}{c} + n \\ x + \end{array} + \right) = \frac{n!}{x!(n-x)!} + $$ (read "$n$ choose $x$") counts the number of ways of selecting $x$ items out of $n$ + without replacement disregarding the order of the items + +$$\left( + \begin{array}{c} + n \\ 0 + \end{array} + \right) = +\left( + \begin{array}{c} + n \\ n + \end{array} + \right) = 1 + $$ + +--- + +## Example justification of the binomial likelihood + +- Consider the probability of getting $6$ heads out of $10$ coin flips from a coin with success probability $p$ +- The probability of getting $6$ heads and $4$ tails in any specific order is + $$ + p^6(1-p)^4 + $$ +- There are +$$\left( +\begin{array}{c} + 10 \\ 6 +\end{array} +\right) +$$ +possible orders of $6$ heads and $4$ tails + +--- + +## Example + +- Suppose a friend has $8$ children (oh my!), $7$ of which are girls and none are twins +- If each gender has an independent $50$% probability for each birth, what's the probability of getting $7$ or more girls out of $8$ births? +$$\left( +\begin{array}{c} + 8 \\ 7 +\end{array} +\right) .5^{7}(1-.5)^{1} ++ +\left( +\begin{array}{c} + 8 \\ 8 +\end{array} +\right) .5^{8}(1-.5)^{0} \approx 0.04 +$$ + +```r +choose(8, 7) * 0.5^8 + choose(8, 8) * 0.5^8 +``` + +``` +## [1] 0.03516 +``` + +```r +pbinom(6, size = 8, prob = 0.5, lower.tail = FALSE) +``` + +``` +## [1] 0.03516 +``` + + +--- + +```r +plot(pvals, dbinom(7, 8, pvals)/dbinom(7, 8, 7/8), lwd = 3, frame = FALSE, type = "l", + xlab = "p", ylab = "likelihood") +``` + +![plot of chunk unnamed-chunk-3](assets/fig/unnamed-chunk-3.png) + + +--- + +## The normal distribution + +- A random variable is said to follow a **normal** or **Gaussian** distribution with mean $\mu$ and variance $\sigma^2$ if the associated density is + $$ + (2\pi \sigma^2)^{-1/2}e^{-(x - \mu)^2/2\sigma^2} + $$ + If $X$ a RV with this density then $E[X] = \mu$ and $Var(X) = \sigma^2$ +- We write $X\sim \mbox{N}(\mu, \sigma^2)$ +- When $\mu = 0$ and $\sigma = 1$ the resulting distribution is called **the standard normal distribution** +- The standard normal density function is labeled $\phi$ +- Standard normal RVs are often labeled $Z$ + +--- + +```r +zvals <- seq(-3, 3, length = 1000) +plot(zvals, dnorm(zvals), type = "l", lwd = 3, frame = FALSE, xlab = "z", ylab = "Density") +sapply(-3:3, function(k) abline(v = k)) +``` + +plot of chunk unnamed-chunk-4 + + +--- + +## Facts about the normal density + +- If $X \sim \mbox{N}(\mu,\sigma^2)$ the $Z = \frac{X -\mu}{\sigma}$ is standard normal +- If $Z$ is standard normal $$X = \mu + \sigma Z \sim \mbox{N}(\mu, \sigma^2)$$ +- The non-standard normal density is $$\phi\{(x - \mu) / \sigma\}/\sigma$$ + +--- + +## More facts about the normal density + +1. Approximately $68\%$, $95\%$ and $99\%$ of the normal density lies within $1$, $2$ and $3$ standard deviations from the mean, respectively +2. $-1.28$, $-1.645$, $-1.96$ and $-2.33$ are the $10^{th}$, $5^{th}$, $2.5^{th}$ and $1^{st}$ percentiles of the standard normal distribution respectively +3. By symmetry, $1.28$, $1.645$, $1.96$ and $2.33$ are the $90^{th}$, $95^{th}$, $97.5^{th}$ and $99^{th}$ percentiles of the standard normal distribution respectively + +--- + +## Question + +- What is the $95^{th}$ percentile of a $N(\mu, \sigma^2)$ distribution? + - Quick answer in R `qnorm(.95, mean = mu, sd = sd)` +- We want the point $x_0$ so that $P(X \leq x_0) = .95$ +$$ + \begin{eqnarray*} + P(X \leq x_0) & = & P\left(\frac{X - \mu}{\sigma} \leq \frac{x_0 - \mu}{\sigma}\right) \\ \\ + & = & P\left(Z \leq \frac{x_0 - \mu}{\sigma}\right) = .95 + \end{eqnarray*} +$$ +- Therefore + $$\frac{x_0 - \mu}{\sigma} = 1.645$$ + or $x_0 = \mu + \sigma 1.645$ +- In general $x_0 = \mu + \sigma z_0$ where $z_0$ is the appropriate standard normal quantile + +--- + +## Question + +- What is the probability that a $\mbox{N}(\mu,\sigma^2)$ RV is 2 standard deviations above the mean? +- We want to know +$$ + \begin{eqnarray*} + P(X > \mu + 2\sigma) & = & +P\left(\frac{X -\mu}{\sigma} > \frac{\mu + 2\sigma - \mu}{\sigma}\right) \\ \\ +& = & P(Z \geq 2 ) \\ \\ +& \approx & 2.5\% + \end{eqnarray*} +$$ + +--- + +## Other properties + +- The normal distribution is symmetric and peaked about its mean (therefore the mean, median and mode are all equal) +- A constant times a normally distributed random variable is also normally distributed (what is the mean and variance?) +- Sums of normally distributed random variables are again normally distributed even if the variables are dependent (what is the mean and variance?) +- Sample means of normally distributed random variables are again normally distributed (with what mean and variance?) +- The square of a *standard normal* random variable follows what is called **chi-squared** distribution +- The exponent of a normally distributed random variables follows what is called the **log-normal** distribution +- As we will see later, many random variables, properly normalized, *limit* to a normal distribution + +--- + +## Final thoughts on normal likelihoods +- The MLE for $\mu$ is $\bar X$. +- The MLE for $\sigma^2$ is + $$ + \frac{\sum_{i=1}^n (X_i - \bar X)^2}{n} + $$ + (Which is the biased version of the sample variance.) +- The MLE of $\sigma$ is simply the square root of this + estimate + +--- +## The Poisson distribution +* Used to model counts +* The Poisson mass function is +$$ +P(X = x; \lambda) = \frac{\lambda^x e^{-\lambda}}{x!} +$$ +for $x=0,1,\ldots$ +* The mean of this distribution is $\lambda$ +* The variance of this distribution is $\lambda$ +* Notice that $x$ ranges from $0$ to $\infty$ + +--- +## Some uses for the Poisson distribution +* Modeling event/time data +* Modeling radioactive decay +* Modeling survival data +* Modeling unbounded count data +* Modeling contingency tables +* Approximating binomials when $n$ is large and $p$ is small + +--- +## Poisson derivation +* $\lambda$ is the mean number of events per unit time +* Let $h$ be very small +* Suppose we assume that + * Prob. of an event in an interval of length $h$ is $\lambda h$ + while the prob. of more than one event is negligible + * Whether or not an event occurs in one small interval + does not impact whether or not an event occurs in another + small interval +then, the number of events per unit time is Poisson with mean $\lambda$ + +--- +## Rates and Poisson random variables +* Poisson random variables are used to model rates +* $X \sim Poisson(\lambda t)$ where + * $\lambda = E[X / t]$ is the expected count per unit of time + * $t$ is the total monitoring time + +--- +## Poisson approximation to the binomial +* When $n$ is large and $p$ is small the Poisson distribution + is an accurate approximation to the binomial distribution +* Notation + * $\lambda = n p$ + * $X \sim \mbox{Binomial}(n, p)$, $\lambda = n p$ and + * $n$ gets large + * $p$ gets small + * $\lambda$ stays constant + +--- +## Example +The number of people that show up at a bus stop is Poisson with +a mean of $2.5$ per hour. + +If watching the bus stop for 4 hours, what is the probability that $3$ +or fewer people show up for the whole time? + + +```r +ppois(3, lambda = 2.5 * 4) +``` + +``` +## [1] 0.01034 +``` + + +--- +## Example, Poisson approximation to the binomial + +We flip a coin with success probablity $0.01$ five hundred times. + +What's the probability of 2 or fewer successes? + + +```r +pbinom(2, size = 500, prob = 0.01) +``` + +``` +## [1] 0.1234 +``` + +```r +ppois(2, lambda = 500 * 0.01) +``` + +``` +## [1] 0.1247 +``` + + diff --git a/06_StatisticalInference/02_01_CommonDistributions/index.pdf b/06_StatisticalInference/02_01_CommonDistributions/index.pdf index 633936fe6..899e8bd29 100644 Binary files a/06_StatisticalInference/02_01_CommonDistributions/index.pdf and b/06_StatisticalInference/02_01_CommonDistributions/index.pdf differ diff --git a/06_StatisticalInference/02_02_Asymptopia/assets/fig/unnamed-chunk-1.png b/06_StatisticalInference/02_02_Asymptopia/assets/fig/unnamed-chunk-1.png new file mode 100644 index 000000000..94ea19152 Binary files /dev/null and b/06_StatisticalInference/02_02_Asymptopia/assets/fig/unnamed-chunk-1.png differ diff --git a/06_StatisticalInference/02_02_Asymptopia/assets/fig/unnamed-chunk-2.png b/06_StatisticalInference/02_02_Asymptopia/assets/fig/unnamed-chunk-2.png new file mode 100644 index 000000000..2a217f6ad Binary files /dev/null and b/06_StatisticalInference/02_02_Asymptopia/assets/fig/unnamed-chunk-2.png differ diff --git a/06_StatisticalInference/02_02_Asymptopia/assets/fig/unnamed-chunk-3.png b/06_StatisticalInference/02_02_Asymptopia/assets/fig/unnamed-chunk-3.png new file mode 100644 index 000000000..484506bca Binary files /dev/null and b/06_StatisticalInference/02_02_Asymptopia/assets/fig/unnamed-chunk-3.png differ diff --git a/06_StatisticalInference/02_02_Asymptopia/index.Rmd b/06_StatisticalInference/02_02_Asymptopia/index.Rmd index a199bd6bf..646a93fe6 100644 --- a/06_StatisticalInference/02_02_Asymptopia/index.Rmd +++ b/06_StatisticalInference/02_02_Asymptopia/index.Rmd @@ -1,271 +1,255 @@ ---- -title : A trip to Asymptopia -subtitle : Statistical Inference -author : Brian Caffo, Jeff Leek, Roger Peng -job : Johns Hopkins Bloomberg School of Public Health -logo : bloomberg_shield.png -framework : io2012 # {io2012, html5slides, shower, dzslides, ...} -highlighter : highlight.js # {highlight.js, prettify, highlight} -hitheme : tomorrow # -url: - lib: ../../librariesNew - assets: ../../assets -widgets : [mathjax] # {mathjax, quiz, bootstrap} -mode : selfcontained # {standalone, draft} ---- -```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F, results='hide'} -# make this an external chunk that can be included in any file -options(width = 100) -opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/') - -options(xtable.type = 'html') -knit_hooks$set(inline = function(x) { - if(is.numeric(x)) { - round(x, getOption('digits')) - } else { - paste(as.character(x), collapse = ', ') - } -}) -knit_hooks$set(plot = knitr:::hook_plot_html) -runif(1) -``` -## Asymptotics -* Asymptotics is the term for the behavior of statistics as the sample size (or some other relevant quantity) limits to infinity (or some other relevant number) -* (Asymptopia is my name for the land of asymptotics, where everything works out well and there's no messes. The land of infinite data is nice that way.) -* Asymptotics are incredibly useful for simple statistical inference and approximations -* (Not covered in this class) Asymptotics often lead to nice understanding of procedures -* Asymptotics generally give no assurances about finite sample performance - * The kinds of asymptotics that do are orders of magnitude more difficult to work with -* Asymptotics form the basis for frequency interpretation of probabilities - (the long run proportion of times an event occurs) -* To understand asymptotics, we need a very basic understanding of limits. - - ---- -## Numerical limits - -- Imagine a sequence - - - $a_1 = .9$, - - $a_2 = .99$, - - $a_3 = .999$, ... - -- Clearly this sequence converges to $1$ -- Definition of a limit: For any fixed distance we can find a point in the sequence so that the sequence is closer to the limit than that distance from that point on - ---- - -## Limits of random variables - -- The problem is harder for random variables -- Consider $\bar X_n$ the sample average of the first $n$ of a collection of $iid$ observations - - - Example $\bar X_n$ could be the average of the result of $n$ coin flips (i.e. the sample proportion of heads) - -- We say that $\bar X_n$ converges in probability to a limit if for any fixed distance the probability of $\bar X_n$ being closer (further away) than that distance from the limit converges to one (zero) - ---- - -## The Law of Large Numbers - -- Establishing that a random sequence converges to a limit is hard -- Fortunately, we have a theorem that does all the work for us, called - the **Law of Large Numbers** -- The law of large numbers states that if $X_1,\ldots X_n$ are iid from a population with mean $\mu$ and variance $\sigma^2$ then $\bar X_n$ converges in probability to $\mu$ -- (There are many variations on the LLN; we are using a particularly lazy version, my favorite kind of version) - ---- -## Law of large numbers in action -```{r, fig.height=4, fig.width=4} -n <- 10000; means <- cumsum(rnorm(n)) / (1 : n) -plot(1 : n, means, type = "l", lwd = 2, - frame = FALSE, ylab = "cumulative means", xlab = "sample size") -abline(h = 0) -``` ---- -## Discussion -- An estimator is **consistent** if it converges to what you want to estimate - - Consistency is neither necessary nor sufficient for one estimator to be better than another - - Typically, good estimators are consistent; it's not too much to ask that if we go to the trouble of collecting an infinite amount of data that we get the right answer -- The LLN basically states that the sample mean is consistent -- The sample variance and the sample standard deviation are consistent as well -- Recall also that the sample mean and the sample variance are unbiased as well -- (The sample standard deviation is biased, by the way) - ---- - -## The Central Limit Theorem - -- The **Central Limit Theorem** (CLT) is one of the most important theorems in statistics -- For our purposes, the CLT states that the distribution of averages of iid variables, properly normalized, becomes that of a standard normal as the sample size increases -- The CLT applies in an endless variety of settings -- Let $X_1,\ldots,X_n$ be a collection of iid random variables with mean $\mu$ and variance $\sigma^2$ -- Let $\bar X_n$ be their sample average -- Then $\frac{\bar X_n - \mu}{\sigma / \sqrt{n}}$ has a distribution like that of a standard normal for large $n$. -- Remember the form -$$\frac{\bar X_n - \mu}{\sigma / \sqrt{n}} = - \frac{\mbox{Estimate} - \mbox{Mean of estimate}}{\mbox{Std. Err. of estimate}}. -$$ -- Usually, replacing the standard error by its estimated value doesn't change the CLT - ---- - -## Example - -- Simulate a standard normal random variable by rolling $n$ (six sided) -- Let $X_i$ be the outcome for die $i$ -- Then note that $\mu = E[X_i] = 3.5$ -- $Var(X_i) = 2.92$ -- SE $\sqrt{2.92 / n} = 1.71 / \sqrt{n}$ -- Standardized mean -$$ - \frac{\bar X_n - 3.5}{1.71/\sqrt{n}} -$$ - ---- -## Simulation of mean of $n$ dice -```{r, echo = FALSE, fig.width=9, fig.height = 3} -par(mfrow = c(1, 3)) -for (n in c(1, 2, 6)){ - temp <- matrix(sample(1 : 6, n * 10000, replace = TRUE), ncol = n) - temp <- apply(temp, 1, mean) - temp <- (temp - 3.5) / (1.71 / sqrt(n)) - dty <- density(temp) - plot(dty$x, dty$y, xlab = "", ylab = "density", type = "n", xlim = c(-3, 3), ylim = c(0, .5)) - title(paste("sample mean of", n, "obs")) - lines(seq(-3, 3, length = 100), dnorm(seq(-3, 3, length = 100)), col = grey(.8), lwd = 3) - lines(dty$x, dty$y, lwd = 2) -} -``` ---- - -## Coin CLT - - - Let $X_i$ be the $0$ or $1$ result of the $i^{th}$ flip of a possibly unfair coin -- The sample proportion, say $\hat p$, is the average of the coin flips -- $E[X_i] = p$ and $Var(X_i) = p(1-p)$ -- Standard error of the mean is $\sqrt{p(1-p)/n}$ -- Then -$$ - \frac{\hat p - p}{\sqrt{p(1-p)/n}} -$$ -will be approximately normally distributed - ---- - -```{r, echo = FALSE, fig.width=7.5, fig.height = 5} -par(mfrow = c(2, 3)) -for (n in c(1, 10, 20)){ - temp <- matrix(sample(0 : 1, n * 10000, replace = TRUE), ncol = n) - temp <- apply(temp, 1, mean) - temp <- (temp - .5) * 2 * sqrt(n) - dty <- density(temp) - plot(dty$x, dty$y, xlab = "", ylab = "density", type = "n", xlim = c(-3, 3), ylim = c(0, .5)) - title(paste("sample mean of", n, "obs")) - lines(seq(-3, 3, length = 100), dnorm(seq(-3, 3, length = 100)), col = grey(.8), lwd = 3) - lines(dty$x, dty$y, lwd = 2) -} -for (n in c(1, 10, 20)){ - temp <- matrix(sample(0 : 1, n * 10000, replace = TRUE, prob = c(.9, .1)), ncol = n) - temp <- apply(temp, 1, mean) - temp <- (temp - .1) / sqrt(.1 * .9 / n) - dty <- density(temp) - plot(dty$x, dty$y, xlab = "", ylab = "density", type = "n", xlim = c(-3, 3), ylim = c(0, .5)) - title(paste("sample mean of", n, "obs")) - lines(seq(-3, 3, length = 100), dnorm(seq(-3, 3, length = 100)), col = grey(.8), lwd = 3) - lines(dty$x, dty$y, lwd = 2) -} -``` - ---- - -## CLT in practice - -- In practice the CLT is mostly useful as an approximation -$$ - P\left( \frac{\bar X_n - \mu}{\sigma / \sqrt{n}} \leq z \right) \approx \Phi(z). -$$ -- Recall $1.96$ is a good approximation to the $.975^{th}$ quantile of the standard normal -- Consider -$$ - \begin{eqnarray*} - .95 & \approx & P\left( -1.96 \leq \frac{\bar X_n - \mu}{\sigma / \sqrt{n}} \leq 1.96 \right)\\ \\ - & = & P\left(\bar X_n +1.96 \sigma/\sqrt{n} \geq \mu \geq \bar X_n - 1.96\sigma/\sqrt{n} \right),\\ - \end{eqnarray*} -$$ - ---- - -## Confidence intervals - -- Therefore, according to the CLT, the probability that the random interval $$\bar X_n \pm z_{1-\alpha/2}\sigma / \sqrt{n}$$ contains $\mu$ is approximately 100$(1-\alpha)$%, where $z_{1-\alpha/2}$ is the $1-\alpha/2$ quantile of the standard normal distribution -- This is called a $100(1 - \alpha)$% **confidence interval** for $\mu$ -- We can replace the unknown $\sigma$ with $s$ - ---- -## Give a confidence interval for the average height of sons -in Galton's data -```{r} -library(UsingR);data(father.son); x <- father.son$sheight -(mean(x) + c(-1, 1) * qnorm(.975) * sd(x) / sqrt(length(x))) / 12 -``` - ---- - -## Sample proportions - -- In the event that each $X_i$ is $0$ or $1$ with common success probability $p$ then $\sigma^2 = p(1 - p)$ -- The interval takes the form -$$ - \hat p \pm z_{1 - \alpha/2} \sqrt{\frac{p(1 - p)}{n}} -$$ -- Replacing $p$ by $\hat p$ in the standard error results in what is called a Wald confidence interval for $p$ -- Also note that $p(1-p) \leq 1/4$ for $0 \leq p \leq 1$ -- Let $\alpha = .05$ so that $z_{1 -\alpha/2} = 1.96 \approx 2$ then -$$ - 2 \sqrt{\frac{p(1 - p)}{n}} \leq 2 \sqrt{\frac{1}{4n}} = \frac{1}{\sqrt{n}} -$$ -- Therefore $\hat p \pm \frac{1}{\sqrt{n}}$ is a quick CI estimate for $p$ - ---- -## Example -* Your campaign advisor told you that in a random sample of 100 likely voters, - 56 intent to vote for you. - * Can you relax? Do you have this race in the bag? - * Without access to a computer or calculator, how precise is this estimate? -* `1/sqrt(100)=.1` so a back of the envelope calculation gives an approximate 95% interval of `(0.46, 0.66)` - * Not enough for you to relax, better go do more campaigning! -* Rough guidelines, 100 for 1 decimal place, 10,000 for 2, 1,000,000 for 3. -```{r} -round(1 / sqrt(10 ^ (1 : 6)), 3) -``` ---- -## Poisson interval -* A nuclear pump failed 5 times out of 94.32 days, give a 95% confidence interval for the failure rate per day? -* $X \sim Poisson(\lambda t)$. -* Estimate $\hat \lambda = X/t$ -* $Var(\hat \lambda) = \lambda / t$ -$$ -\frac{\hat \lambda - \lambda}{\sqrt{\hat \lambda / t}} -= -\frac{X - t \lambda}{\sqrt{X}} -\rightarrow N(0,1) -$$ -* This isn't the best interval. - * There are better asymptotic intervals. - * You can get an exact CI in this case. - ---- -### R code -```{r} -x <- 5; t <- 94.32; lambda <- x / t -round(lambda + c(-1, 1) * qnorm(.975) * sqrt(lambda / t), 3) -poisson.test(x, T = 94.32)$conf -``` - ---- -## In the regression class -```{r} -exp(confint(glm(x ~ 1 + offset(log(t)), family = poisson(link = log)))) -``` - +--- +title : A trip to Asymptopia +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- +## Asymptotics +* Asymptotics is the term for the behavior of statistics as the sample size (or some other relevant quantity) limits to infinity (or some other relevant number) +* (Asymptopia is my name for the land of asymptotics, where everything works out well and there's no messes. The land of infinite data is nice that way.) +* Asymptotics are incredibly useful for simple statistical inference and approximations +* (Not covered in this class) Asymptotics often lead to nice understanding of procedures +* Asymptotics generally give no assurances about finite sample performance + * The kinds of asymptotics that do are orders of magnitude more difficult to work with +* Asymptotics form the basis for frequency interpretation of probabilities + (the long run proportion of times an event occurs) +* To understand asymptotics, we need a very basic understanding of limits. + + +--- +## Numerical limits + +- Imagine a sequence + + - $a_1 = .9$, + - $a_2 = .99$, + - $a_3 = .999$, ... + +- Clearly this sequence converges to $1$ +- Definition of a limit: For any fixed distance we can find a point in the sequence so that the sequence is closer to the limit than that distance from that point on + +--- + +## Limits of random variables + +- The problem is harder for random variables +- Consider $\bar X_n$ the sample average of the first $n$ of a collection of $iid$ observations + + - Example $\bar X_n$ could be the average of the result of $n$ coin flips (i.e. the sample proportion of heads) + +- We say that $\bar X_n$ converges in probability to a limit if for any fixed distance the probability of $\bar X_n$ being closer (further away) than that distance from the limit converges to one (zero) + +--- + +## The Law of Large Numbers + +- Establishing that a random sequence converges to a limit is hard +- Fortunately, we have a theorem that does all the work for us, called + the **Law of Large Numbers** +- The law of large numbers states that if $X_1,\ldots X_n$ are iid from a population with mean $\mu$ and variance $\sigma^2$ then $\bar X_n$ converges in probability to $\mu$ +- (There are many variations on the LLN; we are using a particularly lazy version, my favorite kind of version) + +--- +## Law of large numbers in action +```{r, fig.height=4, fig.width=4} +n <- 10000; means <- cumsum(rnorm(n)) / (1 : n) +plot(1 : n, means, type = "l", lwd = 2, + frame = FALSE, ylab = "cumulative means", xlab = "sample size") +abline(h = 0) +``` +--- +## Discussion +- An estimator is **consistent** if it converges to what you want to estimate + - Consistency is neither necessary nor sufficient for one estimator to be better than another + - Typically, good estimators are consistent; it's not too much to ask that if we go to the trouble of collecting an infinite amount of data that we get the right answer +- The LLN basically states that the sample mean is consistent +- The sample variance and the sample standard deviation are consistent as well +- Recall also that the sample mean and the sample variance are unbiased as well +- (The sample standard deviation is biased, by the way) + +--- + +## The Central Limit Theorem + +- The **Central Limit Theorem** (CLT) is one of the most important theorems in statistics +- For our purposes, the CLT states that the distribution of averages of iid variables, properly normalized, becomes that of a standard normal as the sample size increases +- The CLT applies in an endless variety of settings +- Let $X_1,\ldots,X_n$ be a collection of iid random variables with mean $\mu$ and variance $\sigma^2$ +- Let $\bar X_n$ be their sample average +- Then $\frac{\bar X_n - \mu}{\sigma / \sqrt{n}}$ has a distribution like that of a standard normal for large $n$. +- Remember the form +$$\frac{\bar X_n - \mu}{\sigma / \sqrt{n}} = + \frac{\mbox{Estimate} - \mbox{Mean of estimate}}{\mbox{Std. Err. of estimate}}. +$$ +- Usually, replacing the standard error by its estimated value doesn't change the CLT + +--- + +## Example + +- Simulate a standard normal random variable by rolling $n$ (six sided) +- Let $X_i$ be the outcome for die $i$ +- Then note that $\mu = E[X_i] = 3.5$ +- $Var(X_i) = 2.92$ +- SE $\sqrt{2.92 / n} = 1.71 / \sqrt{n}$ +- Standardized mean +$$ + \frac{\bar X_n - 3.5}{1.71/\sqrt{n}} +$$ + +--- +## Simulation of mean of $n$ dice +```{r, echo = FALSE, fig.width=9, fig.height = 3} +par(mfrow = c(1, 3)) +for (n in c(1, 2, 6)){ + temp <- matrix(sample(1 : 6, n * 10000, replace = TRUE), ncol = n) + temp <- apply(temp, 1, mean) + temp <- (temp - 3.5) / (1.71 / sqrt(n)) + dty <- density(temp) + plot(dty$x, dty$y, xlab = "", ylab = "density", type = "n", xlim = c(-3, 3), ylim = c(0, .5)) + title(paste("sample mean of", n, "obs")) + lines(seq(-3, 3, length = 100), dnorm(seq(-3, 3, length = 100)), col = grey(.8), lwd = 3) + lines(dty$x, dty$y, lwd = 2) +} +``` +--- + +## Coin CLT + + - Let $X_i$ be the $0$ or $1$ result of the $i^{th}$ flip of a possibly unfair coin +- The sample proportion, say $\hat p$, is the average of the coin flips +- $E[X_i] = p$ and $Var(X_i) = p(1-p)$ +- Standard error of the mean is $\sqrt{p(1-p)/n}$ +- Then +$$ + \frac{\hat p - p}{\sqrt{p(1-p)/n}} +$$ +will be approximately normally distributed + +--- + +```{r, echo = FALSE, fig.width=7.5, fig.height = 5} +par(mfrow = c(2, 3)) +for (n in c(1, 10, 20)){ + temp <- matrix(sample(0 : 1, n * 10000, replace = TRUE), ncol = n) + temp <- apply(temp, 1, mean) + temp <- (temp - .5) * 2 * sqrt(n) + dty <- density(temp) + plot(dty$x, dty$y, xlab = "", ylab = "density", type = "n", xlim = c(-3, 3), ylim = c(0, .5)) + title(paste("sample mean of", n, "obs")) + lines(seq(-3, 3, length = 100), dnorm(seq(-3, 3, length = 100)), col = grey(.8), lwd = 3) + lines(dty$x, dty$y, lwd = 2) +} +for (n in c(1, 10, 20)){ + temp <- matrix(sample(0 : 1, n * 10000, replace = TRUE, prob = c(.9, .1)), ncol = n) + temp <- apply(temp, 1, mean) + temp <- (temp - .1) / sqrt(.1 * .9 / n) + dty <- density(temp) + plot(dty$x, dty$y, xlab = "", ylab = "density", type = "n", xlim = c(-3, 3), ylim = c(0, .5)) + title(paste("sample mean of", n, "obs")) + lines(seq(-3, 3, length = 100), dnorm(seq(-3, 3, length = 100)), col = grey(.8), lwd = 3) + lines(dty$x, dty$y, lwd = 2) +} +``` + +--- + +## CLT in practice + +- In practice the CLT is mostly useful as an approximation +$$ + P\left( \frac{\bar X_n - \mu}{\sigma / \sqrt{n}} \leq z \right) \approx \Phi(z). +$$ +- Recall $1.96$ is a good approximation to the $.975^{th}$ quantile of the standard normal +- Consider +$$ + \begin{eqnarray*} + .95 & \approx & P\left( -1.96 \leq \frac{\bar X_n - \mu}{\sigma / \sqrt{n}} \leq 1.96 \right)\\ \\ + & = & P\left(\bar X_n +1.96 \sigma/\sqrt{n} \geq \mu \geq \bar X_n - 1.96\sigma/\sqrt{n} \right),\\ + \end{eqnarray*} +$$ + +--- + +## Confidence intervals + +- Therefore, according to the CLT, the probability that the random interval $$\bar X_n \pm z_{1-\alpha/2}\sigma / \sqrt{n}$$ contains $\mu$ is approximately 100$(1-\alpha)$%, where $z_{1-\alpha/2}$ is the $1-\alpha/2$ quantile of the standard normal distribution +- This is called a $100(1 - \alpha)$% **confidence interval** for $\mu$ +- We can replace the unknown $\sigma$ with $s$ + +--- +## Give a confidence interval for the average height of sons +in Galton's data +```{r} +library(UsingR);data(father.son); x <- father.son$sheight +(mean(x) + c(-1, 1) * qnorm(.975) * sd(x) / sqrt(length(x))) / 12 +``` + +--- + +## Sample proportions + +- In the event that each $X_i$ is $0$ or $1$ with common success probability $p$ then $\sigma^2 = p(1 - p)$ +- The interval takes the form +$$ + \hat p \pm z_{1 - \alpha/2} \sqrt{\frac{p(1 - p)}{n}} +$$ +- Replacing $p$ by $\hat p$ in the standard error results in what is called a Wald confidence interval for $p$ +- Also note that $p(1-p) \leq 1/4$ for $0 \leq p \leq 1$ +- Let $\alpha = .05$ so that $z_{1 -\alpha/2} = 1.96 \approx 2$ then +$$ + 2 \sqrt{\frac{p(1 - p)}{n}} \leq 2 \sqrt{\frac{1}{4n}} = \frac{1}{\sqrt{n}} +$$ +- Therefore $\hat p \pm \frac{1}{\sqrt{n}}$ is a quick CI estimate for $p$ + +--- +## Example +* Your campaign advisor told you that in a random sample of 100 likely voters, + 56 intent to vote for you. + * Can you relax? Do you have this race in the bag? + * Without access to a computer or calculator, how precise is this estimate? +* `1/sqrt(100)=.1` so a back of the envelope calculation gives an approximate 95% interval of `(0.46, 0.66)` + * Not enough for you to relax, better go do more campaigning! +* Rough guidelines, 100 for 1 decimal place, 10,000 for 2, 1,000,000 for 3. +```{r} +round(1 / sqrt(10 ^ (1 : 6)), 3) +``` +--- +## Poisson interval +* A nuclear pump failed 5 times out of 94.32 days, give a 95% confidence interval for the failure rate per day? +* $X \sim Poisson(\lambda t)$. +* Estimate $\hat \lambda = X/t$ +* $Var(\hat \lambda) = \lambda / t$ +$$ +\frac{\hat \lambda - \lambda}{\sqrt{\hat \lambda / t}} += +\frac{X - t \lambda}{\sqrt{X}} +\rightarrow N(0,1) +$$ +* This isn't the best interval. + * There are better asymptotic intervals. + * You can get an exact CI in this case. + +--- +### R code +```{r} +x <- 5; t <- 94.32; lambda <- x / t +round(lambda + c(-1, 1) * qnorm(.975) * sqrt(lambda / t), 3) +poisson.test(x, T = 94.32)$conf +``` + +--- +## In the regression class +```{r} +exp(confint(glm(x ~ 1 + offset(log(t)), family = poisson(link = log)))) +``` + diff --git a/06_StatisticalInference/02_02_Asymptopia/index.html b/06_StatisticalInference/02_02_Asymptopia/index.html index 0cf77bf3a..025ef3536 100644 --- a/06_StatisticalInference/02_02_Asymptopia/index.html +++ b/06_StatisticalInference/02_02_Asymptopia/index.html @@ -1,580 +1,588 @@ - - - - A trip to Asymptopia - - - - - - - - - - - - - - - - - - - - - - - - - - -
-

A trip to Asymptopia

-

Statistical Inference

-

Brian Caffo, Jeff Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

-
-
-
- - - - -
-

Asymptotics

-
-
-
    -
  • Asymptotics is the term for the behavior of statistics as the sample size (or some other relevant quantity) limits to infinity (or some other relevant number)
  • -
  • (Asymptopia is my name for the land of asymptotics, where everything works out well and there's no messes. The land of infinite data is nice that way.)
  • -
  • Asymptotics are incredibly useful for simple statistical inference and approximations
  • -
  • (Not covered in this class) Asymptotics often lead to nice understanding of procedures
  • -
  • Asymptotics generally give no assurances about finite sample performance - -
      -
    • The kinds of asymptotics that do are orders of magnitude more difficult to work with
    • -
  • -
  • Asymptotics form the basis for frequency interpretation of probabilities -(the long run proportion of times an event occurs)
  • -
  • To understand asymptotics, we need a very basic understanding of limits.
  • -
- -
- -
- - -
-

Numerical limits

-
-
-
    -
  • Imagine a sequence

    - -
      -
    • \(a_1 = .9\),
    • -
    • \(a_2 = .99\),
    • -
    • \(a_3 = .999\), ...
    • -
  • -
  • Clearly this sequence converges to \(1\)

  • -
  • Definition of a limit: For any fixed distance we can find a point in the sequence so that the sequence is closer to the limit than that distance from that point on

  • -
- -
- -
- - -
-

Limits of random variables

-
-
-
    -
  • The problem is harder for random variables
  • -
  • Consider \(\bar X_n\) the sample average of the first \(n\) of a collection of \(iid\) observations

    - -
      -
    • Example \(\bar X_n\) could be the average of the result of \(n\) coin flips (i.e. the sample proportion of heads)
    • -
  • -
  • We say that \(\bar X_n\) converges in probability to a limit if for any fixed distance the probability of \(\bar X_n\) being closer (further away) than that distance from the limit converges to one (zero)

  • -
- -
- -
- - -
-

The Law of Large Numbers

-
-
-
    -
  • Establishing that a random sequence converges to a limit is hard
  • -
  • Fortunately, we have a theorem that does all the work for us, called -the Law of Large Numbers
  • -
  • The law of large numbers states that if \(X_1,\ldots X_n\) are iid from a population with mean \(\mu\) and variance \(\sigma^2\) then \(\bar X_n\) converges in probability to \(\mu\)
  • -
  • (There are many variations on the LLN; we are using a particularly lazy version, my favorite kind of version)
  • -
- -
- -
- - -
-

Law of large numbers in action

-
-
-
n <- 10000; means <- cumsum(rnorm(n)) / (1  : n)
-plot(1 : n, means, type = "l", lwd = 2, 
-     frame = FALSE, ylab = "cumulative means", xlab = "sample size")
-abline(h = 0)
-
- -
plot of chunk unnamed-chunk-1
- -
- -
- - -
-

Discussion

-
-
-
    -
  • An estimator is consistent if it converges to what you want to estimate - -
      -
    • Consistency is neither necessary nor sufficient for one estimator to be better than another
    • -
    • Typically, good estimators are consistent; it's not too much to ask that if we go to the trouble of collecting an infinite amount of data that we get the right answer
    • -
  • -
  • The LLN basically states that the sample mean is consistent
  • -
  • The sample variance and the sample standard deviation are consistent as well
  • -
  • Recall also that the sample mean and the sample variance are unbiased as well
  • -
  • (The sample standard deviation is biased, by the way)
  • -
- -
- -
- - -
-

The Central Limit Theorem

-
-
-
    -
  • The Central Limit Theorem (CLT) is one of the most important theorems in statistics
  • -
  • For our purposes, the CLT states that the distribution of averages of iid variables, properly normalized, becomes that of a standard normal as the sample size increases
  • -
  • The CLT applies in an endless variety of settings
  • -
  • Let \(X_1,\ldots,X_n\) be a collection of iid random variables with mean \(\mu\) and variance \(\sigma^2\)
  • -
  • Let \(\bar X_n\) be their sample average
  • -
  • Then \(\frac{\bar X_n - \mu}{\sigma / \sqrt{n}}\) has a distribution like that of a standard normal for large \(n\).
  • -
  • Remember the form -\[\frac{\bar X_n - \mu}{\sigma / \sqrt{n}} = -\frac{\mbox{Estimate} - \mbox{Mean of estimate}}{\mbox{Std. Err. of estimate}}. -\]
  • -
  • Usually, replacing the standard error by its estimated value doesn't change the CLT
  • -
- -
- -
- - -
-

Example

-
-
-
    -
  • Simulate a standard normal random variable by rolling \(n\) (six sided)
  • -
  • Let \(X_i\) be the outcome for die \(i\)
  • -
  • Then note that \(\mu = E[X_i] = 3.5\)
  • -
  • \(Var(X_i) = 2.92\)
  • -
  • SE \(\sqrt{2.92 / n} = 1.71 / \sqrt{n}\)
  • -
  • Standardized mean -\[ -\frac{\bar X_n - 3.5}{1.71/\sqrt{n}} -\]
  • -
- -
- -
- - -
-

Simulation of mean of \(n\) dice

-
-
-
plot of chunk unnamed-chunk-2
- -
- -
- - -
-

Coin CLT

-
-
-
    -
  • Let \(X_i\) be the \(0\) or \(1\) result of the \(i^{th}\) flip of a possibly unfair coin - -
      -
    • The sample proportion, say \(\hat p\), is the average of the coin flips
    • -
    • \(E[X_i] = p\) and \(Var(X_i) = p(1-p)\)
    • -
    • Standard error of the mean is \(\sqrt{p(1-p)/n}\)
    • -
    • Then -\[ -\frac{\hat p - p}{\sqrt{p(1-p)/n}} -\] -will be approximately normally distributed
    • -
  • -
- -
- -
- - -
-
plot of chunk unnamed-chunk-3
- -
- -
- - -
-

CLT in practice

-
-
-
    -
  • In practice the CLT is mostly useful as an approximation -\[ -P\left( \frac{\bar X_n - \mu}{\sigma / \sqrt{n}} \leq z \right) \approx \Phi(z). -\]
  • -
  • Recall \(1.96\) is a good approximation to the \(.975^{th}\) quantile of the standard normal
  • -
  • Consider -\[ -\begin{eqnarray*} - .95 & \approx & P\left( -1.96 \leq \frac{\bar X_n - \mu}{\sigma / \sqrt{n}} \leq 1.96 \right)\\ \\ - & = & P\left(\bar X_n +1.96 \sigma/\sqrt{n} \geq \mu \geq \bar X_n - 1.96\sigma/\sqrt{n} \right),\\ -\end{eqnarray*} -\]
  • -
- -
- -
- - -
-

Confidence intervals

-
-
-
    -
  • Therefore, according to the CLT, the probability that the random interval \[\bar X_n \pm z_{1-\alpha/2}\sigma / \sqrt{n}\] contains \(\mu\) is approximately 100\((1-\alpha)\)%, where \(z_{1-\alpha/2}\) is the \(1-\alpha/2\) quantile of the standard normal distribution
  • -
  • This is called a \(100(1 - \alpha)\)% confidence interval for \(\mu\)
  • -
  • We can replace the unknown \(\sigma\) with \(s\)
  • -
- -
- -
- - -
-

Give a confidence interval for the average height of sons

-
-
-

in Galton's data

- -
library(UsingR);data(father.son); x <- father.son$sheight
-(mean(x) + c(-1, 1) * qnorm(.975) * sd(x) / sqrt(length(x))) / 12
-
- -
[1] 5.710 5.738
-
- -
- -
- - -
-

Sample proportions

-
-
-
    -
  • In the event that each \(X_i\) is \(0\) or \(1\) with common success probability \(p\) then \(\sigma^2 = p(1 - p)\)
  • -
  • The interval takes the form -\[ -\hat p \pm z_{1 - \alpha/2} \sqrt{\frac{p(1 - p)}{n}} -\]
  • -
  • Replacing \(p\) by \(\hat p\) in the standard error results in what is called a Wald confidence interval for \(p\)
  • -
  • Also note that \(p(1-p) \leq 1/4\) for \(0 \leq p \leq 1\)
  • -
  • Let \(\alpha = .05\) so that \(z_{1 -\alpha/2} = 1.96 \approx 2\) then -\[ -2 \sqrt{\frac{p(1 - p)}{n}} \leq 2 \sqrt{\frac{1}{4n}} = \frac{1}{\sqrt{n}} -\]
  • -
  • Therefore \(\hat p \pm \frac{1}{\sqrt{n}}\) is a quick CI estimate for \(p\)
  • -
- -
- -
- - -
-

Example

-
-
-
    -
  • Your campaign advisor told you that in a random sample of 100 likely voters, -56 intent to vote for you. - -
      -
    • Can you relax? Do you have this race in the bag?
    • -
    • Without access to a computer or calculator, how precise is this estimate?
    • -
  • -
  • 1/sqrt(100)=.1 so a back of the envelope calculation gives an approximate 95% interval of (0.46, 0.66) - -
      -
    • Not enough for you to relax, better go do more campaigning!
    • -
  • -
  • Rough guidelines, 100 for 1 decimal place, 10,000 for 2, 1,000,000 for 3.
  • -
- -
round(1 / sqrt(10 ^ (1 : 6)), 3)
-
- -
[1] 0.316 0.100 0.032 0.010 0.003 0.001
-
- -
- -
- - -
-

Poisson interval

-
-
-
    -
  • A nuclear pump failed 5 times out of 94.32 days, give a 95% confidence interval for the failure rate per day?
  • -
  • \(X \sim Poisson(\lambda t)\).
  • -
  • Estimate \(\hat \lambda = X/t\)
  • -
  • \(Var(\hat \lambda) = \lambda / t\) -\[ -\frac{\hat \lambda - \lambda}{\sqrt{\hat \lambda / t}} -= -\frac{X - t \lambda}{\sqrt{X}} -\rightarrow N(0,1) -\]
  • -
  • This isn't the best interval. - -
      -
    • There are better asymptotic intervals.
    • -
    • You can get an exact CI in this case.
    • -
  • -
- -
- -
- - -
-

R code

-
-
-
x <- 5; t <- 94.32; lambda <- x / t
-round(lambda + c(-1, 1) * qnorm(.975) * sqrt(lambda / t), 3)
-
- -
[1] 0.007 0.099
-
- -
poisson.test(x, T = 94.32)$conf
-
- -
[1] 0.01721 0.12371
-attr(,"conf.level")
-[1] 0.95
-
- -
- -
- - -
-

In the regression class

-
-
-
exp(confint(glm(x ~ 1 + offset(log(t)), family = poisson(link = log))))
-
- -
  2.5 %  97.5 % 
-0.01901 0.11393 
-
- -
- -
- - -
- - - - - - - - - - - - - - + + + + A trip to Asymptopia + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

A trip to Asymptopia

+

Statistical Inference

+

Brian Caffo, Jeff Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

+
+
+
+ + + + +
+

Asymptotics

+
+
+
    +
  • Asymptotics is the term for the behavior of statistics as the sample size (or some other relevant quantity) limits to infinity (or some other relevant number)
  • +
  • (Asymptopia is my name for the land of asymptotics, where everything works out well and there's no messes. The land of infinite data is nice that way.)
  • +
  • Asymptotics are incredibly useful for simple statistical inference and approximations
  • +
  • (Not covered in this class) Asymptotics often lead to nice understanding of procedures
  • +
  • Asymptotics generally give no assurances about finite sample performance + +
      +
    • The kinds of asymptotics that do are orders of magnitude more difficult to work with
    • +
  • +
  • Asymptotics form the basis for frequency interpretation of probabilities +(the long run proportion of times an event occurs)
  • +
  • To understand asymptotics, we need a very basic understanding of limits.
  • +
+ +
+ +
+ + +
+

Numerical limits

+
+
+
    +
  • Imagine a sequence

    + +
      +
    • \(a_1 = .9\),
    • +
    • \(a_2 = .99\),
    • +
    • \(a_3 = .999\), ...
    • +
  • +
  • Clearly this sequence converges to \(1\)

  • +
  • Definition of a limit: For any fixed distance we can find a point in the sequence so that the sequence is closer to the limit than that distance from that point on

  • +
+ +
+ +
+ + +
+

Limits of random variables

+
+
+
    +
  • The problem is harder for random variables
  • +
  • Consider \(\bar X_n\) the sample average of the first \(n\) of a collection of \(iid\) observations

    + +
      +
    • Example \(\bar X_n\) could be the average of the result of \(n\) coin flips (i.e. the sample proportion of heads)
    • +
  • +
  • We say that \(\bar X_n\) converges in probability to a limit if for any fixed distance the probability of \(\bar X_n\) being closer (further away) than that distance from the limit converges to one (zero)

  • +
+ +
+ +
+ + +
+

The Law of Large Numbers

+
+
+
    +
  • Establishing that a random sequence converges to a limit is hard
  • +
  • Fortunately, we have a theorem that does all the work for us, called +the Law of Large Numbers
  • +
  • The law of large numbers states that if \(X_1,\ldots X_n\) are iid from a population with mean \(\mu\) and variance \(\sigma^2\) then \(\bar X_n\) converges in probability to \(\mu\)
  • +
  • (There are many variations on the LLN; we are using a particularly lazy version, my favorite kind of version)
  • +
+ +
+ +
+ + +
+

Law of large numbers in action

+
+
+
n <- 10000
+means <- cumsum(rnorm(n))/(1:n)
+plot(1:n, means, type = "l", lwd = 2, frame = FALSE, ylab = "cumulative means", 
+    xlab = "sample size")
+abline(h = 0)
+
+ +

plot of chunk unnamed-chunk-1

+ +
+ +
+ + +
+

Discussion

+
+
+
    +
  • An estimator is consistent if it converges to what you want to estimate + +
      +
    • Consistency is neither necessary nor sufficient for one estimator to be better than another
    • +
    • Typically, good estimators are consistent; it's not too much to ask that if we go to the trouble of collecting an infinite amount of data that we get the right answer
    • +
  • +
  • The LLN basically states that the sample mean is consistent
  • +
  • The sample variance and the sample standard deviation are consistent as well
  • +
  • Recall also that the sample mean and the sample variance are unbiased as well
  • +
  • (The sample standard deviation is biased, by the way)
  • +
+ +
+ +
+ + +
+

The Central Limit Theorem

+
+
+
    +
  • The Central Limit Theorem (CLT) is one of the most important theorems in statistics
  • +
  • For our purposes, the CLT states that the distribution of averages of iid variables, properly normalized, becomes that of a standard normal as the sample size increases
  • +
  • The CLT applies in an endless variety of settings
  • +
  • Let \(X_1,\ldots,X_n\) be a collection of iid random variables with mean \(\mu\) and variance \(\sigma^2\)
  • +
  • Let \(\bar X_n\) be their sample average
  • +
  • Then \(\frac{\bar X_n - \mu}{\sigma / \sqrt{n}}\) has a distribution like that of a standard normal for large \(n\).
  • +
  • Remember the form +\[\frac{\bar X_n - \mu}{\sigma / \sqrt{n}} = +\frac{\mbox{Estimate} - \mbox{Mean of estimate}}{\mbox{Std. Err. of estimate}}. +\]
  • +
  • Usually, replacing the standard error by its estimated value doesn't change the CLT
  • +
+ +
+ +
+ + +
+

Example

+
+
+
    +
  • Simulate a standard normal random variable by rolling \(n\) (six sided)
  • +
  • Let \(X_i\) be the outcome for die \(i\)
  • +
  • Then note that \(\mu = E[X_i] = 3.5\)
  • +
  • \(Var(X_i) = 2.92\)
  • +
  • SE \(\sqrt{2.92 / n} = 1.71 / \sqrt{n}\)
  • +
  • Standardized mean +\[ +\frac{\bar X_n - 3.5}{1.71/\sqrt{n}} +\]
  • +
+ +
+ +
+ + +
+

Simulation of mean of \(n\) dice

+
+
+

plot of chunk unnamed-chunk-2

+ +
+ +
+ + +
+

Coin CLT

+
+
+
    +
  • Let \(X_i\) be the \(0\) or \(1\) result of the \(i^{th}\) flip of a possibly unfair coin + +
      +
    • The sample proportion, say \(\hat p\), is the average of the coin flips
    • +
    • \(E[X_i] = p\) and \(Var(X_i) = p(1-p)\)
    • +
    • Standard error of the mean is \(\sqrt{p(1-p)/n}\)
    • +
    • Then +\[ +\frac{\hat p - p}{\sqrt{p(1-p)/n}} +\] +will be approximately normally distributed
    • +
  • +
+ +
+ +
+ + +
+

plot of chunk unnamed-chunk-3

+ +
+ +
+ + +
+

CLT in practice

+
+
+
    +
  • In practice the CLT is mostly useful as an approximation +\[ +P\left( \frac{\bar X_n - \mu}{\sigma / \sqrt{n}} \leq z \right) \approx \Phi(z). +\]
  • +
  • Recall \(1.96\) is a good approximation to the \(.975^{th}\) quantile of the standard normal
  • +
  • Consider +\[ +\begin{eqnarray*} + .95 & \approx & P\left( -1.96 \leq \frac{\bar X_n - \mu}{\sigma / \sqrt{n}} \leq 1.96 \right)\\ \\ + & = & P\left(\bar X_n +1.96 \sigma/\sqrt{n} \geq \mu \geq \bar X_n - 1.96\sigma/\sqrt{n} \right),\\ +\end{eqnarray*} +\]
  • +
+ +
+ +
+ + +
+

Confidence intervals

+
+
+
    +
  • Therefore, according to the CLT, the probability that the random interval \[\bar X_n \pm z_{1-\alpha/2}\sigma / \sqrt{n}\] contains \(\mu\) is approximately 100\((1-\alpha)\)%, where \(z_{1-\alpha/2}\) is the \(1-\alpha/2\) quantile of the standard normal distribution
  • +
  • This is called a \(100(1 - \alpha)\)% confidence interval for \(\mu\)
  • +
  • We can replace the unknown \(\sigma\) with \(s\)
  • +
+ +
+ +
+ + +
+

Give a confidence interval for the average height of sons

+
+
+

in Galton's data

+ +
library(UsingR)
+data(father.son)
+x <- father.son$sheight
+(mean(x) + c(-1, 1) * qnorm(0.975) * sd(x)/sqrt(length(x)))/12
+
+ +
## [1] 5.710 5.738
+
+ +
+ +
+ + +
+

Sample proportions

+
+
+
    +
  • In the event that each \(X_i\) is \(0\) or \(1\) with common success probability \(p\) then \(\sigma^2 = p(1 - p)\)
  • +
  • The interval takes the form +\[ +\hat p \pm z_{1 - \alpha/2} \sqrt{\frac{p(1 - p)}{n}} +\]
  • +
  • Replacing \(p\) by \(\hat p\) in the standard error results in what is called a Wald confidence interval for \(p\)
  • +
  • Also note that \(p(1-p) \leq 1/4\) for \(0 \leq p \leq 1\)
  • +
  • Let \(\alpha = .05\) so that \(z_{1 -\alpha/2} = 1.96 \approx 2\) then +\[ +2 \sqrt{\frac{p(1 - p)}{n}} \leq 2 \sqrt{\frac{1}{4n}} = \frac{1}{\sqrt{n}} +\]
  • +
  • Therefore \(\hat p \pm \frac{1}{\sqrt{n}}\) is a quick CI estimate for \(p\)
  • +
+ +
+ +
+ + +
+

Example

+
+
+
    +
  • Your campaign advisor told you that in a random sample of 100 likely voters, +56 intent to vote for you. + +
      +
    • Can you relax? Do you have this race in the bag?
    • +
    • Without access to a computer or calculator, how precise is this estimate?
    • +
  • +
  • 1/sqrt(100)=.1 so a back of the envelope calculation gives an approximate 95% interval of (0.46, 0.66) + +
      +
    • Not enough for you to relax, better go do more campaigning!
    • +
  • +
  • Rough guidelines, 100 for 1 decimal place, 10,000 for 2, 1,000,000 for 3.
  • +
+ +
round(1/sqrt(10^(1:6)), 3)
+
+ +
## [1] 0.316 0.100 0.032 0.010 0.003 0.001
+
+ +
+ +
+ + +
+

Poisson interval

+
+
+
    +
  • A nuclear pump failed 5 times out of 94.32 days, give a 95% confidence interval for the failure rate per day?
  • +
  • \(X \sim Poisson(\lambda t)\).
  • +
  • Estimate \(\hat \lambda = X/t\)
  • +
  • \(Var(\hat \lambda) = \lambda / t\) +\[ +\frac{\hat \lambda - \lambda}{\sqrt{\hat \lambda / t}} += +\frac{X - t \lambda}{\sqrt{X}} +\rightarrow N(0,1) +\]
  • +
  • This isn't the best interval. + +
      +
    • There are better asymptotic intervals.
    • +
    • You can get an exact CI in this case.
    • +
  • +
+ +
+ +
+ + +
+

R code

+
+
+
x <- 5
+t <- 94.32
+lambda <- x/t
+round(lambda + c(-1, 1) * qnorm(0.975) * sqrt(lambda/t), 3)
+
+ +
## [1] 0.007 0.099
+
+ +
poisson.test(x, T = 94.32)$conf
+
+ +
## [1] 0.01721 0.12371
+## attr(,"conf.level")
+## [1] 0.95
+
+ +
+ +
+ + +
+

In the regression class

+
+
+
exp(confint(glm(x ~ 1 + offset(log(t)), family = poisson(link = log))))
+
+ +
## Waiting for profiling to be done...
+
+ +
##   2.5 %  97.5 % 
+## 0.01901 0.11393
+
+ +
+ +
+ + +
+ + + + + + + + + + + + + + \ No newline at end of file diff --git a/06_StatisticalInference/02_02_Asymptopia/index.md b/06_StatisticalInference/02_02_Asymptopia/index.md index 1e6b29c78..c2a83dc2c 100644 --- a/06_StatisticalInference/02_02_Asymptopia/index.md +++ b/06_StatisticalInference/02_02_Asymptopia/index.md @@ -1,263 +1,270 @@ ---- -title : A trip to Asymptopia -subtitle : Statistical Inference -author : Brian Caffo, Jeff Leek, Roger Peng -job : Johns Hopkins Bloomberg School of Public Health -logo : bloomberg_shield.png -framework : io2012 # {io2012, html5slides, shower, dzslides, ...} -highlighter : highlight.js # {highlight.js, prettify, highlight} -hitheme : tomorrow # -url: - lib: ../../librariesNew - assets: ../../assets -widgets : [mathjax] # {mathjax, quiz, bootstrap} -mode : selfcontained # {standalone, draft} ---- - - -## Asymptotics -* Asymptotics is the term for the behavior of statistics as the sample size (or some other relevant quantity) limits to infinity (or some other relevant number) -* (Asymptopia is my name for the land of asymptotics, where everything works out well and there's no messes. The land of infinite data is nice that way.) -* Asymptotics are incredibly useful for simple statistical inference and approximations -* (Not covered in this class) Asymptotics often lead to nice understanding of procedures -* Asymptotics generally give no assurances about finite sample performance - * The kinds of asymptotics that do are orders of magnitude more difficult to work with -* Asymptotics form the basis for frequency interpretation of probabilities - (the long run proportion of times an event occurs) -* To understand asymptotics, we need a very basic understanding of limits. - - ---- -## Numerical limits - -- Imagine a sequence - - - $a_1 = .9$, - - $a_2 = .99$, - - $a_3 = .999$, ... - -- Clearly this sequence converges to $1$ -- Definition of a limit: For any fixed distance we can find a point in the sequence so that the sequence is closer to the limit than that distance from that point on - ---- - -## Limits of random variables - -- The problem is harder for random variables -- Consider $\bar X_n$ the sample average of the first $n$ of a collection of $iid$ observations - - - Example $\bar X_n$ could be the average of the result of $n$ coin flips (i.e. the sample proportion of heads) - -- We say that $\bar X_n$ converges in probability to a limit if for any fixed distance the probability of $\bar X_n$ being closer (further away) than that distance from the limit converges to one (zero) - ---- - -## The Law of Large Numbers - -- Establishing that a random sequence converges to a limit is hard -- Fortunately, we have a theorem that does all the work for us, called - the **Law of Large Numbers** -- The law of large numbers states that if $X_1,\ldots X_n$ are iid from a population with mean $\mu$ and variance $\sigma^2$ then $\bar X_n$ converges in probability to $\mu$ -- (There are many variations on the LLN; we are using a particularly lazy version, my favorite kind of version) - ---- -## Law of large numbers in action - -```r -n <- 10000; means <- cumsum(rnorm(n)) / (1 : n) -plot(1 : n, means, type = "l", lwd = 2, - frame = FALSE, ylab = "cumulative means", xlab = "sample size") -abline(h = 0) -``` - -
plot of chunk unnamed-chunk-1
- ---- -## Discussion -- An estimator is **consistent** if it converges to what you want to estimate - - Consistency is neither necessary nor sufficient for one estimator to be better than another - - Typically, good estimators are consistent; it's not too much to ask that if we go to the trouble of collecting an infinite amount of data that we get the right answer -- The LLN basically states that the sample mean is consistent -- The sample variance and the sample standard deviation are consistent as well -- Recall also that the sample mean and the sample variance are unbiased as well -- (The sample standard deviation is biased, by the way) - ---- - -## The Central Limit Theorem - -- The **Central Limit Theorem** (CLT) is one of the most important theorems in statistics -- For our purposes, the CLT states that the distribution of averages of iid variables, properly normalized, becomes that of a standard normal as the sample size increases -- The CLT applies in an endless variety of settings -- Let $X_1,\ldots,X_n$ be a collection of iid random variables with mean $\mu$ and variance $\sigma^2$ -- Let $\bar X_n$ be their sample average -- Then $\frac{\bar X_n - \mu}{\sigma / \sqrt{n}}$ has a distribution like that of a standard normal for large $n$. -- Remember the form -$$\frac{\bar X_n - \mu}{\sigma / \sqrt{n}} = - \frac{\mbox{Estimate} - \mbox{Mean of estimate}}{\mbox{Std. Err. of estimate}}. -$$ -- Usually, replacing the standard error by its estimated value doesn't change the CLT - ---- - -## Example - -- Simulate a standard normal random variable by rolling $n$ (six sided) -- Let $X_i$ be the outcome for die $i$ -- Then note that $\mu = E[X_i] = 3.5$ -- $Var(X_i) = 2.92$ -- SE $\sqrt{2.92 / n} = 1.71 / \sqrt{n}$ -- Standardized mean -$$ - \frac{\bar X_n - 3.5}{1.71/\sqrt{n}} -$$ - ---- -## Simulation of mean of $n$ dice -
plot of chunk unnamed-chunk-2
- ---- - -## Coin CLT - - - Let $X_i$ be the $0$ or $1$ result of the $i^{th}$ flip of a possibly unfair coin -- The sample proportion, say $\hat p$, is the average of the coin flips -- $E[X_i] = p$ and $Var(X_i) = p(1-p)$ -- Standard error of the mean is $\sqrt{p(1-p)/n}$ -- Then -$$ - \frac{\hat p - p}{\sqrt{p(1-p)/n}} -$$ -will be approximately normally distributed - ---- - -
plot of chunk unnamed-chunk-3
- - ---- - -## CLT in practice - -- In practice the CLT is mostly useful as an approximation -$$ - P\left( \frac{\bar X_n - \mu}{\sigma / \sqrt{n}} \leq z \right) \approx \Phi(z). -$$ -- Recall $1.96$ is a good approximation to the $.975^{th}$ quantile of the standard normal -- Consider -$$ - \begin{eqnarray*} - .95 & \approx & P\left( -1.96 \leq \frac{\bar X_n - \mu}{\sigma / \sqrt{n}} \leq 1.96 \right)\\ \\ - & = & P\left(\bar X_n +1.96 \sigma/\sqrt{n} \geq \mu \geq \bar X_n - 1.96\sigma/\sqrt{n} \right),\\ - \end{eqnarray*} -$$ - ---- - -## Confidence intervals - -- Therefore, according to the CLT, the probability that the random interval $$\bar X_n \pm z_{1-\alpha/2}\sigma / \sqrt{n}$$ contains $\mu$ is approximately 100$(1-\alpha)$%, where $z_{1-\alpha/2}$ is the $1-\alpha/2$ quantile of the standard normal distribution -- This is called a $100(1 - \alpha)$% **confidence interval** for $\mu$ -- We can replace the unknown $\sigma$ with $s$ - ---- -## Give a confidence interval for the average height of sons -in Galton's data - -```r -library(UsingR);data(father.son); x <- father.son$sheight -(mean(x) + c(-1, 1) * qnorm(.975) * sd(x) / sqrt(length(x))) / 12 -``` - -``` -[1] 5.710 5.738 -``` - - ---- - -## Sample proportions - -- In the event that each $X_i$ is $0$ or $1$ with common success probability $p$ then $\sigma^2 = p(1 - p)$ -- The interval takes the form -$$ - \hat p \pm z_{1 - \alpha/2} \sqrt{\frac{p(1 - p)}{n}} -$$ -- Replacing $p$ by $\hat p$ in the standard error results in what is called a Wald confidence interval for $p$ -- Also note that $p(1-p) \leq 1/4$ for $0 \leq p \leq 1$ -- Let $\alpha = .05$ so that $z_{1 -\alpha/2} = 1.96 \approx 2$ then -$$ - 2 \sqrt{\frac{p(1 - p)}{n}} \leq 2 \sqrt{\frac{1}{4n}} = \frac{1}{\sqrt{n}} -$$ -- Therefore $\hat p \pm \frac{1}{\sqrt{n}}$ is a quick CI estimate for $p$ - ---- -## Example -* Your campaign advisor told you that in a random sample of 100 likely voters, - 56 intent to vote for you. - * Can you relax? Do you have this race in the bag? - * Without access to a computer or calculator, how precise is this estimate? -* `1/sqrt(100)=.1` so a back of the envelope calculation gives an approximate 95% interval of `(0.46, 0.66)` - * Not enough for you to relax, better go do more campaigning! -* Rough guidelines, 100 for 1 decimal place, 10,000 for 2, 1,000,000 for 3. - -```r -round(1 / sqrt(10 ^ (1 : 6)), 3) -``` - -``` -[1] 0.316 0.100 0.032 0.010 0.003 0.001 -``` - ---- -## Poisson interval -* A nuclear pump failed 5 times out of 94.32 days, give a 95% confidence interval for the failure rate per day? -* $X \sim Poisson(\lambda t)$. -* Estimate $\hat \lambda = X/t$ -* $Var(\hat \lambda) = \lambda / t$ -$$ -\frac{\hat \lambda - \lambda}{\sqrt{\hat \lambda / t}} -= -\frac{X - t \lambda}{\sqrt{X}} -\rightarrow N(0,1) -$$ -* This isn't the best interval. - * There are better asymptotic intervals. - * You can get an exact CI in this case. - ---- -### R code - -```r -x <- 5; t <- 94.32; lambda <- x / t -round(lambda + c(-1, 1) * qnorm(.975) * sqrt(lambda / t), 3) -``` - -``` -[1] 0.007 0.099 -``` - -```r -poisson.test(x, T = 94.32)$conf -``` - -``` -[1] 0.01721 0.12371 -attr(,"conf.level") -[1] 0.95 -``` - - ---- -## In the regression class - -```r -exp(confint(glm(x ~ 1 + offset(log(t)), family = poisson(link = log)))) -``` - -``` - 2.5 % 97.5 % -0.01901 0.11393 -``` - - +--- +title : A trip to Asymptopia +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- +## Asymptotics +* Asymptotics is the term for the behavior of statistics as the sample size (or some other relevant quantity) limits to infinity (or some other relevant number) +* (Asymptopia is my name for the land of asymptotics, where everything works out well and there's no messes. The land of infinite data is nice that way.) +* Asymptotics are incredibly useful for simple statistical inference and approximations +* (Not covered in this class) Asymptotics often lead to nice understanding of procedures +* Asymptotics generally give no assurances about finite sample performance + * The kinds of asymptotics that do are orders of magnitude more difficult to work with +* Asymptotics form the basis for frequency interpretation of probabilities + (the long run proportion of times an event occurs) +* To understand asymptotics, we need a very basic understanding of limits. + + +--- +## Numerical limits + +- Imagine a sequence + + - $a_1 = .9$, + - $a_2 = .99$, + - $a_3 = .999$, ... + +- Clearly this sequence converges to $1$ +- Definition of a limit: For any fixed distance we can find a point in the sequence so that the sequence is closer to the limit than that distance from that point on + +--- + +## Limits of random variables + +- The problem is harder for random variables +- Consider $\bar X_n$ the sample average of the first $n$ of a collection of $iid$ observations + + - Example $\bar X_n$ could be the average of the result of $n$ coin flips (i.e. the sample proportion of heads) + +- We say that $\bar X_n$ converges in probability to a limit if for any fixed distance the probability of $\bar X_n$ being closer (further away) than that distance from the limit converges to one (zero) + +--- + +## The Law of Large Numbers + +- Establishing that a random sequence converges to a limit is hard +- Fortunately, we have a theorem that does all the work for us, called + the **Law of Large Numbers** +- The law of large numbers states that if $X_1,\ldots X_n$ are iid from a population with mean $\mu$ and variance $\sigma^2$ then $\bar X_n$ converges in probability to $\mu$ +- (There are many variations on the LLN; we are using a particularly lazy version, my favorite kind of version) + +--- +## Law of large numbers in action + +```r +n <- 10000 +means <- cumsum(rnorm(n))/(1:n) +plot(1:n, means, type = "l", lwd = 2, frame = FALSE, ylab = "cumulative means", + xlab = "sample size") +abline(h = 0) +``` + +![plot of chunk unnamed-chunk-1](assets/fig/unnamed-chunk-1.png) + +--- +## Discussion +- An estimator is **consistent** if it converges to what you want to estimate + - Consistency is neither necessary nor sufficient for one estimator to be better than another + - Typically, good estimators are consistent; it's not too much to ask that if we go to the trouble of collecting an infinite amount of data that we get the right answer +- The LLN basically states that the sample mean is consistent +- The sample variance and the sample standard deviation are consistent as well +- Recall also that the sample mean and the sample variance are unbiased as well +- (The sample standard deviation is biased, by the way) + +--- + +## The Central Limit Theorem + +- The **Central Limit Theorem** (CLT) is one of the most important theorems in statistics +- For our purposes, the CLT states that the distribution of averages of iid variables, properly normalized, becomes that of a standard normal as the sample size increases +- The CLT applies in an endless variety of settings +- Let $X_1,\ldots,X_n$ be a collection of iid random variables with mean $\mu$ and variance $\sigma^2$ +- Let $\bar X_n$ be their sample average +- Then $\frac{\bar X_n - \mu}{\sigma / \sqrt{n}}$ has a distribution like that of a standard normal for large $n$. +- Remember the form +$$\frac{\bar X_n - \mu}{\sigma / \sqrt{n}} = + \frac{\mbox{Estimate} - \mbox{Mean of estimate}}{\mbox{Std. Err. of estimate}}. +$$ +- Usually, replacing the standard error by its estimated value doesn't change the CLT + +--- + +## Example + +- Simulate a standard normal random variable by rolling $n$ (six sided) +- Let $X_i$ be the outcome for die $i$ +- Then note that $\mu = E[X_i] = 3.5$ +- $Var(X_i) = 2.92$ +- SE $\sqrt{2.92 / n} = 1.71 / \sqrt{n}$ +- Standardized mean +$$ + \frac{\bar X_n - 3.5}{1.71/\sqrt{n}} +$$ + +--- +## Simulation of mean of $n$ dice +![plot of chunk unnamed-chunk-2](assets/fig/unnamed-chunk-2.png) + +--- + +## Coin CLT + + - Let $X_i$ be the $0$ or $1$ result of the $i^{th}$ flip of a possibly unfair coin +- The sample proportion, say $\hat p$, is the average of the coin flips +- $E[X_i] = p$ and $Var(X_i) = p(1-p)$ +- Standard error of the mean is $\sqrt{p(1-p)/n}$ +- Then +$$ + \frac{\hat p - p}{\sqrt{p(1-p)/n}} +$$ +will be approximately normally distributed + +--- + +![plot of chunk unnamed-chunk-3](assets/fig/unnamed-chunk-3.png) + + +--- + +## CLT in practice + +- In practice the CLT is mostly useful as an approximation +$$ + P\left( \frac{\bar X_n - \mu}{\sigma / \sqrt{n}} \leq z \right) \approx \Phi(z). +$$ +- Recall $1.96$ is a good approximation to the $.975^{th}$ quantile of the standard normal +- Consider +$$ + \begin{eqnarray*} + .95 & \approx & P\left( -1.96 \leq \frac{\bar X_n - \mu}{\sigma / \sqrt{n}} \leq 1.96 \right)\\ \\ + & = & P\left(\bar X_n +1.96 \sigma/\sqrt{n} \geq \mu \geq \bar X_n - 1.96\sigma/\sqrt{n} \right),\\ + \end{eqnarray*} +$$ + +--- + +## Confidence intervals + +- Therefore, according to the CLT, the probability that the random interval $$\bar X_n \pm z_{1-\alpha/2}\sigma / \sqrt{n}$$ contains $\mu$ is approximately 100$(1-\alpha)$%, where $z_{1-\alpha/2}$ is the $1-\alpha/2$ quantile of the standard normal distribution +- This is called a $100(1 - \alpha)$% **confidence interval** for $\mu$ +- We can replace the unknown $\sigma$ with $s$ + +--- +## Give a confidence interval for the average height of sons +in Galton's data + +```r +library(UsingR) +data(father.son) +x <- father.son$sheight +(mean(x) + c(-1, 1) * qnorm(0.975) * sd(x)/sqrt(length(x)))/12 +``` + +``` +## [1] 5.710 5.738 +``` + + +--- + +## Sample proportions + +- In the event that each $X_i$ is $0$ or $1$ with common success probability $p$ then $\sigma^2 = p(1 - p)$ +- The interval takes the form +$$ + \hat p \pm z_{1 - \alpha/2} \sqrt{\frac{p(1 - p)}{n}} +$$ +- Replacing $p$ by $\hat p$ in the standard error results in what is called a Wald confidence interval for $p$ +- Also note that $p(1-p) \leq 1/4$ for $0 \leq p \leq 1$ +- Let $\alpha = .05$ so that $z_{1 -\alpha/2} = 1.96 \approx 2$ then +$$ + 2 \sqrt{\frac{p(1 - p)}{n}} \leq 2 \sqrt{\frac{1}{4n}} = \frac{1}{\sqrt{n}} +$$ +- Therefore $\hat p \pm \frac{1}{\sqrt{n}}$ is a quick CI estimate for $p$ + +--- +## Example +* Your campaign advisor told you that in a random sample of 100 likely voters, + 56 intent to vote for you. + * Can you relax? Do you have this race in the bag? + * Without access to a computer or calculator, how precise is this estimate? +* `1/sqrt(100)=.1` so a back of the envelope calculation gives an approximate 95% interval of `(0.46, 0.66)` + * Not enough for you to relax, better go do more campaigning! +* Rough guidelines, 100 for 1 decimal place, 10,000 for 2, 1,000,000 for 3. + +```r +round(1/sqrt(10^(1:6)), 3) +``` + +``` +## [1] 0.316 0.100 0.032 0.010 0.003 0.001 +``` + +--- +## Poisson interval +* A nuclear pump failed 5 times out of 94.32 days, give a 95% confidence interval for the failure rate per day? +* $X \sim Poisson(\lambda t)$. +* Estimate $\hat \lambda = X/t$ +* $Var(\hat \lambda) = \lambda / t$ +$$ +\frac{\hat \lambda - \lambda}{\sqrt{\hat \lambda / t}} += +\frac{X - t \lambda}{\sqrt{X}} +\rightarrow N(0,1) +$$ +* This isn't the best interval. + * There are better asymptotic intervals. + * You can get an exact CI in this case. + +--- +### R code + +```r +x <- 5 +t <- 94.32 +lambda <- x/t +round(lambda + c(-1, 1) * qnorm(0.975) * sqrt(lambda/t), 3) +``` + +``` +## [1] 0.007 0.099 +``` + +```r +poisson.test(x, T = 94.32)$conf +``` + +``` +## [1] 0.01721 0.12371 +## attr(,"conf.level") +## [1] 0.95 +``` + + +--- +## In the regression class + +```r +exp(confint(glm(x ~ 1 + offset(log(t)), family = poisson(link = log)))) +``` + +``` +## Waiting for profiling to be done... +``` + +``` +## 2.5 % 97.5 % +## 0.01901 0.11393 +``` + + diff --git a/06_StatisticalInference/02_02_Asymptopia/index.pdf b/06_StatisticalInference/02_02_Asymptopia/index.pdf index 7fcb65b90..d43b4b4a8 100644 Binary files a/06_StatisticalInference/02_02_Asymptopia/index.pdf and b/06_StatisticalInference/02_02_Asymptopia/index.pdf differ diff --git a/06_StatisticalInference/02_03_tCIs/index.Rmd b/06_StatisticalInference/02_03_tCIs/index.Rmd index a28c1c306..d2f663f19 100644 --- a/06_StatisticalInference/02_03_tCIs/index.Rmd +++ b/06_StatisticalInference/02_03_tCIs/index.Rmd @@ -1,158 +1,161 @@ ---- -title : T Confidence Intervals -subtitle : Statistical Inference -author : Brian Caffo, Jeff Leek, Roger Peng -job : Johns Hopkins Bloomberg School of Public Health -logo : bloomberg_shield.png -framework : io2012 # {io2012, html5slides, shower, dzslides, ...} -highlighter : highlight.js # {highlight.js, prettify, highlight} -hitheme : tomorrow # -url: - lib: ../../librariesNew - assets: ../../assets -widgets : [mathjax] # {mathjax, quiz, bootstrap} -mode : selfcontained # {standalone, draft} ---- -## Confidence intervals - -- In the previous, we discussed creating a confidence interval using the CLT -- In this lecture, we discuss some methods for small samples, notably Gosset's $t$ distribution -- To discuss the $t$ distribution we must discuss the Chi-squared distribution -- Throughout we use the following general procedure for creating CIs - - a. Create a **Pivot** or statistic that does not depend on the parameter of interest - - b. Solve the probability that the pivot lies between bounds for the parameter - ---- - -## The Chi-squared distribution - -- Suppose that $S^2$ is the sample variance from a collection of iid $N(\mu,\sigma^2)$ data; then -$$ - \frac{(n - 1) S^2}{\sigma^2} \sim \chi^2_{n-1} -$$ -which reads: follows a Chi-squared distribution with $n-1$ degrees of freedom -- The Chi-squared distribution is skewed and has support on $0$ to $\infty$ -- The mean of the Chi-squared is its degrees of freedom -- The variance of the Chi-squared distribution is twice the degrees of freedom - ---- - -## Confidence interval for the variance - -Note that if $\chi^2_{n-1, \alpha}$ is the $\alpha$ quantile of the -Chi-squared distribution then - -$$ -\begin{eqnarray*} - 1 - \alpha & = & P \left( \chi^2_{n-1, \alpha/2} \leq \frac{(n - 1) S^2}{\sigma^2} \leq \chi^2_{n-1,1 - \alpha/2} \right) \\ \\ -& = & P\left(\frac{(n-1)S^2}{\chi^2_{n-1,1-\alpha/2}} \leq \sigma^2 \leq -\frac{(n-1)S^2}{\chi^2_{n-1,\alpha/2}} \right) \\ -\end{eqnarray*} -$$ -So that -$$ -\left[\frac{(n-1)S^2}{\chi^2_{n-1,1-\alpha/2}}, \frac{(n-1)S^2}{\chi^2_{n-1,\alpha/2}}\right] -$$ -is a $100(1-\alpha)\%$ confidence interval for $\sigma^2$ - ---- - -## Notes about this interval - -- This interval relies heavily on the assumed normality -- Square-rooting the endpoints yields a CI for $\sigma$ - ---- -## Example -### Confidence interval for the standard deviation of sons' heights from Galton's data -```{r} -library(UsingR); data(father.son); x <- father.son$sheight -s <- sd(x); n <- length(x) -round(sqrt( (n-1) * s ^ 2 / qchisq(c(.975, .025), n - 1) ), 3) -``` - ---- - -## Gosset's $t$ distribution - -- Invented by William Gosset (under the pseudonym "Student") in 1908 -- Has thicker tails than the normal -- Is indexed by a degrees of freedom; gets more like a standard normal as df gets larger -- Is obtained as -$$ -\frac{Z}{\sqrt{\frac{\chi^2}{df}}} -$$ -where $Z$ and $\chi^2$ are independent standard normals and -Chi-squared distributions respectively - ---- - -## Result - -- Suppose that $(X_1,\ldots,X_n)$ are iid $N(\mu,\sigma^2)$, then: - a. $\frac{\bar X - \mu}{\sigma / \sqrt{n}}$ is standard normal - b. $\sqrt{\frac{(n - 1) S^2}{\sigma^2 (n - 1)}} = S / \sigma$ is the square root of a Chi-squared divided by its df - -- Therefore -$$ -\frac{\frac{\bar X - \mu}{\sigma /\sqrt{n}}}{S/\sigma} -= \frac{\bar X - \mu}{S/\sqrt{n}} -$$ - follows Gosset's $t$ distribution with $n-1$ degrees of freedom - ---- - -## Confidence intervals for the mean - -- Notice that the $t$ statistic is a pivot, therefore we use it to create a confidence interval for $\mu$ -- Let $t_{df,\alpha}$ be the $\alpha^{th}$ quantile of the t distribution with $df$ degrees of freedom -$$ - \begin{eqnarray*} -& & 1 - \alpha \\ -& = & P\left(-t_{n-1,1-\alpha/2} \leq \frac{\bar X - \mu}{S/\sqrt{n}} \leq t_{n-1,1-\alpha/2}\right) \\ \\ -& = & P\left(\bar X - t_{n-1,1-\alpha/2} S / \sqrt{n} \leq \mu - \leq \bar X + t_{n-1,1-\alpha/2}S /\sqrt{n}\right) - \end{eqnarray*} -$$ -- Interval is $\bar X \pm t_{n-1,1-\alpha/2} S/\sqrt{n}$ - ---- - -## Note's about the $t$ interval - -- The $t$ interval technically assumes that the data are iid normal, though it is robust to this assumption -- It works well whenever the distribution of the data is roughly symmetric and mound shaped -- Paired observations are often analyzed using the $t$ interval by taking differences -- For large degrees of freedom, $t$ quantiles become the same as standard normal quantiles; therefore this interval converges to the same interval as the CLT yielded -- For skewed distributions, the spirit of the $t$ interval assumptions are violated -- Also, for skewed distributions, it doesn't make a lot of sense to center the interval at the mean -- In this case, consider taking logs or using a different summary like the median -- For highly discrete data, like binary, other intervals are available - ---- - -## Sleep data - -In R typing `data(sleep)` brings up the sleep data originally -analyzed in Gosset's Biometrika paper, which shows the increase in -hours for 10 patients on two soporific drugs. R treats the data as two -groups rather than paired. - ---- -## The data -```{r} -data(sleep) -head(sleep) -``` - ---- -```{r} -g1 <- sleep$extra[1 : 10]; g2 <- sleep$extra[11 : 20] -difference <- g2 - g1 -mn <- mean(difference); s <- sd(difference); n <- 10 -mn + c(-1, 1) * qt(.975, n-1) * s / sqrt(n) -t.test(difference)$conf.int -``` - +--- +title : T Confidence Intervals +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- +## Confidence intervals + +- In the previous, we discussed creating a confidence interval using the CLT +- In this lecture, we discuss some methods for small samples, notably Gosset's $t$ distribution +- To discuss the $t$ distribution we must discuss the Chi-squared distribution +- Throughout we use the following general procedure for creating CIs + + a. Create a **Pivot** or statistic that does not depend on the parameter of interest + + b. Solve the probability that the pivot lies between bounds for the parameter + +--- + +## The Chi-squared distribution + +- Suppose that $S^2$ is the sample variance from a collection of iid $N(\mu,\sigma^2)$ data; then +$$ + \frac{(n - 1) S^2}{\sigma^2} \sim \chi^2_{n-1} +$$ +which reads: follows a Chi-squared distribution with $n-1$ degrees of freedom +- The Chi-squared distribution is skewed and has support on $0$ to $\infty$ +- The mean of the Chi-squared is its degrees of freedom +- The variance of the Chi-squared distribution is twice the degrees of freedom + +--- + +## Confidence interval for the variance + +Note that if $\chi^2_{n-1, \alpha}$ is the $\alpha$ quantile of the +Chi-squared distribution then + +$$ +\begin{eqnarray*} + 1 - \alpha & = & P \left( \chi^2_{n-1, \alpha/2} \leq \frac{(n - 1) S^2}{\sigma^2} \leq \chi^2_{n-1,1 - \alpha/2} \right) \\ \\ +& = & P\left(\frac{(n-1)S^2}{\chi^2_{n-1,1-\alpha/2}} \leq \sigma^2 \leq +\frac{(n-1)S^2}{\chi^2_{n-1,\alpha/2}} \right) \\ +\end{eqnarray*} +$$ +So that +$$ +\left[\frac{(n-1)S^2}{\chi^2_{n-1,1-\alpha/2}}, \frac{(n-1)S^2}{\chi^2_{n-1,\alpha/2}}\right] +$$ +is a $100(1-\alpha)\%$ confidence interval for $\sigma^2$ + +--- + +## Notes about this interval + +- This interval relies heavily on the assumed normality +- Square-rooting the endpoints yields a CI for $\sigma$ + +--- +## Example +### Confidence interval for the standard deviation of sons' heights from Galton's data +```{r} +library(UsingR); data(father.son); x <- father.son$sheight +s <- sd(x); n <- length(x) +round(sqrt( (n-1) * s ^ 2 / qchisq(c(.975, .025), n - 1) ), 3) +``` + +--- + +## Gosset's $t$ distribution + +- Invented by William Gosset (under the pseudonym "Student") in 1908 +- Has thicker tails than the normal +- Is indexed by a degrees of freedom; gets more like a standard normal as df gets larger +- Is obtained as +$$ +\frac{Z}{\sqrt{\frac{\chi^2}{df}}} +$$ +where $Z$ and $\chi^2$ are independent standard normals and +Chi-squared distributions respectively + +--- + +## Result + +- Suppose that $(X_1,\ldots,X_n)$ are iid $N(\mu,\sigma^2)$, then: + a. $\frac{\bar X - \mu}{\sigma / \sqrt{n}}$ is standard normal + b. $\sqrt{\frac{(n - 1) S^2}{\sigma^2 (n - 1)}} = S / \sigma$ is the square root of a Chi-squared divided by its df + +- Therefore +$$ +\frac{\frac{\bar X - \mu}{\sigma /\sqrt{n}}}{S/\sigma} += \frac{\bar X - \mu}{S/\sqrt{n}} +$$ + follows Gosset's $t$ distribution with $n-1$ degrees of freedom + +--- + +## Confidence intervals for the mean + +- Notice that the $t$ statistic is a pivot, therefore we use it to create a confidence interval for $\mu$ +- Let $t_{df,\alpha}$ be the $\alpha^{th}$ quantile of the t distribution with $df$ degrees of freedom +$$ + \begin{eqnarray*} +& & 1 - \alpha \\ +& = & P\left(-t_{n-1,1-\alpha/2} \leq \frac{\bar X - \mu}{S/\sqrt{n}} \leq t_{n-1,1-\alpha/2}\right) \\ \\ +& = & P\left(\bar X - t_{n-1,1-\alpha/2} S / \sqrt{n} \leq \mu + \leq \bar X + t_{n-1,1-\alpha/2}S /\sqrt{n}\right) + \end{eqnarray*} +$$ +- Interval is $\bar X \pm t_{n-1,1-\alpha/2} S/\sqrt{n}$ + +--- + +## Note's about the $t$ interval + +- The $t$ interval technically assumes that the data are iid normal, though it is robust to this assumption +- It works well whenever the distribution of the data is roughly symmetric and mound shaped +- Paired observations are often analyzed using the $t$ interval by taking differences +- For large degrees of freedom, $t$ quantiles become the same as standard normal quantiles; therefore this interval converges to the same interval as the CLT yielded +- For skewed distributions, the spirit of the $t$ interval assumptions are violated +- Also, for skewed distributions, it doesn't make a lot of sense to center the interval at the mean +- In this case, consider taking logs or using a different summary like the median +- For highly discrete data, like binary, other intervals are available + +--- + +## Sleep data + +In R typing `data(sleep)` brings up the sleep data originally +analyzed in Gosset's Biometrika paper, which shows the increase in +hours for 10 patients on two soporific drugs. R treats the data as two +groups rather than paired. + +--- +## The data +```{r} +data(sleep) +head(sleep) +``` + +--- +## Results +```{r, echo=TRUE} +g1 <- sleep$extra[1 : 10]; g2 <- sleep$extra[11 : 20] +difference <- g2 - g1 +mn <- mean(difference); s <- sd(difference); n <- 10 +mn + c(-1, 1) * qt(.975, n-1) * s / sqrt(n) +t.test(difference)$conf.int +``` + + + diff --git a/06_StatisticalInference/02_03_tCIs/index.html b/06_StatisticalInference/02_03_tCIs/index.html index 8a5b3d4eb..8b2b6c9d9 100644 --- a/06_StatisticalInference/02_03_tCIs/index.html +++ b/06_StatisticalInference/02_03_tCIs/index.html @@ -1,399 +1,402 @@ - - - - T Confidence Intervals - - - - - - - - - - - - - - - - - - - - - - - - - - -
-

T Confidence Intervals

-

Statistical Inference

-

Brian Caffo, Jeff Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

-
-
-
- - - - -
-

Confidence intervals

-
-
-
    -
  • In the previous, we discussed creating a confidence interval using the CLT
  • -
  • In this lecture, we discuss some methods for small samples, notably Gosset's \(t\) distribution
  • -
  • To discuss the \(t\) distribution we must discuss the Chi-squared distribution
  • -
  • Throughout we use the following general procedure for creating CIs

    - -

    a. Create a Pivot or statistic that does not depend on the parameter of interest

    - -

    b. Solve the probability that the pivot lies between bounds for the parameter

  • -
- -
- -
- - -
-

The Chi-squared distribution

-
-
-
    -
  • Suppose that \(S^2\) is the sample variance from a collection of iid \(N(\mu,\sigma^2)\) data; then -\[ -\frac{(n - 1) S^2}{\sigma^2} \sim \chi^2_{n-1} -\] -which reads: follows a Chi-squared distribution with \(n-1\) degrees of freedom
  • -
  • The Chi-squared distribution is skewed and has support on \(0\) to \(\infty\)
  • -
  • The mean of the Chi-squared is its degrees of freedom
  • -
  • The variance of the Chi-squared distribution is twice the degrees of freedom
  • -
- -
- -
- - -
-

Confidence interval for the variance

-
-
-

Note that if \(\chi^2_{n-1, \alpha}\) is the \(\alpha\) quantile of the -Chi-squared distribution then

- -

\[ -\begin{eqnarray*} - 1 - \alpha & = & P \left( \chi^2_{n-1, \alpha/2} \leq \frac{(n - 1) S^2}{\sigma^2} \leq \chi^2_{n-1,1 - \alpha/2} \right) \\ \\ -& = & P\left(\frac{(n-1)S^2}{\chi^2_{n-1,1-\alpha/2}} \leq \sigma^2 \leq -\frac{(n-1)S^2}{\chi^2_{n-1,\alpha/2}} \right) \\ -\end{eqnarray*} -\] -So that -\[ -\left[\frac{(n-1)S^2}{\chi^2_{n-1,1-\alpha/2}}, \frac{(n-1)S^2}{\chi^2_{n-1,\alpha/2}}\right] -\] -is a \(100(1-\alpha)\%\) confidence interval for \(\sigma^2\)

- -
- -
- - -
-

Notes about this interval

-
-
-
    -
  • This interval relies heavily on the assumed normality
  • -
  • Square-rooting the endpoints yields a CI for \(\sigma\)
  • -
- -
- -
- - -
-

Example

-
-
-

Confidence interval for the standard deviation of sons' heights from Galton's data

- -
library(UsingR)
-data(father.son)
-x <- father.son$sheight
-s <- sd(x)
-n <- length(x)
-round(sqrt((n - 1) * s^2/qchisq(c(0.975, 0.025), n - 1)), 3)
-
- -
## [1] 2.701 2.939
-
- -
- -
- - -
-

Gosset's \(t\) distribution

-
-
-
    -
  • Invented by William Gosset (under the pseudonym "Student") in 1908
  • -
  • Has thicker tails than the normal
  • -
  • Is indexed by a degrees of freedom; gets more like a standard normal as df gets larger
  • -
  • Is obtained as -\[ -\frac{Z}{\sqrt{\frac{\chi^2}{df}}} -\] -where \(Z\) and \(\chi^2\) are independent standard normals and -Chi-squared distributions respectively
  • -
- -
- -
- - -
-

Result

-
-
-
    -
  • Suppose that \((X_1,\ldots,X_n)\) are iid \(N(\mu,\sigma^2)\), then: -a. \(\frac{\bar X - \mu}{\sigma / \sqrt{n}}\) is standard normal -b. \(\sqrt{\frac{(n - 1) S^2}{\sigma^2 (n - 1)}} = S / \sigma\) is the square root of a Chi-squared divided by its df

  • -
  • Therefore -\[ -\frac{\frac{\bar X - \mu}{\sigma /\sqrt{n}}}{S/\sigma} -= \frac{\bar X - \mu}{S/\sqrt{n}} -\] -follows Gosset's \(t\) distribution with \(n-1\) degrees of freedom

  • -
- -
- -
- - -
-

Confidence intervals for the mean

-
-
-
    -
  • Notice that the \(t\) statistic is a pivot, therefore we use it to create a confidence interval for \(\mu\)
  • -
  • Let \(t_{df,\alpha}\) be the \(\alpha^{th}\) quantile of the t distribution with \(df\) degrees of freedom -\[ -\begin{eqnarray*} -& & 1 - \alpha \\ -& = & P\left(-t_{n-1,1-\alpha/2} \leq \frac{\bar X - \mu}{S/\sqrt{n}} \leq t_{n-1,1-\alpha/2}\right) \\ \\ -& = & P\left(\bar X - t_{n-1,1-\alpha/2} S / \sqrt{n} \leq \mu - \leq \bar X + t_{n-1,1-\alpha/2}S /\sqrt{n}\right) -\end{eqnarray*} -\]
  • -
  • Interval is \(\bar X \pm t_{n-1,1-\alpha/2} S/\sqrt{n}\)
  • -
- -
- -
- - -
-

Note's about the \(t\) interval

-
-
-
    -
  • The \(t\) interval technically assumes that the data are iid normal, though it is robust to this assumption
  • -
  • It works well whenever the distribution of the data is roughly symmetric and mound shaped
  • -
  • Paired observations are often analyzed using the \(t\) interval by taking differences
  • -
  • For large degrees of freedom, \(t\) quantiles become the same as standard normal quantiles; therefore this interval converges to the same interval as the CLT yielded
  • -
  • For skewed distributions, the spirit of the \(t\) interval assumptions are violated
  • -
  • Also, for skewed distributions, it doesn't make a lot of sense to center the interval at the mean
  • -
  • In this case, consider taking logs or using a different summary like the median
  • -
  • For highly discrete data, like binary, other intervals are available
  • -
- -
- -
- - -
-

Sleep data

-
-
-

In R typing data(sleep) brings up the sleep data originally -analyzed in Gosset's Biometrika paper, which shows the increase in -hours for 10 patients on two soporific drugs. R treats the data as two -groups rather than paired.

- -
- -
- - -
-

The data

-
-
-
data(sleep)
-head(sleep)
-
- -
##   extra group ID
-## 1   0.7     1  1
-## 2  -1.6     1  2
-## 3  -0.2     1  3
-## 4  -1.2     1  4
-## 5  -0.1     1  5
-## 6   3.4     1  6
-
- -
- -
- - -
-
g1 <- sleep$extra[1:10]
-g2 <- sleep$extra[11:20]
-difference <- g2 - g1
-mn <- mean(difference)
-s <- sd(difference)
-n <- 10
-mn + c(-1, 1) * qt(0.975, n - 1) * s/sqrt(n)
-
- -
## [1] 0.7001 2.4599
-
- -
t.test(difference)$conf.int
-
- -
## [1] 0.7001 2.4599
-## attr(,"conf.level")
-## [1] 0.95
-
- -
- -
- - -
- - - - - - - - - - - - - - + + + + T Confidence Intervals + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

T Confidence Intervals

+

Statistical Inference

+

Brian Caffo, Jeff Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

+
+
+
+ + + + +
+

Confidence intervals

+
+
+
    +
  • In the previous, we discussed creating a confidence interval using the CLT
  • +
  • In this lecture, we discuss some methods for small samples, notably Gosset's \(t\) distribution
  • +
  • To discuss the \(t\) distribution we must discuss the Chi-squared distribution
  • +
  • Throughout we use the following general procedure for creating CIs

    + +

    a. Create a Pivot or statistic that does not depend on the parameter of interest

    + +

    b. Solve the probability that the pivot lies between bounds for the parameter

  • +
+ +
+ +
+ + +
+

The Chi-squared distribution

+
+
+
    +
  • Suppose that \(S^2\) is the sample variance from a collection of iid \(N(\mu,\sigma^2)\) data; then +\[ +\frac{(n - 1) S^2}{\sigma^2} \sim \chi^2_{n-1} +\] +which reads: follows a Chi-squared distribution with \(n-1\) degrees of freedom
  • +
  • The Chi-squared distribution is skewed and has support on \(0\) to \(\infty\)
  • +
  • The mean of the Chi-squared is its degrees of freedom
  • +
  • The variance of the Chi-squared distribution is twice the degrees of freedom
  • +
+ +
+ +
+ + +
+

Confidence interval for the variance

+
+
+

Note that if \(\chi^2_{n-1, \alpha}\) is the \(\alpha\) quantile of the +Chi-squared distribution then

+ +

\[ +\begin{eqnarray*} + 1 - \alpha & = & P \left( \chi^2_{n-1, \alpha/2} \leq \frac{(n - 1) S^2}{\sigma^2} \leq \chi^2_{n-1,1 - \alpha/2} \right) \\ \\ +& = & P\left(\frac{(n-1)S^2}{\chi^2_{n-1,1-\alpha/2}} \leq \sigma^2 \leq +\frac{(n-1)S^2}{\chi^2_{n-1,\alpha/2}} \right) \\ +\end{eqnarray*} +\] +So that +\[ +\left[\frac{(n-1)S^2}{\chi^2_{n-1,1-\alpha/2}}, \frac{(n-1)S^2}{\chi^2_{n-1,\alpha/2}}\right] +\] +is a \(100(1-\alpha)\%\) confidence interval for \(\sigma^2\)

+ +
+ +
+ + +
+

Notes about this interval

+
+
+
    +
  • This interval relies heavily on the assumed normality
  • +
  • Square-rooting the endpoints yields a CI for \(\sigma\)
  • +
+ +
+ +
+ + +
+

Example

+
+
+

Confidence interval for the standard deviation of sons' heights from Galton's data

+ +
library(UsingR)
+data(father.son)
+x <- father.son$sheight
+s <- sd(x)
+n <- length(x)
+round(sqrt((n - 1) * s^2/qchisq(c(0.975, 0.025), n - 1)), 3)
+
+ +
## [1] 2.701 2.939
+
+ +
+ +
+ + +
+

Gosset's \(t\) distribution

+
+
+
    +
  • Invented by William Gosset (under the pseudonym "Student") in 1908
  • +
  • Has thicker tails than the normal
  • +
  • Is indexed by a degrees of freedom; gets more like a standard normal as df gets larger
  • +
  • Is obtained as +\[ +\frac{Z}{\sqrt{\frac{\chi^2}{df}}} +\] +where \(Z\) and \(\chi^2\) are independent standard normals and +Chi-squared distributions respectively
  • +
+ +
+ +
+ + +
+

Result

+
+
+
    +
  • Suppose that \((X_1,\ldots,X_n)\) are iid \(N(\mu,\sigma^2)\), then: +a. \(\frac{\bar X - \mu}{\sigma / \sqrt{n}}\) is standard normal +b. \(\sqrt{\frac{(n - 1) S^2}{\sigma^2 (n - 1)}} = S / \sigma\) is the square root of a Chi-squared divided by its df

  • +
  • Therefore +\[ +\frac{\frac{\bar X - \mu}{\sigma /\sqrt{n}}}{S/\sigma} += \frac{\bar X - \mu}{S/\sqrt{n}} +\] +follows Gosset's \(t\) distribution with \(n-1\) degrees of freedom

  • +
+ +
+ +
+ + +
+

Confidence intervals for the mean

+
+
+
    +
  • Notice that the \(t\) statistic is a pivot, therefore we use it to create a confidence interval for \(\mu\)
  • +
  • Let \(t_{df,\alpha}\) be the \(\alpha^{th}\) quantile of the t distribution with \(df\) degrees of freedom +\[ +\begin{eqnarray*} +& & 1 - \alpha \\ +& = & P\left(-t_{n-1,1-\alpha/2} \leq \frac{\bar X - \mu}{S/\sqrt{n}} \leq t_{n-1,1-\alpha/2}\right) \\ \\ +& = & P\left(\bar X - t_{n-1,1-\alpha/2} S / \sqrt{n} \leq \mu + \leq \bar X + t_{n-1,1-\alpha/2}S /\sqrt{n}\right) +\end{eqnarray*} +\]
  • +
  • Interval is \(\bar X \pm t_{n-1,1-\alpha/2} S/\sqrt{n}\)
  • +
+ +
+ +
+ + +
+

Note's about the \(t\) interval

+
+
+
    +
  • The \(t\) interval technically assumes that the data are iid normal, though it is robust to this assumption
  • +
  • It works well whenever the distribution of the data is roughly symmetric and mound shaped
  • +
  • Paired observations are often analyzed using the \(t\) interval by taking differences
  • +
  • For large degrees of freedom, \(t\) quantiles become the same as standard normal quantiles; therefore this interval converges to the same interval as the CLT yielded
  • +
  • For skewed distributions, the spirit of the \(t\) interval assumptions are violated
  • +
  • Also, for skewed distributions, it doesn't make a lot of sense to center the interval at the mean
  • +
  • In this case, consider taking logs or using a different summary like the median
  • +
  • For highly discrete data, like binary, other intervals are available
  • +
+ +
+ +
+ + +
+

Sleep data

+
+
+

In R typing data(sleep) brings up the sleep data originally +analyzed in Gosset's Biometrika paper, which shows the increase in +hours for 10 patients on two soporific drugs. R treats the data as two +groups rather than paired.

+ +
+ +
+ + +
+

The data

+
+
+
data(sleep)
+head(sleep)
+
+ +
##   extra group ID
+## 1   0.7     1  1
+## 2  -1.6     1  2
+## 3  -0.2     1  3
+## 4  -1.2     1  4
+## 5  -0.1     1  5
+## 6   3.4     1  6
+
+ +
+ +
+ + +
+

Results

+
+
+
g1 <- sleep$extra[1:10]
+g2 <- sleep$extra[11:20]
+difference <- g2 - g1
+mn <- mean(difference)
+s <- sd(difference)
+n <- 10
+mn + c(-1, 1) * qt(0.975, n - 1) * s/sqrt(n)
+
+ +
## [1] 0.7001 2.4599
+
+ +
t.test(difference)$conf.int
+
+ +
## [1] 0.7001 2.4599
+## attr(,"conf.level")
+## [1] 0.95
+
+ +
+ +
+ + +
+ + + + + + + + + + + + + + \ No newline at end of file diff --git a/06_StatisticalInference/02_03_tCIs/index.md b/06_StatisticalInference/02_03_tCIs/index.md index 5f8bd59ec..fbcd970b6 100644 --- a/06_StatisticalInference/02_03_tCIs/index.md +++ b/06_StatisticalInference/02_03_tCIs/index.md @@ -1,197 +1,200 @@ ---- -title : T Confidence Intervals -subtitle : Statistical Inference -author : Brian Caffo, Jeff Leek, Roger Peng -job : Johns Hopkins Bloomberg School of Public Health -logo : bloomberg_shield.png -framework : io2012 # {io2012, html5slides, shower, dzslides, ...} -highlighter : highlight.js # {highlight.js, prettify, highlight} -hitheme : tomorrow # -url: - lib: ../../librariesNew - assets: ../../assets -widgets : [mathjax] # {mathjax, quiz, bootstrap} -mode : selfcontained # {standalone, draft} ---- -## Confidence intervals - -- In the previous, we discussed creating a confidence interval using the CLT -- In this lecture, we discuss some methods for small samples, notably Gosset's $t$ distribution -- To discuss the $t$ distribution we must discuss the Chi-squared distribution -- Throughout we use the following general procedure for creating CIs - - a. Create a **Pivot** or statistic that does not depend on the parameter of interest - - b. Solve the probability that the pivot lies between bounds for the parameter - ---- - -## The Chi-squared distribution - -- Suppose that $S^2$ is the sample variance from a collection of iid $N(\mu,\sigma^2)$ data; then -$$ - \frac{(n - 1) S^2}{\sigma^2} \sim \chi^2_{n-1} -$$ -which reads: follows a Chi-squared distribution with $n-1$ degrees of freedom -- The Chi-squared distribution is skewed and has support on $0$ to $\infty$ -- The mean of the Chi-squared is its degrees of freedom -- The variance of the Chi-squared distribution is twice the degrees of freedom - ---- - -## Confidence interval for the variance - -Note that if $\chi^2_{n-1, \alpha}$ is the $\alpha$ quantile of the -Chi-squared distribution then - -$$ -\begin{eqnarray*} - 1 - \alpha & = & P \left( \chi^2_{n-1, \alpha/2} \leq \frac{(n - 1) S^2}{\sigma^2} \leq \chi^2_{n-1,1 - \alpha/2} \right) \\ \\ -& = & P\left(\frac{(n-1)S^2}{\chi^2_{n-1,1-\alpha/2}} \leq \sigma^2 \leq -\frac{(n-1)S^2}{\chi^2_{n-1,\alpha/2}} \right) \\ -\end{eqnarray*} -$$ -So that -$$ -\left[\frac{(n-1)S^2}{\chi^2_{n-1,1-\alpha/2}}, \frac{(n-1)S^2}{\chi^2_{n-1,\alpha/2}}\right] -$$ -is a $100(1-\alpha)\%$ confidence interval for $\sigma^2$ - ---- - -## Notes about this interval - -- This interval relies heavily on the assumed normality -- Square-rooting the endpoints yields a CI for $\sigma$ - ---- -## Example -### Confidence interval for the standard deviation of sons' heights from Galton's data - -```r -library(UsingR) -data(father.son) -x <- father.son$sheight -s <- sd(x) -n <- length(x) -round(sqrt((n - 1) * s^2/qchisq(c(0.975, 0.025), n - 1)), 3) -``` - -``` -## [1] 2.701 2.939 -``` - - ---- - -## Gosset's $t$ distribution - -- Invented by William Gosset (under the pseudonym "Student") in 1908 -- Has thicker tails than the normal -- Is indexed by a degrees of freedom; gets more like a standard normal as df gets larger -- Is obtained as -$$ -\frac{Z}{\sqrt{\frac{\chi^2}{df}}} -$$ -where $Z$ and $\chi^2$ are independent standard normals and -Chi-squared distributions respectively - ---- - -## Result - -- Suppose that $(X_1,\ldots,X_n)$ are iid $N(\mu,\sigma^2)$, then: - a. $\frac{\bar X - \mu}{\sigma / \sqrt{n}}$ is standard normal - b. $\sqrt{\frac{(n - 1) S^2}{\sigma^2 (n - 1)}} = S / \sigma$ is the square root of a Chi-squared divided by its df - -- Therefore -$$ -\frac{\frac{\bar X - \mu}{\sigma /\sqrt{n}}}{S/\sigma} -= \frac{\bar X - \mu}{S/\sqrt{n}} -$$ - follows Gosset's $t$ distribution with $n-1$ degrees of freedom - ---- - -## Confidence intervals for the mean - -- Notice that the $t$ statistic is a pivot, therefore we use it to create a confidence interval for $\mu$ -- Let $t_{df,\alpha}$ be the $\alpha^{th}$ quantile of the t distribution with $df$ degrees of freedom -$$ - \begin{eqnarray*} -& & 1 - \alpha \\ -& = & P\left(-t_{n-1,1-\alpha/2} \leq \frac{\bar X - \mu}{S/\sqrt{n}} \leq t_{n-1,1-\alpha/2}\right) \\ \\ -& = & P\left(\bar X - t_{n-1,1-\alpha/2} S / \sqrt{n} \leq \mu - \leq \bar X + t_{n-1,1-\alpha/2}S /\sqrt{n}\right) - \end{eqnarray*} -$$ -- Interval is $\bar X \pm t_{n-1,1-\alpha/2} S/\sqrt{n}$ - ---- - -## Note's about the $t$ interval - -- The $t$ interval technically assumes that the data are iid normal, though it is robust to this assumption -- It works well whenever the distribution of the data is roughly symmetric and mound shaped -- Paired observations are often analyzed using the $t$ interval by taking differences -- For large degrees of freedom, $t$ quantiles become the same as standard normal quantiles; therefore this interval converges to the same interval as the CLT yielded -- For skewed distributions, the spirit of the $t$ interval assumptions are violated -- Also, for skewed distributions, it doesn't make a lot of sense to center the interval at the mean -- In this case, consider taking logs or using a different summary like the median -- For highly discrete data, like binary, other intervals are available - ---- - -## Sleep data - -In R typing `data(sleep)` brings up the sleep data originally -analyzed in Gosset's Biometrika paper, which shows the increase in -hours for 10 patients on two soporific drugs. R treats the data as two -groups rather than paired. - ---- -## The data - -```r -data(sleep) -head(sleep) -``` - -``` -## extra group ID -## 1 0.7 1 1 -## 2 -1.6 1 2 -## 3 -0.2 1 3 -## 4 -1.2 1 4 -## 5 -0.1 1 5 -## 6 3.4 1 6 -``` - - ---- - -```r -g1 <- sleep$extra[1:10] -g2 <- sleep$extra[11:20] -difference <- g2 - g1 -mn <- mean(difference) -s <- sd(difference) -n <- 10 -mn + c(-1, 1) * qt(0.975, n - 1) * s/sqrt(n) -``` - -``` -## [1] 0.7001 2.4599 -``` - -```r -t.test(difference)$conf.int -``` - -``` -## [1] 0.7001 2.4599 -## attr(,"conf.level") -## [1] 0.95 -``` - - +--- +title : T Confidence Intervals +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- +## Confidence intervals + +- In the previous, we discussed creating a confidence interval using the CLT +- In this lecture, we discuss some methods for small samples, notably Gosset's $t$ distribution +- To discuss the $t$ distribution we must discuss the Chi-squared distribution +- Throughout we use the following general procedure for creating CIs + + a. Create a **Pivot** or statistic that does not depend on the parameter of interest + + b. Solve the probability that the pivot lies between bounds for the parameter + +--- + +## The Chi-squared distribution + +- Suppose that $S^2$ is the sample variance from a collection of iid $N(\mu,\sigma^2)$ data; then +$$ + \frac{(n - 1) S^2}{\sigma^2} \sim \chi^2_{n-1} +$$ +which reads: follows a Chi-squared distribution with $n-1$ degrees of freedom +- The Chi-squared distribution is skewed and has support on $0$ to $\infty$ +- The mean of the Chi-squared is its degrees of freedom +- The variance of the Chi-squared distribution is twice the degrees of freedom + +--- + +## Confidence interval for the variance + +Note that if $\chi^2_{n-1, \alpha}$ is the $\alpha$ quantile of the +Chi-squared distribution then + +$$ +\begin{eqnarray*} + 1 - \alpha & = & P \left( \chi^2_{n-1, \alpha/2} \leq \frac{(n - 1) S^2}{\sigma^2} \leq \chi^2_{n-1,1 - \alpha/2} \right) \\ \\ +& = & P\left(\frac{(n-1)S^2}{\chi^2_{n-1,1-\alpha/2}} \leq \sigma^2 \leq +\frac{(n-1)S^2}{\chi^2_{n-1,\alpha/2}} \right) \\ +\end{eqnarray*} +$$ +So that +$$ +\left[\frac{(n-1)S^2}{\chi^2_{n-1,1-\alpha/2}}, \frac{(n-1)S^2}{\chi^2_{n-1,\alpha/2}}\right] +$$ +is a $100(1-\alpha)\%$ confidence interval for $\sigma^2$ + +--- + +## Notes about this interval + +- This interval relies heavily on the assumed normality +- Square-rooting the endpoints yields a CI for $\sigma$ + +--- +## Example +### Confidence interval for the standard deviation of sons' heights from Galton's data + +```r +library(UsingR) +data(father.son) +x <- father.son$sheight +s <- sd(x) +n <- length(x) +round(sqrt((n - 1) * s^2/qchisq(c(0.975, 0.025), n - 1)), 3) +``` + +``` +## [1] 2.701 2.939 +``` + + +--- + +## Gosset's $t$ distribution + +- Invented by William Gosset (under the pseudonym "Student") in 1908 +- Has thicker tails than the normal +- Is indexed by a degrees of freedom; gets more like a standard normal as df gets larger +- Is obtained as +$$ +\frac{Z}{\sqrt{\frac{\chi^2}{df}}} +$$ +where $Z$ and $\chi^2$ are independent standard normals and +Chi-squared distributions respectively + +--- + +## Result + +- Suppose that $(X_1,\ldots,X_n)$ are iid $N(\mu,\sigma^2)$, then: + a. $\frac{\bar X - \mu}{\sigma / \sqrt{n}}$ is standard normal + b. $\sqrt{\frac{(n - 1) S^2}{\sigma^2 (n - 1)}} = S / \sigma$ is the square root of a Chi-squared divided by its df + +- Therefore +$$ +\frac{\frac{\bar X - \mu}{\sigma /\sqrt{n}}}{S/\sigma} += \frac{\bar X - \mu}{S/\sqrt{n}} +$$ + follows Gosset's $t$ distribution with $n-1$ degrees of freedom + +--- + +## Confidence intervals for the mean + +- Notice that the $t$ statistic is a pivot, therefore we use it to create a confidence interval for $\mu$ +- Let $t_{df,\alpha}$ be the $\alpha^{th}$ quantile of the t distribution with $df$ degrees of freedom +$$ + \begin{eqnarray*} +& & 1 - \alpha \\ +& = & P\left(-t_{n-1,1-\alpha/2} \leq \frac{\bar X - \mu}{S/\sqrt{n}} \leq t_{n-1,1-\alpha/2}\right) \\ \\ +& = & P\left(\bar X - t_{n-1,1-\alpha/2} S / \sqrt{n} \leq \mu + \leq \bar X + t_{n-1,1-\alpha/2}S /\sqrt{n}\right) + \end{eqnarray*} +$$ +- Interval is $\bar X \pm t_{n-1,1-\alpha/2} S/\sqrt{n}$ + +--- + +## Note's about the $t$ interval + +- The $t$ interval technically assumes that the data are iid normal, though it is robust to this assumption +- It works well whenever the distribution of the data is roughly symmetric and mound shaped +- Paired observations are often analyzed using the $t$ interval by taking differences +- For large degrees of freedom, $t$ quantiles become the same as standard normal quantiles; therefore this interval converges to the same interval as the CLT yielded +- For skewed distributions, the spirit of the $t$ interval assumptions are violated +- Also, for skewed distributions, it doesn't make a lot of sense to center the interval at the mean +- In this case, consider taking logs or using a different summary like the median +- For highly discrete data, like binary, other intervals are available + +--- + +## Sleep data + +In R typing `data(sleep)` brings up the sleep data originally +analyzed in Gosset's Biometrika paper, which shows the increase in +hours for 10 patients on two soporific drugs. R treats the data as two +groups rather than paired. + +--- +## The data + +```r +data(sleep) +head(sleep) +``` + +``` +## extra group ID +## 1 0.7 1 1 +## 2 -1.6 1 2 +## 3 -0.2 1 3 +## 4 -1.2 1 4 +## 5 -0.1 1 5 +## 6 3.4 1 6 +``` + + +--- +## Results + +```r +g1 <- sleep$extra[1:10] +g2 <- sleep$extra[11:20] +difference <- g2 - g1 +mn <- mean(difference) +s <- sd(difference) +n <- 10 +mn + c(-1, 1) * qt(0.975, n - 1) * s/sqrt(n) +``` + +``` +## [1] 0.7001 2.4599 +``` + +```r +t.test(difference)$conf.int +``` + +``` +## [1] 0.7001 2.4599 +## attr(,"conf.level") +## [1] 0.95 +``` + + + + diff --git a/06_StatisticalInference/02_03_tCIs/index.pdf b/06_StatisticalInference/02_03_tCIs/index.pdf index 855c3502b..19947b5e5 100644 Binary files a/06_StatisticalInference/02_03_tCIs/index.pdf and b/06_StatisticalInference/02_03_tCIs/index.pdf differ diff --git a/06_StatisticalInference/02_04_Likeklihood/assets/fig/unnamed-chunk-1.png b/06_StatisticalInference/02_04_Likeklihood/assets/fig/unnamed-chunk-1.png new file mode 100644 index 000000000..eea114c31 Binary files /dev/null and b/06_StatisticalInference/02_04_Likeklihood/assets/fig/unnamed-chunk-1.png differ diff --git a/06_StatisticalInference/02_04_Likeklihood/assets/fig/unnamed-chunk-2.png b/06_StatisticalInference/02_04_Likeklihood/assets/fig/unnamed-chunk-2.png new file mode 100644 index 000000000..d00011e8a Binary files /dev/null and b/06_StatisticalInference/02_04_Likeklihood/assets/fig/unnamed-chunk-2.png differ diff --git a/06_StatisticalInference/02_04_Likeklihood/index.Rmd b/06_StatisticalInference/02_04_Likeklihood/index.Rmd index 56f723f8d..d62b9a3c8 100644 --- a/06_StatisticalInference/02_04_Likeklihood/index.Rmd +++ b/06_StatisticalInference/02_04_Likeklihood/index.Rmd @@ -1,144 +1,128 @@ ---- -title : Likelihood -subtitle : Statistical Inference -author : Brian Caffo, Roger Peng, Jeff Leek -job : Johns Hopkins Bloomberg School of Public Health -logo : bloomberg_shield.png -framework : io2012 # {io2012, html5slides, shower, dzslides, ...} -highlighter : highlight.js # {highlight.js, prettify, highlight} -hitheme : tomorrow # -url: - lib: ../../librariesNew - assets: ../../assets -widgets : [mathjax] # {mathjax, quiz, bootstrap} -mode : selfcontained # {standalone, draft} ---- -```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F, results='hide'} -# make this an external chunk that can be included in any file -options(width = 100) -opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/') - -options(xtable.type = 'html') -knit_hooks$set(inline = function(x) { - if(is.numeric(x)) { - round(x, getOption('digits')) - } else { - paste(as.character(x), collapse = ', ') - } -}) -knit_hooks$set(plot = knitr:::hook_plot_html) -runif(1) -``` - -## Likelihood - -- A common and fruitful approach to statistics is to assume that the data arises from a family of distributions indexed by a parameter that represents a useful summary of the distribution -- The **likelihood** of a collection of data is the joint density evaluated as a function of the parameters with the data fixed -- Likelihood analysis of data uses the likelihood to perform inference regarding the unknown parameter - ---- - -## Likelihood - -Given a statistical probability mass function or density, say $f(x, \theta)$, where $\theta$ is an unknown parameter, the **likelihood** is $f$ viewed as a function of $\theta$ for a fixed, observed value of $x$. - ---- - -## Interpretations of likelihoods - -The likelihood has the following properties: - -1. Ratios of likelihood values measure the relative evidence of one value of the unknown parameter to another. -2. Given a statistical model and observed data, all of the relevant information contained in the data regarding the unknown parameter is contained in the likelihood. -3. If $\{X_i\}$ are independent random variables, then their likelihoods multiply. That is, the likelihood of the parameters given all of the $X_i$ is simply the product of the individual likelihoods. - ---- - -## Example - -- Suppose that we flip a coin with success probability $\theta$ -- Recall that the mass function for $x$ - $$ - f(x,\theta) = \theta^x(1 - \theta)^{1 - x} ~~~\mbox{for}~~~ \theta \in [0,1]. - $$ - where $x$ is either $0$ (Tails) or $1$ (Heads) -- Suppose that the result is a head -- The likelihood is - $$ - {\cal L}(\theta, 1) = \theta^1 (1 - \theta)^{1 - 1} = \theta ~~~\mbox{for} ~~~ \theta \in [0,1]. - $$ -- Therefore, ${\cal L}(.5, 1) / {\cal L}(.25, 1) = 2$, -- There is twice as much evidence supporting the hypothesis that $\theta = .5$ to the hypothesis that $\theta = .25$ - ---- - -## Example continued - -- Suppose now that we flip our coin from the previous example 4 times and get the sequence 1, 0, 1, 1 -- The likelihood is: -$$ - \begin{eqnarray*} - {\cal L}(\theta, 1,0,1,1) & = & \theta^1 (1 - \theta)^{1 - 1} - \theta^0 (1 - \theta)^{1 - 0} \\ -& \times & \theta^1 (1 - \theta)^{1 - 1} - \theta^1 (1 - \theta)^{1 - 1}\\ -& = & \theta^3(1 - \theta)^1 - \end{eqnarray*} -$$ -- This likelihood only depends on the total number of heads and the total number of tails; we might write ${\cal L}(\theta, 1, 3)$ for shorthand -- Now consider ${\cal L}(.5, 1, 3) / {\cal L}(.25, 1, 3) = 5.33$ -- There is over five times as much evidence supporting the hypothesis that $\theta = .5$ over that $\theta = .25$ - ---- - -## Plotting likelihoods - -- Generally, we want to consider all the values of $\theta$ between 0 and 1 -- A **likelihood plot** displays $\theta$ by ${\cal L}(\theta,x)$ -- Because the likelihood measures *relative evidence*, dividing the curve by its maximum value (or any other value for that matter) does not change its interpretation - ---- -```{r, fig.height=4.5, fig.width=4.5} -pvals <- seq(0, 1, length = 1000) -plot(pvals, dbinom(3, 4, pvals) / dbinom(3, 4, 3/4), type = "l", frame = FALSE, lwd = 3, xlab = "p", ylab = "likelihood / max likelihood") -``` - - ---- - -## Maximum likelihood - -- The value of $\theta$ where the curve reaches its maximum has a special meaning -- It is the value of $\theta$ that is most well supported by the data -- This point is called the **maximum likelihood estimate** (or MLE) of $\theta$ - $$ - MLE = \mathrm{argmax}_\theta {\cal L}(\theta, x). - $$ -- Another interpretation of the MLE is that it is the value of $\theta$ that would make the data that we observed most probable - ---- -## Some results -* $X_1, \ldots, X_n \stackrel{iid}{\sim} N(\mu, \sigma^2)$ the MLE of $\mu$ is $\bar X$ and the ML of $\sigma^2$ is the biased sample variance estimate. -* If $X_1,\ldots, X_n \stackrel{iid}{\sim} Bernoulli(p)$ then the MLE of $p$ is $\bar X$ (the sample proportion of 1s). -* If $X_i \stackrel{iid}{\sim} Binomial(n_i, p)$ then the MLE of $p$ is $\frac{\sum_{i=1}^n X_i}{\sum_{i=1}^n n_i}$ (the sample proportion of 1s). -* If $X \stackrel{iid}{\sim} Poisson(\lambda t)$ then the MLE of $\lambda$ is $X/t$. -* If $X_i \stackrel{iid}{\sim} Poisson(\lambda t_i)$ then the MLE of $\lambda$ is - $\frac{\sum_{i=1}^n X_i}{\sum_{i=1}^n t_i}$ - ---- -## Example -* You saw 5 failure events per 94 days of monitoring a nuclear pump. -* Assuming Poisson, plot the likelihood - ---- -```{r, fig.height=4, fig.width=4, echo= TRUE} -lambda <- seq(0, .2, length = 1000) -likelihood <- dpois(5, 94 * lambda) / dpois(5, 5) -plot(lambda, likelihood, frame = FALSE, lwd = 3, type = "l", xlab = expression(lambda)) -lines(rep(5/94, 2), 0 : 1, col = "red", lwd = 3) -lines(range(lambda[likelihood > 1/16]), rep(1/16, 2), lwd = 2) -lines(range(lambda[likelihood > 1/8]), rep(1/8, 2), lwd = 2) -``` - - - +--- +title : Likelihood +subtitle : Statistical Inference +author : Brian Caffo, Roger Peng, Jeff Leek +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- + +## Likelihood + +- A common and fruitful approach to statistics is to assume that the data arises from a family of distributions indexed by a parameter that represents a useful summary of the distribution +- The **likelihood** of a collection of data is the joint density evaluated as a function of the parameters with the data fixed +- Likelihood analysis of data uses the likelihood to perform inference regarding the unknown parameter + +--- + +## Likelihood + +Given a statistical probability mass function or density, say $f(x, \theta)$, where $\theta$ is an unknown parameter, the **likelihood** is $f$ viewed as a function of $\theta$ for a fixed, observed value of $x$. + +--- + +## Interpretations of likelihoods + +The likelihood has the following properties: + +1. Ratios of likelihood values measure the relative evidence of one value of the unknown parameter to another. +2. Given a statistical model and observed data, all of the relevant information contained in the data regarding the unknown parameter is contained in the likelihood. +3. If $\{X_i\}$ are independent random variables, then their likelihoods multiply. That is, the likelihood of the parameters given all of the $X_i$ is simply the product of the individual likelihoods. + +--- + +## Example + +- Suppose that we flip a coin with success probability $\theta$ +- Recall that the mass function for $x$ + $$ + f(x,\theta) = \theta^x(1 - \theta)^{1 - x} ~~~\mbox{for}~~~ \theta \in [0,1]. + $$ + where $x$ is either $0$ (Tails) or $1$ (Heads) +- Suppose that the result is a head +- The likelihood is + $$ + {\cal L}(\theta, 1) = \theta^1 (1 - \theta)^{1 - 1} = \theta ~~~\mbox{for} ~~~ \theta \in [0,1]. + $$ +- Therefore, ${\cal L}(.5, 1) / {\cal L}(.25, 1) = 2$, +- There is twice as much evidence supporting the hypothesis that $\theta = .5$ to the hypothesis that $\theta = .25$ + +--- + +## Example continued + +- Suppose now that we flip our coin from the previous example 4 times and get the sequence 1, 0, 1, 1 +- The likelihood is: +$$ + \begin{eqnarray*} + {\cal L}(\theta, 1,0,1,1) & = & \theta^1 (1 - \theta)^{1 - 1} + \theta^0 (1 - \theta)^{1 - 0} \\ +& \times & \theta^1 (1 - \theta)^{1 - 1} + \theta^1 (1 - \theta)^{1 - 1}\\ +& = & \theta^3(1 - \theta)^1 + \end{eqnarray*} +$$ +- This likelihood only depends on the total number of heads and the total number of tails; we might write ${\cal L}(\theta, 1, 3)$ for shorthand +- Now consider ${\cal L}(.5, 1, 3) / {\cal L}(.25, 1, 3) = 5.33$ +- There is over five times as much evidence supporting the hypothesis that $\theta = .5$ over that $\theta = .25$ + +--- + +## Plotting likelihoods + +- Generally, we want to consider all the values of $\theta$ between 0 and 1 +- A **likelihood plot** displays $\theta$ by ${\cal L}(\theta,x)$ +- Because the likelihood measures *relative evidence*, dividing the curve by its maximum value (or any other value for that matter) does not change its interpretation + +--- +```{r, fig.height=4.5, fig.width=4.5} +pvals <- seq(0, 1, length = 1000) +plot(pvals, dbinom(3, 4, pvals) / dbinom(3, 4, 3/4), type = "l", frame = FALSE, lwd = 3, xlab = "p", ylab = "likelihood / max likelihood") +``` + + +--- + +## Maximum likelihood + +- The value of $\theta$ where the curve reaches its maximum has a special meaning +- It is the value of $\theta$ that is most well supported by the data +- This point is called the **maximum likelihood estimate** (or MLE) of $\theta$ + $$ + MLE = \mathrm{argmax}_\theta {\cal L}(\theta, x). + $$ +- Another interpretation of the MLE is that it is the value of $\theta$ that would make the data that we observed most probable + +--- +## Some results +* $X_1, \ldots, X_n \stackrel{iid}{\sim} N(\mu, \sigma^2)$ the MLE of $\mu$ is $\bar X$ and the ML of $\sigma^2$ is the biased sample variance estimate. +* If $X_1,\ldots, X_n \stackrel{iid}{\sim} Bernoulli(p)$ then the MLE of $p$ is $\bar X$ (the sample proportion of 1s). +* If $X_i \stackrel{iid}{\sim} Binomial(n_i, p)$ then the MLE of $p$ is $\frac{\sum_{i=1}^n X_i}{\sum_{i=1}^n n_i}$ (the sample proportion of 1s). +* If $X \stackrel{iid}{\sim} Poisson(\lambda t)$ then the MLE of $\lambda$ is $X/t$. +* If $X_i \stackrel{iid}{\sim} Poisson(\lambda t_i)$ then the MLE of $\lambda$ is + $\frac{\sum_{i=1}^n X_i}{\sum_{i=1}^n t_i}$ + +--- +## Example +* You saw 5 failure events per 94 days of monitoring a nuclear pump. +* Assuming Poisson, plot the likelihood + +--- +```{r, fig.height=4, fig.width=4, echo= TRUE} +lambda <- seq(0, .2, length = 1000) +likelihood <- dpois(5, 94 * lambda) / dpois(5, 5) +plot(lambda, likelihood, frame = FALSE, lwd = 3, type = "l", xlab = expression(lambda)) +lines(rep(5/94, 2), 0 : 1, col = "red", lwd = 3) +lines(range(lambda[likelihood > 1/16]), rep(1/16, 2), lwd = 2) +lines(range(lambda[likelihood > 1/8]), rep(1/8, 2), lwd = 2) +``` + + + diff --git a/06_StatisticalInference/02_04_Likeklihood/index.html b/06_StatisticalInference/02_04_Likeklihood/index.html index d7f7d2684..527e9b425 100644 --- a/06_StatisticalInference/02_04_Likeklihood/index.html +++ b/06_StatisticalInference/02_04_Likeklihood/index.html @@ -1,333 +1,334 @@ - - - - Likelihood - - - - - - - - - - - - - - - - - - - - - - - - - - -
-

Likelihood

-

Statistical Inference

-

Brian Caffo, Roger Peng, Jeff Leek
Johns Hopkins Bloomberg School of Public Health

-
-
-
- - - - -
-

Likelihood

-
-
-
    -
  • A common and fruitful approach to statistics is to assume that the data arises from a family of distributions indexed by a parameter that represents a useful summary of the distribution
  • -
  • The likelihood of a collection of data is the joint density evaluated as a function of the parameters with the data fixed
  • -
  • Likelihood analysis of data uses the likelihood to perform inference regarding the unknown parameter
  • -
- -
- -
- - -
-

Likelihood

-
-
-

Given a statistical probability mass function or density, say \(f(x, \theta)\), where \(\theta\) is an unknown parameter, the likelihood is \(f\) viewed as a function of \(\theta\) for a fixed, observed value of \(x\).

- -
- -
- - -
-

Interpretations of likelihoods

-
-
-

The likelihood has the following properties:

- -
    -
  1. Ratios of likelihood values measure the relative evidence of one value of the unknown parameter to another.
  2. -
  3. Given a statistical model and observed data, all of the relevant information contained in the data regarding the unknown parameter is contained in the likelihood.
  4. -
  5. If \(\{X_i\}\) are independent random variables, then their likelihoods multiply. That is, the likelihood of the parameters given all of the \(X_i\) is simply the product of the individual likelihoods.
  6. -
- -
- -
- - -
-

Example

-
-
-
    -
  • Suppose that we flip a coin with success probability \(\theta\)
  • -
  • Recall that the mass function for \(x\) -\[ -f(x,\theta) = \theta^x(1 - \theta)^{1 - x} ~~~\mbox{for}~~~ \theta \in [0,1]. -\] -where \(x\) is either \(0\) (Tails) or \(1\) (Heads)
  • -
  • Suppose that the result is a head
  • -
  • The likelihood is -\[ -{\cal L}(\theta, 1) = \theta^1 (1 - \theta)^{1 - 1} = \theta ~~~\mbox{for} ~~~ \theta \in [0,1]. -\]
  • -
  • Therefore, \({\cal L}(.5, 1) / {\cal L}(.25, 1) = 2\),
  • -
  • There is twice as much evidence supporting the hypothesis that \(\theta = .5\) to the hypothesis that \(\theta = .25\)
  • -
- -
- -
- - -
-

Example continued

-
-
-
    -
  • Suppose now that we flip our coin from the previous example 4 times and get the sequence 1, 0, 1, 1
  • -
  • The likelihood is: -\[ -\begin{eqnarray*} -{\cal L}(\theta, 1,0,1,1) & = & \theta^1 (1 - \theta)^{1 - 1} -\theta^0 (1 - \theta)^{1 - 0} \\ -& \times & \theta^1 (1 - \theta)^{1 - 1} -\theta^1 (1 - \theta)^{1 - 1}\\ -& = & \theta^3(1 - \theta)^1 -\end{eqnarray*} -\]
  • -
  • This likelihood only depends on the total number of heads and the total number of tails; we might write \({\cal L}(\theta, 1, 3)\) for shorthand
  • -
  • Now consider \({\cal L}(.5, 1, 3) / {\cal L}(.25, 1, 3) = 5.33\)
  • -
  • There is over five times as much evidence supporting the hypothesis that \(\theta = .5\) over that \(\theta = .25\)
  • -
- -
- -
- - -
-

Plotting likelihoods

-
-
-
    -
  • Generally, we want to consider all the values of \(\theta\) between 0 and 1
  • -
  • A likelihood plot displays \(\theta\) by \({\cal L}(\theta,x)\)
  • -
  • Because the likelihood measures relative evidence, dividing the curve by its maximum value (or any other value for that matter) does not change its interpretation
  • -
- -
- -
- - -
-
pvals <- seq(0, 1, length = 1000)
-plot(pvals, dbinom(3, 4, pvals) / dbinom(3, 4, 3/4), type = "l", frame = FALSE, lwd = 3, xlab = "p", ylab = "likelihood / max likelihood")
-
- -
plot of chunk unnamed-chunk-1
- -
- -
- - -
-

Maximum likelihood

-
-
-
    -
  • The value of \(\theta\) where the curve reaches its maximum has a special meaning
  • -
  • It is the value of \(\theta\) that is most well supported by the data
  • -
  • This point is called the maximum likelihood estimate (or MLE) of \(\theta\) -\[ -MLE = \mathrm{argmax}_\theta {\cal L}(\theta, x). -\]
  • -
  • Another interpretation of the MLE is that it is the value of \(\theta\) that would make the data that we observed most probable
  • -
- -
- -
- - -
-

Some results

-
-
-
    -
  • \(X_1, \ldots, X_n \stackrel{iid}{\sim} N(\mu, \sigma^2)\) the MLE of \(\mu\) is \(\bar X\) and the ML of \(\sigma^2\) is the biased sample variance estimate.
  • -
  • If \(X_1,\ldots, X_n \stackrel{iid}{\sim} Bernoulli(p)\) then the MLE of \(p\) is \(\bar X\) (the sample proportion of 1s).
  • -
  • If \(X_i \stackrel{iid}{\sim} Binomial(n_i, p)\) then the MLE of \(p\) is \(\frac{\sum_{i=1}^n X_i}{\sum_{i=1}^n n_i}\) (the sample proportion of 1s).
  • -
  • If \(X \stackrel{iid}{\sim} Poisson(\lambda t)\) then the MLE of \(\lambda\) is \(X/t\).
  • -
  • If \(X_i \stackrel{iid}{\sim} Poisson(\lambda t_i)\) then the MLE of \(\lambda\) is -\(\frac{\sum_{i=1}^n X_i}{\sum_{i=1}^n t_i}\)
  • -
- -
- -
- - -
-

Example

-
-
-
    -
  • You saw 5 failure events per 94 days of monitoring a nuclear pump.
  • -
  • Assuming Poisson, plot the likelihood
  • -
- -
- -
- - -
-
lambda <- seq(0, .2, length = 1000)
-likelihood <- dpois(5, 94 * lambda) / dpois(5, 5)
-plot(lambda, likelihood, frame = FALSE, lwd = 3, type = "l", xlab = expression(lambda))
-lines(rep(5/94, 2), 0 : 1, col = "red", lwd = 3)
-lines(range(lambda[likelihood > 1/16]), rep(1/16, 2), lwd = 2)
-lines(range(lambda[likelihood > 1/8]), rep(1/8, 2), lwd = 2)
-
- -
plot of chunk unnamed-chunk-2
- -
- -
- - -
- - - - - - - - - - - - - - + + + + Likelihood + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

Likelihood

+

Statistical Inference

+

Brian Caffo, Roger Peng, Jeff Leek
Johns Hopkins Bloomberg School of Public Health

+
+
+
+ + + + +
+

Likelihood

+
+
+
    +
  • A common and fruitful approach to statistics is to assume that the data arises from a family of distributions indexed by a parameter that represents a useful summary of the distribution
  • +
  • The likelihood of a collection of data is the joint density evaluated as a function of the parameters with the data fixed
  • +
  • Likelihood analysis of data uses the likelihood to perform inference regarding the unknown parameter
  • +
+ +
+ +
+ + +
+

Likelihood

+
+
+

Given a statistical probability mass function or density, say \(f(x, \theta)\), where \(\theta\) is an unknown parameter, the likelihood is \(f\) viewed as a function of \(\theta\) for a fixed, observed value of \(x\).

+ +
+ +
+ + +
+

Interpretations of likelihoods

+
+
+

The likelihood has the following properties:

+ +
    +
  1. Ratios of likelihood values measure the relative evidence of one value of the unknown parameter to another.
  2. +
  3. Given a statistical model and observed data, all of the relevant information contained in the data regarding the unknown parameter is contained in the likelihood.
  4. +
  5. If \(\{X_i\}\) are independent random variables, then their likelihoods multiply. That is, the likelihood of the parameters given all of the \(X_i\) is simply the product of the individual likelihoods.
  6. +
+ +
+ +
+ + +
+

Example

+
+
+
    +
  • Suppose that we flip a coin with success probability \(\theta\)
  • +
  • Recall that the mass function for \(x\) +\[ +f(x,\theta) = \theta^x(1 - \theta)^{1 - x} ~~~\mbox{for}~~~ \theta \in [0,1]. +\] +where \(x\) is either \(0\) (Tails) or \(1\) (Heads)
  • +
  • Suppose that the result is a head
  • +
  • The likelihood is +\[ +{\cal L}(\theta, 1) = \theta^1 (1 - \theta)^{1 - 1} = \theta ~~~\mbox{for} ~~~ \theta \in [0,1]. +\]
  • +
  • Therefore, \({\cal L}(.5, 1) / {\cal L}(.25, 1) = 2\),
  • +
  • There is twice as much evidence supporting the hypothesis that \(\theta = .5\) to the hypothesis that \(\theta = .25\)
  • +
+ +
+ +
+ + +
+

Example continued

+
+
+
    +
  • Suppose now that we flip our coin from the previous example 4 times and get the sequence 1, 0, 1, 1
  • +
  • The likelihood is: +\[ +\begin{eqnarray*} +{\cal L}(\theta, 1,0,1,1) & = & \theta^1 (1 - \theta)^{1 - 1} +\theta^0 (1 - \theta)^{1 - 0} \\ +& \times & \theta^1 (1 - \theta)^{1 - 1} +\theta^1 (1 - \theta)^{1 - 1}\\ +& = & \theta^3(1 - \theta)^1 +\end{eqnarray*} +\]
  • +
  • This likelihood only depends on the total number of heads and the total number of tails; we might write \({\cal L}(\theta, 1, 3)\) for shorthand
  • +
  • Now consider \({\cal L}(.5, 1, 3) / {\cal L}(.25, 1, 3) = 5.33\)
  • +
  • There is over five times as much evidence supporting the hypothesis that \(\theta = .5\) over that \(\theta = .25\)
  • +
+ +
+ +
+ + +
+

Plotting likelihoods

+
+
+
    +
  • Generally, we want to consider all the values of \(\theta\) between 0 and 1
  • +
  • A likelihood plot displays \(\theta\) by \({\cal L}(\theta,x)\)
  • +
  • Because the likelihood measures relative evidence, dividing the curve by its maximum value (or any other value for that matter) does not change its interpretation
  • +
+ +
+ +
+ + +
+
pvals <- seq(0, 1, length = 1000)
+plot(pvals, dbinom(3, 4, pvals)/dbinom(3, 4, 3/4), type = "l", frame = FALSE, 
+    lwd = 3, xlab = "p", ylab = "likelihood / max likelihood")
+
+ +

plot of chunk unnamed-chunk-1

+ +
+ +
+ + +
+

Maximum likelihood

+
+
+
    +
  • The value of \(\theta\) where the curve reaches its maximum has a special meaning
  • +
  • It is the value of \(\theta\) that is most well supported by the data
  • +
  • This point is called the maximum likelihood estimate (or MLE) of \(\theta\) +\[ +MLE = \mathrm{argmax}_\theta {\cal L}(\theta, x). +\]
  • +
  • Another interpretation of the MLE is that it is the value of \(\theta\) that would make the data that we observed most probable
  • +
+ +
+ +
+ + +
+

Some results

+
+
+
    +
  • \(X_1, \ldots, X_n \stackrel{iid}{\sim} N(\mu, \sigma^2)\) the MLE of \(\mu\) is \(\bar X\) and the ML of \(\sigma^2\) is the biased sample variance estimate.
  • +
  • If \(X_1,\ldots, X_n \stackrel{iid}{\sim} Bernoulli(p)\) then the MLE of \(p\) is \(\bar X\) (the sample proportion of 1s).
  • +
  • If \(X_i \stackrel{iid}{\sim} Binomial(n_i, p)\) then the MLE of \(p\) is \(\frac{\sum_{i=1}^n X_i}{\sum_{i=1}^n n_i}\) (the sample proportion of 1s).
  • +
  • If \(X \stackrel{iid}{\sim} Poisson(\lambda t)\) then the MLE of \(\lambda\) is \(X/t\).
  • +
  • If \(X_i \stackrel{iid}{\sim} Poisson(\lambda t_i)\) then the MLE of \(\lambda\) is +\(\frac{\sum_{i=1}^n X_i}{\sum_{i=1}^n t_i}\)
  • +
+ +
+ +
+ + +
+

Example

+
+
+
    +
  • You saw 5 failure events per 94 days of monitoring a nuclear pump.
  • +
  • Assuming Poisson, plot the likelihood
  • +
+ +
+ +
+ + +
+
lambda <- seq(0, 0.2, length = 1000)
+likelihood <- dpois(5, 94 * lambda)/dpois(5, 5)
+plot(lambda, likelihood, frame = FALSE, lwd = 3, type = "l", xlab = expression(lambda))
+lines(rep(5/94, 2), 0:1, col = "red", lwd = 3)
+lines(range(lambda[likelihood > 1/16]), rep(1/16, 2), lwd = 2)
+lines(range(lambda[likelihood > 1/8]), rep(1/8, 2), lwd = 2)
+
+ +

plot of chunk unnamed-chunk-2

+ +
+ +
+ + +
+ + + + + + + + + + + + + + \ No newline at end of file diff --git a/06_StatisticalInference/02_04_Likeklihood/index.md b/06_StatisticalInference/02_04_Likeklihood/index.md index d7cc17b36..1e0739357 100644 --- a/06_StatisticalInference/02_04_Likeklihood/index.md +++ b/06_StatisticalInference/02_04_Likeklihood/index.md @@ -1,138 +1,137 @@ ---- -title : Likelihood -subtitle : Statistical Inference -author : Brian Caffo, Roger Peng, Jeff Leek -job : Johns Hopkins Bloomberg School of Public Health -logo : bloomberg_shield.png -framework : io2012 # {io2012, html5slides, shower, dzslides, ...} -highlighter : highlight.js # {highlight.js, prettify, highlight} -hitheme : tomorrow # -url: - lib: ../../librariesNew - assets: ../../assets -widgets : [mathjax] # {mathjax, quiz, bootstrap} -mode : selfcontained # {standalone, draft} ---- - - - -## Likelihood - -- A common and fruitful approach to statistics is to assume that the data arises from a family of distributions indexed by a parameter that represents a useful summary of the distribution -- The **likelihood** of a collection of data is the joint density evaluated as a function of the parameters with the data fixed -- Likelihood analysis of data uses the likelihood to perform inference regarding the unknown parameter - ---- - -## Likelihood - -Given a statistical probability mass function or density, say $f(x, \theta)$, where $\theta$ is an unknown parameter, the **likelihood** is $f$ viewed as a function of $\theta$ for a fixed, observed value of $x$. - ---- - -## Interpretations of likelihoods - -The likelihood has the following properties: - -1. Ratios of likelihood values measure the relative evidence of one value of the unknown parameter to another. -2. Given a statistical model and observed data, all of the relevant information contained in the data regarding the unknown parameter is contained in the likelihood. -3. If $\{X_i\}$ are independent random variables, then their likelihoods multiply. That is, the likelihood of the parameters given all of the $X_i$ is simply the product of the individual likelihoods. - ---- - -## Example - -- Suppose that we flip a coin with success probability $\theta$ -- Recall that the mass function for $x$ - $$ - f(x,\theta) = \theta^x(1 - \theta)^{1 - x} ~~~\mbox{for}~~~ \theta \in [0,1]. - $$ - where $x$ is either $0$ (Tails) or $1$ (Heads) -- Suppose that the result is a head -- The likelihood is - $$ - {\cal L}(\theta, 1) = \theta^1 (1 - \theta)^{1 - 1} = \theta ~~~\mbox{for} ~~~ \theta \in [0,1]. - $$ -- Therefore, ${\cal L}(.5, 1) / {\cal L}(.25, 1) = 2$, -- There is twice as much evidence supporting the hypothesis that $\theta = .5$ to the hypothesis that $\theta = .25$ - ---- - -## Example continued - -- Suppose now that we flip our coin from the previous example 4 times and get the sequence 1, 0, 1, 1 -- The likelihood is: -$$ - \begin{eqnarray*} - {\cal L}(\theta, 1,0,1,1) & = & \theta^1 (1 - \theta)^{1 - 1} - \theta^0 (1 - \theta)^{1 - 0} \\ -& \times & \theta^1 (1 - \theta)^{1 - 1} - \theta^1 (1 - \theta)^{1 - 1}\\ -& = & \theta^3(1 - \theta)^1 - \end{eqnarray*} -$$ -- This likelihood only depends on the total number of heads and the total number of tails; we might write ${\cal L}(\theta, 1, 3)$ for shorthand -- Now consider ${\cal L}(.5, 1, 3) / {\cal L}(.25, 1, 3) = 5.33$ -- There is over five times as much evidence supporting the hypothesis that $\theta = .5$ over that $\theta = .25$ - ---- - -## Plotting likelihoods - -- Generally, we want to consider all the values of $\theta$ between 0 and 1 -- A **likelihood plot** displays $\theta$ by ${\cal L}(\theta,x)$ -- Because the likelihood measures *relative evidence*, dividing the curve by its maximum value (or any other value for that matter) does not change its interpretation - ---- - -```r -pvals <- seq(0, 1, length = 1000) -plot(pvals, dbinom(3, 4, pvals) / dbinom(3, 4, 3/4), type = "l", frame = FALSE, lwd = 3, xlab = "p", ylab = "likelihood / max likelihood") -``` - -
plot of chunk unnamed-chunk-1
- - - ---- - -## Maximum likelihood - -- The value of $\theta$ where the curve reaches its maximum has a special meaning -- It is the value of $\theta$ that is most well supported by the data -- This point is called the **maximum likelihood estimate** (or MLE) of $\theta$ - $$ - MLE = \mathrm{argmax}_\theta {\cal L}(\theta, x). - $$ -- Another interpretation of the MLE is that it is the value of $\theta$ that would make the data that we observed most probable - ---- -## Some results -* $X_1, \ldots, X_n \stackrel{iid}{\sim} N(\mu, \sigma^2)$ the MLE of $\mu$ is $\bar X$ and the ML of $\sigma^2$ is the biased sample variance estimate. -* If $X_1,\ldots, X_n \stackrel{iid}{\sim} Bernoulli(p)$ then the MLE of $p$ is $\bar X$ (the sample proportion of 1s). -* If $X_i \stackrel{iid}{\sim} Binomial(n_i, p)$ then the MLE of $p$ is $\frac{\sum_{i=1}^n X_i}{\sum_{i=1}^n n_i}$ (the sample proportion of 1s). -* If $X \stackrel{iid}{\sim} Poisson(\lambda t)$ then the MLE of $\lambda$ is $X/t$. -* If $X_i \stackrel{iid}{\sim} Poisson(\lambda t_i)$ then the MLE of $\lambda$ is - $\frac{\sum_{i=1}^n X_i}{\sum_{i=1}^n t_i}$ - ---- -## Example -* You saw 5 failure events per 94 days of monitoring a nuclear pump. -* Assuming Poisson, plot the likelihood - ---- - -```r -lambda <- seq(0, .2, length = 1000) -likelihood <- dpois(5, 94 * lambda) / dpois(5, 5) -plot(lambda, likelihood, frame = FALSE, lwd = 3, type = "l", xlab = expression(lambda)) -lines(rep(5/94, 2), 0 : 1, col = "red", lwd = 3) -lines(range(lambda[likelihood > 1/16]), rep(1/16, 2), lwd = 2) -lines(range(lambda[likelihood > 1/8]), rep(1/8, 2), lwd = 2) -``` - -
plot of chunk unnamed-chunk-2
- - - - +--- +title : Likelihood +subtitle : Statistical Inference +author : Brian Caffo, Roger Peng, Jeff Leek +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- + +## Likelihood + +- A common and fruitful approach to statistics is to assume that the data arises from a family of distributions indexed by a parameter that represents a useful summary of the distribution +- The **likelihood** of a collection of data is the joint density evaluated as a function of the parameters with the data fixed +- Likelihood analysis of data uses the likelihood to perform inference regarding the unknown parameter + +--- + +## Likelihood + +Given a statistical probability mass function or density, say $f(x, \theta)$, where $\theta$ is an unknown parameter, the **likelihood** is $f$ viewed as a function of $\theta$ for a fixed, observed value of $x$. + +--- + +## Interpretations of likelihoods + +The likelihood has the following properties: + +1. Ratios of likelihood values measure the relative evidence of one value of the unknown parameter to another. +2. Given a statistical model and observed data, all of the relevant information contained in the data regarding the unknown parameter is contained in the likelihood. +3. If $\{X_i\}$ are independent random variables, then their likelihoods multiply. That is, the likelihood of the parameters given all of the $X_i$ is simply the product of the individual likelihoods. + +--- + +## Example + +- Suppose that we flip a coin with success probability $\theta$ +- Recall that the mass function for $x$ + $$ + f(x,\theta) = \theta^x(1 - \theta)^{1 - x} ~~~\mbox{for}~~~ \theta \in [0,1]. + $$ + where $x$ is either $0$ (Tails) or $1$ (Heads) +- Suppose that the result is a head +- The likelihood is + $$ + {\cal L}(\theta, 1) = \theta^1 (1 - \theta)^{1 - 1} = \theta ~~~\mbox{for} ~~~ \theta \in [0,1]. + $$ +- Therefore, ${\cal L}(.5, 1) / {\cal L}(.25, 1) = 2$, +- There is twice as much evidence supporting the hypothesis that $\theta = .5$ to the hypothesis that $\theta = .25$ + +--- + +## Example continued + +- Suppose now that we flip our coin from the previous example 4 times and get the sequence 1, 0, 1, 1 +- The likelihood is: +$$ + \begin{eqnarray*} + {\cal L}(\theta, 1,0,1,1) & = & \theta^1 (1 - \theta)^{1 - 1} + \theta^0 (1 - \theta)^{1 - 0} \\ +& \times & \theta^1 (1 - \theta)^{1 - 1} + \theta^1 (1 - \theta)^{1 - 1}\\ +& = & \theta^3(1 - \theta)^1 + \end{eqnarray*} +$$ +- This likelihood only depends on the total number of heads and the total number of tails; we might write ${\cal L}(\theta, 1, 3)$ for shorthand +- Now consider ${\cal L}(.5, 1, 3) / {\cal L}(.25, 1, 3) = 5.33$ +- There is over five times as much evidence supporting the hypothesis that $\theta = .5$ over that $\theta = .25$ + +--- + +## Plotting likelihoods + +- Generally, we want to consider all the values of $\theta$ between 0 and 1 +- A **likelihood plot** displays $\theta$ by ${\cal L}(\theta,x)$ +- Because the likelihood measures *relative evidence*, dividing the curve by its maximum value (or any other value for that matter) does not change its interpretation + +--- + +```r +pvals <- seq(0, 1, length = 1000) +plot(pvals, dbinom(3, 4, pvals)/dbinom(3, 4, 3/4), type = "l", frame = FALSE, + lwd = 3, xlab = "p", ylab = "likelihood / max likelihood") +``` + +![plot of chunk unnamed-chunk-1](assets/fig/unnamed-chunk-1.png) + + + +--- + +## Maximum likelihood + +- The value of $\theta$ where the curve reaches its maximum has a special meaning +- It is the value of $\theta$ that is most well supported by the data +- This point is called the **maximum likelihood estimate** (or MLE) of $\theta$ + $$ + MLE = \mathrm{argmax}_\theta {\cal L}(\theta, x). + $$ +- Another interpretation of the MLE is that it is the value of $\theta$ that would make the data that we observed most probable + +--- +## Some results +* $X_1, \ldots, X_n \stackrel{iid}{\sim} N(\mu, \sigma^2)$ the MLE of $\mu$ is $\bar X$ and the ML of $\sigma^2$ is the biased sample variance estimate. +* If $X_1,\ldots, X_n \stackrel{iid}{\sim} Bernoulli(p)$ then the MLE of $p$ is $\bar X$ (the sample proportion of 1s). +* If $X_i \stackrel{iid}{\sim} Binomial(n_i, p)$ then the MLE of $p$ is $\frac{\sum_{i=1}^n X_i}{\sum_{i=1}^n n_i}$ (the sample proportion of 1s). +* If $X \stackrel{iid}{\sim} Poisson(\lambda t)$ then the MLE of $\lambda$ is $X/t$. +* If $X_i \stackrel{iid}{\sim} Poisson(\lambda t_i)$ then the MLE of $\lambda$ is + $\frac{\sum_{i=1}^n X_i}{\sum_{i=1}^n t_i}$ + +--- +## Example +* You saw 5 failure events per 94 days of monitoring a nuclear pump. +* Assuming Poisson, plot the likelihood + +--- + +```r +lambda <- seq(0, 0.2, length = 1000) +likelihood <- dpois(5, 94 * lambda)/dpois(5, 5) +plot(lambda, likelihood, frame = FALSE, lwd = 3, type = "l", xlab = expression(lambda)) +lines(rep(5/94, 2), 0:1, col = "red", lwd = 3) +lines(range(lambda[likelihood > 1/16]), rep(1/16, 2), lwd = 2) +lines(range(lambda[likelihood > 1/8]), rep(1/8, 2), lwd = 2) +``` + +![plot of chunk unnamed-chunk-2](assets/fig/unnamed-chunk-2.png) + + + + diff --git a/06_StatisticalInference/02_04_Likeklihood/index.pdf b/06_StatisticalInference/02_04_Likeklihood/index.pdf index 85787788c..4e3388a80 100644 Binary files a/06_StatisticalInference/02_04_Likeklihood/index.pdf and b/06_StatisticalInference/02_04_Likeklihood/index.pdf differ diff --git a/06_StatisticalInference/02_05_Bayes/index.Rmd b/06_StatisticalInference/02_05_Bayes/index.Rmd index b1aae880c..7c49de8da 100644 --- a/06_StatisticalInference/02_05_Bayes/index.Rmd +++ b/06_StatisticalInference/02_05_Bayes/index.Rmd @@ -1,204 +1,187 @@ ---- -title : Bayesian inference -subtitle : Statistical Inference -author : Brian Caffo, Roger Peng, Jeff Leek -job : Johns Hopkins Bloomberg School of Public Health -logo : bloomberg_shield.png -framework : io2012 # {io2012, html5slides, shower, dzslides, ...} -highlighter : highlight.js # {highlight.js, prettify, highlight} -hitheme : tomorrow # -url: - lib: ../../librariesNew - assets: ../../assets -widgets : [mathjax] # {mathjax, quiz, bootstrap} -mode : selfcontained # {standalone, draft} ---- -```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F, results='hide'} -# make this an external chunk that can be included in any file -options(width = 100) -opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/') - -options(xtable.type = 'html') -knit_hooks$set(inline = function(x) { - if(is.numeric(x)) { - round(x, getOption('digits')) - } else { - paste(as.character(x), collapse = ', ') - } -}) -knit_hooks$set(plot = knitr:::hook_plot_html) -runif(1) -``` - -## Bayesian analysis -- Bayesian statistics posits a *prior* on the parameter - of interest -- All inferences are then performed on the distribution of - the parameter given the data, called the posterior -- In general, - $$ - \mbox{Posterior} \propto \mbox{Likelihood} \times \mbox{Prior} - $$ -- Therefore (as we saw in diagnostic testing) the likelihood is - the factor by which our prior beliefs are updated to produce - conclusions in the light of the data - ---- -## Prior specification -- The beta distribution is the default prior - for parameters between $0$ and $1$. -- The beta density depends on two parameters $\alpha$ and $\beta$ -$$ -\frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} - p ^ {\alpha - 1} (1 - p) ^ {\beta - 1} ~~~~\mbox{for} ~~ 0 \leq p \leq 1 -$$ -- The mean of the beta density is $\alpha / (\alpha + \beta)$ -- The variance of the beta density is -$$\frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)}$$ -- The uniform density is the special case where $\alpha = \beta = 1$ - ---- - -``` -## Exploring the beta density -library(manipulate) -pvals <- seq(0.01, 0.99, length = 1000) -manipulate( - plot(pvals, dbeta(pvals, alpha, beta), type = "l", lwd = 3, frame = FALSE), - alpha = slider(0.01, 10, initial = 1, step = .5), - beta = slider(0.01, 10, initial = 1, step = .5) - ) -``` - ---- -## Posterior -- Suppose that we chose values of $\alpha$ and $\beta$ so that - the beta prior is indicative of our degree of belief regarding $p$ - in the absence of data -- Then using the rule that - $$ - \mbox{Posterior} \propto \mbox{Likelihood} \times \mbox{Prior} - $$ - and throwing out anything that doesn't depend on $p$, we have that -$$ -\begin{align} -\mbox{Posterior} &\propto p^x(1 - p)^{n-x} \times p^{\alpha -1} (1 - p)^{\beta - 1} \\ - & = p^{x + \alpha - 1} (1 - p)^{n - x + \beta - 1} -\end{align} -$$ -- This density is just another beta density with parameters - $\tilde \alpha = x + \alpha$ and $\tilde \beta = n - x + \beta$ - - ---- -## Posterior mean - -$$ -\begin{align} -E[p ~|~ X] & = \frac{\tilde \alpha}{\tilde \alpha + \tilde \beta}\\ \\ -& = \frac{x + \alpha}{x + \alpha + n - x + \beta}\\ \\ -& = \frac{x + \alpha}{n + \alpha + \beta} \\ \\ -& = \frac{x}{n} \times \frac{n}{n + \alpha + \beta} + \frac{\alpha}{\alpha + \beta} \times \frac{\alpha + \beta}{n + \alpha + \beta} \\ \\ -& = \mbox{MLE} \times \pi + \mbox{Prior Mean} \times (1 - \pi) -\end{align} -$$ - ---- -## Thoughts - -- The posterior mean is a mixture of the MLE ($\hat p$) and the - prior mean -- $\pi$ goes to $1$ as $n$ gets large; for large $n$ the data swamps the prior -- For small $n$, the prior mean dominates -- Generalizes how science should ideally work; as data becomes - increasingly available, prior beliefs should matter less and less -- With a prior that is degenerate at a value, no amount of data - can overcome the prior - ---- -## Example - -- Suppose that in a random sample of an at-risk population -$13$ of $20$ subjects had hypertension. Estimate the prevalence -of hypertension in this population. -- $x = 13$ and $n=20$ -- Consider a uniform prior, $\alpha = \beta = 1$ -- The posterior is proportional to (see formula above) -$$ -p^{x + \alpha - 1} (1 - p)^{n - x + \beta - 1} = p^x (1 - p)^{n-x} -$$ -That is, for the uniform prior, the posterior is the likelihood -- Consider the instance where $\alpha = \beta = 2$ (recall this prior -is humped around the point $.5$) the posterior is -$$ -p^{x + \alpha - 1} (1 - p)^{n - x + \beta - 1} = p^{x + 1} (1 - p)^{n-x + 1} -$$ -- The "Jeffrey's prior" which has some theoretical benefits - puts $\alpha = \beta = .5$ - ---- -``` -pvals <- seq(0.01, 0.99, length = 1000) -x <- 13; n <- 20 -myPlot <- function(alpha, beta){ - plot(0 : 1, 0 : 1, type = "n", xlab = "p", ylab = "", frame = FALSE) - lines(pvals, dbeta(pvals, alpha, beta) / max(dbeta(pvals, alpha, beta)), - lwd = 3, col = "darkred") - lines(pvals, dbinom(x,n,pvals) / dbinom(x,n,x/n), lwd = 3, col = "darkblue") - lines(pvals, dbeta(pvals, alpha+x, beta+(n-x)) / max(dbeta(pvals, alpha+x, beta+(n-x))), - lwd = 3, col = "darkgreen") - title("red=prior,green=posterior,blue=likelihood") -} -manipulate( - myPlot(alpha, beta), - alpha = slider(0.01, 100, initial = 1, step = .5), - beta = slider(0.01, 100, initial = 1, step = .5) - ) -``` - ---- -## Credible intervals -- A Bayesian credible interval is the Bayesian analog of a confidence - interval -- A $95\%$ credible interval, $[a, b]$ would satisfy - $$ - P(p \in [a, b] ~|~ x) = .95 - $$ -- The best credible intervals chop off the posterior with a horizontal - line in the same way we did for likelihoods -- These are called highest posterior density (HPD) intervals - ---- -## Getting HPD intervals for this example -- Install the \texttt{binom} package, then the command -```{r} -library(binom) -binom.bayes(13, 20, type = "highest") -``` -gives the HPD interval. -- The default credible level is $95\%$ and -the default prior is the Jeffrey's prior. - ---- -``` -pvals <- seq(0.01, 0.99, length = 1000) -x <- 13; n <- 20 -myPlot2 <- function(alpha, beta, cl){ - plot(pvals, dbeta(pvals, alpha+x, beta+(n-x)), type = "l", lwd = 3, - xlab = "p", ylab = "", frame = FALSE) - out <- binom.bayes(x, n, type = "highest", - prior.shape1 = alpha, - prior.shape2 = beta, - conf.level = cl) - p1 <- out$lower; p2 <- out$upper - lines(c(p1, p1, p2, p2), c(0, dbeta(c(p1, p2), alpha+x, beta+(n-x)), 0), - type = "l", lwd = 3, col = "darkred") -} -manipulate( - myPlot2(alpha, beta, cl), - alpha = slider(0.01, 10, initial = 1, step = .5), - beta = slider(0.01, 10, initial = 1, step = .5), - cl = slider(0.01, 0.99, initial = 0.95, step = .01) - ) -``` - +--- +title : Bayesian inference +subtitle : Statistical Inference +author : Brian Caffo, Roger Peng, Jeff Leek +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- +## Bayesian analysis +- Bayesian statistics posits a *prior* on the parameter + of interest +- All inferences are then performed on the distribution of + the parameter given the data, called the posterior +- In general, + $$ + \mbox{Posterior} \propto \mbox{Likelihood} \times \mbox{Prior} + $$ +- Therefore (as we saw in diagnostic testing) the likelihood is + the factor by which our prior beliefs are updated to produce + conclusions in the light of the data + +--- +## Prior specification +- The beta distribution is the default prior + for parameters between $0$ and $1$. +- The beta density depends on two parameters $\alpha$ and $\beta$ +$$ +\frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} + p ^ {\alpha - 1} (1 - p) ^ {\beta - 1} ~~~~\mbox{for} ~~ 0 \leq p \leq 1 +$$ +- The mean of the beta density is $\alpha / (\alpha + \beta)$ +- The variance of the beta density is +$$\frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)}$$ +- The uniform density is the special case where $\alpha = \beta = 1$ + +--- + +``` +## Exploring the beta density +library(manipulate) +pvals <- seq(0.01, 0.99, length = 1000) +manipulate( + plot(pvals, dbeta(pvals, alpha, beta), type = "l", lwd = 3, frame = FALSE), + alpha = slider(0.01, 10, initial = 1, step = .5), + beta = slider(0.01, 10, initial = 1, step = .5) + ) +``` + +--- +## Posterior +- Suppose that we chose values of $\alpha$ and $\beta$ so that + the beta prior is indicative of our degree of belief regarding $p$ + in the absence of data +- Then using the rule that + $$ + \mbox{Posterior} \propto \mbox{Likelihood} \times \mbox{Prior} + $$ + and throwing out anything that doesn't depend on $p$, we have that +$$ +\begin{align} +\mbox{Posterior} &\propto p^x(1 - p)^{n-x} \times p^{\alpha -1} (1 - p)^{\beta - 1} \\ + & = p^{x + \alpha - 1} (1 - p)^{n - x + \beta - 1} +\end{align} +$$ +- This density is just another beta density with parameters + $\tilde \alpha = x + \alpha$ and $\tilde \beta = n - x + \beta$ + + +--- +## Posterior mean + +$$ +\begin{align} +E[p ~|~ X] & = \frac{\tilde \alpha}{\tilde \alpha + \tilde \beta}\\ \\ +& = \frac{x + \alpha}{x + \alpha + n - x + \beta}\\ \\ +& = \frac{x + \alpha}{n + \alpha + \beta} \\ \\ +& = \frac{x}{n} \times \frac{n}{n + \alpha + \beta} + \frac{\alpha}{\alpha + \beta} \times \frac{\alpha + \beta}{n + \alpha + \beta} \\ \\ +& = \mbox{MLE} \times \pi + \mbox{Prior Mean} \times (1 - \pi) +\end{align} +$$ + +--- +## Thoughts + +- The posterior mean is a mixture of the MLE ($\hat p$) and the + prior mean +- $\pi$ goes to $1$ as $n$ gets large; for large $n$ the data swamps the prior +- For small $n$, the prior mean dominates +- Generalizes how science should ideally work; as data becomes + increasingly available, prior beliefs should matter less and less +- With a prior that is degenerate at a value, no amount of data + can overcome the prior + +--- +## Example + +- Suppose that in a random sample of an at-risk population +$13$ of $20$ subjects had hypertension. Estimate the prevalence +of hypertension in this population. +- $x = 13$ and $n=20$ +- Consider a uniform prior, $\alpha = \beta = 1$ +- The posterior is proportional to (see formula above) +$$ +p^{x + \alpha - 1} (1 - p)^{n - x + \beta - 1} = p^x (1 - p)^{n-x} +$$ +That is, for the uniform prior, the posterior is the likelihood +- Consider the instance where $\alpha = \beta = 2$ (recall this prior +is humped around the point $.5$) the posterior is +$$ +p^{x + \alpha - 1} (1 - p)^{n - x + \beta - 1} = p^{x + 1} (1 - p)^{n-x + 1} +$$ +- The "Jeffrey's prior" which has some theoretical benefits + puts $\alpha = \beta = .5$ + +--- +``` +pvals <- seq(0.01, 0.99, length = 1000) +x <- 13; n <- 20 +myPlot <- function(alpha, beta){ + plot(0 : 1, 0 : 1, type = "n", xlab = "p", ylab = "", frame = FALSE) + lines(pvals, dbeta(pvals, alpha, beta) / max(dbeta(pvals, alpha, beta)), + lwd = 3, col = "darkred") + lines(pvals, dbinom(x,n,pvals) / dbinom(x,n,x/n), lwd = 3, col = "darkblue") + lines(pvals, dbeta(pvals, alpha+x, beta+(n-x)) / max(dbeta(pvals, alpha+x, beta+(n-x))), + lwd = 3, col = "darkgreen") + title("red=prior,green=posterior,blue=likelihood") +} +manipulate( + myPlot(alpha, beta), + alpha = slider(0.01, 100, initial = 1, step = .5), + beta = slider(0.01, 100, initial = 1, step = .5) + ) +``` + +--- +## Credible intervals +- A Bayesian credible interval is the Bayesian analog of a confidence + interval +- A $95\%$ credible interval, $[a, b]$ would satisfy + $$ + P(p \in [a, b] ~|~ x) = .95 + $$ +- The best credible intervals chop off the posterior with a horizontal + line in the same way we did for likelihoods +- These are called highest posterior density (HPD) intervals + +--- +## Getting HPD intervals for this example +- Install the \texttt{binom} package, then the command +```{r} +library(binom) +binom.bayes(13, 20, type = "highest") +``` +gives the HPD interval. +- The default credible level is $95\%$ and +the default prior is the Jeffrey's prior. + +--- +``` +pvals <- seq(0.01, 0.99, length = 1000) +x <- 13; n <- 20 +myPlot2 <- function(alpha, beta, cl){ + plot(pvals, dbeta(pvals, alpha+x, beta+(n-x)), type = "l", lwd = 3, + xlab = "p", ylab = "", frame = FALSE) + out <- binom.bayes(x, n, type = "highest", + prior.shape1 = alpha, + prior.shape2 = beta, + conf.level = cl) + p1 <- out$lower; p2 <- out$upper + lines(c(p1, p1, p2, p2), c(0, dbeta(c(p1, p2), alpha+x, beta+(n-x)), 0), + type = "l", lwd = 3, col = "darkred") +} +manipulate( + myPlot2(alpha, beta, cl), + alpha = slider(0.01, 10, initial = 1, step = .5), + beta = slider(0.01, 10, initial = 1, step = .5), + cl = slider(0.01, 0.99, initial = 0.95, step = .01) + ) +``` + diff --git a/06_StatisticalInference/02_05_Bayes/index.html b/06_StatisticalInference/02_05_Bayes/index.html index 5fb2d1b05..ac1863c48 100644 --- a/06_StatisticalInference/02_05_Bayes/index.html +++ b/06_StatisticalInference/02_05_Bayes/index.html @@ -1,403 +1,407 @@ - - - - Bayesian inference - - - - - - - - - - - - - - - - - - - - - - - - - - -
-

Bayesian inference

-

Statistical Inference

-

Brian Caffo, Roger Peng, Jeff Leek
Johns Hopkins Bloomberg School of Public Health

-
-
-
- - - - -
-

Bayesian analysis

-
-
-
    -
  • Bayesian statistics posits a prior on the parameter -of interest
  • -
  • All inferences are then performed on the distribution of -the parameter given the data, called the posterior
  • -
  • In general, -\[ -\mbox{Posterior} \propto \mbox{Likelihood} \times \mbox{Prior} -\]
  • -
  • Therefore (as we saw in diagnostic testing) the likelihood is -the factor by which our prior beliefs are updated to produce -conclusions in the light of the data
  • -
- -
- -
- - -
-

Prior specification

-
-
-
    -
  • The beta distribution is the default prior -for parameters between \(0\) and \(1\).
  • -
  • The beta density depends on two parameters \(\alpha\) and \(\beta\) -\[ -\frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} -p ^ {\alpha - 1} (1 - p) ^ {\beta - 1} ~~~~\mbox{for} ~~ 0 \leq p \leq 1 -\]
  • -
  • The mean of the beta density is \(\alpha / (\alpha + \beta)\)
  • -
  • The variance of the beta density is -\[\frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)}\]
  • -
  • The uniform density is the special case where \(\alpha = \beta = 1\)
  • -
- -
- -
- - -
-
## Exploring the beta density
-library(manipulate)
-pvals <- seq(0.01, 0.99, length = 1000)
-manipulate(
-    plot(pvals, dbeta(pvals, alpha, beta), type = "l", lwd = 3, frame = FALSE),
-    alpha = slider(0.01, 10, initial = 1, step = .5),
-    beta = slider(0.01, 10, initial = 1, step = .5)
-    )
-
- -
- -
- - -
-

Posterior

-
-
-
    -
  • Suppose that we chose values of \(\alpha\) and \(\beta\) so that -the beta prior is indicative of our degree of belief regarding \(p\) -in the absence of data
  • -
  • Then using the rule that -\[ -\mbox{Posterior} \propto \mbox{Likelihood} \times \mbox{Prior} -\] -and throwing out anything that doesn't depend on \(p\), we have that -\[ -\begin{align} -\mbox{Posterior} &\propto p^x(1 - p)^{n-x} \times p^{\alpha -1} (1 - p)^{\beta - 1} \\ - & = p^{x + \alpha - 1} (1 - p)^{n - x + \beta - 1} -\end{align} -\]
  • -
  • This density is just another beta density with parameters -\(\tilde \alpha = x + \alpha\) and \(\tilde \beta = n - x + \beta\)
  • -
- -
- -
- - -
-

Posterior mean

-
-
-

\[ -\begin{align} -E[p ~|~ X] & = \frac{\tilde \alpha}{\tilde \alpha + \tilde \beta}\\ \\ -& = \frac{x + \alpha}{x + \alpha + n - x + \beta}\\ \\ -& = \frac{x + \alpha}{n + \alpha + \beta} \\ \\ -& = \frac{x}{n} \times \frac{n}{n + \alpha + \beta} + \frac{\alpha}{\alpha + \beta} \times \frac{\alpha + \beta}{n + \alpha + \beta} \\ \\ -& = \mbox{MLE} \times \pi + \mbox{Prior Mean} \times (1 - \pi) -\end{align} -\]

- -
- -
- - -
-

Thoughts

-
-
-
    -
  • The posterior mean is a mixture of the MLE (\(\hat p\)) and the -prior mean
  • -
  • \(\pi\) goes to \(1\) as \(n\) gets large; for large \(n\) the data swamps the prior
  • -
  • For small \(n\), the prior mean dominates
  • -
  • Generalizes how science should ideally work; as data becomes -increasingly available, prior beliefs should matter less and less
  • -
  • With a prior that is degenerate at a value, no amount of data -can overcome the prior
  • -
- -
- -
- - -
-

Example

-
-
-
    -
  • Suppose that in a random sample of an at-risk population -\(13\) of \(20\) subjects had hypertension. Estimate the prevalence -of hypertension in this population.
  • -
  • \(x = 13\) and \(n=20\)
  • -
  • Consider a uniform prior, \(\alpha = \beta = 1\)
  • -
  • The posterior is proportional to (see formula above) -\[ -p^{x + \alpha - 1} (1 - p)^{n - x + \beta - 1} = p^x (1 - p)^{n-x} -\] -That is, for the uniform prior, the posterior is the likelihood
  • -
  • Consider the instance where \(\alpha = \beta = 2\) (recall this prior -is humped around the point \(.5\)) the posterior is -\[ -p^{x + \alpha - 1} (1 - p)^{n - x + \beta - 1} = p^{x + 1} (1 - p)^{n-x + 1} -\]
  • -
  • The "Jeffrey's prior" which has some theoretical benefits -puts \(\alpha = \beta = .5\)
  • -
- -
- -
- - -
-
pvals <- seq(0.01, 0.99, length = 1000)
-x <- 13; n <- 20
-myPlot <- function(alpha, beta){
-    plot(0 : 1, 0 : 1, type = "n", xlab = "p", ylab = "", frame = FALSE)
-    lines(pvals, dbeta(pvals, alpha, beta) / max(dbeta(pvals, alpha, beta)), 
-            lwd = 3, col = "darkred")
-    lines(pvals, dbinom(x,n,pvals) / dbinom(x,n,x/n), lwd = 3, col = "darkblue")
-    lines(pvals, dbeta(pvals, alpha+x, beta+(n-x)) / max(dbeta(pvals, alpha+x, beta+(n-x))),
-        lwd = 3, col = "darkgreen")
-    title("red=prior,green=posterior,blue=likelihood")
-}
-manipulate(
-    myPlot(alpha, beta),
-    alpha = slider(0.01, 10, initial = 1, step = .5),
-    beta = slider(0.01, 10, initial = 1, step = .5)
-    )
-
- -
- -
- - -
-

Credible intervals

-
-
-
    -
  • A Bayesian credible interval is the Bayesian analog of a confidence -interval
  • -
  • A \(95\%\) credible interval, \([a, b]\) would satisfy -\[ -P(p \in [a, b] ~|~ x) = .95 -\]
  • -
  • The best credible intervals chop off the posterior with a horizontal -line in the same way we did for likelihoods
  • -
  • These are called highest posterior density (HPD) intervals
  • -
- -
- -
- - -
-

Getting HPD intervals for this example

-
-
-
    -
  • Install the \texttt{binom} package, then the command
  • -
- -
library(binom)
-binom.bayes(13, 20, type = "highest")
-
- -
  method  x  n shape1 shape2   mean  lower  upper  sig
-1  bayes 13 20   13.5    7.5 0.6429 0.4423 0.8361 0.05
-
- -

gives the HPD interval.

- -
    -
  • The default credible level is \(95\%\) and -the default prior is the Jeffrey's prior.
  • -
- -
- -
- - -
-
pvals <- seq(0.01, 0.99, length = 1000)
-x <- 13; n <- 20
-myPlot2 <- function(alpha, beta, cl){
-    plot(pvals, dbeta(pvals, alpha+x, beta+(n-x)), type = "l", lwd = 3,
-    xlab = "p", ylab = "", frame = FALSE)
-    out <- binom.bayes(x, n, type = "highest", 
-        prior.shape1 = alpha, 
-        prior.shape2 = beta, 
-        conf.level = cl)
-    p1 <- out$lower; p2 <- out$upper
-    lines(c(p1, p1, p2, p2), c(0, dbeta(c(p1, p2), alpha+x, beta+(n-x)), 0), 
-        type = "l", lwd = 3, col = "darkred")
-}
-manipulate(
-    myPlot2(alpha, beta, cl),
-    alpha = slider(0.01, 10, initial = 1, step = .5),
-    beta = slider(0.01, 10, initial = 1, step = .5),
-    cl = slider(0.01, 0.99, initial = 0.95, step = .01)
-    )
-
- -
- -
- - -
- - - - - - - - - - - - - - + + + + Bayesian inference + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

Bayesian inference

+

Statistical Inference

+

Brian Caffo, Roger Peng, Jeff Leek
Johns Hopkins Bloomberg School of Public Health

+
+
+
+ + + + +
+

Bayesian analysis

+
+
+
    +
  • Bayesian statistics posits a prior on the parameter +of interest
  • +
  • All inferences are then performed on the distribution of +the parameter given the data, called the posterior
  • +
  • In general, +\[ +\mbox{Posterior} \propto \mbox{Likelihood} \times \mbox{Prior} +\]
  • +
  • Therefore (as we saw in diagnostic testing) the likelihood is +the factor by which our prior beliefs are updated to produce +conclusions in the light of the data
  • +
+ +
+ +
+ + +
+

Prior specification

+
+
+
    +
  • The beta distribution is the default prior +for parameters between \(0\) and \(1\).
  • +
  • The beta density depends on two parameters \(\alpha\) and \(\beta\) +\[ +\frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} +p ^ {\alpha - 1} (1 - p) ^ {\beta - 1} ~~~~\mbox{for} ~~ 0 \leq p \leq 1 +\]
  • +
  • The mean of the beta density is \(\alpha / (\alpha + \beta)\)
  • +
  • The variance of the beta density is +\[\frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)}\]
  • +
  • The uniform density is the special case where \(\alpha = \beta = 1\)
  • +
+ +
+ +
+ + +
+
## Exploring the beta density
+library(manipulate)
+pvals <- seq(0.01, 0.99, length = 1000)
+manipulate(
+    plot(pvals, dbeta(pvals, alpha, beta), type = "l", lwd = 3, frame = FALSE),
+    alpha = slider(0.01, 10, initial = 1, step = .5),
+    beta = slider(0.01, 10, initial = 1, step = .5)
+    )
+
+ +
+ +
+ + +
+

Posterior

+
+
+
    +
  • Suppose that we chose values of \(\alpha\) and \(\beta\) so that +the beta prior is indicative of our degree of belief regarding \(p\) +in the absence of data
  • +
  • Then using the rule that +\[ +\mbox{Posterior} \propto \mbox{Likelihood} \times \mbox{Prior} +\] +and throwing out anything that doesn't depend on \(p\), we have that +\[ +\begin{align} +\mbox{Posterior} &\propto p^x(1 - p)^{n-x} \times p^{\alpha -1} (1 - p)^{\beta - 1} \\ + & = p^{x + \alpha - 1} (1 - p)^{n - x + \beta - 1} +\end{align} +\]
  • +
  • This density is just another beta density with parameters +\(\tilde \alpha = x + \alpha\) and \(\tilde \beta = n - x + \beta\)
  • +
+ +
+ +
+ + +
+

Posterior mean

+
+
+

\[ +\begin{align} +E[p ~|~ X] & = \frac{\tilde \alpha}{\tilde \alpha + \tilde \beta}\\ \\ +& = \frac{x + \alpha}{x + \alpha + n - x + \beta}\\ \\ +& = \frac{x + \alpha}{n + \alpha + \beta} \\ \\ +& = \frac{x}{n} \times \frac{n}{n + \alpha + \beta} + \frac{\alpha}{\alpha + \beta} \times \frac{\alpha + \beta}{n + \alpha + \beta} \\ \\ +& = \mbox{MLE} \times \pi + \mbox{Prior Mean} \times (1 - \pi) +\end{align} +\]

+ +
+ +
+ + +
+

Thoughts

+
+
+
    +
  • The posterior mean is a mixture of the MLE (\(\hat p\)) and the +prior mean
  • +
  • \(\pi\) goes to \(1\) as \(n\) gets large; for large \(n\) the data swamps the prior
  • +
  • For small \(n\), the prior mean dominates
  • +
  • Generalizes how science should ideally work; as data becomes +increasingly available, prior beliefs should matter less and less
  • +
  • With a prior that is degenerate at a value, no amount of data +can overcome the prior
  • +
+ +
+ +
+ + +
+

Example

+
+
+
    +
  • Suppose that in a random sample of an at-risk population +\(13\) of \(20\) subjects had hypertension. Estimate the prevalence +of hypertension in this population.
  • +
  • \(x = 13\) and \(n=20\)
  • +
  • Consider a uniform prior, \(\alpha = \beta = 1\)
  • +
  • The posterior is proportional to (see formula above) +\[ +p^{x + \alpha - 1} (1 - p)^{n - x + \beta - 1} = p^x (1 - p)^{n-x} +\] +That is, for the uniform prior, the posterior is the likelihood
  • +
  • Consider the instance where \(\alpha = \beta = 2\) (recall this prior +is humped around the point \(.5\)) the posterior is +\[ +p^{x + \alpha - 1} (1 - p)^{n - x + \beta - 1} = p^{x + 1} (1 - p)^{n-x + 1} +\]
  • +
  • The "Jeffrey's prior" which has some theoretical benefits +puts \(\alpha = \beta = .5\)
  • +
+ +
+ +
+ + +
+
pvals <- seq(0.01, 0.99, length = 1000)
+x <- 13; n <- 20
+myPlot <- function(alpha, beta){
+    plot(0 : 1, 0 : 1, type = "n", xlab = "p", ylab = "", frame = FALSE)
+    lines(pvals, dbeta(pvals, alpha, beta) / max(dbeta(pvals, alpha, beta)), 
+            lwd = 3, col = "darkred")
+    lines(pvals, dbinom(x,n,pvals) / dbinom(x,n,x/n), lwd = 3, col = "darkblue")
+    lines(pvals, dbeta(pvals, alpha+x, beta+(n-x)) / max(dbeta(pvals, alpha+x, beta+(n-x))),
+        lwd = 3, col = "darkgreen")
+    title("red=prior,green=posterior,blue=likelihood")
+}
+manipulate(
+    myPlot(alpha, beta),
+    alpha = slider(0.01, 100, initial = 1, step = .5),
+    beta = slider(0.01, 100, initial = 1, step = .5)
+    )
+
+ +
+ +
+ + +
+

Credible intervals

+
+
+
    +
  • A Bayesian credible interval is the Bayesian analog of a confidence +interval
  • +
  • A \(95\%\) credible interval, \([a, b]\) would satisfy +\[ +P(p \in [a, b] ~|~ x) = .95 +\]
  • +
  • The best credible intervals chop off the posterior with a horizontal +line in the same way we did for likelihoods
  • +
  • These are called highest posterior density (HPD) intervals
  • +
+ +
+ +
+ + +
+

Getting HPD intervals for this example

+
+
+
    +
  • Install the \texttt{binom} package, then the command
  • +
+ +
library(binom)
+
+ +
## Error: there is no package called 'binom'
+
+ +
binom.bayes(13, 20, type = "highest")
+
+ +
## Error: could not find function "binom.bayes"
+
+ +

gives the HPD interval.

+ +
    +
  • The default credible level is \(95\%\) and +the default prior is the Jeffrey's prior.
  • +
+ +
+ +
+ + +
+
pvals <- seq(0.01, 0.99, length = 1000)
+x <- 13; n <- 20
+myPlot2 <- function(alpha, beta, cl){
+    plot(pvals, dbeta(pvals, alpha+x, beta+(n-x)), type = "l", lwd = 3,
+    xlab = "p", ylab = "", frame = FALSE)
+    out <- binom.bayes(x, n, type = "highest", 
+        prior.shape1 = alpha, 
+        prior.shape2 = beta, 
+        conf.level = cl)
+    p1 <- out$lower; p2 <- out$upper
+    lines(c(p1, p1, p2, p2), c(0, dbeta(c(p1, p2), alpha+x, beta+(n-x)), 0), 
+        type = "l", lwd = 3, col = "darkred")
+}
+manipulate(
+    myPlot2(alpha, beta, cl),
+    alpha = slider(0.01, 10, initial = 1, step = .5),
+    beta = slider(0.01, 10, initial = 1, step = .5),
+    cl = slider(0.01, 0.99, initial = 0.95, step = .01)
+    )
+
+ +
+ +
+ + +
+ + + + + + + + + + + + + + \ No newline at end of file diff --git a/06_StatisticalInference/02_05_Bayes/index.md b/06_StatisticalInference/02_05_Bayes/index.md index ec53a034d..bb82da2b2 100644 --- a/06_StatisticalInference/02_05_Bayes/index.md +++ b/06_StatisticalInference/02_05_Bayes/index.md @@ -1,197 +1,200 @@ ---- -title : Bayesian inference -subtitle : Statistical Inference -author : Brian Caffo, Roger Peng, Jeff Leek -job : Johns Hopkins Bloomberg School of Public Health -logo : bloomberg_shield.png -framework : io2012 # {io2012, html5slides, shower, dzslides, ...} -highlighter : highlight.js # {highlight.js, prettify, highlight} -hitheme : tomorrow # -url: - lib: ../../librariesNew - assets: ../../assets -widgets : [mathjax] # {mathjax, quiz, bootstrap} -mode : selfcontained # {standalone, draft} ---- - - - -## Bayesian analysis -- Bayesian statistics posits a *prior* on the parameter - of interest -- All inferences are then performed on the distribution of - the parameter given the data, called the posterior -- In general, - $$ - \mbox{Posterior} \propto \mbox{Likelihood} \times \mbox{Prior} - $$ -- Therefore (as we saw in diagnostic testing) the likelihood is - the factor by which our prior beliefs are updated to produce - conclusions in the light of the data - ---- -## Prior specification -- The beta distribution is the default prior - for parameters between $0$ and $1$. -- The beta density depends on two parameters $\alpha$ and $\beta$ -$$ -\frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} - p ^ {\alpha - 1} (1 - p) ^ {\beta - 1} ~~~~\mbox{for} ~~ 0 \leq p \leq 1 -$$ -- The mean of the beta density is $\alpha / (\alpha + \beta)$ -- The variance of the beta density is -$$\frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)}$$ -- The uniform density is the special case where $\alpha = \beta = 1$ - ---- - -``` -## Exploring the beta density -library(manipulate) -pvals <- seq(0.01, 0.99, length = 1000) -manipulate( - plot(pvals, dbeta(pvals, alpha, beta), type = "l", lwd = 3, frame = FALSE), - alpha = slider(0.01, 10, initial = 1, step = .5), - beta = slider(0.01, 10, initial = 1, step = .5) - ) -``` - ---- -## Posterior -- Suppose that we chose values of $\alpha$ and $\beta$ so that - the beta prior is indicative of our degree of belief regarding $p$ - in the absence of data -- Then using the rule that - $$ - \mbox{Posterior} \propto \mbox{Likelihood} \times \mbox{Prior} - $$ - and throwing out anything that doesn't depend on $p$, we have that -$$ -\begin{align} -\mbox{Posterior} &\propto p^x(1 - p)^{n-x} \times p^{\alpha -1} (1 - p)^{\beta - 1} \\ - & = p^{x + \alpha - 1} (1 - p)^{n - x + \beta - 1} -\end{align} -$$ -- This density is just another beta density with parameters - $\tilde \alpha = x + \alpha$ and $\tilde \beta = n - x + \beta$ - - ---- -## Posterior mean - -$$ -\begin{align} -E[p ~|~ X] & = \frac{\tilde \alpha}{\tilde \alpha + \tilde \beta}\\ \\ -& = \frac{x + \alpha}{x + \alpha + n - x + \beta}\\ \\ -& = \frac{x + \alpha}{n + \alpha + \beta} \\ \\ -& = \frac{x}{n} \times \frac{n}{n + \alpha + \beta} + \frac{\alpha}{\alpha + \beta} \times \frac{\alpha + \beta}{n + \alpha + \beta} \\ \\ -& = \mbox{MLE} \times \pi + \mbox{Prior Mean} \times (1 - \pi) -\end{align} -$$ - ---- -## Thoughts - -- The posterior mean is a mixture of the MLE ($\hat p$) and the - prior mean -- $\pi$ goes to $1$ as $n$ gets large; for large $n$ the data swamps the prior -- For small $n$, the prior mean dominates -- Generalizes how science should ideally work; as data becomes - increasingly available, prior beliefs should matter less and less -- With a prior that is degenerate at a value, no amount of data - can overcome the prior - ---- -## Example - -- Suppose that in a random sample of an at-risk population -$13$ of $20$ subjects had hypertension. Estimate the prevalence -of hypertension in this population. -- $x = 13$ and $n=20$ -- Consider a uniform prior, $\alpha = \beta = 1$ -- The posterior is proportional to (see formula above) -$$ -p^{x + \alpha - 1} (1 - p)^{n - x + \beta - 1} = p^x (1 - p)^{n-x} -$$ -That is, for the uniform prior, the posterior is the likelihood -- Consider the instance where $\alpha = \beta = 2$ (recall this prior -is humped around the point $.5$) the posterior is -$$ -p^{x + \alpha - 1} (1 - p)^{n - x + \beta - 1} = p^{x + 1} (1 - p)^{n-x + 1} -$$ -- The "Jeffrey's prior" which has some theoretical benefits - puts $\alpha = \beta = .5$ - ---- -``` -pvals <- seq(0.01, 0.99, length = 1000) -x <- 13; n <- 20 -myPlot <- function(alpha, beta){ - plot(0 : 1, 0 : 1, type = "n", xlab = "p", ylab = "", frame = FALSE) - lines(pvals, dbeta(pvals, alpha, beta) / max(dbeta(pvals, alpha, beta)), - lwd = 3, col = "darkred") - lines(pvals, dbinom(x,n,pvals) / dbinom(x,n,x/n), lwd = 3, col = "darkblue") - lines(pvals, dbeta(pvals, alpha+x, beta+(n-x)) / max(dbeta(pvals, alpha+x, beta+(n-x))), - lwd = 3, col = "darkgreen") - title("red=prior,green=posterior,blue=likelihood") -} -manipulate( - myPlot(alpha, beta), - alpha = slider(0.01, 10, initial = 1, step = .5), - beta = slider(0.01, 10, initial = 1, step = .5) - ) -``` - ---- -## Credible intervals -- A Bayesian credible interval is the Bayesian analog of a confidence - interval -- A $95\%$ credible interval, $[a, b]$ would satisfy - $$ - P(p \in [a, b] ~|~ x) = .95 - $$ -- The best credible intervals chop off the posterior with a horizontal - line in the same way we did for likelihoods -- These are called highest posterior density (HPD) intervals - ---- -## Getting HPD intervals for this example -- Install the \texttt{binom} package, then the command - -```r -library(binom) -binom.bayes(13, 20, type = "highest") -``` - -``` - method x n shape1 shape2 mean lower upper sig -1 bayes 13 20 13.5 7.5 0.6429 0.4423 0.8361 0.05 -``` - -gives the HPD interval. -- The default credible level is $95\%$ and -the default prior is the Jeffrey's prior. - ---- -``` -pvals <- seq(0.01, 0.99, length = 1000) -x <- 13; n <- 20 -myPlot2 <- function(alpha, beta, cl){ - plot(pvals, dbeta(pvals, alpha+x, beta+(n-x)), type = "l", lwd = 3, - xlab = "p", ylab = "", frame = FALSE) - out <- binom.bayes(x, n, type = "highest", - prior.shape1 = alpha, - prior.shape2 = beta, - conf.level = cl) - p1 <- out$lower; p2 <- out$upper - lines(c(p1, p1, p2, p2), c(0, dbeta(c(p1, p2), alpha+x, beta+(n-x)), 0), - type = "l", lwd = 3, col = "darkred") -} -manipulate( - myPlot2(alpha, beta, cl), - alpha = slider(0.01, 10, initial = 1, step = .5), - beta = slider(0.01, 10, initial = 1, step = .5), - cl = slider(0.01, 0.99, initial = 0.95, step = .01) - ) -``` - +--- +title : Bayesian inference +subtitle : Statistical Inference +author : Brian Caffo, Roger Peng, Jeff Leek +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- +## Bayesian analysis +- Bayesian statistics posits a *prior* on the parameter + of interest +- All inferences are then performed on the distribution of + the parameter given the data, called the posterior +- In general, + $$ + \mbox{Posterior} \propto \mbox{Likelihood} \times \mbox{Prior} + $$ +- Therefore (as we saw in diagnostic testing) the likelihood is + the factor by which our prior beliefs are updated to produce + conclusions in the light of the data + +--- +## Prior specification +- The beta distribution is the default prior + for parameters between $0$ and $1$. +- The beta density depends on two parameters $\alpha$ and $\beta$ +$$ +\frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} + p ^ {\alpha - 1} (1 - p) ^ {\beta - 1} ~~~~\mbox{for} ~~ 0 \leq p \leq 1 +$$ +- The mean of the beta density is $\alpha / (\alpha + \beta)$ +- The variance of the beta density is +$$\frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)}$$ +- The uniform density is the special case where $\alpha = \beta = 1$ + +--- + +``` +## Exploring the beta density +library(manipulate) +pvals <- seq(0.01, 0.99, length = 1000) +manipulate( + plot(pvals, dbeta(pvals, alpha, beta), type = "l", lwd = 3, frame = FALSE), + alpha = slider(0.01, 10, initial = 1, step = .5), + beta = slider(0.01, 10, initial = 1, step = .5) + ) +``` + +--- +## Posterior +- Suppose that we chose values of $\alpha$ and $\beta$ so that + the beta prior is indicative of our degree of belief regarding $p$ + in the absence of data +- Then using the rule that + $$ + \mbox{Posterior} \propto \mbox{Likelihood} \times \mbox{Prior} + $$ + and throwing out anything that doesn't depend on $p$, we have that +$$ +\begin{align} +\mbox{Posterior} &\propto p^x(1 - p)^{n-x} \times p^{\alpha -1} (1 - p)^{\beta - 1} \\ + & = p^{x + \alpha - 1} (1 - p)^{n - x + \beta - 1} +\end{align} +$$ +- This density is just another beta density with parameters + $\tilde \alpha = x + \alpha$ and $\tilde \beta = n - x + \beta$ + + +--- +## Posterior mean + +$$ +\begin{align} +E[p ~|~ X] & = \frac{\tilde \alpha}{\tilde \alpha + \tilde \beta}\\ \\ +& = \frac{x + \alpha}{x + \alpha + n - x + \beta}\\ \\ +& = \frac{x + \alpha}{n + \alpha + \beta} \\ \\ +& = \frac{x}{n} \times \frac{n}{n + \alpha + \beta} + \frac{\alpha}{\alpha + \beta} \times \frac{\alpha + \beta}{n + \alpha + \beta} \\ \\ +& = \mbox{MLE} \times \pi + \mbox{Prior Mean} \times (1 - \pi) +\end{align} +$$ + +--- +## Thoughts + +- The posterior mean is a mixture of the MLE ($\hat p$) and the + prior mean +- $\pi$ goes to $1$ as $n$ gets large; for large $n$ the data swamps the prior +- For small $n$, the prior mean dominates +- Generalizes how science should ideally work; as data becomes + increasingly available, prior beliefs should matter less and less +- With a prior that is degenerate at a value, no amount of data + can overcome the prior + +--- +## Example + +- Suppose that in a random sample of an at-risk population +$13$ of $20$ subjects had hypertension. Estimate the prevalence +of hypertension in this population. +- $x = 13$ and $n=20$ +- Consider a uniform prior, $\alpha = \beta = 1$ +- The posterior is proportional to (see formula above) +$$ +p^{x + \alpha - 1} (1 - p)^{n - x + \beta - 1} = p^x (1 - p)^{n-x} +$$ +That is, for the uniform prior, the posterior is the likelihood +- Consider the instance where $\alpha = \beta = 2$ (recall this prior +is humped around the point $.5$) the posterior is +$$ +p^{x + \alpha - 1} (1 - p)^{n - x + \beta - 1} = p^{x + 1} (1 - p)^{n-x + 1} +$$ +- The "Jeffrey's prior" which has some theoretical benefits + puts $\alpha = \beta = .5$ + +--- +``` +pvals <- seq(0.01, 0.99, length = 1000) +x <- 13; n <- 20 +myPlot <- function(alpha, beta){ + plot(0 : 1, 0 : 1, type = "n", xlab = "p", ylab = "", frame = FALSE) + lines(pvals, dbeta(pvals, alpha, beta) / max(dbeta(pvals, alpha, beta)), + lwd = 3, col = "darkred") + lines(pvals, dbinom(x,n,pvals) / dbinom(x,n,x/n), lwd = 3, col = "darkblue") + lines(pvals, dbeta(pvals, alpha+x, beta+(n-x)) / max(dbeta(pvals, alpha+x, beta+(n-x))), + lwd = 3, col = "darkgreen") + title("red=prior,green=posterior,blue=likelihood") +} +manipulate( + myPlot(alpha, beta), + alpha = slider(0.01, 100, initial = 1, step = .5), + beta = slider(0.01, 100, initial = 1, step = .5) + ) +``` + +--- +## Credible intervals +- A Bayesian credible interval is the Bayesian analog of a confidence + interval +- A $95\%$ credible interval, $[a, b]$ would satisfy + $$ + P(p \in [a, b] ~|~ x) = .95 + $$ +- The best credible intervals chop off the posterior with a horizontal + line in the same way we did for likelihoods +- These are called highest posterior density (HPD) intervals + +--- +## Getting HPD intervals for this example +- Install the \texttt{binom} package, then the command + +```r +library(binom) +``` + +``` +## Error: there is no package called 'binom' +``` + +```r +binom.bayes(13, 20, type = "highest") +``` + +``` +## Error: could not find function "binom.bayes" +``` + +gives the HPD interval. +- The default credible level is $95\%$ and +the default prior is the Jeffrey's prior. + +--- +``` +pvals <- seq(0.01, 0.99, length = 1000) +x <- 13; n <- 20 +myPlot2 <- function(alpha, beta, cl){ + plot(pvals, dbeta(pvals, alpha+x, beta+(n-x)), type = "l", lwd = 3, + xlab = "p", ylab = "", frame = FALSE) + out <- binom.bayes(x, n, type = "highest", + prior.shape1 = alpha, + prior.shape2 = beta, + conf.level = cl) + p1 <- out$lower; p2 <- out$upper + lines(c(p1, p1, p2, p2), c(0, dbeta(c(p1, p2), alpha+x, beta+(n-x)), 0), + type = "l", lwd = 3, col = "darkred") +} +manipulate( + myPlot2(alpha, beta, cl), + alpha = slider(0.01, 10, initial = 1, step = .5), + beta = slider(0.01, 10, initial = 1, step = .5), + cl = slider(0.01, 0.99, initial = 0.95, step = .01) + ) +``` + diff --git a/06_StatisticalInference/02_05_Bayes/index.pdf b/06_StatisticalInference/02_05_Bayes/index.pdf index c8043e4a1..ae65bc28f 100644 Binary files a/06_StatisticalInference/02_05_Bayes/index.pdf and b/06_StatisticalInference/02_05_Bayes/index.pdf differ diff --git a/06_StatisticalInference/03_01_TwoGroupIntervals/assets/fig/unnamed-chunk-3.png b/06_StatisticalInference/03_01_TwoGroupIntervals/assets/fig/unnamed-chunk-3.png new file mode 100644 index 000000000..e5ca858ce Binary files /dev/null and b/06_StatisticalInference/03_01_TwoGroupIntervals/assets/fig/unnamed-chunk-3.png differ diff --git a/06_StatisticalInference/03_01_TwoGroupIntervals/index.Rmd b/06_StatisticalInference/03_01_TwoGroupIntervals/index.Rmd index d139bf681..226588ff5 100644 --- a/06_StatisticalInference/03_01_TwoGroupIntervals/index.Rmd +++ b/06_StatisticalInference/03_01_TwoGroupIntervals/index.Rmd @@ -1,186 +1,170 @@ ---- -title : Two group intervals -subtitle : Statistical Inference -author : Brian Caffo, Jeff Leek, Roger Peng -job : Johns Hopkins Bloomberg School of Public Health -logo : bloomberg_shield.png -framework : io2012 # {io2012, html5slides, shower, dzslides, ...} -highlighter : highlight.js # {highlight.js, prettify, highlight} -hitheme : tomorrow # -url: - lib: ../../librariesNew - assets: ../../assets -widgets : [mathjax] # {mathjax, quiz, bootstrap} -mode : selfcontained # {standalone, draft} ---- -```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F, results='hide'} -# make this an external chunk that can be included in any file -options(width = 100) -opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/') - -options(xtable.type = 'html') -knit_hooks$set(inline = function(x) { - if(is.numeric(x)) { - round(x, getOption('digits')) - } else { - paste(as.character(x), collapse = ', ') - } -}) -knit_hooks$set(plot = knitr:::hook_plot_html) -runif(1) -``` - -## Independent group $t$ confidence intervals - -- Suppose that we want to compare the mean blood pressure between two groups in a randomized trial; those who received the treatment to those who received a placebo -- We cannot use the paired t test because the groups are independent and may have different sample sizes -- We now present methods for comparing independent groups - ---- - -## Notation - -- Let $X_1,\ldots,X_{n_x}$ be iid $N(\mu_x,\sigma^2)$ -- Let $Y_1,\ldots,Y_{n_y}$ be iid $N(\mu_y, \sigma^2)$ -- Let $\bar X$, $\bar Y$, $S_x$, $S_y$ be the means and standard deviations -- Using the fact that linear combinations of normals are again normal, we know that $\bar Y - \bar X$ is also normal with mean $\mu_y - \mu_x$ and variance $\sigma^2 (\frac{1}{n_x} + \frac{1}{n_y})$ -- The pooled variance estimator $$S_p^2 = \{(n_x - 1) S_x^2 + (n_y - 1) S_y^2\}/(n_x + n_y - 2)$$ is a good estimator of $\sigma^2$ - ---- - -## Note - -- The pooled estimator is a mixture of the group variances, placing greater weight on whichever has a larger sample size -- If the sample sizes are the same the pooled variance estimate is the average of the group variances -- The pooled estimator is unbiased -$$ - \begin{eqnarray*} - E[S_p^2] & = & \frac{(n_x - 1) E[S_x^2] + (n_y - 1) E[S_y^2]}{n_x + n_y - 2}\\ - & = & \frac{(n_x - 1)\sigma^2 + (n_y - 1)\sigma^2}{n_x + n_y - 2} - \end{eqnarray*} -$$ -- The pooled variance estimate is independent of $\bar Y - \bar X$ since $S_x$ is independent of $\bar X$ and $S_y$ is independent of $\bar Y$ and the groups are independent - ---- - -## Result - -- The sum of two independent Chi-squared random variables is Chi-squared with degrees of freedom equal to the sum of the degrees of freedom of the summands -- Therefore -$$ - \begin{eqnarray*} - (n_x + n_y - 2) S_p^2 / \sigma^2 & = & (n_x - 1)S_x^2 /\sigma^2 + (n_y - 1)S_y^2/\sigma^2 \\ \\ - & = & \chi^2_{n_x - 1} + \chi^2_{n_y-1} \\ \\ - & = & \chi^2_{n_x + n_y - 2} - \end{eqnarray*} -$$ - ---- - -## Putting this all together - -- The statistic -$$ - \frac{\frac{\bar Y - \bar X - (\mu_y - \mu_x)}{\sigma \left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2}}}% - {\sqrt{\frac{(n_x + n_y - 2) S_p^2}{(n_x + n_y - 2)\sigma^2}}} - = \frac{\bar Y - \bar X - (\mu_y - \mu_x)}{S_p \left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2}} -$$ -is a standard normal divided by the square root of an independent Chi-squared divided by its degrees of freedom -- Therefore this statistic follows Gosset's $t$ distribution with $n_x + n_y - 2$ degrees of freedom -- Notice the form is (estimator - true value) / SE - ---- - -## Confidence interval - -- Therefore a $(1 - \alpha)\times 100\%$ confidence interval for $\mu_y - \mu_x$ is -$$ - \bar Y - \bar X \pm t_{n_x + n_y - 2, 1 - \alpha/2}S_p\left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2} -$$ -- Remember this interval is assuming a constant variance across the two groups -- If there is some doubt, assume a different variance per group, which we will discuss later - ---- - - -## Example -### Based on Rosner, Fundamentals of Biostatistics - -- Comparing SBP for 8 oral contraceptive users versus 21 controls -- $\bar X_{OC} = 132.86$ mmHg with $s_{OC} = 15.34$ mmHg -- $\bar X_{C} = 127.44$ mmHg with $s_{C} = 18.23$ mmHg -- Pooled variance estimate -```{r} -sp <- sqrt((7 * 15.34^2 + 20 * 18.23^2) / (8 + 21 - 2)) -132.86 - 127.44 + c(-1, 1) * qt(.975, 27) * sp * (1 / 8 + 1 / 21)^.5 -``` - ---- -```{r} -data(sleep) -x1 <- sleep$extra[sleep$group == 1] -x2 <- sleep$extra[sleep$group == 2] -n1 <- length(x1) -n2 <- length(x2) -sp <- sqrt( ((n1 - 1) * sd(x1)^2 + (n2-1) * sd(x2)^2) / (n1 + n2-2)) -md <- mean(x1) - mean(x2) -semd <- sp * sqrt(1 / n1 + 1/n2) -md + c(-1, 1) * qt(.975, n1 + n2 - 2) * semd -t.test(x1, x2, paired = FALSE, var.equal = TRUE)$conf -t.test(x1, x2, paired = TRUE)$conf -``` - ---- -## Ignoring pairing -```{r, echo = FALSE, fig.width=5, fig.height=5} -plot(c(0.5, 2.5), range(x1, x2), type = "n", frame = FALSE, xlab = "group", ylab = "Extra", axes = FALSE) -axis(2) -axis(1, at = 1 : 2, labels = c("Group 1", "Group 2")) -for (i in 1 : n1) lines(c(1, 2), c(x1[i], x2[i]), lwd = 2, col = "red") -for (i in 1 : n1) points(c(1, 2), c(x1[i], x2[i]), lwd = 2, col = "black", bg = "salmon", pch = 21, cex = 3) -``` - ---- - -## Unequal variances - -- Under unequal variances -$$ - \bar Y - \bar X \sim N\left(\mu_y - \mu_x, \frac{s_x^2}{n_x} + \frac{\sigma_y^2}{n_y}\right) -$$ -- The statistic -$$ - \frac{\bar Y - \bar X - (\mu_y - \mu_x)}{\left(\frac{s_x^2}{n_x} + \frac{\sigma_y^2}{n_y}\right)^{1/2}} -$$ -approximately follows Gosset's $t$ distribution with degrees of freedom equal to -$$ - \frac{\left(S_x^2 / n_x + S_y^2/n_y\right)^2} - {\left(\frac{S_x^2}{n_x}\right)^2 / (n_x - 1) + - \left(\frac{S_y^2}{n_y}\right)^2 / (n_y - 1)} -$$ - ---- - -## Example - -- Comparing SBP for 8 oral contraceptive users versus 21 controls -- $\bar X_{OC} = 132.86$ mmHg with $s_{OC} = 15.34$ mmHg -- $\bar X_{C} = 127.44$ mmHg with $s_{C} = 18.23$ mmHg -- $df=15.04$, $t_{15.04, .975} = 2.13$ -- Interval -$$ -132.86 - 127.44 \pm 2.13 \left(\frac{15.34^2}{8} + \frac{18.23^2}{21} \right)^{1/2} -= [-8.91, 19.75] -$$ -- In R, `t.test(..., var.equal = FALSE)` - ---- -## Comparing other kinds of data -* For binomial data, there's lots of ways to compare two groups - * Relative risk, risk difference, odds ratio. - * Chi-squared tests, normal approximations, exact tests. -* For count data, there's also Chi-squared tests and exact tests. -* We'll leave the discussions for comparing groups of data for binary - and count data until covering glms in the regression class. -* In addition, Mathematical Biostatistics Boot Camp 2 covers many special - cases relevant to biostatistics. +--- +title : Two group intervals +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- + +## Independent group $t$ confidence intervals + +- Suppose that we want to compare the mean blood pressure between two groups in a randomized trial; those who received the treatment to those who received a placebo +- We cannot use the paired t test because the groups are independent and may have different sample sizes +- We now present methods for comparing independent groups + +--- + +## Notation + +- Let $X_1,\ldots,X_{n_x}$ be iid $N(\mu_x,\sigma^2)$ +- Let $Y_1,\ldots,Y_{n_y}$ be iid $N(\mu_y, \sigma^2)$ +- Let $\bar X$, $\bar Y$, $S_x$, $S_y$ be the means and standard deviations +- Using the fact that linear combinations of normals are again normal, we know that $\bar Y - \bar X$ is also normal with mean $\mu_y - \mu_x$ and variance $\sigma^2 (\frac{1}{n_x} + \frac{1}{n_y})$ +- The pooled variance estimator $$S_p^2 = \{(n_x - 1) S_x^2 + (n_y - 1) S_y^2\}/(n_x + n_y - 2)$$ is a good estimator of $\sigma^2$ + +--- + +## Note + +- The pooled estimator is a mixture of the group variances, placing greater weight on whichever has a larger sample size +- If the sample sizes are the same the pooled variance estimate is the average of the group variances +- The pooled estimator is unbiased +$$ + \begin{eqnarray*} + E[S_p^2] & = & \frac{(n_x - 1) E[S_x^2] + (n_y - 1) E[S_y^2]}{n_x + n_y - 2}\\ + & = & \frac{(n_x - 1)\sigma^2 + (n_y - 1)\sigma^2}{n_x + n_y - 2} + \end{eqnarray*} +$$ +- The pooled variance estimate is independent of $\bar Y - \bar X$ since $S_x$ is independent of $\bar X$ and $S_y$ is independent of $\bar Y$ and the groups are independent + +--- + +## Result + +- The sum of two independent Chi-squared random variables is Chi-squared with degrees of freedom equal to the sum of the degrees of freedom of the summands +- Therefore +$$ + \begin{eqnarray*} + (n_x + n_y - 2) S_p^2 / \sigma^2 & = & (n_x - 1)S_x^2 /\sigma^2 + (n_y - 1)S_y^2/\sigma^2 \\ \\ + & = & \chi^2_{n_x - 1} + \chi^2_{n_y-1} \\ \\ + & = & \chi^2_{n_x + n_y - 2} + \end{eqnarray*} +$$ + +--- + +## Putting this all together + +- The statistic +$$ + \frac{\frac{\bar Y - \bar X - (\mu_y - \mu_x)}{\sigma \left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2}}}% + {\sqrt{\frac{(n_x + n_y - 2) S_p^2}{(n_x + n_y - 2)\sigma^2}}} + = \frac{\bar Y - \bar X - (\mu_y - \mu_x)}{S_p \left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2}} +$$ +is a standard normal divided by the square root of an independent Chi-squared divided by its degrees of freedom +- Therefore this statistic follows Gosset's $t$ distribution with $n_x + n_y - 2$ degrees of freedom +- Notice the form is (estimator - true value) / SE + +--- + +## Confidence interval + +- Therefore a $(1 - \alpha)\times 100\%$ confidence interval for $\mu_y - \mu_x$ is +$$ + \bar Y - \bar X \pm t_{n_x + n_y - 2, 1 - \alpha/2}S_p\left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2} +$$ +- Remember this interval is assuming a constant variance across the two groups +- If there is some doubt, assume a different variance per group, which we will discuss later + +--- + + +## Example +### Based on Rosner, Fundamentals of Biostatistics + +- Comparing SBP for 8 oral contraceptive users versus 21 controls +- $\bar X_{OC} = 132.86$ mmHg with $s_{OC} = 15.34$ mmHg +- $\bar X_{C} = 127.44$ mmHg with $s_{C} = 18.23$ mmHg +- Pooled variance estimate +```{r} +sp <- sqrt((7 * 15.34^2 + 20 * 18.23^2) / (8 + 21 - 2)) +132.86 - 127.44 + c(-1, 1) * qt(.975, 27) * sp * (1 / 8 + 1 / 21)^.5 +``` + +--- +```{r} +data(sleep) +x1 <- sleep$extra[sleep$group == 1] +x2 <- sleep$extra[sleep$group == 2] +n1 <- length(x1) +n2 <- length(x2) +sp <- sqrt( ((n1 - 1) * sd(x1)^2 + (n2-1) * sd(x2)^2) / (n1 + n2-2)) +md <- mean(x1) - mean(x2) +semd <- sp * sqrt(1 / n1 + 1/n2) +md + c(-1, 1) * qt(.975, n1 + n2 - 2) * semd +t.test(x1, x2, paired = FALSE, var.equal = TRUE)$conf +t.test(x1, x2, paired = TRUE)$conf +``` + +--- +## Ignoring pairing +```{r, echo = FALSE, fig.width=5, fig.height=5} +plot(c(0.5, 2.5), range(x1, x2), type = "n", frame = FALSE, xlab = "group", ylab = "Extra", axes = FALSE) +axis(2) +axis(1, at = 1 : 2, labels = c("Group 1", "Group 2")) +for (i in 1 : n1) lines(c(1, 2), c(x1[i], x2[i]), lwd = 2, col = "red") +for (i in 1 : n1) points(c(1, 2), c(x1[i], x2[i]), lwd = 2, col = "black", bg = "salmon", pch = 21, cex = 3) +``` + +--- + +## Unequal variances + +- Under unequal variances +$$ + \bar Y - \bar X \sim N\left(\mu_y - \mu_x, \frac{s_x^2}{n_x} + \frac{\sigma_y^2}{n_y}\right) +$$ +- The statistic +$$ + \frac{\bar Y - \bar X - (\mu_y - \mu_x)}{\left(\frac{s_x^2}{n_x} + \frac{\sigma_y^2}{n_y}\right)^{1/2}} +$$ +approximately follows Gosset's $t$ distribution with degrees of freedom equal to +$$ + \frac{\left(S_x^2 / n_x + S_y^2/n_y\right)^2} + {\left(\frac{S_x^2}{n_x}\right)^2 / (n_x - 1) + + \left(\frac{S_y^2}{n_y}\right)^2 / (n_y - 1)} +$$ + +--- + +## Example + +- Comparing SBP for 8 oral contraceptive users versus 21 controls +- $\bar X_{OC} = 132.86$ mmHg with $s_{OC} = 15.34$ mmHg +- $\bar X_{C} = 127.44$ mmHg with $s_{C} = 18.23$ mmHg +- $df=15.04$, $t_{15.04, .975} = 2.13$ +- Interval +$$ +132.86 - 127.44 \pm 2.13 \left(\frac{15.34^2}{8} + \frac{18.23^2}{21} \right)^{1/2} += [-8.91, 19.75] +$$ +- In R, `t.test(..., var.equal = FALSE)` + +--- +## Comparing other kinds of data +* For binomial data, there's lots of ways to compare two groups + * Relative risk, risk difference, odds ratio. + * Chi-squared tests, normal approximations, exact tests. +* For count data, there's also Chi-squared tests and exact tests. +* We'll leave the discussions for comparing groups of data for binary + and count data until covering glms in the regression class. +* In addition, Mathematical Biostatistics Boot Camp 2 covers many special + cases relevant to biostatistics. diff --git a/06_StatisticalInference/03_01_TwoGroupIntervals/index.html b/06_StatisticalInference/03_01_TwoGroupIntervals/index.html index 910ae14e9..bb7feb5c5 100644 --- a/06_StatisticalInference/03_01_TwoGroupIntervals/index.html +++ b/06_StatisticalInference/03_01_TwoGroupIntervals/index.html @@ -1,408 +1,408 @@ - - - - Two group intervals - - - - - - - - - - - - - - - - - - - - - - - - - - -
-

Two group intervals

-

Statistical Inference

-

Brian Caffo, Jeff Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

-
-
-
- - - - -
-

Independent group \(t\) confidence intervals

-
-
-
    -
  • Suppose that we want to compare the mean blood pressure between two groups in a randomized trial; those who received the treatment to those who received a placebo
  • -
  • We cannot use the paired t test because the groups are independent and may have different sample sizes
  • -
  • We now present methods for comparing independent groups
  • -
- -
- -
- - -
-

Notation

-
-
-
    -
  • Let \(X_1,\ldots,X_{n_x}\) be iid \(N(\mu_x,\sigma^2)\)
  • -
  • Let \(Y_1,\ldots,Y_{n_y}\) be iid \(N(\mu_y, \sigma^2)\)
  • -
  • Let \(\bar X\), \(\bar Y\), \(S_x\), \(S_y\) be the means and standard deviations
  • -
  • Using the fact that linear combinations of normals are again normal, we know that \(\bar Y - \bar X\) is also normal with mean \(\mu_y - \mu_x\) and variance \(\sigma^2 (\frac{1}{n_x} + \frac{1}{n_y})\)
  • -
  • The pooled variance estimator \[S_p^2 = \{(n_x - 1) S_x^2 + (n_y - 1) S_y^2\}/(n_x + n_y - 2)\] is a good estimator of \(\sigma^2\)
  • -
- -
- -
- - -
-

Note

-
-
-
    -
  • The pooled estimator is a mixture of the group variances, placing greater weight on whichever has a larger sample size
  • -
  • If the sample sizes are the same the pooled variance estimate is the average of the group variances
  • -
  • The pooled estimator is unbiased -\[ -\begin{eqnarray*} -E[S_p^2] & = & \frac{(n_x - 1) E[S_x^2] + (n_y - 1) E[S_y^2]}{n_x + n_y - 2}\\ - & = & \frac{(n_x - 1)\sigma^2 + (n_y - 1)\sigma^2}{n_x + n_y - 2} -\end{eqnarray*} -\]
  • -
  • The pooled variance estimate is independent of \(\bar Y - \bar X\) since \(S_x\) is independent of \(\bar X\) and \(S_y\) is independent of \(\bar Y\) and the groups are independent
  • -
- -
- -
- - -
-

Result

-
-
-
    -
  • The sum of two independent Chi-squared random variables is Chi-squared with degrees of freedom equal to the sum of the degrees of freedom of the summands
  • -
  • Therefore -\[ -\begin{eqnarray*} - (n_x + n_y - 2) S_p^2 / \sigma^2 & = & (n_x - 1)S_x^2 /\sigma^2 + (n_y - 1)S_y^2/\sigma^2 \\ \\ - & = & \chi^2_{n_x - 1} + \chi^2_{n_y-1} \\ \\ - & = & \chi^2_{n_x + n_y - 2} -\end{eqnarray*} -\]
  • -
- -
- -
- - -
-

Putting this all together

-
-
-
    -
  • The statistic -\[ -\frac{\frac{\bar Y - \bar X - (\mu_y - \mu_x)}{\sigma \left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2}}}% -{\sqrt{\frac{(n_x + n_y - 2) S_p^2}{(n_x + n_y - 2)\sigma^2}}} -= \frac{\bar Y - \bar X - (\mu_y - \mu_x)}{S_p \left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2}} -\] -is a standard normal divided by the square root of an independent Chi-squared divided by its degrees of freedom
  • -
  • Therefore this statistic follows Gosset's \(t\) distribution with \(n_x + n_y - 2\) degrees of freedom
  • -
  • Notice the form is (estimator - true value) / SE
  • -
- -
- -
- - -
-

Confidence interval

-
-
-
    -
  • Therefore a \((1 - \alpha)\times 100\%\) confidence interval for \(\mu_y - \mu_x\) is -\[ -\bar Y - \bar X \pm t_{n_x + n_y - 2, 1 - \alpha/2}S_p\left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2} -\]
  • -
  • Remember this interval is assuming a constant variance across the two groups
  • -
  • If there is some doubt, assume a different variance per group, which we will discuss later
  • -
- -
- -
- - -
-

Example

-
-
-

Based on Rosner, Fundamentals of Biostatistics

- -
    -
  • Comparing SBP for 8 oral contraceptive users versus 21 controls
  • -
  • \(\bar X_{OC} = 132.86\) mmHg with \(s_{OC} = 15.34\) mmHg
  • -
  • \(\bar X_{C} = 127.44\) mmHg with \(s_{C} = 18.23\) mmHg
  • -
  • Pooled variance estimate
  • -
- -
sp <- sqrt((7 * 15.34^2 + 20 * 18.23^2) / (8 + 21 - 2))
-132.86 - 127.44 + c(-1, 1) * qt(.975, 27) * sp * (1 / 8 + 1 / 21)^.5
-
- -
[1] -9.521 20.361
-
- -
- -
- - -
-
data(sleep)
-x1 <- sleep$extra[sleep$group == 1]
-x2 <- sleep$extra[sleep$group == 2]
-n1 <- length(x1)
-n2 <- length(x2)
-sp <- sqrt( ((n1 - 1) * sd(x1)^2 + (n2-1) * sd(x2)^2) / (n1 + n2-2))
-md <- mean(x1) - mean(x2)
-semd <- sp * sqrt(1 / n1 + 1/n2)
-md + c(-1, 1) * qt(.975, n1 + n2 - 2) * semd
-
- -
[1] -3.3639  0.2039
-
- -
t.test(x1, x2, paired = FALSE, var.equal = TRUE)$conf
-
- -
[1] -3.3639  0.2039
-attr(,"conf.level")
-[1] 0.95
-
- -
t.test(x1, x2, paired = TRUE)$conf
-
- -
[1] -2.4599 -0.7001
-attr(,"conf.level")
-[1] 0.95
-
- -
- -
- - -
-

Ignoring pairing

-
-
-
plot of chunk unnamed-chunk-3
- -
- -
- - -
-

Unequal variances

-
-
-
    -
  • Under unequal variances -\[ -\bar Y - \bar X \sim N\left(\mu_y - \mu_x, \frac{s_x^2}{n_x} + \frac{\sigma_y^2}{n_y}\right) -\]
  • -
  • The statistic -\[ -\frac{\bar Y - \bar X - (\mu_y - \mu_x)}{\left(\frac{s_x^2}{n_x} + \frac{\sigma_y^2}{n_y}\right)^{1/2}} -\] -approximately follows Gosset's \(t\) distribution with degrees of freedom equal to -\[ -\frac{\left(S_x^2 / n_x + S_y^2/n_y\right)^2} -{\left(\frac{S_x^2}{n_x}\right)^2 / (n_x - 1) + - \left(\frac{S_y^2}{n_y}\right)^2 / (n_y - 1)} -\]
  • -
- -
- -
- - -
-

Example

-
-
-
    -
  • Comparing SBP for 8 oral contraceptive users versus 21 controls
  • -
  • \(\bar X_{OC} = 132.86\) mmHg with \(s_{OC} = 15.34\) mmHg
  • -
  • \(\bar X_{C} = 127.44\) mmHg with \(s_{C} = 18.23\) mmHg
  • -
  • \(df=15.04\), \(t_{15.04, .975} = 2.13\)
  • -
  • Interval -\[ -132.86 - 127.44 \pm 2.13 \left(\frac{15.34^2}{8} + \frac{18.23^2}{21} \right)^{1/2} -= [-8.91, 19.75] -\]
  • -
  • In R, t.test(..., var.equal = FALSE)
  • -
- -
- -
- - -
-

Comparing other kinds of data

-
-
-
    -
  • For binomial data, there's lots of ways to compare two groups - -
      -
    • Relative risk, risk difference, odds ratio.
    • -
    • Chi-squared tests, normal approximations, exact tests.
    • -
  • -
  • For count data, there's also Chi-squared tests and exact tests.
  • -
  • We'll leave the discussions for comparing groups of data for binary -and count data until covering glms in the regression class.
  • -
  • In addition, Mathematical Biostatistics Boot Camp 2 covers many special -cases relevant to biostatistics.
  • -
- -
- -
- - -
- - - - - - - - - - - - - - + + + + Two group intervals + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

Two group intervals

+

Statistical Inference

+

Brian Caffo, Jeff Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

+
+
+
+ + + + +
+

Independent group \(t\) confidence intervals

+
+
+
    +
  • Suppose that we want to compare the mean blood pressure between two groups in a randomized trial; those who received the treatment to those who received a placebo
  • +
  • We cannot use the paired t test because the groups are independent and may have different sample sizes
  • +
  • We now present methods for comparing independent groups
  • +
+ +
+ +
+ + +
+

Notation

+
+
+
    +
  • Let \(X_1,\ldots,X_{n_x}\) be iid \(N(\mu_x,\sigma^2)\)
  • +
  • Let \(Y_1,\ldots,Y_{n_y}\) be iid \(N(\mu_y, \sigma^2)\)
  • +
  • Let \(\bar X\), \(\bar Y\), \(S_x\), \(S_y\) be the means and standard deviations
  • +
  • Using the fact that linear combinations of normals are again normal, we know that \(\bar Y - \bar X\) is also normal with mean \(\mu_y - \mu_x\) and variance \(\sigma^2 (\frac{1}{n_x} + \frac{1}{n_y})\)
  • +
  • The pooled variance estimator \[S_p^2 = \{(n_x - 1) S_x^2 + (n_y - 1) S_y^2\}/(n_x + n_y - 2)\] is a good estimator of \(\sigma^2\)
  • +
+ +
+ +
+ + +
+

Note

+
+
+
    +
  • The pooled estimator is a mixture of the group variances, placing greater weight on whichever has a larger sample size
  • +
  • If the sample sizes are the same the pooled variance estimate is the average of the group variances
  • +
  • The pooled estimator is unbiased +\[ +\begin{eqnarray*} +E[S_p^2] & = & \frac{(n_x - 1) E[S_x^2] + (n_y - 1) E[S_y^2]}{n_x + n_y - 2}\\ + & = & \frac{(n_x - 1)\sigma^2 + (n_y - 1)\sigma^2}{n_x + n_y - 2} +\end{eqnarray*} +\]
  • +
  • The pooled variance estimate is independent of \(\bar Y - \bar X\) since \(S_x\) is independent of \(\bar X\) and \(S_y\) is independent of \(\bar Y\) and the groups are independent
  • +
+ +
+ +
+ + +
+

Result

+
+
+
    +
  • The sum of two independent Chi-squared random variables is Chi-squared with degrees of freedom equal to the sum of the degrees of freedom of the summands
  • +
  • Therefore +\[ +\begin{eqnarray*} + (n_x + n_y - 2) S_p^2 / \sigma^2 & = & (n_x - 1)S_x^2 /\sigma^2 + (n_y - 1)S_y^2/\sigma^2 \\ \\ + & = & \chi^2_{n_x - 1} + \chi^2_{n_y-1} \\ \\ + & = & \chi^2_{n_x + n_y - 2} +\end{eqnarray*} +\]
  • +
+ +
+ +
+ + +
+

Putting this all together

+
+
+
    +
  • The statistic +\[ +\frac{\frac{\bar Y - \bar X - (\mu_y - \mu_x)}{\sigma \left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2}}}% +{\sqrt{\frac{(n_x + n_y - 2) S_p^2}{(n_x + n_y - 2)\sigma^2}}} += \frac{\bar Y - \bar X - (\mu_y - \mu_x)}{S_p \left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2}} +\] +is a standard normal divided by the square root of an independent Chi-squared divided by its degrees of freedom
  • +
  • Therefore this statistic follows Gosset's \(t\) distribution with \(n_x + n_y - 2\) degrees of freedom
  • +
  • Notice the form is (estimator - true value) / SE
  • +
+ +
+ +
+ + +
+

Confidence interval

+
+
+
    +
  • Therefore a \((1 - \alpha)\times 100\%\) confidence interval for \(\mu_y - \mu_x\) is +\[ +\bar Y - \bar X \pm t_{n_x + n_y - 2, 1 - \alpha/2}S_p\left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2} +\]
  • +
  • Remember this interval is assuming a constant variance across the two groups
  • +
  • If there is some doubt, assume a different variance per group, which we will discuss later
  • +
+ +
+ +
+ + +
+

Example

+
+
+

Based on Rosner, Fundamentals of Biostatistics

+ +
    +
  • Comparing SBP for 8 oral contraceptive users versus 21 controls
  • +
  • \(\bar X_{OC} = 132.86\) mmHg with \(s_{OC} = 15.34\) mmHg
  • +
  • \(\bar X_{C} = 127.44\) mmHg with \(s_{C} = 18.23\) mmHg
  • +
  • Pooled variance estimate
  • +
+ +
sp <- sqrt((7 * 15.34^2 + 20 * 18.23^2)/(8 + 21 - 2))
+132.86 - 127.44 + c(-1, 1) * qt(0.975, 27) * sp * (1/8 + 1/21)^0.5
+
+ +
## [1] -9.521 20.361
+
+ +
+ +
+ + +
+
data(sleep)
+x1 <- sleep$extra[sleep$group == 1]
+x2 <- sleep$extra[sleep$group == 2]
+n1 <- length(x1)
+n2 <- length(x2)
+sp <- sqrt(((n1 - 1) * sd(x1)^2 + (n2 - 1) * sd(x2)^2)/(n1 + n2 - 2))
+md <- mean(x1) - mean(x2)
+semd <- sp * sqrt(1/n1 + 1/n2)
+md + c(-1, 1) * qt(0.975, n1 + n2 - 2) * semd
+
+ +
## [1] -3.3639  0.2039
+
+ +
t.test(x1, x2, paired = FALSE, var.equal = TRUE)$conf
+
+ +
## [1] -3.3639  0.2039
+## attr(,"conf.level")
+## [1] 0.95
+
+ +
t.test(x1, x2, paired = TRUE)$conf
+
+ +
## [1] -2.4599 -0.7001
+## attr(,"conf.level")
+## [1] 0.95
+
+ +
+ +
+ + +
+

Ignoring pairing

+
+
+

plot of chunk unnamed-chunk-3

+ +
+ +
+ + +
+

Unequal variances

+
+
+
    +
  • Under unequal variances +\[ +\bar Y - \bar X \sim N\left(\mu_y - \mu_x, \frac{s_x^2}{n_x} + \frac{\sigma_y^2}{n_y}\right) +\]
  • +
  • The statistic +\[ +\frac{\bar Y - \bar X - (\mu_y - \mu_x)}{\left(\frac{s_x^2}{n_x} + \frac{\sigma_y^2}{n_y}\right)^{1/2}} +\] +approximately follows Gosset's \(t\) distribution with degrees of freedom equal to +\[ +\frac{\left(S_x^2 / n_x + S_y^2/n_y\right)^2} +{\left(\frac{S_x^2}{n_x}\right)^2 / (n_x - 1) + + \left(\frac{S_y^2}{n_y}\right)^2 / (n_y - 1)} +\]
  • +
+ +
+ +
+ + +
+

Example

+
+
+
    +
  • Comparing SBP for 8 oral contraceptive users versus 21 controls
  • +
  • \(\bar X_{OC} = 132.86\) mmHg with \(s_{OC} = 15.34\) mmHg
  • +
  • \(\bar X_{C} = 127.44\) mmHg with \(s_{C} = 18.23\) mmHg
  • +
  • \(df=15.04\), \(t_{15.04, .975} = 2.13\)
  • +
  • Interval +\[ +132.86 - 127.44 \pm 2.13 \left(\frac{15.34^2}{8} + \frac{18.23^2}{21} \right)^{1/2} += [-8.91, 19.75] +\]
  • +
  • In R, t.test(..., var.equal = FALSE)
  • +
+ +
+ +
+ + +
+

Comparing other kinds of data

+
+
+
    +
  • For binomial data, there's lots of ways to compare two groups + +
      +
    • Relative risk, risk difference, odds ratio.
    • +
    • Chi-squared tests, normal approximations, exact tests.
    • +
  • +
  • For count data, there's also Chi-squared tests and exact tests.
  • +
  • We'll leave the discussions for comparing groups of data for binary +and count data until covering glms in the regression class.
  • +
  • In addition, Mathematical Biostatistics Boot Camp 2 covers many special +cases relevant to biostatistics.
  • +
+ +
+ +
+ + +
+ + + + + + + + + + + + + + \ No newline at end of file diff --git a/06_StatisticalInference/03_01_TwoGroupIntervals/index.md b/06_StatisticalInference/03_01_TwoGroupIntervals/index.md index 7bcda6797..d856f7887 100644 --- a/06_StatisticalInference/03_01_TwoGroupIntervals/index.md +++ b/06_StatisticalInference/03_01_TwoGroupIntervals/index.md @@ -1,197 +1,195 @@ ---- -title : Two group intervals -subtitle : Statistical Inference -author : Brian Caffo, Jeff Leek, Roger Peng -job : Johns Hopkins Bloomberg School of Public Health -logo : bloomberg_shield.png -framework : io2012 # {io2012, html5slides, shower, dzslides, ...} -highlighter : highlight.js # {highlight.js, prettify, highlight} -hitheme : tomorrow # -url: - lib: ../../librariesNew - assets: ../../assets -widgets : [mathjax] # {mathjax, quiz, bootstrap} -mode : selfcontained # {standalone, draft} ---- - - - -## Independent group $t$ confidence intervals - -- Suppose that we want to compare the mean blood pressure between two groups in a randomized trial; those who received the treatment to those who received a placebo -- We cannot use the paired t test because the groups are independent and may have different sample sizes -- We now present methods for comparing independent groups - ---- - -## Notation - -- Let $X_1,\ldots,X_{n_x}$ be iid $N(\mu_x,\sigma^2)$ -- Let $Y_1,\ldots,Y_{n_y}$ be iid $N(\mu_y, \sigma^2)$ -- Let $\bar X$, $\bar Y$, $S_x$, $S_y$ be the means and standard deviations -- Using the fact that linear combinations of normals are again normal, we know that $\bar Y - \bar X$ is also normal with mean $\mu_y - \mu_x$ and variance $\sigma^2 (\frac{1}{n_x} + \frac{1}{n_y})$ -- The pooled variance estimator $$S_p^2 = \{(n_x - 1) S_x^2 + (n_y - 1) S_y^2\}/(n_x + n_y - 2)$$ is a good estimator of $\sigma^2$ - ---- - -## Note - -- The pooled estimator is a mixture of the group variances, placing greater weight on whichever has a larger sample size -- If the sample sizes are the same the pooled variance estimate is the average of the group variances -- The pooled estimator is unbiased -$$ - \begin{eqnarray*} - E[S_p^2] & = & \frac{(n_x - 1) E[S_x^2] + (n_y - 1) E[S_y^2]}{n_x + n_y - 2}\\ - & = & \frac{(n_x - 1)\sigma^2 + (n_y - 1)\sigma^2}{n_x + n_y - 2} - \end{eqnarray*} -$$ -- The pooled variance estimate is independent of $\bar Y - \bar X$ since $S_x$ is independent of $\bar X$ and $S_y$ is independent of $\bar Y$ and the groups are independent - ---- - -## Result - -- The sum of two independent Chi-squared random variables is Chi-squared with degrees of freedom equal to the sum of the degrees of freedom of the summands -- Therefore -$$ - \begin{eqnarray*} - (n_x + n_y - 2) S_p^2 / \sigma^2 & = & (n_x - 1)S_x^2 /\sigma^2 + (n_y - 1)S_y^2/\sigma^2 \\ \\ - & = & \chi^2_{n_x - 1} + \chi^2_{n_y-1} \\ \\ - & = & \chi^2_{n_x + n_y - 2} - \end{eqnarray*} -$$ - ---- - -## Putting this all together - -- The statistic -$$ - \frac{\frac{\bar Y - \bar X - (\mu_y - \mu_x)}{\sigma \left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2}}}% - {\sqrt{\frac{(n_x + n_y - 2) S_p^2}{(n_x + n_y - 2)\sigma^2}}} - = \frac{\bar Y - \bar X - (\mu_y - \mu_x)}{S_p \left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2}} -$$ -is a standard normal divided by the square root of an independent Chi-squared divided by its degrees of freedom -- Therefore this statistic follows Gosset's $t$ distribution with $n_x + n_y - 2$ degrees of freedom -- Notice the form is (estimator - true value) / SE - ---- - -## Confidence interval - -- Therefore a $(1 - \alpha)\times 100\%$ confidence interval for $\mu_y - \mu_x$ is -$$ - \bar Y - \bar X \pm t_{n_x + n_y - 2, 1 - \alpha/2}S_p\left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2} -$$ -- Remember this interval is assuming a constant variance across the two groups -- If there is some doubt, assume a different variance per group, which we will discuss later - ---- - - -## Example -### Based on Rosner, Fundamentals of Biostatistics - -- Comparing SBP for 8 oral contraceptive users versus 21 controls -- $\bar X_{OC} = 132.86$ mmHg with $s_{OC} = 15.34$ mmHg -- $\bar X_{C} = 127.44$ mmHg with $s_{C} = 18.23$ mmHg -- Pooled variance estimate - -```r -sp <- sqrt((7 * 15.34^2 + 20 * 18.23^2) / (8 + 21 - 2)) -132.86 - 127.44 + c(-1, 1) * qt(.975, 27) * sp * (1 / 8 + 1 / 21)^.5 -``` - -``` -[1] -9.521 20.361 -``` - - ---- - -```r -data(sleep) -x1 <- sleep$extra[sleep$group == 1] -x2 <- sleep$extra[sleep$group == 2] -n1 <- length(x1) -n2 <- length(x2) -sp <- sqrt( ((n1 - 1) * sd(x1)^2 + (n2-1) * sd(x2)^2) / (n1 + n2-2)) -md <- mean(x1) - mean(x2) -semd <- sp * sqrt(1 / n1 + 1/n2) -md + c(-1, 1) * qt(.975, n1 + n2 - 2) * semd -``` - -``` -[1] -3.3639 0.2039 -``` - -```r -t.test(x1, x2, paired = FALSE, var.equal = TRUE)$conf -``` - -``` -[1] -3.3639 0.2039 -attr(,"conf.level") -[1] 0.95 -``` - -```r -t.test(x1, x2, paired = TRUE)$conf -``` - -``` -[1] -2.4599 -0.7001 -attr(,"conf.level") -[1] 0.95 -``` - - ---- -## Ignoring pairing -
plot of chunk unnamed-chunk-3
- - ---- - -## Unequal variances - -- Under unequal variances -$$ - \bar Y - \bar X \sim N\left(\mu_y - \mu_x, \frac{s_x^2}{n_x} + \frac{\sigma_y^2}{n_y}\right) -$$ -- The statistic -$$ - \frac{\bar Y - \bar X - (\mu_y - \mu_x)}{\left(\frac{s_x^2}{n_x} + \frac{\sigma_y^2}{n_y}\right)^{1/2}} -$$ -approximately follows Gosset's $t$ distribution with degrees of freedom equal to -$$ - \frac{\left(S_x^2 / n_x + S_y^2/n_y\right)^2} - {\left(\frac{S_x^2}{n_x}\right)^2 / (n_x - 1) + - \left(\frac{S_y^2}{n_y}\right)^2 / (n_y - 1)} -$$ - ---- - -## Example - -- Comparing SBP for 8 oral contraceptive users versus 21 controls -- $\bar X_{OC} = 132.86$ mmHg with $s_{OC} = 15.34$ mmHg -- $\bar X_{C} = 127.44$ mmHg with $s_{C} = 18.23$ mmHg -- $df=15.04$, $t_{15.04, .975} = 2.13$ -- Interval -$$ -132.86 - 127.44 \pm 2.13 \left(\frac{15.34^2}{8} + \frac{18.23^2}{21} \right)^{1/2} -= [-8.91, 19.75] -$$ -- In R, `t.test(..., var.equal = FALSE)` - ---- -## Comparing other kinds of data -* For binomial data, there's lots of ways to compare two groups - * Relative risk, risk difference, odds ratio. - * Chi-squared tests, normal approximations, exact tests. -* For count data, there's also Chi-squared tests and exact tests. -* We'll leave the discussions for comparing groups of data for binary - and count data until covering glms in the regression class. -* In addition, Mathematical Biostatistics Boot Camp 2 covers many special - cases relevant to biostatistics. +--- +title : Two group intervals +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- + +## Independent group $t$ confidence intervals + +- Suppose that we want to compare the mean blood pressure between two groups in a randomized trial; those who received the treatment to those who received a placebo +- We cannot use the paired t test because the groups are independent and may have different sample sizes +- We now present methods for comparing independent groups + +--- + +## Notation + +- Let $X_1,\ldots,X_{n_x}$ be iid $N(\mu_x,\sigma^2)$ +- Let $Y_1,\ldots,Y_{n_y}$ be iid $N(\mu_y, \sigma^2)$ +- Let $\bar X$, $\bar Y$, $S_x$, $S_y$ be the means and standard deviations +- Using the fact that linear combinations of normals are again normal, we know that $\bar Y - \bar X$ is also normal with mean $\mu_y - \mu_x$ and variance $\sigma^2 (\frac{1}{n_x} + \frac{1}{n_y})$ +- The pooled variance estimator $$S_p^2 = \{(n_x - 1) S_x^2 + (n_y - 1) S_y^2\}/(n_x + n_y - 2)$$ is a good estimator of $\sigma^2$ + +--- + +## Note + +- The pooled estimator is a mixture of the group variances, placing greater weight on whichever has a larger sample size +- If the sample sizes are the same the pooled variance estimate is the average of the group variances +- The pooled estimator is unbiased +$$ + \begin{eqnarray*} + E[S_p^2] & = & \frac{(n_x - 1) E[S_x^2] + (n_y - 1) E[S_y^2]}{n_x + n_y - 2}\\ + & = & \frac{(n_x - 1)\sigma^2 + (n_y - 1)\sigma^2}{n_x + n_y - 2} + \end{eqnarray*} +$$ +- The pooled variance estimate is independent of $\bar Y - \bar X$ since $S_x$ is independent of $\bar X$ and $S_y$ is independent of $\bar Y$ and the groups are independent + +--- + +## Result + +- The sum of two independent Chi-squared random variables is Chi-squared with degrees of freedom equal to the sum of the degrees of freedom of the summands +- Therefore +$$ + \begin{eqnarray*} + (n_x + n_y - 2) S_p^2 / \sigma^2 & = & (n_x - 1)S_x^2 /\sigma^2 + (n_y - 1)S_y^2/\sigma^2 \\ \\ + & = & \chi^2_{n_x - 1} + \chi^2_{n_y-1} \\ \\ + & = & \chi^2_{n_x + n_y - 2} + \end{eqnarray*} +$$ + +--- + +## Putting this all together + +- The statistic +$$ + \frac{\frac{\bar Y - \bar X - (\mu_y - \mu_x)}{\sigma \left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2}}}% + {\sqrt{\frac{(n_x + n_y - 2) S_p^2}{(n_x + n_y - 2)\sigma^2}}} + = \frac{\bar Y - \bar X - (\mu_y - \mu_x)}{S_p \left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2}} +$$ +is a standard normal divided by the square root of an independent Chi-squared divided by its degrees of freedom +- Therefore this statistic follows Gosset's $t$ distribution with $n_x + n_y - 2$ degrees of freedom +- Notice the form is (estimator - true value) / SE + +--- + +## Confidence interval + +- Therefore a $(1 - \alpha)\times 100\%$ confidence interval for $\mu_y - \mu_x$ is +$$ + \bar Y - \bar X \pm t_{n_x + n_y - 2, 1 - \alpha/2}S_p\left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2} +$$ +- Remember this interval is assuming a constant variance across the two groups +- If there is some doubt, assume a different variance per group, which we will discuss later + +--- + + +## Example +### Based on Rosner, Fundamentals of Biostatistics + +- Comparing SBP for 8 oral contraceptive users versus 21 controls +- $\bar X_{OC} = 132.86$ mmHg with $s_{OC} = 15.34$ mmHg +- $\bar X_{C} = 127.44$ mmHg with $s_{C} = 18.23$ mmHg +- Pooled variance estimate + +```r +sp <- sqrt((7 * 15.34^2 + 20 * 18.23^2)/(8 + 21 - 2)) +132.86 - 127.44 + c(-1, 1) * qt(0.975, 27) * sp * (1/8 + 1/21)^0.5 +``` + +``` +## [1] -9.521 20.361 +``` + + +--- + +```r +data(sleep) +x1 <- sleep$extra[sleep$group == 1] +x2 <- sleep$extra[sleep$group == 2] +n1 <- length(x1) +n2 <- length(x2) +sp <- sqrt(((n1 - 1) * sd(x1)^2 + (n2 - 1) * sd(x2)^2)/(n1 + n2 - 2)) +md <- mean(x1) - mean(x2) +semd <- sp * sqrt(1/n1 + 1/n2) +md + c(-1, 1) * qt(0.975, n1 + n2 - 2) * semd +``` + +``` +## [1] -3.3639 0.2039 +``` + +```r +t.test(x1, x2, paired = FALSE, var.equal = TRUE)$conf +``` + +``` +## [1] -3.3639 0.2039 +## attr(,"conf.level") +## [1] 0.95 +``` + +```r +t.test(x1, x2, paired = TRUE)$conf +``` + +``` +## [1] -2.4599 -0.7001 +## attr(,"conf.level") +## [1] 0.95 +``` + + +--- +## Ignoring pairing +![plot of chunk unnamed-chunk-3](assets/fig/unnamed-chunk-3.png) + + +--- + +## Unequal variances + +- Under unequal variances +$$ + \bar Y - \bar X \sim N\left(\mu_y - \mu_x, \frac{s_x^2}{n_x} + \frac{\sigma_y^2}{n_y}\right) +$$ +- The statistic +$$ + \frac{\bar Y - \bar X - (\mu_y - \mu_x)}{\left(\frac{s_x^2}{n_x} + \frac{\sigma_y^2}{n_y}\right)^{1/2}} +$$ +approximately follows Gosset's $t$ distribution with degrees of freedom equal to +$$ + \frac{\left(S_x^2 / n_x + S_y^2/n_y\right)^2} + {\left(\frac{S_x^2}{n_x}\right)^2 / (n_x - 1) + + \left(\frac{S_y^2}{n_y}\right)^2 / (n_y - 1)} +$$ + +--- + +## Example + +- Comparing SBP for 8 oral contraceptive users versus 21 controls +- $\bar X_{OC} = 132.86$ mmHg with $s_{OC} = 15.34$ mmHg +- $\bar X_{C} = 127.44$ mmHg with $s_{C} = 18.23$ mmHg +- $df=15.04$, $t_{15.04, .975} = 2.13$ +- Interval +$$ +132.86 - 127.44 \pm 2.13 \left(\frac{15.34^2}{8} + \frac{18.23^2}{21} \right)^{1/2} += [-8.91, 19.75] +$$ +- In R, `t.test(..., var.equal = FALSE)` + +--- +## Comparing other kinds of data +* For binomial data, there's lots of ways to compare two groups + * Relative risk, risk difference, odds ratio. + * Chi-squared tests, normal approximations, exact tests. +* For count data, there's also Chi-squared tests and exact tests. +* We'll leave the discussions for comparing groups of data for binary + and count data until covering glms in the regression class. +* In addition, Mathematical Biostatistics Boot Camp 2 covers many special + cases relevant to biostatistics. diff --git a/06_StatisticalInference/03_01_TwoGroupIntervals/index.pdf b/06_StatisticalInference/03_01_TwoGroupIntervals/index.pdf index a15898d7a..b46a7169f 100644 Binary files a/06_StatisticalInference/03_01_TwoGroupIntervals/index.pdf and b/06_StatisticalInference/03_01_TwoGroupIntervals/index.pdf differ diff --git a/06_StatisticalInference/03_02_HypothesisTesting/index.Rmd b/06_StatisticalInference/03_02_HypothesisTesting/index.Rmd index 3cfdbb6fd..79a50d8d6 100644 --- a/06_StatisticalInference/03_02_HypothesisTesting/index.Rmd +++ b/06_StatisticalInference/03_02_HypothesisTesting/index.Rmd @@ -1,215 +1,199 @@ ---- -title : Hypothesis testing -subtitle : Statistical Inference -author : Brian Caffo, Jeff Leek, Roger Peng -job : Johns Hopkins Bloomberg School of Public Health -logo : bloomberg_shield.png -framework : io2012 # {io2012, html5slides, shower, dzslides, ...} -highlighter : highlight.js # {highlight.js, prettify, highlight} -hitheme : tomorrow # -url: - lib: ../../librariesNew - assets: ../../assets -widgets : [mathjax] # {mathjax, quiz, bootstrap} -mode : selfcontained # {standalone, draft} ---- -```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F, results='hide'} -# make this an external chunk that can be included in any file -options(width = 100) -opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/') - -options(xtable.type = 'html') -knit_hooks$set(inline = function(x) { - if(is.numeric(x)) { - round(x, getOption('digits')) - } else { - paste(as.character(x), collapse = ', ') - } -}) -knit_hooks$set(plot = knitr:::hook_plot_html) -runif(1) -``` - -## Hypothesis testing -* Hypothesis testing is concerned with making decisions using data -* A null hypothesis is specified that represents the status quo, - usually labeled $H_0$ -* The null hypothesis is assumed true and statistical evidence is required - to reject it in favor of a research or alternative hypothesis - ---- -## Example -* A respiratory disturbance index of more than $30$ events / hour, say, is - considered evidence of severe sleep disordered breathing (SDB). -* Suppose that in a sample of $100$ overweight subjects with other - risk factors for sleep disordered breathing at a sleep clinic, the - mean RDI was $32$ events / hour with a standard deviation of $10$ events / hour. -* We might want to test the hypothesis that - * $H_0 : \mu = 30$ - * $H_a : \mu > 30$ - * where $\mu$ is the population mean RDI. - ---- -## Hypothesis testing -* The alternative hypotheses are typically of the form $<$, $>$ or $\neq$ -* Note that there are four possible outcomes of our statistical decision process - -Truth | Decide | Result | ----|---|---| -$H_0$ | $H_0$ | Correctly accept null | -$H_0$ | $H_a$ | Type I error | -$H_a$ | $H_a$ | Correctly reject null | -$H_a$ | $H_0$ | Type II error | - ---- -## Discussion -* Consider a court of law; the null hypothesis is that the - defendant is innocent -* We require evidence to reject the null hypothesis (convict) -* If we require little evidence, then we would increase the - percentage of innocent people convicted (type I errors); however we - would also increase the percentage of guilty people convicted - (correctly rejecting the null) -* If we require a lot of evidence, then we increase the the - percentage of innocent people let free (correctly accepting the - null) while we would also increase the percentage of guilty people - let free (type II errors) - ---- -## Example -* Consider our example again -* A reasonable strategy would reject the null hypothesis if - $\bar X$ was larger than some constant, say $C$ -* Typically, $C$ is chosen so that the probability of a Type I - error, $\alpha$, is $.05$ (or some other relevant constant) -* $\alpha$ = Type I error rate = Probability of rejecting the null hypothesis when, in fact, the null hypothesis is correct - ---- -## Example continued - - -$$ -\begin{align} -0.05 & = P\left(\bar X \geq C ~|~ \mu = 30 \right) \\ - & = P\left(\frac{\bar X - 30}{10 / \sqrt{100}} \geq \frac{C - 30}{10/\sqrt{100}} ~|~ \mu = 30\right) \\ - & = P\left(Z \geq \frac{C - 30}{1}\right) \\ -\end{align} -$$ - -* Hence $(C - 30) / 1 = 1.645$ implying $C = 31.645$ -* Since our mean is $32$ we reject the null hypothesis - ---- -## Discussion -* In general we don't convert $C$ back to the original scale -* We would just reject because the Z-score; which is how many - standard errors the sample mean is above the hypothesized mean - $$ - \frac{32 - 30}{10 / \sqrt{100}} = 2 - $$ - is greater than $1.645$ -* Or, whenever $\sqrt{n} (\bar X - \mu_0) / s > Z_{1-\alpha}$ - ---- -## General rules -* The $Z$ test for $H_0:\mu = \mu_0$ versus - * $H_1: \mu < \mu_0$ - * $H_2: \mu \neq \mu_0$ - * $H_3: \mu > \mu_0$ -* Test statistic $ TS = \frac{\bar{X} - \mu_0}{S / \sqrt{n}} $ -* Reject the null hypothesis when - * $TS \leq -Z_{1 - \alpha}$ - * $|TS| \geq Z_{1 - \alpha / 2}$ - * $TS \geq Z_{1 - \alpha}$ - ---- -## Notes -* We have fixed $\alpha$ to be low, so if we reject $H_0$ (either - our model is wrong) or there is a low probability that we have made - an error -* We have not fixed the probability of a type II error, $\beta$; - therefore we tend to say ``Fail to reject $H_0$'' rather than - accepting $H_0$ -* Statistical significance is no the same as scientific - significance -* The region of TS values for which you reject $H_0$ is called the - rejection region - ---- -## More notes -* The $Z$ test requires the assumptions of the CLT and for $n$ to be large enough - for it to apply -* If $n$ is small, then a Gossett's $T$ test is performed exactly in the same way, - with the normal quantiles replaced by the appropriate Student's $T$ quantiles and - $n-1$ df -* The probability of rejecting the null hypothesis when it is false is called *power* -* Power is a used a lot to calculate sample sizes for experiments - ---- -## Example reconsidered -- Consider our example again. Suppose that $n= 16$ (rather than -$100$). Then consider that -$$ -.05 = P\left(\frac{\bar X - 30}{s / \sqrt{16}} \geq t_{1-\alpha, 15} ~|~ \mu = 30 \right) -$$ -- So that our test statistic is now $\sqrt{16}(32 - 30) / 10 = 0.8 $, while the critical value is $t_{1-\alpha, 15} = 1.75$. -- We now fail to reject. - ---- -## Two sided tests -* Suppose that we would reject the null hypothesis if in fact the - mean was too large or too small -* That is, we want to test the alternative $H_a : \mu \neq 30$ - (doesn't make a lot of sense in our setting) -* Then note -$$ - \alpha = P\left(\left. \left|\frac{\bar X - 30}{s /\sqrt{16}}\right| > t_{1-\alpha/2,15} ~\right|~ \mu = 30\right) -$$ -* That is we will reject if the test statistic, $0.8$, is either - too large or too small, but the critical value is calculated using - $\alpha / 2$ -* In our example the critical value is $2.13$, so we fail to reject. - ---- -## T test in R -```{r} -library(UsingR); data(father.son) -t.test(father.son$sheight - father.son$fheight) -``` - ---- -## Connections with confidence intervals -* Consider testing $H_0: \mu = \mu_0$ versus $H_a: \mu \neq \mu_0$ -* Take the set of all possible values for which you fail to reject $H_0$, this set is a $(1-\alpha)100\%$ confidence interval for $\mu$ -* The same works in reverse; if a $(1-\alpha)100\%$ interval - contains $\mu_0$, then we *fail to* reject $H_0$ - ---- -## Exact binomial test -- Recall this problem, *Suppose a friend has $8$ children, $7$ of which are girls and none are twins* -- Perform the relevant hypothesis test. $H_0 : p = 0.5$ $H_a : p > 0.5$ - - What is the relevant rejection region so that the probability of rejecting is (less than) 5%? - -Rejection region | Type I error rate | ----|---| -[0 : 8] | `r pbinom(-1, size = 8, p = .5, lower.tail = FALSE)` -[1 : 8] | `r pbinom( 0, size = 8, p = .5, lower.tail = FALSE)` -[2 : 8] | `r pbinom( 1, size = 8, p = .5, lower.tail = FALSE)` -[3 : 8] | `r pbinom( 2, size = 8, p = .5, lower.tail = FALSE)` -[4 : 8] | `r pbinom( 3, size = 8, p = .5, lower.tail = FALSE)` -[5 : 8] | `r pbinom( 4, size = 8, p = .5, lower.tail = FALSE)` -[6 : 8] | `r pbinom( 5, size = 8, p = .5, lower.tail = FALSE)` -[7 : 8] | `r pbinom( 6, size = 8, p = .5, lower.tail = FALSE)` -[8 : 8] | `r pbinom( 7, size = 8, p = .5, lower.tail = FALSE)` - ---- -## Notes -* It's impossible to get an exact 5% level test for this case due to the discreteness of the binomial. - * The closest is the rejection region [7 : 8] - * Any alpha level lower than `r 1 / 2 ^8` is not attainable. -* For larger sample sizes, we could do a normal approximation, but you already knew this. -* Two sided test isn't obvious. - * Given a way to do two sided tests, we could take the set of values of $p_0$ for which we fail to reject to get an exact binomial confidence interval (called the Clopper/Pearson interval, BTW) -* For these problems, people always create a P-value (next lecture) rather than computing the rejection region. - - +--- +title : Hypothesis testing +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- + +## Hypothesis testing +* Hypothesis testing is concerned with making decisions using data +* A null hypothesis is specified that represents the status quo, + usually labeled $H_0$ +* The null hypothesis is assumed true and statistical evidence is required + to reject it in favor of a research or alternative hypothesis + +--- +## Example +* A respiratory disturbance index of more than $30$ events / hour, say, is + considered evidence of severe sleep disordered breathing (SDB). +* Suppose that in a sample of $100$ overweight subjects with other + risk factors for sleep disordered breathing at a sleep clinic, the + mean RDI was $32$ events / hour with a standard deviation of $10$ events / hour. +* We might want to test the hypothesis that + * $H_0 : \mu = 30$ + * $H_a : \mu > 30$ + * where $\mu$ is the population mean RDI. + +--- +## Hypothesis testing +* The alternative hypotheses are typically of the form $<$, $>$ or $\neq$ +* Note that there are four possible outcomes of our statistical decision process + +Truth | Decide | Result | +---|---|---| +$H_0$ | $H_0$ | Correctly accept null | +$H_0$ | $H_a$ | Type I error | +$H_a$ | $H_a$ | Correctly reject null | +$H_a$ | $H_0$ | Type II error | + +--- +## Discussion +* Consider a court of law; the null hypothesis is that the + defendant is innocent +* We require evidence to reject the null hypothesis (convict) +* If we require little evidence, then we would increase the + percentage of innocent people convicted (type I errors); however we + would also increase the percentage of guilty people convicted + (correctly rejecting the null) +* If we require a lot of evidence, then we increase the the + percentage of innocent people let free (correctly accepting the + null) while we would also increase the percentage of guilty people + let free (type II errors) + +--- +## Example +* Consider our example again +* A reasonable strategy would reject the null hypothesis if + $\bar X$ was larger than some constant, say $C$ +* Typically, $C$ is chosen so that the probability of a Type I + error, $\alpha$, is $.05$ (or some other relevant constant) +* $\alpha$ = Type I error rate = Probability of rejecting the null hypothesis when, in fact, the null hypothesis is correct + +--- +## Example continued + + +$$ +\begin{align} +0.05 & = P\left(\bar X \geq C ~|~ \mu = 30 \right) \\ + & = P\left(\frac{\bar X - 30}{10 / \sqrt{100}} \geq \frac{C - 30}{10/\sqrt{100}} ~|~ \mu = 30\right) \\ + & = P\left(Z \geq \frac{C - 30}{1}\right) \\ +\end{align} +$$ + +* Hence $(C - 30) / 1 = 1.645$ implying $C = 31.645$ +* Since our mean is $32$ we reject the null hypothesis + +--- +## Discussion +* In general we don't convert $C$ back to the original scale +* We would just reject because the Z-score; which is how many + standard errors the sample mean is above the hypothesized mean + $$ + \frac{32 - 30}{10 / \sqrt{100}} = 2 + $$ + is greater than $1.645$ +* Or, whenever $\sqrt{n} (\bar X - \mu_0) / s > Z_{1-\alpha}$ + +--- +## General rules +* The $Z$ test for $H_0:\mu = \mu_0$ versus + * $H_1: \mu < \mu_0$ + * $H_2: \mu \neq \mu_0$ + * $H_3: \mu > \mu_0$ +* Test statistic $ TS = \frac{\bar{X} - \mu_0}{S / \sqrt{n}} $ +* Reject the null hypothesis when + * $TS \leq -Z_{1 - \alpha}$ + * $|TS| \geq Z_{1 - \alpha / 2}$ + * $TS \geq Z_{1 - \alpha}$ + +--- +## Notes +* We have fixed $\alpha$ to be low, so if we reject $H_0$ (either + our model is wrong) or there is a low probability that we have made + an error +* We have not fixed the probability of a type II error, $\beta$; + therefore we tend to say ``Fail to reject $H_0$'' rather than + accepting $H_0$ +* Statistical significance is no the same as scientific + significance +* The region of TS values for which you reject $H_0$ is called the + rejection region + +--- +## More notes +* The $Z$ test requires the assumptions of the CLT and for $n$ to be large enough + for it to apply +* If $n$ is small, then a Gossett's $T$ test is performed exactly in the same way, + with the normal quantiles replaced by the appropriate Student's $T$ quantiles and + $n-1$ df +* The probability of rejecting the null hypothesis when it is false is called *power* +* Power is a used a lot to calculate sample sizes for experiments + +--- +## Example reconsidered +- Consider our example again. Suppose that $n= 16$ (rather than +$100$). Then consider that +$$ +.05 = P\left(\frac{\bar X - 30}{s / \sqrt{16}} \geq t_{1-\alpha, 15} ~|~ \mu = 30 \right) +$$ +- So that our test statistic is now $\sqrt{16}(32 - 30) / 10 = 0.8 $, while the critical value is $t_{1-\alpha, 15} = 1.75$. +- We now fail to reject. + +--- +## Two sided tests +* Suppose that we would reject the null hypothesis if in fact the + mean was too large or too small +* That is, we want to test the alternative $H_a : \mu \neq 30$ + (doesn't make a lot of sense in our setting) +* Then note +$$ + \alpha = P\left(\left. \left|\frac{\bar X - 30}{s /\sqrt{16}}\right| > t_{1-\alpha/2,15} ~\right|~ \mu = 30\right) +$$ +* That is we will reject if the test statistic, $0.8$, is either + too large or too small, but the critical value is calculated using + $\alpha / 2$ +* In our example the critical value is $2.13$, so we fail to reject. + +--- +## T test in R +```{r} +library(UsingR); data(father.son) +t.test(father.son$sheight - father.son$fheight) +``` + +--- +## Connections with confidence intervals +* Consider testing $H_0: \mu = \mu_0$ versus $H_a: \mu \neq \mu_0$ +* Take the set of all possible values for which you fail to reject $H_0$, this set is a $(1-\alpha)100\%$ confidence interval for $\mu$ +* The same works in reverse; if a $(1-\alpha)100\%$ interval + contains $\mu_0$, then we *fail to* reject $H_0$ + +--- +## Exact binomial test +- Recall this problem, *Suppose a friend has $8$ children, $7$ of which are girls and none are twins* +- Perform the relevant hypothesis test. $H_0 : p = 0.5$ $H_a : p > 0.5$ + - What is the relevant rejection region so that the probability of rejecting is (less than) 5%? + +Rejection region | Type I error rate | +---|---| +[0 : 8] | `r pbinom(-1, size = 8, p = .5, lower.tail = FALSE)` +[1 : 8] | `r pbinom( 0, size = 8, p = .5, lower.tail = FALSE)` +[2 : 8] | `r pbinom( 1, size = 8, p = .5, lower.tail = FALSE)` +[3 : 8] | `r pbinom( 2, size = 8, p = .5, lower.tail = FALSE)` +[4 : 8] | `r pbinom( 3, size = 8, p = .5, lower.tail = FALSE)` +[5 : 8] | `r pbinom( 4, size = 8, p = .5, lower.tail = FALSE)` +[6 : 8] | `r pbinom( 5, size = 8, p = .5, lower.tail = FALSE)` +[7 : 8] | `r pbinom( 6, size = 8, p = .5, lower.tail = FALSE)` +[8 : 8] | `r pbinom( 7, size = 8, p = .5, lower.tail = FALSE)` + +--- +## Notes +* It's impossible to get an exact 5% level test for this case due to the discreteness of the binomial. + * The closest is the rejection region [7 : 8] + * Any alpha level lower than `r 1 / 2 ^8` is not attainable. +* For larger sample sizes, we could do a normal approximation, but you already knew this. +* Two sided test isn't obvious. + * Given a way to do two sided tests, we could take the set of values of $p_0$ for which we fail to reject to get an exact binomial confidence interval (called the Clopper/Pearson interval, BTW) +* For these problems, people always create a P-value (next lecture) rather than computing the rejection region. + + diff --git a/06_StatisticalInference/03_02_HypothesisTesting/index.html b/06_StatisticalInference/03_02_HypothesisTesting/index.html index e8cc58b1c..a855fe1a6 100644 --- a/06_StatisticalInference/03_02_HypothesisTesting/index.html +++ b/06_StatisticalInference/03_02_HypothesisTesting/index.html @@ -1,582 +1,583 @@ - - - - Hypothesis testing - - - - - - - - - - - - - - - - - - - - - - - - - - -
-

Hypothesis testing

-

Statistical Inference

-

Brian Caffo, Jeff Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

-
-
-
- - - - -
-

Hypothesis testing

-
-
-
    -
  • Hypothesis testing is concerned with making decisions using data
  • -
  • A null hypothesis is specified that represents the status quo, -usually labeled \(H_0\)
  • -
  • The null hypothesis is assumed true and statistical evidence is required -to reject it in favor of a research or alternative hypothesis
  • -
- -
- -
- - -
-

Example

-
-
-
    -
  • A respiratory disturbance index of more than \(30\) events / hour, say, is -considered evidence of severe sleep disordered breathing (SDB).
  • -
  • Suppose that in a sample of \(100\) overweight subjects with other -risk factors for sleep disordered breathing at a sleep clinic, the -mean RDI was \(32\) events / hour with a standard deviation of \(10\) events / hour.
  • -
  • We might want to test the hypothesis that - -
      -
    • \(H_0 : \mu = 30\)
    • -
    • \(H_a : \mu > 30\)
    • -
    • where \(\mu\) is the population mean RDI.
    • -
  • -
- -
- -
- - -
-

Hypothesis testing

-
-
-
    -
  • The alternative hypotheses are typically of the form \(<\), \(>\) or \(\neq\)
  • -
  • Note that there are four possible outcomes of our statistical decision process
  • -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
TruthDecideResult
\(H_0\)\(H_0\)Correctly accept null
\(H_0\)\(H_a\)Type I error
\(H_a\)\(H_a\)Correctly reject null
\(H_a\)\(H_0\)Type II error
- -
- -
- - -
-

Discussion

-
-
-
    -
  • Consider a court of law; the null hypothesis is that the -defendant is innocent
  • -
  • We require evidence to reject the null hypothesis (convict)
  • -
  • If we require little evidence, then we would increase the -percentage of innocent people convicted (type I errors); however we -would also increase the percentage of guilty people convicted -(correctly rejecting the null)
  • -
  • If we require a lot of evidence, then we increase the the -percentage of innocent people let free (correctly accepting the -null) while we would also increase the percentage of guilty people -let free (type II errors)
  • -
- -
- -
- - -
-

Example

-
-
-
    -
  • Consider our example again
  • -
  • A reasonable strategy would reject the null hypothesis if -\(\bar X\) was larger than some constant, say \(C\)
  • -
  • Typically, \(C\) is chosen so that the probability of a Type I -error, \(\alpha\), is \(.05\) (or some other relevant constant)
  • -
  • \(\alpha\) = Type I error rate = Probability of rejecting the null hypothesis when, in fact, the null hypothesis is correct
  • -
- -
- -
- - -
-

Example continued

-
-
-

\[ -\begin{align} -0.05 & = P\left(\bar X \geq C ~|~ \mu = 30 \right) \\ - & = P\left(\frac{\bar X - 30}{10 / \sqrt{100}} \geq \frac{C - 30}{10/\sqrt{100}} ~|~ \mu = 30\right) \\ - & = P\left(Z \geq \frac{C - 30}{1}\right) \\ -\end{align} -\]

- -
    -
  • Hence \((C - 30) / 1 = 1.645\) implying \(C = 31.645\)
  • -
  • Since our mean is \(32\) we reject the null hypothesis
  • -
- -
- -
- - -
-

Discussion

-
-
-
    -
  • In general we don't convert \(C\) back to the original scale
  • -
  • We would just reject because the Z-score; which is how many -standard errors the sample mean is above the hypothesized mean -\[ -\frac{32 - 30}{10 / \sqrt{100}} = 2 -\] -is greater than \(1.645\)
  • -
  • Or, whenever \(\sqrt{n} (\bar X - \mu_0) / s > Z_{1-\alpha}\)
  • -
- -
- -
- - -
-

General rules

-
-
-
    -
  • The \(Z\) test for \(H_0:\mu = \mu_0\) versus - -
      -
    • \(H_1: \mu < \mu_0\)
    • -
    • \(H_2: \mu \neq \mu_0\)
    • -
    • \(H_3: \mu > \mu_0\)
    • -
  • -
  • Test statistic $ TS = \frac{\bar{X} - \mu_0}{S / \sqrt{n}} $
  • -
  • Reject the null hypothesis when - -
      -
    • \(TS \leq -Z_{1 - \alpha}\)
    • -
    • \(|TS| \geq Z_{1 - \alpha / 2}\)
    • -
    • \(TS \geq Z_{1 - \alpha}\)
    • -
  • -
- -
- -
- - -
-

Notes

-
-
-
    -
  • We have fixed \(\alpha\) to be low, so if we reject \(H_0\) (either -our model is wrong) or there is a low probability that we have made -an error
  • -
  • We have not fixed the probability of a type II error, \(\beta\); -therefore we tend to say ``Fail to reject \(H_0\)'' rather than -accepting \(H_0\)
  • -
  • Statistical significance is no the same as scientific -significance
  • -
  • The region of TS values for which you reject \(H_0\) is called the -rejection region
  • -
- -
- -
- - -
-

More notes

-
-
-
    -
  • The \(Z\) test requires the assumptions of the CLT and for \(n\) to be large enough -for it to apply
  • -
  • If \(n\) is small, then a Gossett's \(T\) test is performed exactly in the same way, -with the normal quantiles replaced by the appropriate Student's \(T\) quantiles and -\(n-1\) df
  • -
  • The probability of rejecting the null hypothesis when it is false is called power
  • -
  • Power is a used a lot to calculate sample sizes for experiments
  • -
- -
- -
- - -
-

Example reconsidered

-
-
-
    -
  • Consider our example again. Suppose that \(n= 16\) (rather than -\(100\)). Then consider that -\[ -.05 = P\left(\frac{\bar X - 30}{s / \sqrt{16}} \geq t_{1-\alpha, 15} ~|~ \mu = 30 \right) -\]
  • -
  • So that our test statistic is now $\sqrt{16}(32 - 30) / 10 = 0.8 $, while the critical value is \(t_{1-\alpha, 15} = 1.75\).
  • -
  • We now fail to reject.
  • -
- -
- -
- - -
-

Two sided tests

-
-
-
    -
  • Suppose that we would reject the null hypothesis if in fact the -mean was too large or too small
  • -
  • That is, we want to test the alternative \(H_a : \mu \neq 30\) -(doesn't make a lot of sense in our setting)
  • -
  • Then note -\[ -\alpha = P\left(\left. \left|\frac{\bar X - 30}{s /\sqrt{16}}\right| > t_{1-\alpha/2,15} ~\right|~ \mu = 30\right) -\]
  • -
  • That is we will reject if the test statistic, \(0.8\), is either -too large or too small, but the critical value is calculated using -\(\alpha / 2\)
  • -
  • In our example the critical value is \(2.13\), so we fail to reject.
  • -
- -
- -
- - -
-

T test in R

-
-
-
library(UsingR); data(father.son)
-t.test(father.son$sheight - father.son$fheight)
-
- -

-    One Sample t-test
-
-data:  father.son$sheight - father.son$fheight
-t = 11.79, df = 1077, p-value < 2.2e-16
-alternative hypothesis: true mean is not equal to 0
-95 percent confidence interval:
- 0.831 1.163
-sample estimates:
-mean of x 
-    0.997 
-
- -
- -
- - -
-

Connections with confidence intervals

-
-
-
    -
  • Consider testing \(H_0: \mu = \mu_0\) versus \(H_a: \mu \neq \mu_0\)
  • -
  • Take the set of all possible values for which you fail to reject \(H_0\), this set is a \((1-\alpha)100\%\) confidence interval for \(\mu\)
  • -
  • The same works in reverse; if a \((1-\alpha)100\%\) interval -contains \(\mu_0\), then we fail to reject \(H_0\)
  • -
- -
- -
- - -
-

Exact binomial test

-
-
-
    -
  • Recall this problem, Suppose a friend has \(8\) children, \(7\) of which are girls and none are twins
  • -
  • Perform the relevant hypothesis test. \(H_0 : p = 0.5\) \(H_a : p > 0.5\) - -
      -
    • What is the relevant rejection region so that the probability of rejecting is (less than) 5%?
    • -
  • -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Rejection regionType I error rate
[0 : 8]1
[1 : 8]0.9961
[2 : 8]0.9648
[3 : 8]0.8555
[4 : 8]0.6367
[5 : 8]0.3633
[6 : 8]0.1445
[7 : 8]0.0352
[8 : 8]0.0039
- -
- -
- - -
-

Notes

-
-
-
    -
  • It's impossible to get an exact 5% level test for this case due to the discreteness of the binomial. - -
      -
    • The closest is the rejection region [7 : 8]
    • -
    • Any alpha level lower than 0.0039 is not attainable.
    • -
  • -
  • For larger sample sizes, we could do a normal approximation, but you already knew this.
  • -
  • Two sided test isn't obvious. - -
      -
    • Given a way to do two sided tests, we could take the set of values of \(p_0\) for which we fail to reject to get an exact binomial confidence interval (called the Clopper/Pearson interval, BTW)
    • -
  • -
  • For these problems, people always create a P-value (next lecture) rather than computing the rejection region.
  • -
- -
- -
- - -
- - - - - - - - - - - - - - + + + + Hypothesis testing + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

Hypothesis testing

+

Statistical Inference

+

Brian Caffo, Jeff Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

+
+
+
+ + + + +
+

Hypothesis testing

+
+
+
    +
  • Hypothesis testing is concerned with making decisions using data
  • +
  • A null hypothesis is specified that represents the status quo, +usually labeled \(H_0\)
  • +
  • The null hypothesis is assumed true and statistical evidence is required +to reject it in favor of a research or alternative hypothesis
  • +
+ +
+ +
+ + +
+

Example

+
+
+
    +
  • A respiratory disturbance index of more than \(30\) events / hour, say, is +considered evidence of severe sleep disordered breathing (SDB).
  • +
  • Suppose that in a sample of \(100\) overweight subjects with other +risk factors for sleep disordered breathing at a sleep clinic, the +mean RDI was \(32\) events / hour with a standard deviation of \(10\) events / hour.
  • +
  • We might want to test the hypothesis that + +
      +
    • \(H_0 : \mu = 30\)
    • +
    • \(H_a : \mu > 30\)
    • +
    • where \(\mu\) is the population mean RDI.
    • +
  • +
+ +
+ +
+ + +
+

Hypothesis testing

+
+
+
    +
  • The alternative hypotheses are typically of the form \(<\), \(>\) or \(\neq\)
  • +
  • Note that there are four possible outcomes of our statistical decision process
  • +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
TruthDecideResult
\(H_0\)\(H_0\)Correctly accept null
\(H_0\)\(H_a\)Type I error
\(H_a\)\(H_a\)Correctly reject null
\(H_a\)\(H_0\)Type II error
+ +
+ +
+ + +
+

Discussion

+
+
+
    +
  • Consider a court of law; the null hypothesis is that the +defendant is innocent
  • +
  • We require evidence to reject the null hypothesis (convict)
  • +
  • If we require little evidence, then we would increase the +percentage of innocent people convicted (type I errors); however we +would also increase the percentage of guilty people convicted +(correctly rejecting the null)
  • +
  • If we require a lot of evidence, then we increase the the +percentage of innocent people let free (correctly accepting the +null) while we would also increase the percentage of guilty people +let free (type II errors)
  • +
+ +
+ +
+ + +
+

Example

+
+
+
    +
  • Consider our example again
  • +
  • A reasonable strategy would reject the null hypothesis if +\(\bar X\) was larger than some constant, say \(C\)
  • +
  • Typically, \(C\) is chosen so that the probability of a Type I +error, \(\alpha\), is \(.05\) (or some other relevant constant)
  • +
  • \(\alpha\) = Type I error rate = Probability of rejecting the null hypothesis when, in fact, the null hypothesis is correct
  • +
+ +
+ +
+ + +
+

Example continued

+
+
+

\[ +\begin{align} +0.05 & = P\left(\bar X \geq C ~|~ \mu = 30 \right) \\ + & = P\left(\frac{\bar X - 30}{10 / \sqrt{100}} \geq \frac{C - 30}{10/\sqrt{100}} ~|~ \mu = 30\right) \\ + & = P\left(Z \geq \frac{C - 30}{1}\right) \\ +\end{align} +\]

+ +
    +
  • Hence \((C - 30) / 1 = 1.645\) implying \(C = 31.645\)
  • +
  • Since our mean is \(32\) we reject the null hypothesis
  • +
+ +
+ +
+ + +
+

Discussion

+
+
+
    +
  • In general we don't convert \(C\) back to the original scale
  • +
  • We would just reject because the Z-score; which is how many +standard errors the sample mean is above the hypothesized mean +\[ +\frac{32 - 30}{10 / \sqrt{100}} = 2 +\] +is greater than \(1.645\)
  • +
  • Or, whenever \(\sqrt{n} (\bar X - \mu_0) / s > Z_{1-\alpha}\)
  • +
+ +
+ +
+ + +
+

General rules

+
+
+
    +
  • The \(Z\) test for \(H_0:\mu = \mu_0\) versus + +
      +
    • \(H_1: \mu < \mu_0\)
    • +
    • \(H_2: \mu \neq \mu_0\)
    • +
    • \(H_3: \mu > \mu_0\)
    • +
  • +
  • Test statistic $ TS = \frac{\bar{X} - \mu_0}{S / \sqrt{n}} $
  • +
  • Reject the null hypothesis when + +
      +
    • \(TS \leq -Z_{1 - \alpha}\)
    • +
    • \(|TS| \geq Z_{1 - \alpha / 2}\)
    • +
    • \(TS \geq Z_{1 - \alpha}\)
    • +
  • +
+ +
+ +
+ + +
+

Notes

+
+
+
    +
  • We have fixed \(\alpha\) to be low, so if we reject \(H_0\) (either +our model is wrong) or there is a low probability that we have made +an error
  • +
  • We have not fixed the probability of a type II error, \(\beta\); +therefore we tend to say ``Fail to reject \(H_0\)'' rather than +accepting \(H_0\)
  • +
  • Statistical significance is no the same as scientific +significance
  • +
  • The region of TS values for which you reject \(H_0\) is called the +rejection region
  • +
+ +
+ +
+ + +
+

More notes

+
+
+
    +
  • The \(Z\) test requires the assumptions of the CLT and for \(n\) to be large enough +for it to apply
  • +
  • If \(n\) is small, then a Gossett's \(T\) test is performed exactly in the same way, +with the normal quantiles replaced by the appropriate Student's \(T\) quantiles and +\(n-1\) df
  • +
  • The probability of rejecting the null hypothesis when it is false is called power
  • +
  • Power is a used a lot to calculate sample sizes for experiments
  • +
+ +
+ +
+ + +
+

Example reconsidered

+
+
+
    +
  • Consider our example again. Suppose that \(n= 16\) (rather than +\(100\)). Then consider that +\[ +.05 = P\left(\frac{\bar X - 30}{s / \sqrt{16}} \geq t_{1-\alpha, 15} ~|~ \mu = 30 \right) +\]
  • +
  • So that our test statistic is now $\sqrt{16}(32 - 30) / 10 = 0.8 $, while the critical value is \(t_{1-\alpha, 15} = 1.75\).
  • +
  • We now fail to reject.
  • +
+ +
+ +
+ + +
+

Two sided tests

+
+
+
    +
  • Suppose that we would reject the null hypothesis if in fact the +mean was too large or too small
  • +
  • That is, we want to test the alternative \(H_a : \mu \neq 30\) +(doesn't make a lot of sense in our setting)
  • +
  • Then note +\[ +\alpha = P\left(\left. \left|\frac{\bar X - 30}{s /\sqrt{16}}\right| > t_{1-\alpha/2,15} ~\right|~ \mu = 30\right) +\]
  • +
  • That is we will reject if the test statistic, \(0.8\), is either +too large or too small, but the critical value is calculated using +\(\alpha / 2\)
  • +
  • In our example the critical value is \(2.13\), so we fail to reject.
  • +
+ +
+ +
+ + +
+

T test in R

+
+
+
library(UsingR)
+data(father.son)
+t.test(father.son$sheight - father.son$fheight)
+
+ +
## 
+##  One Sample t-test
+## 
+## data:  father.son$sheight - father.son$fheight
+## t = 11.79, df = 1077, p-value < 2.2e-16
+## alternative hypothesis: true mean is not equal to 0
+## 95 percent confidence interval:
+##  0.831 1.163
+## sample estimates:
+## mean of x 
+##     0.997
+
+ +
+ +
+ + +
+

Connections with confidence intervals

+
+
+
    +
  • Consider testing \(H_0: \mu = \mu_0\) versus \(H_a: \mu \neq \mu_0\)
  • +
  • Take the set of all possible values for which you fail to reject \(H_0\), this set is a \((1-\alpha)100\%\) confidence interval for \(\mu\)
  • +
  • The same works in reverse; if a \((1-\alpha)100\%\) interval +contains \(\mu_0\), then we fail to reject \(H_0\)
  • +
+ +
+ +
+ + +
+

Exact binomial test

+
+
+
    +
  • Recall this problem, Suppose a friend has \(8\) children, \(7\) of which are girls and none are twins
  • +
  • Perform the relevant hypothesis test. \(H_0 : p = 0.5\) \(H_a : p > 0.5\) + +
      +
    • What is the relevant rejection region so that the probability of rejecting is (less than) 5%?
    • +
  • +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Rejection regionType I error rate
[0 : 8]1
[1 : 8]0.9961
[2 : 8]0.9648
[3 : 8]0.8555
[4 : 8]0.6367
[5 : 8]0.3633
[6 : 8]0.1445
[7 : 8]0.0352
[8 : 8]0.0039
+ +
+ +
+ + +
+

Notes

+
+
+
    +
  • It's impossible to get an exact 5% level test for this case due to the discreteness of the binomial. + +
      +
    • The closest is the rejection region [7 : 8]
    • +
    • Any alpha level lower than 0.0039 is not attainable.
    • +
  • +
  • For larger sample sizes, we could do a normal approximation, but you already knew this.
  • +
  • Two sided test isn't obvious. + +
      +
    • Given a way to do two sided tests, we could take the set of values of \(p_0\) for which we fail to reject to get an exact binomial confidence interval (called the Clopper/Pearson interval, BTW)
    • +
  • +
  • For these problems, people always create a P-value (next lecture) rather than computing the rejection region.
  • +
+ +
+ +
+ + +
+ + + + + + + + + + + + + + \ No newline at end of file diff --git a/06_StatisticalInference/03_02_HypothesisTesting/index.md b/06_StatisticalInference/03_02_HypothesisTesting/index.md index 94bf98520..7f584af2c 100644 --- a/06_StatisticalInference/03_02_HypothesisTesting/index.md +++ b/06_StatisticalInference/03_02_HypothesisTesting/index.md @@ -1,217 +1,216 @@ ---- -title : Hypothesis testing -subtitle : Statistical Inference -author : Brian Caffo, Jeff Leek, Roger Peng -job : Johns Hopkins Bloomberg School of Public Health -logo : bloomberg_shield.png -framework : io2012 # {io2012, html5slides, shower, dzslides, ...} -highlighter : highlight.js # {highlight.js, prettify, highlight} -hitheme : tomorrow # -url: - lib: ../../librariesNew - assets: ../../assets -widgets : [mathjax] # {mathjax, quiz, bootstrap} -mode : selfcontained # {standalone, draft} ---- - - - -## Hypothesis testing -* Hypothesis testing is concerned with making decisions using data -* A null hypothesis is specified that represents the status quo, - usually labeled $H_0$ -* The null hypothesis is assumed true and statistical evidence is required - to reject it in favor of a research or alternative hypothesis - ---- -## Example -* A respiratory disturbance index of more than $30$ events / hour, say, is - considered evidence of severe sleep disordered breathing (SDB). -* Suppose that in a sample of $100$ overweight subjects with other - risk factors for sleep disordered breathing at a sleep clinic, the - mean RDI was $32$ events / hour with a standard deviation of $10$ events / hour. -* We might want to test the hypothesis that - * $H_0 : \mu = 30$ - * $H_a : \mu > 30$ - * where $\mu$ is the population mean RDI. - ---- -## Hypothesis testing -* The alternative hypotheses are typically of the form $<$, $>$ or $\neq$ -* Note that there are four possible outcomes of our statistical decision process - -Truth | Decide | Result | ----|---|---| -$H_0$ | $H_0$ | Correctly accept null | -$H_0$ | $H_a$ | Type I error | -$H_a$ | $H_a$ | Correctly reject null | -$H_a$ | $H_0$ | Type II error | - ---- -## Discussion -* Consider a court of law; the null hypothesis is that the - defendant is innocent -* We require evidence to reject the null hypothesis (convict) -* If we require little evidence, then we would increase the - percentage of innocent people convicted (type I errors); however we - would also increase the percentage of guilty people convicted - (correctly rejecting the null) -* If we require a lot of evidence, then we increase the the - percentage of innocent people let free (correctly accepting the - null) while we would also increase the percentage of guilty people - let free (type II errors) - ---- -## Example -* Consider our example again -* A reasonable strategy would reject the null hypothesis if - $\bar X$ was larger than some constant, say $C$ -* Typically, $C$ is chosen so that the probability of a Type I - error, $\alpha$, is $.05$ (or some other relevant constant) -* $\alpha$ = Type I error rate = Probability of rejecting the null hypothesis when, in fact, the null hypothesis is correct - ---- -## Example continued - - -$$ -\begin{align} -0.05 & = P\left(\bar X \geq C ~|~ \mu = 30 \right) \\ - & = P\left(\frac{\bar X - 30}{10 / \sqrt{100}} \geq \frac{C - 30}{10/\sqrt{100}} ~|~ \mu = 30\right) \\ - & = P\left(Z \geq \frac{C - 30}{1}\right) \\ -\end{align} -$$ - -* Hence $(C - 30) / 1 = 1.645$ implying $C = 31.645$ -* Since our mean is $32$ we reject the null hypothesis - ---- -## Discussion -* In general we don't convert $C$ back to the original scale -* We would just reject because the Z-score; which is how many - standard errors the sample mean is above the hypothesized mean - $$ - \frac{32 - 30}{10 / \sqrt{100}} = 2 - $$ - is greater than $1.645$ -* Or, whenever $\sqrt{n} (\bar X - \mu_0) / s > Z_{1-\alpha}$ - ---- -## General rules -* The $Z$ test for $H_0:\mu = \mu_0$ versus - * $H_1: \mu < \mu_0$ - * $H_2: \mu \neq \mu_0$ - * $H_3: \mu > \mu_0$ -* Test statistic $ TS = \frac{\bar{X} - \mu_0}{S / \sqrt{n}} $ -* Reject the null hypothesis when - * $TS \leq -Z_{1 - \alpha}$ - * $|TS| \geq Z_{1 - \alpha / 2}$ - * $TS \geq Z_{1 - \alpha}$ - ---- -## Notes -* We have fixed $\alpha$ to be low, so if we reject $H_0$ (either - our model is wrong) or there is a low probability that we have made - an error -* We have not fixed the probability of a type II error, $\beta$; - therefore we tend to say ``Fail to reject $H_0$'' rather than - accepting $H_0$ -* Statistical significance is no the same as scientific - significance -* The region of TS values for which you reject $H_0$ is called the - rejection region - ---- -## More notes -* The $Z$ test requires the assumptions of the CLT and for $n$ to be large enough - for it to apply -* If $n$ is small, then a Gossett's $T$ test is performed exactly in the same way, - with the normal quantiles replaced by the appropriate Student's $T$ quantiles and - $n-1$ df -* The probability of rejecting the null hypothesis when it is false is called *power* -* Power is a used a lot to calculate sample sizes for experiments - ---- -## Example reconsidered -- Consider our example again. Suppose that $n= 16$ (rather than -$100$). Then consider that -$$ -.05 = P\left(\frac{\bar X - 30}{s / \sqrt{16}} \geq t_{1-\alpha, 15} ~|~ \mu = 30 \right) -$$ -- So that our test statistic is now $\sqrt{16}(32 - 30) / 10 = 0.8 $, while the critical value is $t_{1-\alpha, 15} = 1.75$. -- We now fail to reject. - ---- -## Two sided tests -* Suppose that we would reject the null hypothesis if in fact the - mean was too large or too small -* That is, we want to test the alternative $H_a : \mu \neq 30$ - (doesn't make a lot of sense in our setting) -* Then note -$$ - \alpha = P\left(\left. \left|\frac{\bar X - 30}{s /\sqrt{16}}\right| > t_{1-\alpha/2,15} ~\right|~ \mu = 30\right) -$$ -* That is we will reject if the test statistic, $0.8$, is either - too large or too small, but the critical value is calculated using - $\alpha / 2$ -* In our example the critical value is $2.13$, so we fail to reject. - ---- -## T test in R - -```r -library(UsingR); data(father.son) -t.test(father.son$sheight - father.son$fheight) -``` - -``` - - One Sample t-test - -data: father.son$sheight - father.son$fheight -t = 11.79, df = 1077, p-value < 2.2e-16 -alternative hypothesis: true mean is not equal to 0 -95 percent confidence interval: - 0.831 1.163 -sample estimates: -mean of x - 0.997 -``` - - ---- -## Connections with confidence intervals -* Consider testing $H_0: \mu = \mu_0$ versus $H_a: \mu \neq \mu_0$ -* Take the set of all possible values for which you fail to reject $H_0$, this set is a $(1-\alpha)100\%$ confidence interval for $\mu$ -* The same works in reverse; if a $(1-\alpha)100\%$ interval - contains $\mu_0$, then we *fail to* reject $H_0$ - ---- -## Exact binomial test -- Recall this problem, *Suppose a friend has $8$ children, $7$ of which are girls and none are twins* -- Perform the relevant hypothesis test. $H_0 : p = 0.5$ $H_a : p > 0.5$ - - What is the relevant rejection region so that the probability of rejecting is (less than) 5%? - -Rejection region | Type I error rate | ----|---| -[0 : 8] | 1 -[1 : 8] | 0.9961 -[2 : 8] | 0.9648 -[3 : 8] | 0.8555 -[4 : 8] | 0.6367 -[5 : 8] | 0.3633 -[6 : 8] | 0.1445 -[7 : 8] | 0.0352 -[8 : 8] | 0.0039 - ---- -## Notes -* It's impossible to get an exact 5% level test for this case due to the discreteness of the binomial. - * The closest is the rejection region [7 : 8] - * Any alpha level lower than 0.0039 is not attainable. -* For larger sample sizes, we could do a normal approximation, but you already knew this. -* Two sided test isn't obvious. - * Given a way to do two sided tests, we could take the set of values of $p_0$ for which we fail to reject to get an exact binomial confidence interval (called the Clopper/Pearson interval, BTW) -* For these problems, people always create a P-value (next lecture) rather than computing the rejection region. - - +--- +title : Hypothesis testing +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- + +## Hypothesis testing +* Hypothesis testing is concerned with making decisions using data +* A null hypothesis is specified that represents the status quo, + usually labeled $H_0$ +* The null hypothesis is assumed true and statistical evidence is required + to reject it in favor of a research or alternative hypothesis + +--- +## Example +* A respiratory disturbance index of more than $30$ events / hour, say, is + considered evidence of severe sleep disordered breathing (SDB). +* Suppose that in a sample of $100$ overweight subjects with other + risk factors for sleep disordered breathing at a sleep clinic, the + mean RDI was $32$ events / hour with a standard deviation of $10$ events / hour. +* We might want to test the hypothesis that + * $H_0 : \mu = 30$ + * $H_a : \mu > 30$ + * where $\mu$ is the population mean RDI. + +--- +## Hypothesis testing +* The alternative hypotheses are typically of the form $<$, $>$ or $\neq$ +* Note that there are four possible outcomes of our statistical decision process + +Truth | Decide | Result | +---|---|---| +$H_0$ | $H_0$ | Correctly accept null | +$H_0$ | $H_a$ | Type I error | +$H_a$ | $H_a$ | Correctly reject null | +$H_a$ | $H_0$ | Type II error | + +--- +## Discussion +* Consider a court of law; the null hypothesis is that the + defendant is innocent +* We require evidence to reject the null hypothesis (convict) +* If we require little evidence, then we would increase the + percentage of innocent people convicted (type I errors); however we + would also increase the percentage of guilty people convicted + (correctly rejecting the null) +* If we require a lot of evidence, then we increase the the + percentage of innocent people let free (correctly accepting the + null) while we would also increase the percentage of guilty people + let free (type II errors) + +--- +## Example +* Consider our example again +* A reasonable strategy would reject the null hypothesis if + $\bar X$ was larger than some constant, say $C$ +* Typically, $C$ is chosen so that the probability of a Type I + error, $\alpha$, is $.05$ (or some other relevant constant) +* $\alpha$ = Type I error rate = Probability of rejecting the null hypothesis when, in fact, the null hypothesis is correct + +--- +## Example continued + + +$$ +\begin{align} +0.05 & = P\left(\bar X \geq C ~|~ \mu = 30 \right) \\ + & = P\left(\frac{\bar X - 30}{10 / \sqrt{100}} \geq \frac{C - 30}{10/\sqrt{100}} ~|~ \mu = 30\right) \\ + & = P\left(Z \geq \frac{C - 30}{1}\right) \\ +\end{align} +$$ + +* Hence $(C - 30) / 1 = 1.645$ implying $C = 31.645$ +* Since our mean is $32$ we reject the null hypothesis + +--- +## Discussion +* In general we don't convert $C$ back to the original scale +* We would just reject because the Z-score; which is how many + standard errors the sample mean is above the hypothesized mean + $$ + \frac{32 - 30}{10 / \sqrt{100}} = 2 + $$ + is greater than $1.645$ +* Or, whenever $\sqrt{n} (\bar X - \mu_0) / s > Z_{1-\alpha}$ + +--- +## General rules +* The $Z$ test for $H_0:\mu = \mu_0$ versus + * $H_1: \mu < \mu_0$ + * $H_2: \mu \neq \mu_0$ + * $H_3: \mu > \mu_0$ +* Test statistic $ TS = \frac{\bar{X} - \mu_0}{S / \sqrt{n}} $ +* Reject the null hypothesis when + * $TS \leq -Z_{1 - \alpha}$ + * $|TS| \geq Z_{1 - \alpha / 2}$ + * $TS \geq Z_{1 - \alpha}$ + +--- +## Notes +* We have fixed $\alpha$ to be low, so if we reject $H_0$ (either + our model is wrong) or there is a low probability that we have made + an error +* We have not fixed the probability of a type II error, $\beta$; + therefore we tend to say ``Fail to reject $H_0$'' rather than + accepting $H_0$ +* Statistical significance is no the same as scientific + significance +* The region of TS values for which you reject $H_0$ is called the + rejection region + +--- +## More notes +* The $Z$ test requires the assumptions of the CLT and for $n$ to be large enough + for it to apply +* If $n$ is small, then a Gossett's $T$ test is performed exactly in the same way, + with the normal quantiles replaced by the appropriate Student's $T$ quantiles and + $n-1$ df +* The probability of rejecting the null hypothesis when it is false is called *power* +* Power is a used a lot to calculate sample sizes for experiments + +--- +## Example reconsidered +- Consider our example again. Suppose that $n= 16$ (rather than +$100$). Then consider that +$$ +.05 = P\left(\frac{\bar X - 30}{s / \sqrt{16}} \geq t_{1-\alpha, 15} ~|~ \mu = 30 \right) +$$ +- So that our test statistic is now $\sqrt{16}(32 - 30) / 10 = 0.8 $, while the critical value is $t_{1-\alpha, 15} = 1.75$. +- We now fail to reject. + +--- +## Two sided tests +* Suppose that we would reject the null hypothesis if in fact the + mean was too large or too small +* That is, we want to test the alternative $H_a : \mu \neq 30$ + (doesn't make a lot of sense in our setting) +* Then note +$$ + \alpha = P\left(\left. \left|\frac{\bar X - 30}{s /\sqrt{16}}\right| > t_{1-\alpha/2,15} ~\right|~ \mu = 30\right) +$$ +* That is we will reject if the test statistic, $0.8$, is either + too large or too small, but the critical value is calculated using + $\alpha / 2$ +* In our example the critical value is $2.13$, so we fail to reject. + +--- +## T test in R + +```r +library(UsingR) +data(father.son) +t.test(father.son$sheight - father.son$fheight) +``` + +``` +## +## One Sample t-test +## +## data: father.son$sheight - father.son$fheight +## t = 11.79, df = 1077, p-value < 2.2e-16 +## alternative hypothesis: true mean is not equal to 0 +## 95 percent confidence interval: +## 0.831 1.163 +## sample estimates: +## mean of x +## 0.997 +``` + + +--- +## Connections with confidence intervals +* Consider testing $H_0: \mu = \mu_0$ versus $H_a: \mu \neq \mu_0$ +* Take the set of all possible values for which you fail to reject $H_0$, this set is a $(1-\alpha)100\%$ confidence interval for $\mu$ +* The same works in reverse; if a $(1-\alpha)100\%$ interval + contains $\mu_0$, then we *fail to* reject $H_0$ + +--- +## Exact binomial test +- Recall this problem, *Suppose a friend has $8$ children, $7$ of which are girls and none are twins* +- Perform the relevant hypothesis test. $H_0 : p = 0.5$ $H_a : p > 0.5$ + - What is the relevant rejection region so that the probability of rejecting is (less than) 5%? + +Rejection region | Type I error rate | +---|---| +[0 : 8] | 1 +[1 : 8] | 0.9961 +[2 : 8] | 0.9648 +[3 : 8] | 0.8555 +[4 : 8] | 0.6367 +[5 : 8] | 0.3633 +[6 : 8] | 0.1445 +[7 : 8] | 0.0352 +[8 : 8] | 0.0039 + +--- +## Notes +* It's impossible to get an exact 5% level test for this case due to the discreteness of the binomial. + * The closest is the rejection region [7 : 8] + * Any alpha level lower than 0.0039 is not attainable. +* For larger sample sizes, we could do a normal approximation, but you already knew this. +* Two sided test isn't obvious. + * Given a way to do two sided tests, we could take the set of values of $p_0$ for which we fail to reject to get an exact binomial confidence interval (called the Clopper/Pearson interval, BTW) +* For these problems, people always create a P-value (next lecture) rather than computing the rejection region. + + diff --git a/06_StatisticalInference/03_02_HypothesisTesting/index.pdf b/06_StatisticalInference/03_02_HypothesisTesting/index.pdf index 9e5f2ae42..c5db5a783 100644 Binary files a/06_StatisticalInference/03_02_HypothesisTesting/index.pdf and b/06_StatisticalInference/03_02_HypothesisTesting/index.pdf differ diff --git a/06_StatisticalInference/03_03_pValues/index.Rmd b/06_StatisticalInference/03_03_pValues/index.Rmd index bc6f09908..ea0bb8398 100644 --- a/06_StatisticalInference/03_03_pValues/index.Rmd +++ b/06_StatisticalInference/03_03_pValues/index.Rmd @@ -1,107 +1,92 @@ ---- -title : P-values -subtitle : Statistical inference -author : Brian Caffo, Jeffrey Leek, Roger Peng -job : Johns Hopkins Bloomberg School of Public Health -logo : bloomberg_shield.png -framework : io2012 # {io2012, html5slides, shower, dzslides, ...} -highlighter : highlight.js # {highlight.js, prettify, highlight} -hitheme : tomorrow # -url: - lib: ../../librariesNew - assets: ../../assets -widgets : [mathjax] # {mathjax, quiz, bootstrap} -mode : selfcontained # {standalone, draft} ---- -```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F} -# make this an external chunk that can be included in any file -options(width = 100) -opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/') - -options(xtable.type = 'html') -knit_hooks$set(inline = function(x) { - if(is.numeric(x)) { - round(x, getOption('digits')) - } else { - paste(as.character(x), collapse = ', ') - } -}) -knit_hooks$set(plot = knitr:::hook_plot_html) -``` - -## P-values - -* Most common measure of "statistical significance" -* Their ubiquity, along with concern over their interpretation and use - makes them controversial among statisticians - * [http://warnercnr.colostate.edu/~anderson/thompson1.html](http://warnercnr.colostate.edu/~anderson/thompson1.html) - * Also see *Statistical Evidence: A Likelihood Paradigm* by Richard Royall - * *Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy* by Steve Goodman - * The hilariously titled: *The Earth is Round (p < .05)* by Cohen. -* Some positive comments - * [simply statistics](http://simplystatistics.org/2012/01/06/p-values-and-hypothesis-testing-get-a-bad-rap-but-we/) - * [normal deviate](http://normaldeviate.wordpress.com/2013/03/14/double-misunderstandings-about-p-values/) - * [Error statistics](http://errorstatistics.com/2013/06/14/p-values-cant-be-trusted-except-when-used-to-argue-that-p-values-cant-be-trusted/) - ---- - - -## What is a P-value? - -__Idea__: Suppose nothing is going on - how unusual is it to see the estimate we got? - -__Approach__: - -1. Define the hypothetical distribution of a data summary (statistic) when "nothing is going on" (_null hypothesis_) -2. Calculate the summary/statistic with the data we have (_test statistic_) -3. Compare what we calculated to our hypothetical distribution and see if the value is "extreme" (_p-value_) - ---- -## P-values -* The P-value is the probability under the null hypothesis of obtaining evidence as extreme or more extreme than would be observed by chance alone -* If the P-value is small, then either $H_0$ is true and we have observed a rare event or $H_0$ is false -* In our example the $T$ statistic was $0.8$. - * What's the probability of getting a $T$ statistic as large as $0.8$? -```{r} -pt(0.8, 15, lower.tail = FALSE) -``` -* Therefore, the probability of seeing evidence as extreme or more extreme than that actually obtained under $H_0$ is `r pt(0.8, 15, lower.tail = FALSE)` - ---- -## The attained significance level -* Our test statistic was $2$ for $H_0 : \mu_0 = 30$ versus $H_a:\mu > 30$. -* Notice that we rejected the one sided test when $\alpha = 0.05$, would we reject if $\alpha = 0.01$, how about $0.001$? -* The smallest value for alpha that you still reject the null hypothesis is called the *attained significance level* -* This is equivalent, but philosophically a little different from, the *P-value* - ---- -## Notes -* By reporting a P-value the reader can perform the hypothesis - test at whatever $\alpha$ level he or she choses -* If the P-value is less than $\alpha$ you reject the null hypothesis -* For two sided hypothesis test, double the smaller of the two one - sided hypothesis test Pvalues - ---- -## Revisiting an earlier example -- Suppose a friend has $8$ children, $7$ of which are girls and none are twins -- If each gender has an independent $50$% probability for each birth, what's the probability of getting $7$ or more girls out of $8$ births? -```{r} -choose(8, 7) * .5 ^ 8 + choose(8, 8) * .5 ^ 8 -pbinom(6, size = 8, prob = .5, lower.tail = FALSE) -``` - ---- -## Poisson example -- Suppose that a hospital has an infection rate of 10 infections per 100 person/days at risk (rate of 0.1) during the last monitoring period. -- Assume that an infection rate of 0.05 is an important benchmark. -- Given the model, could the observed rate being larger than 0.05 be attributed to chance? -- Under $H_0: \lambda = 0.05$ so that $\lambda_0 100 = 5$ -- Consider $H_a: \lambda > 0.05$. - -```{r} -ppois(9, 5, lower.tail = FALSE) -``` - - - +--- +title : P-values +subtitle : Statistical inference +author : Brian Caffo, Jeffrey Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- + +## P-values + +* Most common measure of "statistical significance" +* Their ubiquity, along with concern over their interpretation and use + makes them controversial among statisticians + * [http://warnercnr.colostate.edu/~anderson/thompson1.html](http://warnercnr.colostate.edu/~anderson/thompson1.html) + * Also see *Statistical Evidence: A Likelihood Paradigm* by Richard Royall + * *Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy* by Steve Goodman + * The hilariously titled: *The Earth is Round (p < .05)* by Cohen. +* Some positive comments + * [simply statistics](http://simplystatistics.org/2012/01/06/p-values-and-hypothesis-testing-get-a-bad-rap-but-we/) + * [normal deviate](http://normaldeviate.wordpress.com/2013/03/14/double-misunderstandings-about-p-values/) + * [Error statistics](http://errorstatistics.com/2013/06/14/p-values-cant-be-trusted-except-when-used-to-argue-that-p-values-cant-be-trusted/) + +--- + + +## What is a P-value? + +__Idea__: Suppose nothing is going on - how unusual is it to see the estimate we got? + +__Approach__: + +1. Define the hypothetical distribution of a data summary (statistic) when "nothing is going on" (_null hypothesis_) +2. Calculate the summary/statistic with the data we have (_test statistic_) +3. Compare what we calculated to our hypothetical distribution and see if the value is "extreme" (_p-value_) + +--- +## P-values +* The P-value is the probability under the null hypothesis of obtaining evidence as extreme or more extreme than would be observed by chance alone +* If the P-value is small, then either $H_0$ is true and we have observed a rare event or $H_0$ is false +* In our example the $T$ statistic was $0.8$. + * What's the probability of getting a $T$ statistic as large as $0.8$? +```{r} +pt(0.8, 15, lower.tail = FALSE) +``` +* Therefore, the probability of seeing evidence as extreme or more extreme than that actually obtained under $H_0$ is `r pt(0.8, 15, lower.tail = FALSE)` + +--- +## The attained significance level +* Our test statistic was $2$ for $H_0 : \mu_0 = 30$ versus $H_a:\mu > 30$. +* Notice that we rejected the one sided test when $\alpha = 0.05$, would we reject if $\alpha = 0.01$, how about $0.001$? +* The smallest value for alpha that you still reject the null hypothesis is called the *attained significance level* +* This is equivalent, but philosophically a little different from, the *P-value* + +--- +## Notes +* By reporting a P-value the reader can perform the hypothesis + test at whatever $\alpha$ level he or she choses +* If the P-value is less than $\alpha$ you reject the null hypothesis +* For two sided hypothesis test, double the smaller of the two one + sided hypothesis test Pvalues + +--- +## Revisiting an earlier example +- Suppose a friend has $8$ children, $7$ of which are girls and none are twins +- If each gender has an independent $50$% probability for each birth, what's the probability of getting $7$ or more girls out of $8$ births? +```{r} +choose(8, 7) * .5 ^ 8 + choose(8, 8) * .5 ^ 8 +pbinom(6, size = 8, prob = .5, lower.tail = FALSE) +``` + +--- +## Poisson example +- Suppose that a hospital has an infection rate of 10 infections per 100 person/days at risk (rate of 0.1) during the last monitoring period. +- Assume that an infection rate of 0.05 is an important benchmark. +- Given the model, could the observed rate being larger than 0.05 be attributed to chance? +- Under $H_0: \lambda = 0.05$ so that $\lambda_0 100 = 5$ +- Consider $H_a: \lambda > 0.05$. + +```{r} +ppois(9, 5, lower.tail = FALSE) +``` + + + diff --git a/06_StatisticalInference/03_03_pValues/index.html b/06_StatisticalInference/03_03_pValues/index.html index a92ee0403..8d31071c3 100644 --- a/06_StatisticalInference/03_03_pValues/index.html +++ b/06_StatisticalInference/03_03_pValues/index.html @@ -1,280 +1,280 @@ - - - - P-values - - - - - - - - - - - - - - - - - - - - - - - - - - -
-

P-values

-

Statistical inference

-

Brian Caffo, Jeffrey Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

-
-
-
- - - - -
-

P-values

-
-
-
    -
  • Most common measure of "statistical significance"
  • -
  • Their ubiquity, along with concern over their interpretation and use -makes them controversial among statisticians - -
      -
    • http://warnercnr.colostate.edu/~anderson/thompson1.html
    • -
    • Also see Statistical Evidence: A Likelihood Paradigm by Richard Royall
    • -
    • Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy by Steve Goodman
    • -
    • The hilariously titled: The Earth is Round (p < .05) by Cohen.
    • -
  • -
  • Some positive comments - -
  • -
- -
- -
- - -
-

What is a P-value?

-
-
-

Idea: Suppose nothing is going on - how unusual is it to see the estimate we got?

- -

Approach:

- -
    -
  1. Define the hypothetical distribution of a data summary (statistic) when "nothing is going on" (null hypothesis)
  2. -
  3. Calculate the summary/statistic with the data we have (test statistic)
  4. -
  5. Compare what we calculated to our hypothetical distribution and see if the value is "extreme" (p-value)
  6. -
- -
- -
- - -
-

P-values

-
-
-
    -
  • The P-value is the probability under the null hypothesis of obtaining evidence as extreme or more extreme than would be observed by chance alone
  • -
  • If the P-value is small, then either \(H_0\) is true and we have observed a rare event or \(H_0\) is false
  • -
  • In our example the \(T\) statistic was \(0.8\). - -
      -
    • What's the probability of getting a \(T\) statistic as large as \(0.8\)?
    • -
  • -
- -
pt(0.8, 15, lower.tail = FALSE) 
-
- -
[1] 0.2181
-
- -
    -
  • Therefore, the probability of seeing evidence as extreme or more extreme than that actually obtained under \(H_0\) is 0.2181
  • -
- -
- -
- - -
-

The attained significance level

-
-
-
    -
  • Our test statistic was \(2\) for \(H_0 : \mu_0 = 30\) versus \(H_a:\mu > 30\).
  • -
  • Notice that we rejected the one sided test when \(\alpha = 0.05\), would we reject if \(\alpha = 0.01\), how about \(0.001\)?
  • -
  • The smallest value for alpha that you still reject the null hypothesis is called the attained significance level
  • -
  • This is equivalent, but philosophically a little different from, the P-value
  • -
- -
- -
- - -
-

Notes

-
-
-
    -
  • By reporting a P-value the reader can perform the hypothesis -test at whatever \(\alpha\) level he or she choses
  • -
  • If the P-value is less than \(\alpha\) you reject the null hypothesis
  • -
  • For two sided hypothesis test, double the smaller of the two one -sided hypothesis test Pvalues
  • -
- -
- -
- - -
-

Revisiting an earlier example

-
-
-
    -
  • Suppose a friend has \(8\) children, \(7\) of which are girls and none are twins
  • -
  • If each gender has an independent \(50\)% probability for each birth, what's the probability of getting \(7\) or more girls out of \(8\) births?
  • -
- -
choose(8, 7) * .5 ^ 8 + choose(8, 8) * .5 ^ 8 
-
- -
[1] 0.03516
-
- -
pbinom(6, size = 8, prob = .5, lower.tail = FALSE)
-
- -
[1] 0.03516
-
- -
- -
- - -
-

Poisson example

-
-
-
    -
  • Suppose that a hospital has an infection rate of 10 infections per 100 person/days at risk (rate of 0.1) during the last monitoring period.
  • -
  • Assume that an infection rate of 0.05 is an important benchmark.
  • -
  • Given the model, could the observed rate being larger than 0.05 be attributed to chance?
  • -
  • Under \(H_0: \lambda = 0.05\) so that \(\lambda_0 100 = 5\)
  • -
  • Consider \(H_a: \lambda > 0.05\).
  • -
- -
ppois(9, 5, lower.tail = FALSE)
-
- -
[1] 0.03183
-
- -
- -
- - -
- - - - - - - - - - - - - - + + + + P-values + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

P-values

+

Statistical inference

+

Brian Caffo, Jeffrey Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

+
+
+
+ + + + +
+

P-values

+
+
+
    +
  • Most common measure of "statistical significance"
  • +
  • Their ubiquity, along with concern over their interpretation and use +makes them controversial among statisticians + +
      +
    • http://warnercnr.colostate.edu/~anderson/thompson1.html
    • +
    • Also see Statistical Evidence: A Likelihood Paradigm by Richard Royall
    • +
    • Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy by Steve Goodman
    • +
    • The hilariously titled: The Earth is Round (p < .05) by Cohen.
    • +
  • +
  • Some positive comments + +
  • +
+ +
+ +
+ + +
+

What is a P-value?

+
+
+

Idea: Suppose nothing is going on - how unusual is it to see the estimate we got?

+ +

Approach:

+ +
    +
  1. Define the hypothetical distribution of a data summary (statistic) when "nothing is going on" (null hypothesis)
  2. +
  3. Calculate the summary/statistic with the data we have (test statistic)
  4. +
  5. Compare what we calculated to our hypothetical distribution and see if the value is "extreme" (p-value)
  6. +
+ +
+ +
+ + +
+

P-values

+
+
+
    +
  • The P-value is the probability under the null hypothesis of obtaining evidence as extreme or more extreme than would be observed by chance alone
  • +
  • If the P-value is small, then either \(H_0\) is true and we have observed a rare event or \(H_0\) is false
  • +
  • In our example the \(T\) statistic was \(0.8\). + +
      +
    • What's the probability of getting a \(T\) statistic as large as \(0.8\)?
    • +
  • +
+ +
pt(0.8, 15, lower.tail = FALSE)
+
+ +
## [1] 0.2181
+
+ +
    +
  • Therefore, the probability of seeing evidence as extreme or more extreme than that actually obtained under \(H_0\) is 0.2181
  • +
+ +
+ +
+ + +
+

The attained significance level

+
+
+
    +
  • Our test statistic was \(2\) for \(H_0 : \mu_0 = 30\) versus \(H_a:\mu > 30\).
  • +
  • Notice that we rejected the one sided test when \(\alpha = 0.05\), would we reject if \(\alpha = 0.01\), how about \(0.001\)?
  • +
  • The smallest value for alpha that you still reject the null hypothesis is called the attained significance level
  • +
  • This is equivalent, but philosophically a little different from, the P-value
  • +
+ +
+ +
+ + +
+

Notes

+
+
+
    +
  • By reporting a P-value the reader can perform the hypothesis +test at whatever \(\alpha\) level he or she choses
  • +
  • If the P-value is less than \(\alpha\) you reject the null hypothesis
  • +
  • For two sided hypothesis test, double the smaller of the two one +sided hypothesis test Pvalues
  • +
+ +
+ +
+ + +
+

Revisiting an earlier example

+
+
+
    +
  • Suppose a friend has \(8\) children, \(7\) of which are girls and none are twins
  • +
  • If each gender has an independent \(50\)% probability for each birth, what's the probability of getting \(7\) or more girls out of \(8\) births?
  • +
+ +
choose(8, 7) * 0.5^8 + choose(8, 8) * 0.5^8
+
+ +
## [1] 0.03516
+
+ +
pbinom(6, size = 8, prob = 0.5, lower.tail = FALSE)
+
+ +
## [1] 0.03516
+
+ +
+ +
+ + +
+

Poisson example

+
+
+
    +
  • Suppose that a hospital has an infection rate of 10 infections per 100 person/days at risk (rate of 0.1) during the last monitoring period.
  • +
  • Assume that an infection rate of 0.05 is an important benchmark.
  • +
  • Given the model, could the observed rate being larger than 0.05 be attributed to chance?
  • +
  • Under \(H_0: \lambda = 0.05\) so that \(\lambda_0 100 = 5\)
  • +
  • Consider \(H_a: \lambda > 0.05\).
  • +
+ +
ppois(9, 5, lower.tail = FALSE)
+
+ +
## [1] 0.03183
+
+ +
+ +
+ + +
+ + + + + + + + + + + + + + \ No newline at end of file diff --git a/06_StatisticalInference/03_03_pValues/index.md b/06_StatisticalInference/03_03_pValues/index.md index 3cec2d0b8..5dabc5349 100644 --- a/06_StatisticalInference/03_03_pValues/index.md +++ b/06_StatisticalInference/03_03_pValues/index.md @@ -1,119 +1,117 @@ ---- -title : P-values -subtitle : Statistical inference -author : Brian Caffo, Jeffrey Leek, Roger Peng -job : Johns Hopkins Bloomberg School of Public Health -logo : bloomberg_shield.png -framework : io2012 # {io2012, html5slides, shower, dzslides, ...} -highlighter : highlight.js # {highlight.js, prettify, highlight} -hitheme : tomorrow # -url: - lib: ../../librariesNew - assets: ../../assets -widgets : [mathjax] # {mathjax, quiz, bootstrap} -mode : selfcontained # {standalone, draft} ---- - - - -## P-values - -* Most common measure of "statistical significance" -* Their ubiquity, along with concern over their interpretation and use - makes them controversial among statisticians - * [http://warnercnr.colostate.edu/~anderson/thompson1.html](http://warnercnr.colostate.edu/~anderson/thompson1.html) - * Also see *Statistical Evidence: A Likelihood Paradigm* by Richard Royall - * *Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy* by Steve Goodman - * The hilariously titled: *The Earth is Round (p < .05)* by Cohen. -* Some positive comments - * [simply statistics](http://simplystatistics.org/2012/01/06/p-values-and-hypothesis-testing-get-a-bad-rap-but-we/) - * [normal deviate](http://normaldeviate.wordpress.com/2013/03/14/double-misunderstandings-about-p-values/) - * [Error statistics](http://errorstatistics.com/2013/06/14/p-values-cant-be-trusted-except-when-used-to-argue-that-p-values-cant-be-trusted/) - ---- - - -## What is a P-value? - -__Idea__: Suppose nothing is going on - how unusual is it to see the estimate we got? - -__Approach__: - -1. Define the hypothetical distribution of a data summary (statistic) when "nothing is going on" (_null hypothesis_) -2. Calculate the summary/statistic with the data we have (_test statistic_) -3. Compare what we calculated to our hypothetical distribution and see if the value is "extreme" (_p-value_) - ---- -## P-values -* The P-value is the probability under the null hypothesis of obtaining evidence as extreme or more extreme than would be observed by chance alone -* If the P-value is small, then either $H_0$ is true and we have observed a rare event or $H_0$ is false -* In our example the $T$ statistic was $0.8$. - * What's the probability of getting a $T$ statistic as large as $0.8$? - -```r -pt(0.8, 15, lower.tail = FALSE) -``` - -``` -[1] 0.2181 -``` - -* Therefore, the probability of seeing evidence as extreme or more extreme than that actually obtained under $H_0$ is 0.2181 - ---- -## The attained significance level -* Our test statistic was $2$ for $H_0 : \mu_0 = 30$ versus $H_a:\mu > 30$. -* Notice that we rejected the one sided test when $\alpha = 0.05$, would we reject if $\alpha = 0.01$, how about $0.001$? -* The smallest value for alpha that you still reject the null hypothesis is called the *attained significance level* -* This is equivalent, but philosophically a little different from, the *P-value* - ---- -## Notes -* By reporting a P-value the reader can perform the hypothesis - test at whatever $\alpha$ level he or she choses -* If the P-value is less than $\alpha$ you reject the null hypothesis -* For two sided hypothesis test, double the smaller of the two one - sided hypothesis test Pvalues - ---- -## Revisiting an earlier example -- Suppose a friend has $8$ children, $7$ of which are girls and none are twins -- If each gender has an independent $50$% probability for each birth, what's the probability of getting $7$ or more girls out of $8$ births? - -```r -choose(8, 7) * .5 ^ 8 + choose(8, 8) * .5 ^ 8 -``` - -``` -[1] 0.03516 -``` - -```r -pbinom(6, size = 8, prob = .5, lower.tail = FALSE) -``` - -``` -[1] 0.03516 -``` - - ---- -## Poisson example -- Suppose that a hospital has an infection rate of 10 infections per 100 person/days at risk (rate of 0.1) during the last monitoring period. -- Assume that an infection rate of 0.05 is an important benchmark. -- Given the model, could the observed rate being larger than 0.05 be attributed to chance? -- Under $H_0: \lambda = 0.05$ so that $\lambda_0 100 = 5$ -- Consider $H_a: \lambda > 0.05$. - - -```r -ppois(9, 5, lower.tail = FALSE) -``` - -``` -[1] 0.03183 -``` - - - - +--- +title : P-values +subtitle : Statistical inference +author : Brian Caffo, Jeffrey Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- + +## P-values + +* Most common measure of "statistical significance" +* Their ubiquity, along with concern over their interpretation and use + makes them controversial among statisticians + * [http://warnercnr.colostate.edu/~anderson/thompson1.html](http://warnercnr.colostate.edu/~anderson/thompson1.html) + * Also see *Statistical Evidence: A Likelihood Paradigm* by Richard Royall + * *Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy* by Steve Goodman + * The hilariously titled: *The Earth is Round (p < .05)* by Cohen. +* Some positive comments + * [simply statistics](http://simplystatistics.org/2012/01/06/p-values-and-hypothesis-testing-get-a-bad-rap-but-we/) + * [normal deviate](http://normaldeviate.wordpress.com/2013/03/14/double-misunderstandings-about-p-values/) + * [Error statistics](http://errorstatistics.com/2013/06/14/p-values-cant-be-trusted-except-when-used-to-argue-that-p-values-cant-be-trusted/) + +--- + + +## What is a P-value? + +__Idea__: Suppose nothing is going on - how unusual is it to see the estimate we got? + +__Approach__: + +1. Define the hypothetical distribution of a data summary (statistic) when "nothing is going on" (_null hypothesis_) +2. Calculate the summary/statistic with the data we have (_test statistic_) +3. Compare what we calculated to our hypothetical distribution and see if the value is "extreme" (_p-value_) + +--- +## P-values +* The P-value is the probability under the null hypothesis of obtaining evidence as extreme or more extreme than would be observed by chance alone +* If the P-value is small, then either $H_0$ is true and we have observed a rare event or $H_0$ is false +* In our example the $T$ statistic was $0.8$. + * What's the probability of getting a $T$ statistic as large as $0.8$? + +```r +pt(0.8, 15, lower.tail = FALSE) +``` + +``` +## [1] 0.2181 +``` + +* Therefore, the probability of seeing evidence as extreme or more extreme than that actually obtained under $H_0$ is 0.2181 + +--- +## The attained significance level +* Our test statistic was $2$ for $H_0 : \mu_0 = 30$ versus $H_a:\mu > 30$. +* Notice that we rejected the one sided test when $\alpha = 0.05$, would we reject if $\alpha = 0.01$, how about $0.001$? +* The smallest value for alpha that you still reject the null hypothesis is called the *attained significance level* +* This is equivalent, but philosophically a little different from, the *P-value* + +--- +## Notes +* By reporting a P-value the reader can perform the hypothesis + test at whatever $\alpha$ level he or she choses +* If the P-value is less than $\alpha$ you reject the null hypothesis +* For two sided hypothesis test, double the smaller of the two one + sided hypothesis test Pvalues + +--- +## Revisiting an earlier example +- Suppose a friend has $8$ children, $7$ of which are girls and none are twins +- If each gender has an independent $50$% probability for each birth, what's the probability of getting $7$ or more girls out of $8$ births? + +```r +choose(8, 7) * 0.5^8 + choose(8, 8) * 0.5^8 +``` + +``` +## [1] 0.03516 +``` + +```r +pbinom(6, size = 8, prob = 0.5, lower.tail = FALSE) +``` + +``` +## [1] 0.03516 +``` + + +--- +## Poisson example +- Suppose that a hospital has an infection rate of 10 infections per 100 person/days at risk (rate of 0.1) during the last monitoring period. +- Assume that an infection rate of 0.05 is an important benchmark. +- Given the model, could the observed rate being larger than 0.05 be attributed to chance? +- Under $H_0: \lambda = 0.05$ so that $\lambda_0 100 = 5$ +- Consider $H_a: \lambda > 0.05$. + + +```r +ppois(9, 5, lower.tail = FALSE) +``` + +``` +## [1] 0.03183 +``` + + + + diff --git a/06_StatisticalInference/03_03_pValues/index.pdf b/06_StatisticalInference/03_03_pValues/index.pdf index 85d8bd9d1..ba31db25c 100644 Binary files a/06_StatisticalInference/03_03_pValues/index.pdf and b/06_StatisticalInference/03_03_pValues/index.pdf differ diff --git a/06_StatisticalInference/03_04_Power/assets/fig/unnamed-chunk-2.png b/06_StatisticalInference/03_04_Power/assets/fig/unnamed-chunk-2.png new file mode 100644 index 000000000..c1c383311 Binary files /dev/null and b/06_StatisticalInference/03_04_Power/assets/fig/unnamed-chunk-2.png differ diff --git a/06_StatisticalInference/03_04_Power/index.Rmd b/06_StatisticalInference/03_04_Power/index.Rmd index ae0446258..53898e8cc 100644 --- a/06_StatisticalInference/03_04_Power/index.Rmd +++ b/06_StatisticalInference/03_04_Power/index.Rmd @@ -1,153 +1,137 @@ ---- -title : Power -subtitle : Statistical Inference -author : Brian Caffo, Jeff Leek, Roger Peng -job : Johns Hopkins Bloomberg School of Public Health -logo : bloomberg_shield.png -framework : io2012 # {io2012, html5slides, shower, dzslides, ...} -highlighter : highlight.js # {highlight.js, prettify, highlight} -hitheme : tomorrow # -url: - lib: ../../librariesNew - assets: ../../assets -widgets : [mathjax] # {mathjax, quiz, bootstrap} -mode : selfcontained # {standalone, draft} ---- -```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F, results='hide'} -# make this an external chunk that can be included in any file -options(width = 100) -opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/') - -options(xtable.type = 'html') -knit_hooks$set(inline = function(x) { - if(is.numeric(x)) { - round(x, getOption('digits')) - } else { - paste(as.character(x), collapse = ', ') - } -}) -knit_hooks$set(plot = knitr:::hook_plot_html) -runif(1) -``` - -## Power -- Power is the probability of rejecting the null hypothesis when it is false -- Ergo, power (as it's name would suggest) is a good thing; you want more power -- A type II error (a bad thing, as its name would suggest) is failing to reject the null hypothesis when it's false; the probability of a type II error is usually called $\beta$ -- Note Power $= 1 - \beta$ - ---- -## Notes -- Consider our previous example involving RDI -- $H_0: \mu = 30$ versus $H_a: \mu > 30$ -- Then power is -$$P\left(\frac{\bar X - 30}{s /\sqrt{n}} > t_{1-\alpha,n-1} ~|~ \mu = \mu_a \right)$$ -- Note that this is a function that depends on the specific value of $\mu_a$! -- Notice as $\mu_a$ approaches $30$ the power approaches $\alpha$ - - ---- -## Calculating power for Gaussian data -Assume that $n$ is large and that we know $\sigma$ -$$ -\begin{align} -1 -\beta & = -P\left(\frac{\bar X - 30}{\sigma /\sqrt{n}} > z_{1-\alpha} ~|~ \mu = \mu_a \right)\\ -& = P\left(\frac{\bar X - \mu_a + \mu_a - 30}{\sigma /\sqrt{n}} > z_{1-\alpha} ~|~ \mu = \mu_a \right)\\ \\ -& = P\left(\frac{\bar X - \mu_a}{\sigma /\sqrt{n}} > z_{1-\alpha} - \frac{\mu_a - 30}{\sigma /\sqrt{n}} ~|~ \mu = \mu_a \right)\\ \\ -& = P\left(Z > z_{1-\alpha} - \frac{\mu_a - 30}{\sigma /\sqrt{n}} ~|~ \mu = \mu_a \right)\\ \\ -\end{align} -$$ - ---- -## Example continued -- Suppose that we wanted to detect a increase in mean RDI - of at least 2 events / hour (above 30). -- Assume normality and that the sample in question will have a standard deviation of $4$; -- What would be the power if we took a sample size of $16$? - - $Z_{1-\alpha} = 1.645$ - - $\frac{\mu_a - 30}{\sigma /\sqrt{n}} = 2 / (4 /\sqrt{16}) = 2$ - - $P(Z > 1.645 - 2) = P(Z > -0.355) = 64\%$ -```{r} -pnorm(-0.355, lower.tail = FALSE) -``` - ---- -## Note -- Consider $H_0 : \mu = \mu_0$ and $H_a : \mu > \mu_0$ with $\mu = \mu_a$ under $H_a$. -- Under $H_0$ the statistic $Z = \frac{\sqrt{n}(\bar X - \mu_0)}{\sigma}$ is $N(0, 1)$ -- Under $H_a$ $Z$ is $N\left( \frac{\sqrt{n}(\mu_a - \mu_0)}{\sigma}, 1\right)$ -- We reject if $Z > Z_{1-\alpha}$ - -``` -sigma <- 10; mu_0 = 0; mu_a = 2; n <- 100; alpha = .05 -plot(c(-3, 6),c(0, dnorm(0)), type = "n", frame = false, xlab = "Z value", ylab = "") -xvals <- seq(-3, 6, length = 1000) -lines(xvals, dnorm(xvals), type = "l", lwd = 3) -lines(xvals, dnorm(xvals, mean = sqrt(n) * (mu_a - mu_0) / sigma), lwd =3) -abline(v = qnorm(1 - alpha)) -``` - ---- -```{r, fig.height=5, fig.width=5, echo = FALSE} -sigma <- 10; mu_0 = 0; mu_a = 2; n <- 100; alpha = .05 -plot(c(-3, 6),c(0, dnorm(0)), type = "n", frame = false, xlab = "Z value", ylab = "") -xvals <- seq(-3, 6, length = 1000) -lines(xvals, dnorm(xvals), type = "l", lwd = 3) -lines(xvals, dnorm(xvals, mean = sqrt(n) * (mu_a - mu_0) / sigma), lwd =3) -abline(v = qnorm(1 - alpha)) -``` - - ---- -## Question -- When testing $H_a : \mu > \mu_0$, notice if power is $1 - \beta$, then -$$1 - \beta = P\left(Z > z_{1-\alpha} - \frac{\mu_a - \mu_0}{\sigma /\sqrt{n}} ~|~ \mu = \mu_a \right) = P(Z > z_{\beta})$$ -- This yields the equation -$$z_{1-\alpha} - \frac{\sqrt{n}(\mu_a - \mu_0)}{\sigma} = z_{\beta}$$ -- Unknowns: $\mu_a$, $\sigma$, $n$, $\beta$ -- Knowns: $\mu_0$, $\alpha$ -- Specify any 3 of the unknowns and you can solve for the remainder - ---- -## Notes -- The calculation for $H_a:\mu < \mu_0$ is similar -- For $H_a: \mu \neq \mu_0$ calculate the one sided power using - $\alpha / 2$ (this is only approximately right, it excludes the probability of - getting a large TS in the opposite direction of the truth) -- Power goes up as $\alpha$ gets larger -- Power of a one sided test is greater than the power of the - associated two sided test -- Power goes up as $\mu_1$ gets further away from $\mu_0$ -- Power goes up as $n$ goes up -- Power doesn't need $\mu_a$, $\sigma$ and $n$, instead only $\frac{\sqrt{n}(\mu_a - \mu_0)}{\sigma}$ - - The quantity $\frac{\mu_a - \mu_0}{\sigma}$ is called the effect size, the difference in the means in standard deviation units. - - Being unit free, it has some hope of interpretability across settings - ---- -## T-test power -- Consider calculating power for a Gossett's $T$ test for our example -- The power is - $$ - P\left(\frac{\bar X - \mu_0}{S /\sqrt{n}} > t_{1-\alpha, n-1} ~|~ \mu = \mu_a \right) - $$ -- Calcuting this requires the non-central t distribution. -- `power.t.test` does this very well - - Omit one of the arguments and it solves for it - ---- -## Example -```{r} -power.t.test(n = 16, delta = 2 / 4, sd=1, type = "one.sample", alt = "one.sided")$power -power.t.test(n = 16, delta = 2, sd=4, type = "one.sample", alt = "one.sided")$power -power.t.test(n = 16, delta = 100, sd=200, type = "one.sample", alt = "one.sided")$power -``` - ---- -## Example -```{r} -power.t.test(power = .8, delta = 2 / 4, sd=1, type = "one.sample", alt = "one.sided")$n -power.t.test(power = .8, delta = 2, sd=4, type = "one.sample", alt = "one.sided")$n -power.t.test(power = .8, delta = 100, sd=200, type = "one.sample", alt = "one.sided")$n -``` - +--- +title : Power +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- + +## Power +- Power is the probability of rejecting the null hypothesis when it is false +- Ergo, power (as it's name would suggest) is a good thing; you want more power +- A type II error (a bad thing, as its name would suggest) is failing to reject the null hypothesis when it's false; the probability of a type II error is usually called $\beta$ +- Note Power $= 1 - \beta$ + +--- +## Notes +- Consider our previous example involving RDI +- $H_0: \mu = 30$ versus $H_a: \mu > 30$ +- Then power is +$$P\left(\frac{\bar X - 30}{s /\sqrt{n}} > t_{1-\alpha,n-1} ~|~ \mu = \mu_a \right)$$ +- Note that this is a function that depends on the specific value of $\mu_a$! +- Notice as $\mu_a$ approaches $30$ the power approaches $\alpha$ + + +--- +## Calculating power for Gaussian data +Assume that $n$ is large and that we know $\sigma$ +$$ +\begin{align} +1 -\beta & = +P\left(\frac{\bar X - 30}{\sigma /\sqrt{n}} > z_{1-\alpha} ~|~ \mu = \mu_a \right)\\ +& = P\left(\frac{\bar X - \mu_a + \mu_a - 30}{\sigma /\sqrt{n}} > z_{1-\alpha} ~|~ \mu = \mu_a \right)\\ \\ +& = P\left(\frac{\bar X - \mu_a}{\sigma /\sqrt{n}} > z_{1-\alpha} - \frac{\mu_a - 30}{\sigma /\sqrt{n}} ~|~ \mu = \mu_a \right)\\ \\ +& = P\left(Z > z_{1-\alpha} - \frac{\mu_a - 30}{\sigma /\sqrt{n}} ~|~ \mu = \mu_a \right)\\ \\ +\end{align} +$$ + +--- +## Example continued +- Suppose that we wanted to detect a increase in mean RDI + of at least 2 events / hour (above 30). +- Assume normality and that the sample in question will have a standard deviation of $4$; +- What would be the power if we took a sample size of $16$? + - $Z_{1-\alpha} = 1.645$ + - $\frac{\mu_a - 30}{\sigma /\sqrt{n}} = 2 / (4 /\sqrt{16}) = 2$ + - $P(Z > 1.645 - 2) = P(Z > -0.355) = 64\%$ +```{r} +pnorm(-0.355, lower.tail = FALSE) +``` + +--- +## Note +- Consider $H_0 : \mu = \mu_0$ and $H_a : \mu > \mu_0$ with $\mu = \mu_a$ under $H_a$. +- Under $H_0$ the statistic $Z = \frac{\sqrt{n}(\bar X - \mu_0)}{\sigma}$ is $N(0, 1)$ +- Under $H_a$ $Z$ is $N\left( \frac{\sqrt{n}(\mu_a - \mu_0)}{\sigma}, 1\right)$ +- We reject if $Z > Z_{1-\alpha}$ + +``` +sigma <- 10; mu_0 = 0; mu_a = 2; n <- 100; alpha = .05 +plot(c(-3, 6),c(0, dnorm(0)), type = "n", frame = FALSE, xlab = "Z value", ylab = "") +xvals <- seq(-3, 6, length = 1000) +lines(xvals, dnorm(xvals), type = "l", lwd = 3) +lines(xvals, dnorm(xvals, mean = sqrt(n) * (mu_a - mu_0) / sigma), lwd =3) +abline(v = qnorm(1 - alpha)) +``` + +--- +```{r, fig.height=5, fig.width=5, echo = FALSE} +sigma <- 10; mu_0 = 0; mu_a = 2; n <- 100; alpha = .05 +plot(c(-3, 6),c(0, dnorm(0)), type = "n", frame = FALSE, xlab = "Z value", ylab = "") +xvals <- seq(-3, 6, length = 1000) +lines(xvals, dnorm(xvals), type = "l", lwd = 3) +lines(xvals, dnorm(xvals, mean = sqrt(n) * (mu_a - mu_0) / sigma), lwd =3) +abline(v = qnorm(1 - alpha)) +``` + + +--- +## Question +- When testing $H_a : \mu > \mu_0$, notice if power is $1 - \beta$, then +$$1 - \beta = P\left(Z > z_{1-\alpha} - \frac{\mu_a - \mu_0}{\sigma /\sqrt{n}} ~|~ \mu = \mu_a \right) = P(Z > z_{\beta})$$ +- This yields the equation +$$z_{1-\alpha} - \frac{\sqrt{n}(\mu_a - \mu_0)}{\sigma} = z_{\beta}$$ +- Unknowns: $\mu_a$, $\sigma$, $n$, $\beta$ +- Knowns: $\mu_0$, $\alpha$ +- Specify any 3 of the unknowns and you can solve for the remainder + +--- +## Notes +- The calculation for $H_a:\mu < \mu_0$ is similar +- For $H_a: \mu \neq \mu_0$ calculate the one sided power using + $\alpha / 2$ (this is only approximately right, it excludes the probability of + getting a large TS in the opposite direction of the truth) +- Power goes up as $\alpha$ gets larger +- Power of a one sided test is greater than the power of the + associated two sided test +- Power goes up as $\mu_1$ gets further away from $\mu_0$ +- Power goes up as $n$ goes up +- Power doesn't need $\mu_a$, $\sigma$ and $n$, instead only $\frac{\sqrt{n}(\mu_a - \mu_0)}{\sigma}$ + - The quantity $\frac{\mu_a - \mu_0}{\sigma}$ is called the effect size, the difference in the means in standard deviation units. + - Being unit free, it has some hope of interpretability across settings + +--- +## T-test power +- Consider calculating power for a Gossett's $T$ test for our example +- The power is + $$ + P\left(\frac{\bar X - \mu_0}{S /\sqrt{n}} > t_{1-\alpha, n-1} ~|~ \mu = \mu_a \right) + $$ +- Calcuting this requires the non-central t distribution. +- `power.t.test` does this very well + - Omit one of the arguments and it solves for it + +--- +## Example +```{r} +power.t.test(n = 16, delta = 2 / 4, sd=1, type = "one.sample", alt = "one.sided")$power +power.t.test(n = 16, delta = 2, sd=4, type = "one.sample", alt = "one.sided")$power +power.t.test(n = 16, delta = 100, sd=200, type = "one.sample", alt = "one.sided")$power +``` + +--- +## Example +```{r} +power.t.test(power = .8, delta = 2 / 4, sd=1, type = "one.sample", alt = "one.sided")$n +power.t.test(power = .8, delta = 2, sd=4, type = "one.sample", alt = "one.sided")$n +power.t.test(power = .8, delta = 100, sd=200, type = "one.sample", alt = "one.sided")$n +``` + diff --git a/06_StatisticalInference/03_04_Power/index.html b/06_StatisticalInference/03_04_Power/index.html index 84415fc6d..6465ed50f 100644 --- a/06_StatisticalInference/03_04_Power/index.html +++ b/06_StatisticalInference/03_04_Power/index.html @@ -1,382 +1,382 @@ - - - - Power - - - - - - - - - - - - - - - - - - - - - - - - - - -
-

Power

-

Statistical Inference

-

Brian Caffo, Jeff Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

-
-
-
- - - - -
-

Power

-
-
-
    -
  • Power is the probability of rejecting the null hypothesis when it is false
  • -
  • Ergo, power (as it's name would suggest) is a good thing; you want more power
  • -
  • A type II error (a bad thing, as its name would suggest) is failing to reject the null hypothesis when it's false; the probability of a type II error is usually called \(\beta\)
  • -
  • Note Power \(= 1 - \beta\)
  • -
- -
- -
- - -
-

Notes

-
-
-
    -
  • Consider our previous example involving RDI
  • -
  • \(H_0: \mu = 30\) versus \(H_a: \mu > 30\)
  • -
  • Then power is -\[P\left(\frac{\bar X - 30}{s /\sqrt{n}} > t_{1-\alpha,n-1} ~|~ \mu = \mu_a \right)\]
  • -
  • Note that this is a function that depends on the specific value of \(\mu_a\)!
  • -
  • Notice as \(\mu_a\) approaches \(30\) the power approaches \(\alpha\)
  • -
- -
- -
- - -
-

Calculating power for Gaussian data

-
-
-

Assume that \(n\) is large and that we know \(\sigma\) -\[ -\begin{align} -1 -\beta & = -P\left(\frac{\bar X - 30}{\sigma /\sqrt{n}} > z_{1-\alpha} ~|~ \mu = \mu_a \right)\\ -& = P\left(\frac{\bar X - \mu_a + \mu_a - 30}{\sigma /\sqrt{n}} > z_{1-\alpha} ~|~ \mu = \mu_a \right)\\ \\ -& = P\left(\frac{\bar X - \mu_a}{\sigma /\sqrt{n}} > z_{1-\alpha} - \frac{\mu_a - 30}{\sigma /\sqrt{n}} ~|~ \mu = \mu_a \right)\\ \\ -& = P\left(Z > z_{1-\alpha} - \frac{\mu_a - 30}{\sigma /\sqrt{n}} ~|~ \mu = \mu_a \right)\\ \\ -\end{align} -\]

- -
- -
- - -
-

Example continued

-
-
-
    -
  • Suppose that we wanted to detect a increase in mean RDI -of at least 2 events / hour (above 30).
  • -
  • Assume normality and that the sample in question will have a standard deviation of \(4\);
  • -
  • What would be the power if we took a sample size of \(16\)? - -
      -
    • \(Z_{1-\alpha} = 1.645\)
    • -
    • \(\frac{\mu_a - 30}{\sigma /\sqrt{n}} = 2 / (4 /\sqrt{16}) = 2\)
    • -
    • \(P(Z > 1.645 - 2) = P(Z > -0.355) = 64\%\)
    • -
  • -
- -
pnorm(-0.355, lower.tail = FALSE)
-
- -
[1] 0.6387
-
- -
- -
- - -
-

Note

-
-
-
    -
  • Consider \(H_0 : \mu = \mu_0\) and \(H_a : \mu > \mu_0\) with \(\mu = \mu_a\) under \(H_a\).
  • -
  • Under \(H_0\) the statistic \(Z = \frac{\sqrt{n}(\bar X - \mu_0)}{\sigma}\) is \(N(0, 1)\)
  • -
  • Under \(H_a\) \(Z\) is \(N\left( \frac{\sqrt{n}(\mu_a - \mu_0)}{\sigma}, 1\right)\)
  • -
  • We reject if \(Z > Z_{1-\alpha}\)
  • -
- -
sigma <- 10; mu_0 = 0; mu_a = 2; n <- 100; alpha = .05
-plot(c(-3, 6),c(0, dnorm(0)), type = "n", frame = false, xlab = "Z value", ylab = "")
-xvals <- seq(-3, 6, length = 1000)
-lines(xvals, dnorm(xvals), type = "l", lwd = 3)
-lines(xvals, dnorm(xvals, mean = sqrt(n) * (mu_a - mu_0) / sigma), lwd =3)
-abline(v = qnorm(1 - alpha))
-
- -
- -
- - -
-
plot of chunk unnamed-chunk-2
- -
- -
- - -
-

Question

-
-
-
    -
  • When testing \(H_a : \mu > \mu_0\), notice if power is \(1 - \beta\), then -\[1 - \beta = P\left(Z > z_{1-\alpha} - \frac{\mu_a - \mu_0}{\sigma /\sqrt{n}} ~|~ \mu = \mu_a \right) = P(Z > z_{\beta})\]
  • -
  • This yields the equation -\[z_{1-\alpha} - \frac{\sqrt{n}(\mu_a - \mu_0)}{\sigma} = z_{\beta}\]
  • -
  • Unknowns: \(\mu_a\), \(\sigma\), \(n\), \(\beta\)
  • -
  • Knowns: \(\mu_0\), \(\alpha\)
  • -
  • Specify any 3 of the unknowns and you can solve for the remainder
  • -
- -
- -
- - -
-

Notes

-
-
-
    -
  • The calculation for \(H_a:\mu < \mu_0\) is similar
  • -
  • For \(H_a: \mu \neq \mu_0\) calculate the one sided power using -\(\alpha / 2\) (this is only approximately right, it excludes the probability of -getting a large TS in the opposite direction of the truth)
  • -
  • Power goes up as \(\alpha\) gets larger
  • -
  • Power of a one sided test is greater than the power of the -associated two sided test
  • -
  • Power goes up as \(\mu_1\) gets further away from \(\mu_0\)
  • -
  • Power goes up as \(n\) goes up
  • -
  • Power doesn't need \(\mu_a\), \(\sigma\) and \(n\), instead only \(\frac{\sqrt{n}(\mu_a - \mu_0)}{\sigma}\) - -
      -
    • The quantity \(\frac{\mu_a - \mu_0}{\sigma}\) is called the effect size, the difference in the means in standard deviation units.
    • -
    • Being unit free, it has some hope of interpretability across settings
    • -
  • -
- -
- -
- - -
-

T-test power

-
-
-
    -
  • Consider calculating power for a Gossett's \(T\) test for our example
  • -
  • The power is -\[ -P\left(\frac{\bar X - \mu_0}{S /\sqrt{n}} > t_{1-\alpha, n-1} ~|~ \mu = \mu_a \right) -\]
  • -
  • Calcuting this requires the non-central t distribution.
  • -
  • power.t.test does this very well - -
      -
    • Omit one of the arguments and it solves for it
    • -
  • -
- -
- -
- - -
-

Example

-
-
-
power.t.test(n = 16, delta = 2 / 4, sd=1, type = "one.sample",  alt = "one.sided")$power
-
- -
[1] 0.604
-
- -
power.t.test(n = 16, delta = 2, sd=4, type = "one.sample",  alt = "one.sided")$power
-
- -
[1] 0.604
-
- -
power.t.test(n = 16, delta = 100, sd=200, type = "one.sample", alt = "one.sided")$power
-
- -
[1] 0.604
-
- -
- -
- - -
-

Example

-
-
-
power.t.test(power = .8, delta = 2 / 4, sd=1, type = "one.sample",  alt = "one.sided")$n
-
- -
[1] 26.14
-
- -
power.t.test(power = .8, delta = 2, sd=4, type = "one.sample",  alt = "one.sided")$n
-
- -
[1] 26.14
-
- -
power.t.test(power = .8, delta = 100, sd=200, type = "one.sample", alt = "one.sided")$n
-
- -
[1] 26.14
-
- -
- -
- - -
- - - - - - - - - - - - - - + + + + Power + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

Power

+

Statistical Inference

+

Brian Caffo, Jeff Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

+
+
+
+ + + + +
+

Power

+
+
+
    +
  • Power is the probability of rejecting the null hypothesis when it is false
  • +
  • Ergo, power (as it's name would suggest) is a good thing; you want more power
  • +
  • A type II error (a bad thing, as its name would suggest) is failing to reject the null hypothesis when it's false; the probability of a type II error is usually called \(\beta\)
  • +
  • Note Power \(= 1 - \beta\)
  • +
+ +
+ +
+ + +
+

Notes

+
+
+
    +
  • Consider our previous example involving RDI
  • +
  • \(H_0: \mu = 30\) versus \(H_a: \mu > 30\)
  • +
  • Then power is +\[P\left(\frac{\bar X - 30}{s /\sqrt{n}} > t_{1-\alpha,n-1} ~|~ \mu = \mu_a \right)\]
  • +
  • Note that this is a function that depends on the specific value of \(\mu_a\)!
  • +
  • Notice as \(\mu_a\) approaches \(30\) the power approaches \(\alpha\)
  • +
+ +
+ +
+ + +
+

Calculating power for Gaussian data

+
+
+

Assume that \(n\) is large and that we know \(\sigma\) +\[ +\begin{align} +1 -\beta & = +P\left(\frac{\bar X - 30}{\sigma /\sqrt{n}} > z_{1-\alpha} ~|~ \mu = \mu_a \right)\\ +& = P\left(\frac{\bar X - \mu_a + \mu_a - 30}{\sigma /\sqrt{n}} > z_{1-\alpha} ~|~ \mu = \mu_a \right)\\ \\ +& = P\left(\frac{\bar X - \mu_a}{\sigma /\sqrt{n}} > z_{1-\alpha} - \frac{\mu_a - 30}{\sigma /\sqrt{n}} ~|~ \mu = \mu_a \right)\\ \\ +& = P\left(Z > z_{1-\alpha} - \frac{\mu_a - 30}{\sigma /\sqrt{n}} ~|~ \mu = \mu_a \right)\\ \\ +\end{align} +\]

+ +
+ +
+ + +
+

Example continued

+
+
+
    +
  • Suppose that we wanted to detect a increase in mean RDI +of at least 2 events / hour (above 30).
  • +
  • Assume normality and that the sample in question will have a standard deviation of \(4\);
  • +
  • What would be the power if we took a sample size of \(16\)? + +
      +
    • \(Z_{1-\alpha} = 1.645\)
    • +
    • \(\frac{\mu_a - 30}{\sigma /\sqrt{n}} = 2 / (4 /\sqrt{16}) = 2\)
    • +
    • \(P(Z > 1.645 - 2) = P(Z > -0.355) = 64\%\)
    • +
  • +
+ +
pnorm(-0.355, lower.tail = FALSE)
+
+ +
## [1] 0.6387
+
+ +
+ +
+ + +
+

Note

+
+
+
    +
  • Consider \(H_0 : \mu = \mu_0\) and \(H_a : \mu > \mu_0\) with \(\mu = \mu_a\) under \(H_a\).
  • +
  • Under \(H_0\) the statistic \(Z = \frac{\sqrt{n}(\bar X - \mu_0)}{\sigma}\) is \(N(0, 1)\)
  • +
  • Under \(H_a\) \(Z\) is \(N\left( \frac{\sqrt{n}(\mu_a - \mu_0)}{\sigma}, 1\right)\)
  • +
  • We reject if \(Z > Z_{1-\alpha}\)
  • +
+ +
sigma <- 10; mu_0 = 0; mu_a = 2; n <- 100; alpha = .05
+plot(c(-3, 6),c(0, dnorm(0)), type = "n", frame = FALSE, xlab = "Z value", ylab = "")
+xvals <- seq(-3, 6, length = 1000)
+lines(xvals, dnorm(xvals), type = "l", lwd = 3)
+lines(xvals, dnorm(xvals, mean = sqrt(n) * (mu_a - mu_0) / sigma), lwd =3)
+abline(v = qnorm(1 - alpha))
+
+ +
+ +
+ + +
+

plot of chunk unnamed-chunk-2

+ +
+ +
+ + +
+

Question

+
+
+
    +
  • When testing \(H_a : \mu > \mu_0\), notice if power is \(1 - \beta\), then +\[1 - \beta = P\left(Z > z_{1-\alpha} - \frac{\mu_a - \mu_0}{\sigma /\sqrt{n}} ~|~ \mu = \mu_a \right) = P(Z > z_{\beta})\]
  • +
  • This yields the equation +\[z_{1-\alpha} - \frac{\sqrt{n}(\mu_a - \mu_0)}{\sigma} = z_{\beta}\]
  • +
  • Unknowns: \(\mu_a\), \(\sigma\), \(n\), \(\beta\)
  • +
  • Knowns: \(\mu_0\), \(\alpha\)
  • +
  • Specify any 3 of the unknowns and you can solve for the remainder
  • +
+ +
+ +
+ + +
+

Notes

+
+
+
    +
  • The calculation for \(H_a:\mu < \mu_0\) is similar
  • +
  • For \(H_a: \mu \neq \mu_0\) calculate the one sided power using +\(\alpha / 2\) (this is only approximately right, it excludes the probability of +getting a large TS in the opposite direction of the truth)
  • +
  • Power goes up as \(\alpha\) gets larger
  • +
  • Power of a one sided test is greater than the power of the +associated two sided test
  • +
  • Power goes up as \(\mu_1\) gets further away from \(\mu_0\)
  • +
  • Power goes up as \(n\) goes up
  • +
  • Power doesn't need \(\mu_a\), \(\sigma\) and \(n\), instead only \(\frac{\sqrt{n}(\mu_a - \mu_0)}{\sigma}\) + +
      +
    • The quantity \(\frac{\mu_a - \mu_0}{\sigma}\) is called the effect size, the difference in the means in standard deviation units.
    • +
    • Being unit free, it has some hope of interpretability across settings
    • +
  • +
+ +
+ +
+ + +
+

T-test power

+
+
+
    +
  • Consider calculating power for a Gossett's \(T\) test for our example
  • +
  • The power is +\[ +P\left(\frac{\bar X - \mu_0}{S /\sqrt{n}} > t_{1-\alpha, n-1} ~|~ \mu = \mu_a \right) +\]
  • +
  • Calcuting this requires the non-central t distribution.
  • +
  • power.t.test does this very well + +
      +
    • Omit one of the arguments and it solves for it
    • +
  • +
+ +
+ +
+ + +
+

Example

+
+
+
power.t.test(n = 16, delta = 2/4, sd = 1, type = "one.sample", alt = "one.sided")$power
+
+ +
## [1] 0.604
+
+ +
power.t.test(n = 16, delta = 2, sd = 4, type = "one.sample", alt = "one.sided")$power
+
+ +
## [1] 0.604
+
+ +
power.t.test(n = 16, delta = 100, sd = 200, type = "one.sample", alt = "one.sided")$power
+
+ +
## [1] 0.604
+
+ +
+ +
+ + +
+

Example

+
+
+
power.t.test(power = 0.8, delta = 2/4, sd = 1, type = "one.sample", alt = "one.sided")$n
+
+ +
## [1] 26.14
+
+ +
power.t.test(power = 0.8, delta = 2, sd = 4, type = "one.sample", alt = "one.sided")$n
+
+ +
## [1] 26.14
+
+ +
power.t.test(power = 0.8, delta = 100, sd = 200, type = "one.sample", alt = "one.sided")$n
+
+ +
## [1] 26.14
+
+ +
+ +
+ + +
+ + + + + + + + + + + + + + \ No newline at end of file diff --git a/06_StatisticalInference/03_04_Power/index.md b/06_StatisticalInference/03_04_Power/index.md index 1b289c777..37484c7b4 100644 --- a/06_StatisticalInference/03_04_Power/index.md +++ b/06_StatisticalInference/03_04_Power/index.md @@ -1,179 +1,177 @@ ---- -title : Power -subtitle : Statistical Inference -author : Brian Caffo, Jeff Leek, Roger Peng -job : Johns Hopkins Bloomberg School of Public Health -logo : bloomberg_shield.png -framework : io2012 # {io2012, html5slides, shower, dzslides, ...} -highlighter : highlight.js # {highlight.js, prettify, highlight} -hitheme : tomorrow # -url: - lib: ../../librariesNew - assets: ../../assets -widgets : [mathjax] # {mathjax, quiz, bootstrap} -mode : selfcontained # {standalone, draft} ---- - - - -## Power -- Power is the probability of rejecting the null hypothesis when it is false -- Ergo, power (as it's name would suggest) is a good thing; you want more power -- A type II error (a bad thing, as its name would suggest) is failing to reject the null hypothesis when it's false; the probability of a type II error is usually called $\beta$ -- Note Power $= 1 - \beta$ - ---- -## Notes -- Consider our previous example involving RDI -- $H_0: \mu = 30$ versus $H_a: \mu > 30$ -- Then power is -$$P\left(\frac{\bar X - 30}{s /\sqrt{n}} > t_{1-\alpha,n-1} ~|~ \mu = \mu_a \right)$$ -- Note that this is a function that depends on the specific value of $\mu_a$! -- Notice as $\mu_a$ approaches $30$ the power approaches $\alpha$ - - ---- -## Calculating power for Gaussian data -Assume that $n$ is large and that we know $\sigma$ -$$ -\begin{align} -1 -\beta & = -P\left(\frac{\bar X - 30}{\sigma /\sqrt{n}} > z_{1-\alpha} ~|~ \mu = \mu_a \right)\\ -& = P\left(\frac{\bar X - \mu_a + \mu_a - 30}{\sigma /\sqrt{n}} > z_{1-\alpha} ~|~ \mu = \mu_a \right)\\ \\ -& = P\left(\frac{\bar X - \mu_a}{\sigma /\sqrt{n}} > z_{1-\alpha} - \frac{\mu_a - 30}{\sigma /\sqrt{n}} ~|~ \mu = \mu_a \right)\\ \\ -& = P\left(Z > z_{1-\alpha} - \frac{\mu_a - 30}{\sigma /\sqrt{n}} ~|~ \mu = \mu_a \right)\\ \\ -\end{align} -$$ - ---- -## Example continued -- Suppose that we wanted to detect a increase in mean RDI - of at least 2 events / hour (above 30). -- Assume normality and that the sample in question will have a standard deviation of $4$; -- What would be the power if we took a sample size of $16$? - - $Z_{1-\alpha} = 1.645$ - - $\frac{\mu_a - 30}{\sigma /\sqrt{n}} = 2 / (4 /\sqrt{16}) = 2$ - - $P(Z > 1.645 - 2) = P(Z > -0.355) = 64\%$ - -```r -pnorm(-0.355, lower.tail = FALSE) -``` - -``` -[1] 0.6387 -``` - - ---- -## Note -- Consider $H_0 : \mu = \mu_0$ and $H_a : \mu > \mu_0$ with $\mu = \mu_a$ under $H_a$. -- Under $H_0$ the statistic $Z = \frac{\sqrt{n}(\bar X - \mu_0)}{\sigma}$ is $N(0, 1)$ -- Under $H_a$ $Z$ is $N\left( \frac{\sqrt{n}(\mu_a - \mu_0)}{\sigma}, 1\right)$ -- We reject if $Z > Z_{1-\alpha}$ - -``` -sigma <- 10; mu_0 = 0; mu_a = 2; n <- 100; alpha = .05 -plot(c(-3, 6),c(0, dnorm(0)), type = "n", frame = false, xlab = "Z value", ylab = "") -xvals <- seq(-3, 6, length = 1000) -lines(xvals, dnorm(xvals), type = "l", lwd = 3) -lines(xvals, dnorm(xvals, mean = sqrt(n) * (mu_a - mu_0) / sigma), lwd =3) -abline(v = qnorm(1 - alpha)) -``` - ---- -
plot of chunk unnamed-chunk-2
- - - ---- -## Question -- When testing $H_a : \mu > \mu_0$, notice if power is $1 - \beta$, then -$$1 - \beta = P\left(Z > z_{1-\alpha} - \frac{\mu_a - \mu_0}{\sigma /\sqrt{n}} ~|~ \mu = \mu_a \right) = P(Z > z_{\beta})$$ -- This yields the equation -$$z_{1-\alpha} - \frac{\sqrt{n}(\mu_a - \mu_0)}{\sigma} = z_{\beta}$$ -- Unknowns: $\mu_a$, $\sigma$, $n$, $\beta$ -- Knowns: $\mu_0$, $\alpha$ -- Specify any 3 of the unknowns and you can solve for the remainder - ---- -## Notes -- The calculation for $H_a:\mu < \mu_0$ is similar -- For $H_a: \mu \neq \mu_0$ calculate the one sided power using - $\alpha / 2$ (this is only approximately right, it excludes the probability of - getting a large TS in the opposite direction of the truth) -- Power goes up as $\alpha$ gets larger -- Power of a one sided test is greater than the power of the - associated two sided test -- Power goes up as $\mu_1$ gets further away from $\mu_0$ -- Power goes up as $n$ goes up -- Power doesn't need $\mu_a$, $\sigma$ and $n$, instead only $\frac{\sqrt{n}(\mu_a - \mu_0)}{\sigma}$ - - The quantity $\frac{\mu_a - \mu_0}{\sigma}$ is called the effect size, the difference in the means in standard deviation units. - - Being unit free, it has some hope of interpretability across settings - ---- -## T-test power -- Consider calculating power for a Gossett's $T$ test for our example -- The power is - $$ - P\left(\frac{\bar X - \mu_0}{S /\sqrt{n}} > t_{1-\alpha, n-1} ~|~ \mu = \mu_a \right) - $$ -- Calcuting this requires the non-central t distribution. -- `power.t.test` does this very well - - Omit one of the arguments and it solves for it - ---- -## Example - -```r -power.t.test(n = 16, delta = 2 / 4, sd=1, type = "one.sample", alt = "one.sided")$power -``` - -``` -[1] 0.604 -``` - -```r -power.t.test(n = 16, delta = 2, sd=4, type = "one.sample", alt = "one.sided")$power -``` - -``` -[1] 0.604 -``` - -```r -power.t.test(n = 16, delta = 100, sd=200, type = "one.sample", alt = "one.sided")$power -``` - -``` -[1] 0.604 -``` - - ---- -## Example - -```r -power.t.test(power = .8, delta = 2 / 4, sd=1, type = "one.sample", alt = "one.sided")$n -``` - -``` -[1] 26.14 -``` - -```r -power.t.test(power = .8, delta = 2, sd=4, type = "one.sample", alt = "one.sided")$n -``` - -``` -[1] 26.14 -``` - -```r -power.t.test(power = .8, delta = 100, sd=200, type = "one.sample", alt = "one.sided")$n -``` - -``` -[1] 26.14 -``` - - +--- +title : Power +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- + +## Power +- Power is the probability of rejecting the null hypothesis when it is false +- Ergo, power (as it's name would suggest) is a good thing; you want more power +- A type II error (a bad thing, as its name would suggest) is failing to reject the null hypothesis when it's false; the probability of a type II error is usually called $\beta$ +- Note Power $= 1 - \beta$ + +--- +## Notes +- Consider our previous example involving RDI +- $H_0: \mu = 30$ versus $H_a: \mu > 30$ +- Then power is +$$P\left(\frac{\bar X - 30}{s /\sqrt{n}} > t_{1-\alpha,n-1} ~|~ \mu = \mu_a \right)$$ +- Note that this is a function that depends on the specific value of $\mu_a$! +- Notice as $\mu_a$ approaches $30$ the power approaches $\alpha$ + + +--- +## Calculating power for Gaussian data +Assume that $n$ is large and that we know $\sigma$ +$$ +\begin{align} +1 -\beta & = +P\left(\frac{\bar X - 30}{\sigma /\sqrt{n}} > z_{1-\alpha} ~|~ \mu = \mu_a \right)\\ +& = P\left(\frac{\bar X - \mu_a + \mu_a - 30}{\sigma /\sqrt{n}} > z_{1-\alpha} ~|~ \mu = \mu_a \right)\\ \\ +& = P\left(\frac{\bar X - \mu_a}{\sigma /\sqrt{n}} > z_{1-\alpha} - \frac{\mu_a - 30}{\sigma /\sqrt{n}} ~|~ \mu = \mu_a \right)\\ \\ +& = P\left(Z > z_{1-\alpha} - \frac{\mu_a - 30}{\sigma /\sqrt{n}} ~|~ \mu = \mu_a \right)\\ \\ +\end{align} +$$ + +--- +## Example continued +- Suppose that we wanted to detect a increase in mean RDI + of at least 2 events / hour (above 30). +- Assume normality and that the sample in question will have a standard deviation of $4$; +- What would be the power if we took a sample size of $16$? + - $Z_{1-\alpha} = 1.645$ + - $\frac{\mu_a - 30}{\sigma /\sqrt{n}} = 2 / (4 /\sqrt{16}) = 2$ + - $P(Z > 1.645 - 2) = P(Z > -0.355) = 64\%$ + +```r +pnorm(-0.355, lower.tail = FALSE) +``` + +``` +## [1] 0.6387 +``` + + +--- +## Note +- Consider $H_0 : \mu = \mu_0$ and $H_a : \mu > \mu_0$ with $\mu = \mu_a$ under $H_a$. +- Under $H_0$ the statistic $Z = \frac{\sqrt{n}(\bar X - \mu_0)}{\sigma}$ is $N(0, 1)$ +- Under $H_a$ $Z$ is $N\left( \frac{\sqrt{n}(\mu_a - \mu_0)}{\sigma}, 1\right)$ +- We reject if $Z > Z_{1-\alpha}$ + +``` +sigma <- 10; mu_0 = 0; mu_a = 2; n <- 100; alpha = .05 +plot(c(-3, 6),c(0, dnorm(0)), type = "n", frame = FALSE, xlab = "Z value", ylab = "") +xvals <- seq(-3, 6, length = 1000) +lines(xvals, dnorm(xvals), type = "l", lwd = 3) +lines(xvals, dnorm(xvals, mean = sqrt(n) * (mu_a - mu_0) / sigma), lwd =3) +abline(v = qnorm(1 - alpha)) +``` + +--- +![plot of chunk unnamed-chunk-2](assets/fig/unnamed-chunk-2.png) + + + +--- +## Question +- When testing $H_a : \mu > \mu_0$, notice if power is $1 - \beta$, then +$$1 - \beta = P\left(Z > z_{1-\alpha} - \frac{\mu_a - \mu_0}{\sigma /\sqrt{n}} ~|~ \mu = \mu_a \right) = P(Z > z_{\beta})$$ +- This yields the equation +$$z_{1-\alpha} - \frac{\sqrt{n}(\mu_a - \mu_0)}{\sigma} = z_{\beta}$$ +- Unknowns: $\mu_a$, $\sigma$, $n$, $\beta$ +- Knowns: $\mu_0$, $\alpha$ +- Specify any 3 of the unknowns and you can solve for the remainder + +--- +## Notes +- The calculation for $H_a:\mu < \mu_0$ is similar +- For $H_a: \mu \neq \mu_0$ calculate the one sided power using + $\alpha / 2$ (this is only approximately right, it excludes the probability of + getting a large TS in the opposite direction of the truth) +- Power goes up as $\alpha$ gets larger +- Power of a one sided test is greater than the power of the + associated two sided test +- Power goes up as $\mu_1$ gets further away from $\mu_0$ +- Power goes up as $n$ goes up +- Power doesn't need $\mu_a$, $\sigma$ and $n$, instead only $\frac{\sqrt{n}(\mu_a - \mu_0)}{\sigma}$ + - The quantity $\frac{\mu_a - \mu_0}{\sigma}$ is called the effect size, the difference in the means in standard deviation units. + - Being unit free, it has some hope of interpretability across settings + +--- +## T-test power +- Consider calculating power for a Gossett's $T$ test for our example +- The power is + $$ + P\left(\frac{\bar X - \mu_0}{S /\sqrt{n}} > t_{1-\alpha, n-1} ~|~ \mu = \mu_a \right) + $$ +- Calcuting this requires the non-central t distribution. +- `power.t.test` does this very well + - Omit one of the arguments and it solves for it + +--- +## Example + +```r +power.t.test(n = 16, delta = 2/4, sd = 1, type = "one.sample", alt = "one.sided")$power +``` + +``` +## [1] 0.604 +``` + +```r +power.t.test(n = 16, delta = 2, sd = 4, type = "one.sample", alt = "one.sided")$power +``` + +``` +## [1] 0.604 +``` + +```r +power.t.test(n = 16, delta = 100, sd = 200, type = "one.sample", alt = "one.sided")$power +``` + +``` +## [1] 0.604 +``` + + +--- +## Example + +```r +power.t.test(power = 0.8, delta = 2/4, sd = 1, type = "one.sample", alt = "one.sided")$n +``` + +``` +## [1] 26.14 +``` + +```r +power.t.test(power = 0.8, delta = 2, sd = 4, type = "one.sample", alt = "one.sided")$n +``` + +``` +## [1] 26.14 +``` + +```r +power.t.test(power = 0.8, delta = 100, sd = 200, type = "one.sample", alt = "one.sided")$n +``` + +``` +## [1] 26.14 +``` + + diff --git a/06_StatisticalInference/03_04_Power/index.pdf b/06_StatisticalInference/03_04_Power/index.pdf index a5daf8abd..b5e3024dc 100644 Binary files a/06_StatisticalInference/03_04_Power/index.pdf and b/06_StatisticalInference/03_04_Power/index.pdf differ diff --git a/06_StatisticalInference/03_05_MultipleTesting/assets/fig/unnamed-chunk-3.png b/06_StatisticalInference/03_05_MultipleTesting/assets/fig/unnamed-chunk-3.png new file mode 100644 index 000000000..556c3a44b Binary files /dev/null and b/06_StatisticalInference/03_05_MultipleTesting/assets/fig/unnamed-chunk-3.png differ diff --git a/06_StatisticalInference/03_05_MultipleTesting/index.Rmd b/06_StatisticalInference/03_05_MultipleTesting/index.Rmd index 6c19901a0..4d5cc68a4 100644 --- a/06_StatisticalInference/03_05_MultipleTesting/index.Rmd +++ b/06_StatisticalInference/03_05_MultipleTesting/index.Rmd @@ -1,271 +1,253 @@ ---- -title : Multiple testing -subtitle : Statistical Inference -author : Brian Caffo, Jeffrey Leek, Roger Peng -job : Johns Hopkins Bloomberg School of Public Health -logo : bloomberg_shield.png -framework : io2012 # {io2012, html5slides, shower, dzslides, ...} -highlighter : highlight.js # {highlight.js, prettify, highlight} -hitheme : tomorrow # -url: - lib: ../../librariesNew - assets: ../../assets -widgets : [mathjax] # {mathjax, quiz, bootstrap} -mode : selfcontained # {standalone, draft} ---- - - -```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F} -# make this an external chunk that can be included in any file -options(width = 100) -opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/') - -options(xtable.type = 'html') -knit_hooks$set(inline = function(x) { - if(is.numeric(x)) { - round(x, getOption('digits')) - } else { - paste(as.character(x), collapse = ', ') - } -}) -knit_hooks$set(plot = knitr:::hook_plot_html) -``` - -## Key ideas - -* Hypothesis testing/significance analysis is commonly overused -* Correcting for multiple testing avoids false positives or discoveries -* Two key components - * Error measure - * Correction - - ---- - -## Three eras of statistics - -__The age of Quetelet and his successors, in which huge census-level data sets were brought to bear on simple but important questions__: Are there more male than female births? Is the rate of insanity rising? - -The classical period of Pearson, Fisher, Neyman, Hotelling, and their successors, intellectual giants who __developed a theory of optimal inference capable of wringing every drop of information out of a scientific experiment__. The questions dealt with still tended to be simple Is treatment A better than treatment B? - -__The era of scientific mass production__, in which new technologies typified by the microarray allow a single team of scientists to produce data sets of a size Quetelet would envy. But now the flood of data is accompanied by a deluge of questions, perhaps thousands of estimates or hypothesis tests that the statistician is charged with answering together; not at all what the classical masters had in mind. Which variables matter among the thousands measured? How do you relate unrelated information? - -[http://www-stat.stanford.edu/~ckirby/brad/papers/2010LSIexcerpt.pdf](http://www-stat.stanford.edu/~ckirby/brad/papers/2010LSIexcerpt.pdf) - ---- - -## Reasons for multiple testing - - - - ---- - -## Why correct for multiple tests? - - - - -[http://xkcd.com/882/](http://xkcd.com/882/) - ---- - -## Why correct for multiple tests? - - - -[http://xkcd.com/882/](http://xkcd.com/882/) - - ---- - -## Types of errors - -Suppose you are testing a hypothesis that a parameter $\beta$ equals zero versus the alternative that it does not equal zero. These are the possible outcomes. -

- - | $\beta=0$ | $\beta\neq0$ | Hypotheses ---------------------|-------------|----------------|--------- -Claim $\beta=0$ | $U$ | $T$ | $m-R$ -Claim $\beta\neq 0$ | $V$ | $S$ | $R$ - Claims | $m_0$ | $m-m_0$ | $m$ - -

- -__Type I error or false positive ($V$)__ Say that the parameter does not equal zero when it does - -__Type II error or false negative ($T$)__ Say that the parameter equals zero when it doesn't - - ---- - -## Error rates - -__False positive rate__ - The rate at which false results ($\beta = 0$) are called significant: $E\left[\frac{V}{m_0}\right]$* - -__Family wise error rate (FWER)__ - The probability of at least one false positive ${\rm Pr}(V \geq 1)$ - -__False discovery rate (FDR)__ - The rate at which claims of significance are false $E\left[\frac{V}{R}\right]$ - -* The false positive rate is closely related to the type I error rate [http://en.wikipedia.org/wiki/False_positive_rate](http://en.wikipedia.org/wiki/False_positive_rate) - ---- - -## Controlling the false positive rate - -If P-values are correctly calculated calling all $P < \alpha$ significant will control the false positive rate at level $\alpha$ on average. - -Problem: Suppose that you perform 10,000 tests and $\beta = 0$ for all of them. - -Suppose that you call all $P < 0.05$ significant. - -The expected number of false positives is: $10,000 \times 0.05 = 500$ false positives. - -__How do we avoid so many false positives?__ - - ---- - -## Controlling family-wise error rate (FWER) - - -The [Bonferroni correction](http://en.wikipedia.org/wiki/Bonferroni_correction) is the oldest multiple testing correction. - -__Basic idea__: -* Suppose you do $m$ tests -* You want to control FWER at level $\alpha$ so $Pr(V \geq 1) < \alpha$ -* Calculate P-values normally -* Set $\alpha_{fwer} = \alpha/m$ -* Call all $P$-values less than $\alpha_{fwer}$ significant - -__Pros__: Easy to calculate, conservative -__Cons__: May be very conservative - - ---- - -## Controlling false discovery rate (FDR) - -This is the most popular correction when performing _lots_ of tests say in genomics, imaging, astronomy, or other signal-processing disciplines. - -__Basic idea__: -* Suppose you do $m$ tests -* You want to control FDR at level $\alpha$ so $E\left[\frac{V}{R}\right]$ -* Calculate P-values normally -* Order the P-values from smallest to largest $P_{(1)},...,P_{(m)}$ -* Call any $P_{(i)} \leq \alpha \times \frac{i}{m}$ significant - -__Pros__: Still pretty easy to calculate, less conservative (maybe much less) - -__Cons__: Allows for more false positives, may behave strangely under dependence - ---- - -## Example with 10 P-values - - - -Controlling all error rates at $\alpha = 0.20$ - ---- - -## Adjusted P-values - -* One approach is to adjust the threshold $\alpha$ -* A different approach is to calculate "adjusted p-values" -* They _are not p-values_ anymore -* But they can be used directly without adjusting $\alpha$ - -__Example__: -* Suppose P-values are $P_1,\ldots,P_m$ -* You could adjust them by taking $P_i^{fwer} = \max{m \times P_i,1}$ for each P-value. -* Then if you call all $P_i^{fwer} < \alpha$ significant you will control the FWER. - ---- - -## Case study I: no true positives - -```{r createPvals,cache=TRUE} -set.seed(1010093) -pValues <- rep(NA,1000) -for(i in 1:1000){ - y <- rnorm(20) - x <- rnorm(20) - pValues[i] <- summary(lm(y ~ x))$coeff[2,4] -} - -# Controls false positive rate -sum(pValues < 0.05) -``` - ---- - -## Case study I: no true positives - -```{r, dependson="createPvals"} -# Controls FWER -sum(p.adjust(pValues,method="bonferroni") < 0.05) -# Controls FDR -sum(p.adjust(pValues,method="BH") < 0.05) -``` - - ---- - -## Case study II: 50% true positives - -```{r createPvals2,cache=TRUE} -set.seed(1010093) -pValues <- rep(NA,1000) -for(i in 1:1000){ - x <- rnorm(20) - # First 500 beta=0, last 500 beta=2 - if(i <= 500){y <- rnorm(20)}else{ y <- rnorm(20,mean=2*x)} - pValues[i] <- summary(lm(y ~ x))$coeff[2,4] -} -trueStatus <- rep(c("zero","not zero"),each=500) -table(pValues < 0.05, trueStatus) -``` - ---- - - -## Case study II: 50% true positives - -```{r, dependson="createPvals2"} -# Controls FWER -table(p.adjust(pValues,method="bonferroni") < 0.05,trueStatus) -# Controls FDR -table(p.adjust(pValues,method="BH") < 0.05,trueStatus) -``` - - ---- - - -## Case study II: 50% true positives - -__P-values versus adjusted P-values__ -```{r, dependson="createPvals2",fig.height=4,fig.width=8} -par(mfrow=c(1,2)) -plot(pValues,p.adjust(pValues,method="bonferroni"),pch=19) -plot(pValues,p.adjust(pValues,method="BH"),pch=19) -``` - - ---- - - -## Notes and resources - -__Notes__: -* Multiple testing is an entire subfield -* A basic Bonferroni/BH correction is usually enough -* If there is strong dependence between tests there may be problems - * Consider method="BY" - -__Further resources__: -* [Multiple testing procedures with applications to genomics](http://www.amazon.com/Multiple-Procedures-Applications-Genomics-Statistics/dp/0387493166/ref=sr_1_2/102-3292576-129059?ie=UTF8&s=books&qid=1187394873&sr=1-2) -* [Statistical significance for genome-wide studies](http://www.pnas.org/content/100/16/9440.full) -* [Introduction to multiple testing](http://ies.ed.gov/ncee/pubs/20084018/app_b.asp) - +--- +title : Multiple testing +subtitle : Statistical Inference +author : Brian Caffo, Jeffrey Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- +## Key ideas + +* Hypothesis testing/significance analysis is commonly overused +* Correcting for multiple testing avoids false positives or discoveries +* Two key components + * Error measure + * Correction + + +--- + +## Three eras of statistics + +__The age of Quetelet and his successors, in which huge census-level data sets were brought to bear on simple but important questions__: Are there more male than female births? Is the rate of insanity rising? + +The classical period of Pearson, Fisher, Neyman, Hotelling, and their successors, intellectual giants who __developed a theory of optimal inference capable of wringing every drop of information out of a scientific experiment__. The questions dealt with still tended to be simple Is treatment A better than treatment B? + +__The era of scientific mass production__, in which new technologies typified by the microarray allow a single team of scientists to produce data sets of a size Quetelet would envy. But now the flood of data is accompanied by a deluge of questions, perhaps thousands of estimates or hypothesis tests that the statistician is charged with answering together; not at all what the classical masters had in mind. Which variables matter among the thousands measured? How do you relate unrelated information? + +[http://www-stat.stanford.edu/~ckirby/brad/papers/2010LSIexcerpt.pdf](http://www-stat.stanford.edu/~ckirby/brad/papers/2010LSIexcerpt.pdf) + +--- + +## Reasons for multiple testing + + + + +--- + +## Why correct for multiple tests? + + + + +[http://xkcd.com/882/](http://xkcd.com/882/) + +--- + +## Why correct for multiple tests? + + + +[http://xkcd.com/882/](http://xkcd.com/882/) + + +--- + +## Types of errors + +Suppose you are testing a hypothesis that a parameter $\beta$ equals zero versus the alternative that it does not equal zero. These are the possible outcomes. +

+ + | $\beta=0$ | $\beta\neq0$ | Hypotheses +--------------------|-------------|----------------|--------- +Claim $\beta=0$ | $U$ | $T$ | $m-R$ +Claim $\beta\neq 0$ | $V$ | $S$ | $R$ + Claims | $m_0$ | $m-m_0$ | $m$ + +

+ +__Type I error or false positive ($V$)__ Say that the parameter does not equal zero when it does + +__Type II error or false negative ($T$)__ Say that the parameter equals zero when it doesn't + + +--- + +## Error rates + +__False positive rate__ - The rate at which false results ($\beta = 0$) are called significant: $E\left[\frac{V}{m_0}\right]$* + +__Family wise error rate (FWER)__ - The probability of at least one false positive ${\rm Pr}(V \geq 1)$ + +__False discovery rate (FDR)__ - The rate at which claims of significance are false $E\left[\frac{V}{R}\right]$ + +* The false positive rate is closely related to the type I error rate [http://en.wikipedia.org/wiki/False_positive_rate](http://en.wikipedia.org/wiki/False_positive_rate) + +--- + +## Controlling the false positive rate + +If P-values are correctly calculated calling all $P < \alpha$ significant will control the false positive rate at level $\alpha$ on average. + +Problem: Suppose that you perform 10,000 tests and $\beta = 0$ for all of them. + +Suppose that you call all $P < 0.05$ significant. + +The expected number of false positives is: $10,000 \times 0.05 = 500$ false positives. + +__How do we avoid so many false positives?__ + + +--- + +## Controlling family-wise error rate (FWER) + + +The [Bonferroni correction](http://en.wikipedia.org/wiki/Bonferroni_correction) is the oldest multiple testing correction. + +__Basic idea__: +* Suppose you do $m$ tests +* You want to control FWER at level $\alpha$ so $Pr(V \geq 1) < \alpha$ +* Calculate P-values normally +* Set $\alpha_{fwer} = \alpha/m$ +* Call all $P$-values less than $\alpha_{fwer}$ significant + +__Pros__: Easy to calculate, conservative +__Cons__: May be very conservative + + +--- + +## Controlling false discovery rate (FDR) + +This is the most popular correction when performing _lots_ of tests say in genomics, imaging, astronomy, or other signal-processing disciplines. + +__Basic idea__: +* Suppose you do $m$ tests +* You want to control FDR at level $\alpha$ so $E\left[\frac{V}{R}\right]$ +* Calculate P-values normally +* Order the P-values from smallest to largest $P_{(1)},...,P_{(m)}$ +* Call any $P_{(i)} \leq \alpha \times \frac{i}{m}$ significant + +__Pros__: Still pretty easy to calculate, less conservative (maybe much less) + +__Cons__: Allows for more false positives, may behave strangely under dependence + +--- + +## Example with 10 P-values + + + +Controlling all error rates at $\alpha = 0.20$ + +--- + +## Adjusted P-values + +* One approach is to adjust the threshold $\alpha$ +* A different approach is to calculate "adjusted p-values" +* They _are not p-values_ anymore +* But they can be used directly without adjusting $\alpha$ + +__Example__: +* Suppose P-values are $P_1,\ldots,P_m$ +* You could adjust them by taking $P_i^{fwer} = \max{m \times P_i,1}$ for each P-value. +* Then if you call all $P_i^{fwer} < \alpha$ significant you will control the FWER. + +--- + +## Case study I: no true positives + +```{r createPvals,cache=TRUE} +set.seed(1010093) +pValues <- rep(NA,1000) +for(i in 1:1000){ + y <- rnorm(20) + x <- rnorm(20) + pValues[i] <- summary(lm(y ~ x))$coeff[2,4] +} + +# Controls false positive rate +sum(pValues < 0.05) +``` + +--- + +## Case study I: no true positives + +```{r, dependson="createPvals"} +# Controls FWER +sum(p.adjust(pValues,method="bonferroni") < 0.05) +# Controls FDR +sum(p.adjust(pValues,method="BH") < 0.05) +``` + + +--- + +## Case study II: 50% true positives + +```{r createPvals2,cache=TRUE} +set.seed(1010093) +pValues <- rep(NA,1000) +for(i in 1:1000){ + x <- rnorm(20) + # First 500 beta=0, last 500 beta=2 + if(i <= 500){y <- rnorm(20)}else{ y <- rnorm(20,mean=2*x)} + pValues[i] <- summary(lm(y ~ x))$coeff[2,4] +} +trueStatus <- rep(c("zero","not zero"),each=500) +table(pValues < 0.05, trueStatus) +``` + +--- + + +## Case study II: 50% true positives + +```{r, dependson="createPvals2"} +# Controls FWER +table(p.adjust(pValues,method="bonferroni") < 0.05,trueStatus) +# Controls FDR +table(p.adjust(pValues,method="BH") < 0.05,trueStatus) +``` + + +--- + + +## Case study II: 50% true positives + +__P-values versus adjusted P-values__ +```{r, dependson="createPvals2",fig.height=4,fig.width=8} +par(mfrow=c(1,2)) +plot(pValues,p.adjust(pValues,method="bonferroni"),pch=19) +plot(pValues,p.adjust(pValues,method="BH"),pch=19) +``` + + +--- + + +## Notes and resources + +__Notes__: +* Multiple testing is an entire subfield +* A basic Bonferroni/BH correction is usually enough +* If there is strong dependence between tests there may be problems + * Consider method="BY" + +__Further resources__: +* [Multiple testing procedures with applications to genomics](http://www.amazon.com/Multiple-Procedures-Applications-Genomics-Statistics/dp/0387493166/ref=sr_1_2/102-3292576-129059?ie=UTF8&s=books&qid=1187394873&sr=1-2) +* [Statistical significance for genome-wide studies](http://www.pnas.org/content/100/16/9440.full) +* [Introduction to multiple testing](http://ies.ed.gov/ncee/pubs/20084018/app_b.asp) + diff --git a/06_StatisticalInference/03_05_MultipleTesting/index.html b/06_StatisticalInference/03_05_MultipleTesting/index.html index bb3a271db..dfc498eeb 100644 --- a/06_StatisticalInference/03_05_MultipleTesting/index.html +++ b/06_StatisticalInference/03_05_MultipleTesting/index.html @@ -1,581 +1,585 @@ - - - - Multiple testing - - - - - - - - - - - - - - - - - - - - - - - - - - -
-

Multiple testing

-

Statistical Inference

-

Brian Caffo, Jeffrey Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

-
-
-
- - - - -
-

Key ideas

-
-
-
    -
  • Hypothesis testing/significance analysis is commonly overused
  • -
  • Correcting for multiple testing avoids false positives or discoveries
  • -
  • Two key components - -
      -
    • Error measure
    • -
    • Correction
    • -
  • -
- -
- -
- - -
-

Three eras of statistics

-
-
-

The age of Quetelet and his successors, in which huge census-level data sets were brought to bear on simple but important questions: Are there more male than female births? Is the rate of insanity rising?

- -

The classical period of Pearson, Fisher, Neyman, Hotelling, and their successors, intellectual giants who developed a theory of optimal inference capable of wringing every drop of information out of a scientific experiment. The questions dealt with still tended to be simple Is treatment A better than treatment B?

- -

The era of scientific mass production, in which new technologies typified by the microarray allow a single team of scientists to produce data sets of a size Quetelet would envy. But now the flood of data is accompanied by a deluge of questions, perhaps thousands of estimates or hypothesis tests that the statistician is charged with answering together; not at all what the classical masters had in mind. Which variables matter among the thousands measured? How do you relate unrelated information?

- -

http://www-stat.stanford.edu/~ckirby/brad/papers/2010LSIexcerpt.pdf

- -
- -
- - -
-

Reasons for multiple testing

-
-
-

- -
- -
- - -
-

Why correct for multiple tests?

-
- - -
- - -
-

Why correct for multiple tests?

-
- - -
- - -
-

Types of errors

-
-
-

Suppose you are testing a hypothesis that a parameter \(\beta\) equals zero versus the alternative that it does not equal zero. These are the possible outcomes. -

- - - - - - - - - - - - - - - - - - - - - - - - - - - -
\(\beta=0\)\(\beta\neq0\)Hypotheses
Claim \(\beta=0\)\(U\)\(T\)\(m-R\)
Claim \(\beta\neq 0\)\(V\)\(S\)\(R\)
Claims\(m_0\)\(m-m_0\)\(m\)
- -



- -

Type I error or false positive (\(V\)) Say that the parameter does not equal zero when it does

- -

Type II error or false negative (\(T\)) Say that the parameter equals zero when it doesn't

- -
- -
- - -
-

Error rates

-
-
-

False positive rate - The rate at which false results (\(\beta = 0\)) are called significant: \(E\left[\frac{V}{m_0}\right]\)*

- -

Family wise error rate (FWER) - The probability of at least one false positive \({\rm Pr}(V \geq 1)\)

- -

False discovery rate (FDR) - The rate at which claims of significance are false \(E\left[\frac{V}{R}\right]\)

- - - -
- -
- - -
-

Controlling the false positive rate

-
-
-

If P-values are correctly calculated calling all \(P < \alpha\) significant will control the false positive rate at level \(\alpha\) on average.

- -

Problem: Suppose that you perform 10,000 tests and \(\beta = 0\) for all of them.

- -

Suppose that you call all \(P < 0.05\) significant.

- -

The expected number of false positives is: \(10,000 \times 0.05 = 500\) false positives.

- -

How do we avoid so many false positives?

- -
- -
- - -
-

Controlling family-wise error rate (FWER)

-
-
-

The Bonferroni correction is the oldest multiple testing correction.

- -

Basic idea:

- -
    -
  • Suppose you do \(m\) tests
  • -
  • You want to control FWER at level \(\alpha\) so \(Pr(V \geq 1) < \alpha\)
  • -
  • Calculate P-values normally
  • -
  • Set \(\alpha_{fwer} = \alpha/m\)
  • -
  • Call all \(P\)-values less than \(\alpha_{fwer}\) significant
  • -
- -

Pros: Easy to calculate, conservative -Cons: May be very conservative

- -
- -
- - -
-

Controlling false discovery rate (FDR)

-
-
-

This is the most popular correction when performing lots of tests say in genomics, imaging, astronomy, or other signal-processing disciplines.

- -

Basic idea:

- -
    -
  • Suppose you do \(m\) tests
  • -
  • You want to control FDR at level \(\alpha\) so \(E\left[\frac{V}{R}\right]\)
  • -
  • Calculate P-values normally
  • -
  • Order the P-values from smallest to largest \(P_{(1)},...,P_{(m)}\)
  • -
  • Call any \(P_{(i)} \leq \alpha \times \frac{i}{m}\) significant
  • -
- -

Pros: Still pretty easy to calculate, less conservative (maybe much less)

- -

Cons: Allows for more false positives, may behave strangely under dependence

- -
- -
- - -
-

Example with 10 P-values

-
-
-

- -

Controlling all error rates at \(\alpha = 0.20\)

- -
- -
- - -
-

Adjusted P-values

-
-
-
    -
  • One approach is to adjust the threshold \(\alpha\)
  • -
  • A different approach is to calculate "adjusted p-values"
  • -
  • They are not p-values anymore
  • -
  • But they can be used directly without adjusting \(\alpha\)
  • -
- -

Example:

- -
    -
  • Suppose P-values are \(P_1,\ldots,P_m\)
  • -
  • You could adjust them by taking \(P_i^{fwer} = \max{m \times P_i,1}\) for each P-value.
  • -
  • Then if you call all \(P_i^{fwer} < \alpha\) significant you will control the FWER.
  • -
- -
- -
- - -
-

Case study I: no true positives

-
-
-
set.seed(1010093)
-pValues <- rep(NA,1000)
-for(i in 1:1000){
-  y <- rnorm(20)
-  x <- rnorm(20)
-  pValues[i] <- summary(lm(y ~ x))$coeff[2,4]
-}
-
-# Controls false positive rate
-sum(pValues < 0.05)
-
- -
[1] 51
-
- -
- -
- - -
-

Case study I: no true positives

-
-
-
# Controls FWER 
-sum(p.adjust(pValues,method="bonferroni") < 0.05)
-
- -
[1] 0
-
- -
# Controls FDR 
-sum(p.adjust(pValues,method="BH") < 0.05)
-
- -
[1] 0
-
- -
- -
- - -
-

Case study II: 50% true positives

-
-
-
set.seed(1010093)
-pValues <- rep(NA,1000)
-for(i in 1:1000){
-  x <- rnorm(20)
-  # First 500 beta=0, last 500 beta=2
-  if(i <= 500){y <- rnorm(20)}else{ y <- rnorm(20,mean=2*x)}
-  pValues[i] <- summary(lm(y ~ x))$coeff[2,4]
-}
-trueStatus <- rep(c("zero","not zero"),each=500)
-table(pValues < 0.05, trueStatus)
-
- -
       trueStatus
-        not zero zero
-  FALSE        0  476
-  TRUE       500   24
-
- -
- -
- - -
-

Case study II: 50% true positives

-
-
-
# Controls FWER 
-table(p.adjust(pValues,method="bonferroni") < 0.05,trueStatus)
-
- -
       trueStatus
-        not zero zero
-  FALSE       23  500
-  TRUE       477    0
-
- -
# Controls FDR 
-table(p.adjust(pValues,method="BH") < 0.05,trueStatus)
-
- -
       trueStatus
-        not zero zero
-  FALSE        0  487
-  TRUE       500   13
-
- -
- -
- - -
-

Case study II: 50% true positives

-
-
-

P-values versus adjusted P-values

- -
par(mfrow=c(1,2))
-plot(pValues,p.adjust(pValues,method="bonferroni"),pch=19)
-plot(pValues,p.adjust(pValues,method="BH"),pch=19)
-
- -
plot of chunk unnamed-chunk-3
- -
- -
- - -
-

Notes and resources

-
- - -
- - -
- - - - - - - - - - - - - - + + + + Multiple testing + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

Multiple testing

+

Statistical Inference

+

Brian Caffo, Jeffrey Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

+
+
+
+ + + + +
+

Key ideas

+
+
+
    +
  • Hypothesis testing/significance analysis is commonly overused
  • +
  • Correcting for multiple testing avoids false positives or discoveries
  • +
  • Two key components + +
      +
    • Error measure
    • +
    • Correction
    • +
  • +
+ +
+ +
+ + +
+

Three eras of statistics

+
+
+

The age of Quetelet and his successors, in which huge census-level data sets were brought to bear on simple but important questions: Are there more male than female births? Is the rate of insanity rising?

+ +

The classical period of Pearson, Fisher, Neyman, Hotelling, and their successors, intellectual giants who developed a theory of optimal inference capable of wringing every drop of information out of a scientific experiment. The questions dealt with still tended to be simple Is treatment A better than treatment B?

+ +

The era of scientific mass production, in which new technologies typified by the microarray allow a single team of scientists to produce data sets of a size Quetelet would envy. But now the flood of data is accompanied by a deluge of questions, perhaps thousands of estimates or hypothesis tests that the statistician is charged with answering together; not at all what the classical masters had in mind. Which variables matter among the thousands measured? How do you relate unrelated information?

+ +

http://www-stat.stanford.edu/~ckirby/brad/papers/2010LSIexcerpt.pdf

+ +
+ +
+ + +
+

Reasons for multiple testing

+
+
+

+ +
+ +
+ + +
+

Why correct for multiple tests?

+
+ + +
+ + +
+

Why correct for multiple tests?

+
+ + +
+ + +
+

Types of errors

+
+
+

Suppose you are testing a hypothesis that a parameter \(\beta\) equals zero versus the alternative that it does not equal zero. These are the possible outcomes. +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + +
\(\beta=0\)\(\beta\neq0\)Hypotheses
Claim \(\beta=0\)\(U\)\(T\)\(m-R\)
Claim \(\beta\neq 0\)\(V\)\(S\)\(R\)
Claims\(m_0\)\(m-m_0\)\(m\)
+ +



+ +

Type I error or false positive (\(V\)) Say that the parameter does not equal zero when it does

+ +

Type II error or false negative (\(T\)) Say that the parameter equals zero when it doesn't

+ +
+ +
+ + +
+

Error rates

+
+
+

False positive rate - The rate at which false results (\(\beta = 0\)) are called significant: \(E\left[\frac{V}{m_0}\right]\)*

+ +

Family wise error rate (FWER) - The probability of at least one false positive \({\rm Pr}(V \geq 1)\)

+ +

False discovery rate (FDR) - The rate at which claims of significance are false \(E\left[\frac{V}{R}\right]\)

+ + + +
+ +
+ + +
+

Controlling the false positive rate

+
+
+

If P-values are correctly calculated calling all \(P < \alpha\) significant will control the false positive rate at level \(\alpha\) on average.

+ +

Problem: Suppose that you perform 10,000 tests and \(\beta = 0\) for all of them.

+ +

Suppose that you call all \(P < 0.05\) significant.

+ +

The expected number of false positives is: \(10,000 \times 0.05 = 500\) false positives.

+ +

How do we avoid so many false positives?

+ +
+ +
+ + +
+

Controlling family-wise error rate (FWER)

+
+
+

The Bonferroni correction is the oldest multiple testing correction.

+ +

Basic idea:

+ +
    +
  • Suppose you do \(m\) tests
  • +
  • You want to control FWER at level \(\alpha\) so \(Pr(V \geq 1) < \alpha\)
  • +
  • Calculate P-values normally
  • +
  • Set \(\alpha_{fwer} = \alpha/m\)
  • +
  • Call all \(P\)-values less than \(\alpha_{fwer}\) significant
  • +
+ +

Pros: Easy to calculate, conservative +Cons: May be very conservative

+ +
+ +
+ + +
+

Controlling false discovery rate (FDR)

+
+
+

This is the most popular correction when performing lots of tests say in genomics, imaging, astronomy, or other signal-processing disciplines.

+ +

Basic idea:

+ +
    +
  • Suppose you do \(m\) tests
  • +
  • You want to control FDR at level \(\alpha\) so \(E\left[\frac{V}{R}\right]\)
  • +
  • Calculate P-values normally
  • +
  • Order the P-values from smallest to largest \(P_{(1)},...,P_{(m)}\)
  • +
  • Call any \(P_{(i)} \leq \alpha \times \frac{i}{m}\) significant
  • +
+ +

Pros: Still pretty easy to calculate, less conservative (maybe much less)

+ +

Cons: Allows for more false positives, may behave strangely under dependence

+ +
+ +
+ + +
+

Example with 10 P-values

+
+
+

+ +

Controlling all error rates at \(\alpha = 0.20\)

+ +
+ +
+ + +
+

Adjusted P-values

+
+
+
    +
  • One approach is to adjust the threshold \(\alpha\)
  • +
  • A different approach is to calculate "adjusted p-values"
  • +
  • They are not p-values anymore
  • +
  • But they can be used directly without adjusting \(\alpha\)
  • +
+ +

Example:

+ +
    +
  • Suppose P-values are \(P_1,\ldots,P_m\)
  • +
  • You could adjust them by taking \(P_i^{fwer} = \max{m \times P_i,1}\) for each P-value.
  • +
  • Then if you call all \(P_i^{fwer} < \alpha\) significant you will control the FWER.
  • +
+ +
+ +
+ + +
+

Case study I: no true positives

+
+
+
set.seed(1010093)
+pValues <- rep(NA, 1000)
+for (i in 1:1000) {
+    y <- rnorm(20)
+    x <- rnorm(20)
+    pValues[i] <- summary(lm(y ~ x))$coeff[2, 4]
+}
+
+# Controls false positive rate
+sum(pValues < 0.05)
+
+ +
## [1] 51
+
+ +
+ +
+ + +
+

Case study I: no true positives

+
+
+
# Controls FWER
+sum(p.adjust(pValues, method = "bonferroni") < 0.05)
+
+ +
## [1] 0
+
+ +
# Controls FDR
+sum(p.adjust(pValues, method = "BH") < 0.05)
+
+ +
## [1] 0
+
+ +
+ +
+ + +
+

Case study II: 50% true positives

+
+
+
set.seed(1010093)
+pValues <- rep(NA, 1000)
+for (i in 1:1000) {
+    x <- rnorm(20)
+    # First 500 beta=0, last 500 beta=2
+    if (i <= 500) {
+        y <- rnorm(20)
+    } else {
+        y <- rnorm(20, mean = 2 * x)
+    }
+    pValues[i] <- summary(lm(y ~ x))$coeff[2, 4]
+}
+trueStatus <- rep(c("zero", "not zero"), each = 500)
+table(pValues < 0.05, trueStatus)
+
+ +
##        trueStatus
+##         not zero zero
+##   FALSE        0  476
+##   TRUE       500   24
+
+ +
+ +
+ + +
+

Case study II: 50% true positives

+
+
+
# Controls FWER
+table(p.adjust(pValues, method = "bonferroni") < 0.05, trueStatus)
+
+ +
##        trueStatus
+##         not zero zero
+##   FALSE       23  500
+##   TRUE       477    0
+
+ +
# Controls FDR
+table(p.adjust(pValues, method = "BH") < 0.05, trueStatus)
+
+ +
##        trueStatus
+##         not zero zero
+##   FALSE        0  487
+##   TRUE       500   13
+
+ +
+ +
+ + +
+

Case study II: 50% true positives

+
+
+

P-values versus adjusted P-values

+ +
par(mfrow = c(1, 2))
+plot(pValues, p.adjust(pValues, method = "bonferroni"), pch = 19)
+plot(pValues, p.adjust(pValues, method = "BH"), pch = 19)
+
+ +

plot of chunk unnamed-chunk-3

+ +
+ +
+ + +
+

Notes and resources

+
+ + +
+ + +
+ + + + + + + + + + + + + + \ No newline at end of file diff --git a/06_StatisticalInference/03_05_MultipleTesting/index.md b/06_StatisticalInference/03_05_MultipleTesting/index.md index 317eae1a7..08f1afa2f 100644 --- a/06_StatisticalInference/03_05_MultipleTesting/index.md +++ b/06_StatisticalInference/03_05_MultipleTesting/index.md @@ -1,309 +1,308 @@ ---- -title : Multiple testing -subtitle : Statistical Inference -author : Brian Caffo, Jeffrey Leek, Roger Peng -job : Johns Hopkins Bloomberg School of Public Health -logo : bloomberg_shield.png -framework : io2012 # {io2012, html5slides, shower, dzslides, ...} -highlighter : highlight.js # {highlight.js, prettify, highlight} -hitheme : tomorrow # -url: - lib: ../../librariesNew - assets: ../../assets -widgets : [mathjax] # {mathjax, quiz, bootstrap} -mode : selfcontained # {standalone, draft} ---- - - - - - -## Key ideas - -* Hypothesis testing/significance analysis is commonly overused -* Correcting for multiple testing avoids false positives or discoveries -* Two key components - * Error measure - * Correction - - ---- - -## Three eras of statistics - -__The age of Quetelet and his successors, in which huge census-level data sets were brought to bear on simple but important questions__: Are there more male than female births? Is the rate of insanity rising? - -The classical period of Pearson, Fisher, Neyman, Hotelling, and their successors, intellectual giants who __developed a theory of optimal inference capable of wringing every drop of information out of a scientific experiment__. The questions dealt with still tended to be simple Is treatment A better than treatment B? - -__The era of scientific mass production__, in which new technologies typified by the microarray allow a single team of scientists to produce data sets of a size Quetelet would envy. But now the flood of data is accompanied by a deluge of questions, perhaps thousands of estimates or hypothesis tests that the statistician is charged with answering together; not at all what the classical masters had in mind. Which variables matter among the thousands measured? How do you relate unrelated information? - -[http://www-stat.stanford.edu/~ckirby/brad/papers/2010LSIexcerpt.pdf](http://www-stat.stanford.edu/~ckirby/brad/papers/2010LSIexcerpt.pdf) - ---- - -## Reasons for multiple testing - - - - ---- - -## Why correct for multiple tests? - - - - -[http://xkcd.com/882/](http://xkcd.com/882/) - ---- - -## Why correct for multiple tests? - - - -[http://xkcd.com/882/](http://xkcd.com/882/) - - ---- - -## Types of errors - -Suppose you are testing a hypothesis that a parameter $\beta$ equals zero versus the alternative that it does not equal zero. These are the possible outcomes. -

- - | $\beta=0$ | $\beta\neq0$ | Hypotheses ---------------------|-------------|----------------|--------- -Claim $\beta=0$ | $U$ | $T$ | $m-R$ -Claim $\beta\neq 0$ | $V$ | $S$ | $R$ - Claims | $m_0$ | $m-m_0$ | $m$ - -

- -__Type I error or false positive ($V$)__ Say that the parameter does not equal zero when it does - -__Type II error or false negative ($T$)__ Say that the parameter equals zero when it doesn't - - ---- - -## Error rates - -__False positive rate__ - The rate at which false results ($\beta = 0$) are called significant: $E\left[\frac{V}{m_0}\right]$* - -__Family wise error rate (FWER)__ - The probability of at least one false positive ${\rm Pr}(V \geq 1)$ - -__False discovery rate (FDR)__ - The rate at which claims of significance are false $E\left[\frac{V}{R}\right]$ - -* The false positive rate is closely related to the type I error rate [http://en.wikipedia.org/wiki/False_positive_rate](http://en.wikipedia.org/wiki/False_positive_rate) - ---- - -## Controlling the false positive rate - -If P-values are correctly calculated calling all $P < \alpha$ significant will control the false positive rate at level $\alpha$ on average. - -Problem: Suppose that you perform 10,000 tests and $\beta = 0$ for all of them. - -Suppose that you call all $P < 0.05$ significant. - -The expected number of false positives is: $10,000 \times 0.05 = 500$ false positives. - -__How do we avoid so many false positives?__ - - ---- - -## Controlling family-wise error rate (FWER) - - -The [Bonferroni correction](http://en.wikipedia.org/wiki/Bonferroni_correction) is the oldest multiple testing correction. - -__Basic idea__: -* Suppose you do $m$ tests -* You want to control FWER at level $\alpha$ so $Pr(V \geq 1) < \alpha$ -* Calculate P-values normally -* Set $\alpha_{fwer} = \alpha/m$ -* Call all $P$-values less than $\alpha_{fwer}$ significant - -__Pros__: Easy to calculate, conservative -__Cons__: May be very conservative - - ---- - -## Controlling false discovery rate (FDR) - -This is the most popular correction when performing _lots_ of tests say in genomics, imaging, astronomy, or other signal-processing disciplines. - -__Basic idea__: -* Suppose you do $m$ tests -* You want to control FDR at level $\alpha$ so $E\left[\frac{V}{R}\right]$ -* Calculate P-values normally -* Order the P-values from smallest to largest $P_{(1)},...,P_{(m)}$ -* Call any $P_{(i)} \leq \alpha \times \frac{i}{m}$ significant - -__Pros__: Still pretty easy to calculate, less conservative (maybe much less) - -__Cons__: Allows for more false positives, may behave strangely under dependence - ---- - -## Example with 10 P-values - - - -Controlling all error rates at $\alpha = 0.20$ - ---- - -## Adjusted P-values - -* One approach is to adjust the threshold $\alpha$ -* A different approach is to calculate "adjusted p-values" -* They _are not p-values_ anymore -* But they can be used directly without adjusting $\alpha$ - -__Example__: -* Suppose P-values are $P_1,\ldots,P_m$ -* You could adjust them by taking $P_i^{fwer} = \max{m \times P_i,1}$ for each P-value. -* Then if you call all $P_i^{fwer} < \alpha$ significant you will control the FWER. - ---- - -## Case study I: no true positives - - -```r -set.seed(1010093) -pValues <- rep(NA,1000) -for(i in 1:1000){ - y <- rnorm(20) - x <- rnorm(20) - pValues[i] <- summary(lm(y ~ x))$coeff[2,4] -} - -# Controls false positive rate -sum(pValues < 0.05) -``` - -``` -[1] 51 -``` - - ---- - -## Case study I: no true positives - - -```r -# Controls FWER -sum(p.adjust(pValues,method="bonferroni") < 0.05) -``` - -``` -[1] 0 -``` - -```r -# Controls FDR -sum(p.adjust(pValues,method="BH") < 0.05) -``` - -``` -[1] 0 -``` - - - ---- - -## Case study II: 50% true positives - - -```r -set.seed(1010093) -pValues <- rep(NA,1000) -for(i in 1:1000){ - x <- rnorm(20) - # First 500 beta=0, last 500 beta=2 - if(i <= 500){y <- rnorm(20)}else{ y <- rnorm(20,mean=2*x)} - pValues[i] <- summary(lm(y ~ x))$coeff[2,4] -} -trueStatus <- rep(c("zero","not zero"),each=500) -table(pValues < 0.05, trueStatus) -``` - -``` - trueStatus - not zero zero - FALSE 0 476 - TRUE 500 24 -``` - - ---- - - -## Case study II: 50% true positives - - -```r -# Controls FWER -table(p.adjust(pValues,method="bonferroni") < 0.05,trueStatus) -``` - -``` - trueStatus - not zero zero - FALSE 23 500 - TRUE 477 0 -``` - -```r -# Controls FDR -table(p.adjust(pValues,method="BH") < 0.05,trueStatus) -``` - -``` - trueStatus - not zero zero - FALSE 0 487 - TRUE 500 13 -``` - - - ---- - - -## Case study II: 50% true positives - -__P-values versus adjusted P-values__ - -```r -par(mfrow=c(1,2)) -plot(pValues,p.adjust(pValues,method="bonferroni"),pch=19) -plot(pValues,p.adjust(pValues,method="BH"),pch=19) -``` - -
plot of chunk unnamed-chunk-3
- - - ---- - - -## Notes and resources - -__Notes__: -* Multiple testing is an entire subfield -* A basic Bonferroni/BH correction is usually enough -* If there is strong dependence between tests there may be problems - * Consider method="BY" - -__Further resources__: -* [Multiple testing procedures with applications to genomics](http://www.amazon.com/Multiple-Procedures-Applications-Genomics-Statistics/dp/0387493166/ref=sr_1_2/102-3292576-129059?ie=UTF8&s=books&qid=1187394873&sr=1-2) -* [Statistical significance for genome-wide studies](http://www.pnas.org/content/100/16/9440.full) -* [Introduction to multiple testing](http://ies.ed.gov/ncee/pubs/20084018/app_b.asp) - +--- +title : Multiple testing +subtitle : Statistical Inference +author : Brian Caffo, Jeffrey Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} +--- +## Key ideas + +* Hypothesis testing/significance analysis is commonly overused +* Correcting for multiple testing avoids false positives or discoveries +* Two key components + * Error measure + * Correction + + +--- + +## Three eras of statistics + +__The age of Quetelet and his successors, in which huge census-level data sets were brought to bear on simple but important questions__: Are there more male than female births? Is the rate of insanity rising? + +The classical period of Pearson, Fisher, Neyman, Hotelling, and their successors, intellectual giants who __developed a theory of optimal inference capable of wringing every drop of information out of a scientific experiment__. The questions dealt with still tended to be simple Is treatment A better than treatment B? + +__The era of scientific mass production__, in which new technologies typified by the microarray allow a single team of scientists to produce data sets of a size Quetelet would envy. But now the flood of data is accompanied by a deluge of questions, perhaps thousands of estimates or hypothesis tests that the statistician is charged with answering together; not at all what the classical masters had in mind. Which variables matter among the thousands measured? How do you relate unrelated information? + +[http://www-stat.stanford.edu/~ckirby/brad/papers/2010LSIexcerpt.pdf](http://www-stat.stanford.edu/~ckirby/brad/papers/2010LSIexcerpt.pdf) + +--- + +## Reasons for multiple testing + + + + +--- + +## Why correct for multiple tests? + + + + +[http://xkcd.com/882/](http://xkcd.com/882/) + +--- + +## Why correct for multiple tests? + + + +[http://xkcd.com/882/](http://xkcd.com/882/) + + +--- + +## Types of errors + +Suppose you are testing a hypothesis that a parameter $\beta$ equals zero versus the alternative that it does not equal zero. These are the possible outcomes. +

+ + | $\beta=0$ | $\beta\neq0$ | Hypotheses +--------------------|-------------|----------------|--------- +Claim $\beta=0$ | $U$ | $T$ | $m-R$ +Claim $\beta\neq 0$ | $V$ | $S$ | $R$ + Claims | $m_0$ | $m-m_0$ | $m$ + +

+ +__Type I error or false positive ($V$)__ Say that the parameter does not equal zero when it does + +__Type II error or false negative ($T$)__ Say that the parameter equals zero when it doesn't + + +--- + +## Error rates + +__False positive rate__ - The rate at which false results ($\beta = 0$) are called significant: $E\left[\frac{V}{m_0}\right]$* + +__Family wise error rate (FWER)__ - The probability of at least one false positive ${\rm Pr}(V \geq 1)$ + +__False discovery rate (FDR)__ - The rate at which claims of significance are false $E\left[\frac{V}{R}\right]$ + +* The false positive rate is closely related to the type I error rate [http://en.wikipedia.org/wiki/False_positive_rate](http://en.wikipedia.org/wiki/False_positive_rate) + +--- + +## Controlling the false positive rate + +If P-values are correctly calculated calling all $P < \alpha$ significant will control the false positive rate at level $\alpha$ on average. + +Problem: Suppose that you perform 10,000 tests and $\beta = 0$ for all of them. + +Suppose that you call all $P < 0.05$ significant. + +The expected number of false positives is: $10,000 \times 0.05 = 500$ false positives. + +__How do we avoid so many false positives?__ + + +--- + +## Controlling family-wise error rate (FWER) + + +The [Bonferroni correction](http://en.wikipedia.org/wiki/Bonferroni_correction) is the oldest multiple testing correction. + +__Basic idea__: +* Suppose you do $m$ tests +* You want to control FWER at level $\alpha$ so $Pr(V \geq 1) < \alpha$ +* Calculate P-values normally +* Set $\alpha_{fwer} = \alpha/m$ +* Call all $P$-values less than $\alpha_{fwer}$ significant + +__Pros__: Easy to calculate, conservative +__Cons__: May be very conservative + + +--- + +## Controlling false discovery rate (FDR) + +This is the most popular correction when performing _lots_ of tests say in genomics, imaging, astronomy, or other signal-processing disciplines. + +__Basic idea__: +* Suppose you do $m$ tests +* You want to control FDR at level $\alpha$ so $E\left[\frac{V}{R}\right]$ +* Calculate P-values normally +* Order the P-values from smallest to largest $P_{(1)},...,P_{(m)}$ +* Call any $P_{(i)} \leq \alpha \times \frac{i}{m}$ significant + +__Pros__: Still pretty easy to calculate, less conservative (maybe much less) + +__Cons__: Allows for more false positives, may behave strangely under dependence + +--- + +## Example with 10 P-values + + + +Controlling all error rates at $\alpha = 0.20$ + +--- + +## Adjusted P-values + +* One approach is to adjust the threshold $\alpha$ +* A different approach is to calculate "adjusted p-values" +* They _are not p-values_ anymore +* But they can be used directly without adjusting $\alpha$ + +__Example__: +* Suppose P-values are $P_1,\ldots,P_m$ +* You could adjust them by taking $P_i^{fwer} = \max{m \times P_i,1}$ for each P-value. +* Then if you call all $P_i^{fwer} < \alpha$ significant you will control the FWER. + +--- + +## Case study I: no true positives + + +```r +set.seed(1010093) +pValues <- rep(NA, 1000) +for (i in 1:1000) { + y <- rnorm(20) + x <- rnorm(20) + pValues[i] <- summary(lm(y ~ x))$coeff[2, 4] +} + +# Controls false positive rate +sum(pValues < 0.05) +``` + +``` +## [1] 51 +``` + + +--- + +## Case study I: no true positives + + +```r +# Controls FWER +sum(p.adjust(pValues, method = "bonferroni") < 0.05) +``` + +``` +## [1] 0 +``` + +```r +# Controls FDR +sum(p.adjust(pValues, method = "BH") < 0.05) +``` + +``` +## [1] 0 +``` + + + +--- + +## Case study II: 50% true positives + + +```r +set.seed(1010093) +pValues <- rep(NA, 1000) +for (i in 1:1000) { + x <- rnorm(20) + # First 500 beta=0, last 500 beta=2 + if (i <= 500) { + y <- rnorm(20) + } else { + y <- rnorm(20, mean = 2 * x) + } + pValues[i] <- summary(lm(y ~ x))$coeff[2, 4] +} +trueStatus <- rep(c("zero", "not zero"), each = 500) +table(pValues < 0.05, trueStatus) +``` + +``` +## trueStatus +## not zero zero +## FALSE 0 476 +## TRUE 500 24 +``` + + +--- + + +## Case study II: 50% true positives + + +```r +# Controls FWER +table(p.adjust(pValues, method = "bonferroni") < 0.05, trueStatus) +``` + +``` +## trueStatus +## not zero zero +## FALSE 23 500 +## TRUE 477 0 +``` + +```r +# Controls FDR +table(p.adjust(pValues, method = "BH") < 0.05, trueStatus) +``` + +``` +## trueStatus +## not zero zero +## FALSE 0 487 +## TRUE 500 13 +``` + + + +--- + + +## Case study II: 50% true positives + +__P-values versus adjusted P-values__ + +```r +par(mfrow = c(1, 2)) +plot(pValues, p.adjust(pValues, method = "bonferroni"), pch = 19) +plot(pValues, p.adjust(pValues, method = "BH"), pch = 19) +``` + +![plot of chunk unnamed-chunk-3](assets/fig/unnamed-chunk-3.png) + + + +--- + + +## Notes and resources + +__Notes__: +* Multiple testing is an entire subfield +* A basic Bonferroni/BH correction is usually enough +* If there is strong dependence between tests there may be problems + * Consider method="BY" + +__Further resources__: +* [Multiple testing procedures with applications to genomics](http://www.amazon.com/Multiple-Procedures-Applications-Genomics-Statistics/dp/0387493166/ref=sr_1_2/102-3292576-129059?ie=UTF8&s=books&qid=1187394873&sr=1-2) +* [Statistical significance for genome-wide studies](http://www.pnas.org/content/100/16/9440.full) +* [Introduction to multiple testing](http://ies.ed.gov/ncee/pubs/20084018/app_b.asp) + diff --git a/06_StatisticalInference/03_05_MultipleTesting/index.pdf b/06_StatisticalInference/03_05_MultipleTesting/index.pdf index 190c24c34..88d17ad14 100644 Binary files a/06_StatisticalInference/03_05_MultipleTesting/index.pdf and b/06_StatisticalInference/03_05_MultipleTesting/index.pdf differ diff --git a/06_StatisticalInference/03_06_resampledInference/assets/fig/unnamed-chunk-4.png b/06_StatisticalInference/03_06_resampledInference/assets/fig/unnamed-chunk-4.png new file mode 100644 index 000000000..d88aaaddd Binary files /dev/null and b/06_StatisticalInference/03_06_resampledInference/assets/fig/unnamed-chunk-4.png differ diff --git a/06_StatisticalInference/03_06_resampledInference/assets/fig/unnamed-chunk-5.png b/06_StatisticalInference/03_06_resampledInference/assets/fig/unnamed-chunk-5.png new file mode 100644 index 000000000..57f92bdae Binary files /dev/null and b/06_StatisticalInference/03_06_resampledInference/assets/fig/unnamed-chunk-5.png differ diff --git a/06_StatisticalInference/03_06_resampledInference/assets/fig/unnamed-chunk-7.png b/06_StatisticalInference/03_06_resampledInference/assets/fig/unnamed-chunk-7.png new file mode 100644 index 000000000..8542ea3a4 Binary files /dev/null and b/06_StatisticalInference/03_06_resampledInference/assets/fig/unnamed-chunk-7.png differ diff --git a/06_StatisticalInference/03_06_resampledInference/index.Rmd b/06_StatisticalInference/03_06_resampledInference/index.Rmd index 775099c76..3f25e8c38 100644 --- a/06_StatisticalInference/03_06_resampledInference/index.Rmd +++ b/06_StatisticalInference/03_06_resampledInference/index.Rmd @@ -1,259 +1,243 @@ ---- -title : Resampled inference -subtitle : Statistical Inference -author : Brian Caffo, Jeff Leek, Roger Peng -job : Johns Hopkins Bloomberg School of Public Health -logo : bloomberg_shield.png -framework : io2012 # {io2012, html5slides, shower, dzslides, ...} -highlighter : highlight.js # {highlight.js, prettify, highlight} -hitheme : tomorrow # -url: - lib: ../../librariesNew - assets: ../../assets -widgets : [mathjax] # {mathjax, quiz, bootstrap} -mode : selfcontained # {standalone, draft} - ---- -```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F, results='hide'} -# make this an external chunk that can be included in any file -options(width = 100) -opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/') - -options(xtable.type = 'html') -knit_hooks$set(inline = function(x) { - if(is.numeric(x)) { - round(x, getOption('digits')) - } else { - paste(as.character(x), collapse = ', ') - } -}) -knit_hooks$set(plot = knitr:::hook_plot_html) -runif(1) -``` - -## The jackknife - -- The jackknife is a tool for estimating standard errors and the bias of estimators -- As its name suggests, the jackknife is a small, handy tool; in contrast to the bootstrap, which is then the moral equivalent of a giant workshop full of tools -- Both the jackknife and the bootstrap involve *resampling* data; that is, repeatedly creating new data sets from the original data - ---- - -## The jackknife - -- The jackknife deletes each observation and calculates an estimate based on the remaining $n-1$ of them -- It uses this collection of estimates to do things like estimate the bias and the standard error -- Note that estimating the bias and having a standard error are not needed for things like sample means, which we know are unbiased estimates of population means and what their standard errors are - ---- - -## The jackknife - -- We'll consider the jackknife for univariate data -- Let $X_1,\ldots,X_n$ be a collection of data used to estimate a parameter $\theta$ -- Let $\hat \theta$ be the estimate based on the full data set -- Let $\hat \theta_{i}$ be the estimate of $\theta$ obtained by *deleting observation $i$* -- Let $\bar \theta = \frac{1}{n}\sum_{i=1}^n \hat \theta_{i}$ - ---- - -## Continued - -- Then, the jackknife estimate of the bias is - $$ - (n - 1) \left(\bar \theta - \hat \theta\right) - $$ - (how far the average delete-one estimate is from the actual estimate) -- The jackknife estimate of the standard error is - $$ - \left[\frac{n-1}{n}\sum_{i=1}^n (\hat \theta_i - \bar\theta )^2\right]^{1/2} - $$ -(the deviance of the delete-one estimates from the average delete-one estimate) - ---- - -## Example -### We want to estimate the bias and standard error of the median - -```{r, results='hide'} -library(UsingR) -data(father.son) -x <- father.son$sheight -n <- length(x) -theta <- median(x) -jk <- sapply(1 : n, - function(i) median(x[-i]) - ) -thetaBar <- mean(jk) -biasEst <- (n - 1) * (thetaBar - theta) -seEst <- sqrt((n - 1) * mean((jk - thetaBar)^2)) -``` - ---- - -## Example - -```{r} -c(biasEst, seEst) -library(bootstrap) -temp <- jackknife(x, median) -c(temp$jack.bias, temp$jack.se) -``` - ---- - -## Example - -- Both methods (of course) yield an estimated bias of `r temp$jack.bias` and a se of `r temp$jack.se` -- Odd little fact: the jackknife estimate of the bias for the median is always $0$ when the number of observations is even -- It has been shown that the jackknife is a linear approximation to the bootstrap -- Generally do not use the jackknife for sample quantiles like the median; as it has been shown to have some poor properties - ---- - -## Pseudo observations - -- Another interesting way to think about the jackknife uses pseudo observations -- Let -$$ - \mbox{Pseudo Obs} = n \hat \theta - (n - 1) \hat \theta_{i} -$$ -- Think of these as ``whatever observation $i$ contributes to the estimate of $\theta$'' -- Note when $\hat \theta$ is the sample mean, the pseudo observations are the data themselves -- Then the sample standard error of these observations is the previous jackknife estimated standard error. -- The mean of these observations is a bias-corrected estimate of $\theta$ - ---- - -## The bootstrap - -- The bootstrap is a tremendously useful tool for constructing confidence intervals and calculating standard errors for difficult statistics -- For example, how would one derive a confidence interval for the median? -- The bootstrap procedure follows from the so called bootstrap principle - ---- - -## The bootstrap principle - -- Suppose that I have a statistic that estimates some population parameter, but I don't know its sampling distribution -- The bootstrap principle suggests using the distribution defined by the data to approximate its sampling distribution - ---- - -## The bootstrap in practice - -- In practice, the bootstrap principle is always carried out using simulation -- We will cover only a few aspects of bootstrap resampling -- The general procedure follows by first simulating complete data sets from the observed data with replacement - - - This is approximately drawing from the sampling distribution of that statistic, at least as far as the data is able to approximate the true population distribution - -- Calculate the statistic for each simulated data set -- Use the simulated statistics to either define a confidence interval or take the standard deviation to calculate a standard error - ---- -## Nonparametric bootstrap algorithm example - -- Bootstrap procedure for calculating confidence interval for the median from a data set of $n$ observations - - i. Sample $n$ observations **with replacement** from the observed data resulting in one simulated complete data set - - ii. Take the median of the simulated data set - - iii. Repeat these two steps $B$ times, resulting in $B$ simulated medians - - iv. These medians are approximately drawn from the sampling distribution of the median of $n$ observations; therefore we can - - - Draw a histogram of them - - Calculate their standard deviation to estimate the standard error of the median - - Take the $2.5^{th}$ and $97.5^{th}$ percentiles as a confidence interval for the median - ---- - -## Example code - -```{r} -B <- 1000 -resamples <- matrix(sample(x, - n * B, - replace = TRUE), - B, n) -medians <- apply(resamples, 1, median) -sd(medians) -quantile(medians, c(.025, .975)) -``` - ---- -## Histogram of bootstrap resamples - -```{r, fig.height=5, fig.width=5} -hist(medians) -``` - ---- - -## Notes on the bootstrap - -- The bootstrap is non-parametric -- Better percentile bootstrap confidence intervals correct for bias -- There are lots of variations on bootstrap procedures; the book "An Introduction to the Bootstrap"" by Efron and Tibshirani is a great place to start for both bootstrap and jackknife information - - ---- -## Group comparisons -- Consider comparing two independent groups. -- Example, comparing sprays B and C - -```{r, fig.height=4, fig.width=4} -data(InsectSprays) -boxplot(count ~ spray, data = InsectSprays) -``` - ---- -## Permutation tests -- Consider the null hypothesis that the distribution of the observations from each group is the same -- Then, the group labels are irrelevant -- We then discard the group levels and permute the combined data -- Split the permuted data into two groups with $n_A$ and $n_B$ - observations (say by always treating the first $n_A$ observations as - the first group) -- Evaluate the probability of getting a statistic as large or - large than the one observed -- An example statistic would be the difference in the averages between the two groups; - one could also use a t-statistic - ---- -## Variations on permutation testing -Data type | Statistic | Test name ----|---|---| -Ranks | rank sum | rank sum test -Binary | hypergeometric prob | Fisher's exact test -Raw data | | ordinary permutation test - -- Also, so-called *randomization tests* are exactly permutation tests, with a different motivation. -- For matched data, one can randomize the signs - - For ranks, this results in the signed rank test -- Permutation strategies work for regression as well - - Permuting a regressor of interest -- Permutation tests work very well in multivariate settings - ---- -## Permutation test for pesticide data -```{r} -subdata <- InsectSprays[InsectSprays$spray %in% c("B", "C"),] -y <- subdata$count -group <- as.character(subdata$spray) -testStat <- function(w, g) mean(w[g == "B"]) - mean(w[g == "C"]) -observedStat <- testStat(y, group) -permutations <- sapply(1 : 10000, function(i) testStat(y, sample(group))) -observedStat -mean(permutations > observedStat) -``` - ---- -## Histogram of permutations -```{r, echo= FALSE, fig.width=5, fig.height=5} -hist(permutations) -``` - - +--- +title : Resampled inference +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} + +--- + +## The jackknife + +- The jackknife is a tool for estimating standard errors and the bias of estimators +- As its name suggests, the jackknife is a small, handy tool; in contrast to the bootstrap, which is then the moral equivalent of a giant workshop full of tools +- Both the jackknife and the bootstrap involve *resampling* data; that is, repeatedly creating new data sets from the original data + +--- + +## The jackknife + +- The jackknife deletes each observation and calculates an estimate based on the remaining $n-1$ of them +- It uses this collection of estimates to do things like estimate the bias and the standard error +- Note that estimating the bias and having a standard error are not needed for things like sample means, which we know are unbiased estimates of population means and what their standard errors are + +--- + +## The jackknife + +- We'll consider the jackknife for univariate data +- Let $X_1,\ldots,X_n$ be a collection of data used to estimate a parameter $\theta$ +- Let $\hat \theta$ be the estimate based on the full data set +- Let $\hat \theta_{i}$ be the estimate of $\theta$ obtained by *deleting observation $i$* +- Let $\bar \theta = \frac{1}{n}\sum_{i=1}^n \hat \theta_{i}$ + +--- + +## Continued + +- Then, the jackknife estimate of the bias is + $$ + (n - 1) \left(\bar \theta - \hat \theta\right) + $$ + (how far the average delete-one estimate is from the actual estimate) +- The jackknife estimate of the standard error is + $$ + \left[\frac{n-1}{n}\sum_{i=1}^n (\hat \theta_i - \bar\theta )^2\right]^{1/2} + $$ +(the deviance of the delete-one estimates from the average delete-one estimate) + +--- + +## Example +### We want to estimate the bias and standard error of the median + +```{r, results='hide'} +library(UsingR) +data(father.son) +x <- father.son$sheight +n <- length(x) +theta <- median(x) +jk <- sapply(1 : n, + function(i) median(x[-i]) + ) +thetaBar <- mean(jk) +biasEst <- (n - 1) * (thetaBar - theta) +seEst <- sqrt((n - 1) * mean((jk - thetaBar)^2)) +``` + +--- + +## Example test + +```{r} +c(biasEst, seEst) +library(bootstrap) +temp <- jackknife(x, median) +c(temp$jack.bias, temp$jack.se) +``` + +--- + +## Example + +- Both methods (of course) yield an estimated bias of `r temp$jack.bias` and a se of `r temp$jack.se` +- Odd little fact: the jackknife estimate of the bias for the median is always $0$ when the number of observations is even +- It has been shown that the jackknife is a linear approximation to the bootstrap +- Generally do not use the jackknife for sample quantiles like the median; as it has been shown to have some poor properties + +--- + +## Pseudo observations + +- Another interesting way to think about the jackknife uses pseudo observations +- Let +$$ + \mbox{Pseudo Obs} = n \hat \theta - (n - 1) \hat \theta_{i} +$$ +- Think of these as ``whatever observation $i$ contributes to the estimate of $\theta$'' +- Note when $\hat \theta$ is the sample mean, the pseudo observations are the data themselves +- Then the sample standard error of these observations is the previous jackknife estimated standard error. +- The mean of these observations is a bias-corrected estimate of $\theta$ + +--- + +## The bootstrap + +- The bootstrap is a tremendously useful tool for constructing confidence intervals and calculating standard errors for difficult statistics +- For example, how would one derive a confidence interval for the median? +- The bootstrap procedure follows from the so called bootstrap principle + +--- + +## The bootstrap principle + +- Suppose that I have a statistic that estimates some population parameter, but I don't know its sampling distribution +- The bootstrap principle suggests using the distribution defined by the data to approximate its sampling distribution + +--- + +## The bootstrap in practice + +- In practice, the bootstrap principle is always carried out using simulation +- We will cover only a few aspects of bootstrap resampling +- The general procedure follows by first simulating complete data sets from the observed data with replacement + + - This is approximately drawing from the sampling distribution of that statistic, at least as far as the data is able to approximate the true population distribution + +- Calculate the statistic for each simulated data set +- Use the simulated statistics to either define a confidence interval or take the standard deviation to calculate a standard error + +--- +## Nonparametric bootstrap algorithm example + +- Bootstrap procedure for calculating confidence interval for the median from a data set of $n$ observations + + i. Sample $n$ observations **with replacement** from the observed data resulting in one simulated complete data set + + ii. Take the median of the simulated data set + + iii. Repeat these two steps $B$ times, resulting in $B$ simulated medians + + iv. These medians are approximately drawn from the sampling distribution of the median of $n$ observations; therefore we can + + - Draw a histogram of them + - Calculate their standard deviation to estimate the standard error of the median + - Take the $2.5^{th}$ and $97.5^{th}$ percentiles as a confidence interval for the median + +--- + +## Example code + +```{r} +B <- 1000 +resamples <- matrix(sample(x, + n * B, + replace = TRUE), + B, n) +medians <- apply(resamples, 1, median) +sd(medians) +quantile(medians, c(.025, .975)) +``` + +--- +## Histogram of bootstrap resamples + +```{r, fig.height=5, fig.width=5} +hist(medians) +``` + +--- + +## Notes on the bootstrap + +- The bootstrap is non-parametric +- Better percentile bootstrap confidence intervals correct for bias +- There are lots of variations on bootstrap procedures; the book "An Introduction to the Bootstrap"" by Efron and Tibshirani is a great place to start for both bootstrap and jackknife information + + +--- +## Group comparisons +- Consider comparing two independent groups. +- Example, comparing sprays B and C + +```{r, fig.height=4, fig.width=4} +data(InsectSprays) +boxplot(count ~ spray, data = InsectSprays) +``` + +--- +## Permutation tests +- Consider the null hypothesis that the distribution of the observations from each group is the same +- Then, the group labels are irrelevant +- We then discard the group levels and permute the combined data +- Split the permuted data into two groups with $n_A$ and $n_B$ + observations (say by always treating the first $n_A$ observations as + the first group) +- Evaluate the probability of getting a statistic as large or + large than the one observed +- An example statistic would be the difference in the averages between the two groups; + one could also use a t-statistic + +--- +## Variations on permutation testing +Data type | Statistic | Test name +---|---|---| +Ranks | rank sum | rank sum test +Binary | hypergeometric prob | Fisher's exact test +Raw data | | ordinary permutation test + +- Also, so-called *randomization tests* are exactly permutation tests, with a different motivation. +- For matched data, one can randomize the signs + - For ranks, this results in the signed rank test +- Permutation strategies work for regression as well + - Permuting a regressor of interest +- Permutation tests work very well in multivariate settings + +--- +## Permutation test for pesticide data +```{r} +subdata <- InsectSprays[InsectSprays$spray %in% c("B", "C"),] +y <- subdata$count +group <- as.character(subdata$spray) +testStat <- function(w, g) mean(w[g == "B"]) - mean(w[g == "C"]) +observedStat <- testStat(y, group) +permutations <- sapply(1 : 10000, function(i) testStat(y, sample(group))) +observedStat +mean(permutations > observedStat) +``` + +--- +## Histogram of permutations +```{r, echo= FALSE, fig.width=5, fig.height=5} +hist(permutations) +``` + + diff --git a/06_StatisticalInference/03_06_resampledInference/index.html b/06_StatisticalInference/03_06_resampledInference/index.html index 15515b719..ce1da45ea 100644 --- a/06_StatisticalInference/03_06_resampledInference/index.html +++ b/06_StatisticalInference/03_06_resampledInference/index.html @@ -1,614 +1,609 @@ - - - - Resampled inference - - - - - - - - - - - - - - - - - - - - - - - - - - -
-

Resampled inference

-

Statistical Inference

-

Brian Caffo, Jeff Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

-
-
-
- - - - -
-

The jackknife

-
-
-
    -
  • The jackknife is a tool for estimating standard errors and the bias of estimators
  • -
  • As its name suggests, the jackknife is a small, handy tool; in contrast to the bootstrap, which is then the moral equivalent of a giant workshop full of tools
  • -
  • Both the jackknife and the bootstrap involve resampling data; that is, repeatedly creating new data sets from the original data
  • -
- -
- -
- - -
-

The jackknife

-
-
-
    -
  • The jackknife deletes each observation and calculates an estimate based on the remaining \(n-1\) of them
  • -
  • It uses this collection of estimates to do things like estimate the bias and the standard error
  • -
  • Note that estimating the bias and having a standard error are not needed for things like sample means, which we know are unbiased estimates of population means and what their standard errors are
  • -
- -
- -
- - -
-

The jackknife

-
-
-
    -
  • We'll consider the jackknife for univariate data
  • -
  • Let \(X_1,\ldots,X_n\) be a collection of data used to estimate a parameter \(\theta\)
  • -
  • Let \(\hat \theta\) be the estimate based on the full data set
  • -
  • Let \(\hat \theta_{i}\) be the estimate of \(\theta\) obtained by deleting observation \(i\)
  • -
  • Let \(\bar \theta = \frac{1}{n}\sum_{i=1}^n \hat \theta_{i}\)
  • -
- -
- -
- - -
-

Continued

-
-
-
    -
  • Then, the jackknife estimate of the bias is -\[ -(n - 1) \left(\bar \theta - \hat \theta\right) -\] -(how far the average delete-one estimate is from the actual estimate)
  • -
  • The jackknife estimate of the standard error is -\[ -\left[\frac{n-1}{n}\sum_{i=1}^n (\hat \theta_i - \bar\theta )^2\right]^{1/2} -\] -(the deviance of the delete-one estimates from the average delete-one estimate)
  • -
- -
- -
- - -
-

Example

-
-
-

We want to estimate the bias and standard error of the median

- -
library(UsingR)
-data(father.son)
-x <- father.son$sheight
-n <- length(x)
-theta <- median(x)
-jk <- sapply(1 : n,
-             function(i) median(x[-i])
-             )
-thetaBar <- mean(jk)
-biasEst <- (n - 1) * (thetaBar - theta) 
-seEst <- sqrt((n - 1) * mean((jk - thetaBar)^2))
-
- -
- -
- - -
-

Example

-
-
-
c(biasEst, seEst)
-
- -
[1] 0.0000 0.1014
-
- -
library(bootstrap)
-temp <- jackknife(x, median)
-c(temp$jack.bias, temp$jack.se)
-
- -
[1] 0.0000 0.1014
-
- -
- -
- - -
-

Example

-
-
-
    -
  • Both methods (of course) yield an estimated bias of 0 and a se of 0.1014
  • -
  • Odd little fact: the jackknife estimate of the bias for the median is always \(0\) when the number of observations is even
  • -
  • It has been shown that the jackknife is a linear approximation to the bootstrap
  • -
  • Generally do not use the jackknife for sample quantiles like the median; as it has been shown to have some poor properties
  • -
- -
- -
- - -
-

Pseudo observations

-
-
-
    -
  • Another interesting way to think about the jackknife uses pseudo observations
  • -
  • Let -\[ - \mbox{Pseudo Obs} = n \hat \theta - (n - 1) \hat \theta_{i} -\]
  • -
  • Think of these as ``whatever observation \(i\) contributes to the estimate of \(\theta\)''
  • -
  • Note when \(\hat \theta\) is the sample mean, the pseudo observations are the data themselves
  • -
  • Then the sample standard error of these observations is the previous jackknife estimated standard error.
  • -
  • The mean of these observations is a bias-corrected estimate of \(\theta\)
  • -
- -
- -
- - -
-

The bootstrap

-
-
-
    -
  • The bootstrap is a tremendously useful tool for constructing confidence intervals and calculating standard errors for difficult statistics
  • -
  • For example, how would one derive a confidence interval for the median?
  • -
  • The bootstrap procedure follows from the so called bootstrap principle
  • -
- -
- -
- - -
-

The bootstrap principle

-
-
-
    -
  • Suppose that I have a statistic that estimates some population parameter, but I don't know its sampling distribution
  • -
  • The bootstrap principle suggests using the distribution defined by the data to approximate its sampling distribution
  • -
- -
- -
- - -
-

The bootstrap in practice

-
-
-
    -
  • In practice, the bootstrap principle is always carried out using simulation
  • -
  • We will cover only a few aspects of bootstrap resampling
  • -
  • The general procedure follows by first simulating complete data sets from the observed data with replacement

    - -
      -
    • This is approximately drawing from the sampling distribution of that statistic, at least as far as the data is able to approximate the true population distribution
    • -
  • -
  • Calculate the statistic for each simulated data set

  • -
  • Use the simulated statistics to either define a confidence interval or take the standard deviation to calculate a standard error

  • -
- -
- -
- - -
-

Nonparametric bootstrap algorithm example

-
-
-
    -
  • Bootstrap procedure for calculating confidence interval for the median from a data set of \(n\) observations

    - -

    i. Sample \(n\) observations with replacement from the observed data resulting in one simulated complete data set

    - -

    ii. Take the median of the simulated data set

    - -

    iii. Repeat these two steps \(B\) times, resulting in \(B\) simulated medians

    - -

    iv. These medians are approximately drawn from the sampling distribution of the median of \(n\) observations; therefore we can

    - -
      -
    • Draw a histogram of them
    • -
    • Calculate their standard deviation to estimate the standard error of the median
    • -
    • Take the \(2.5^{th}\) and \(97.5^{th}\) percentiles as a confidence interval for the median
    • -
  • -
- -
- -
- - -
-

Example code

-
-
-
B <- 1000
-resamples <- matrix(sample(x,
-                           n * B,
-                           replace = TRUE),
-                    B, n)
-medians <- apply(resamples, 1, median)
-sd(medians)
-
- -
[1] 0.08546
-
- -
quantile(medians, c(.025, .975))
-
- -
 2.5% 97.5% 
-68.43 68.82 
-
- -
- -
- - -
-

Histogram of bootstrap resamples

-
-
-
hist(medians)
-
- -
plot of chunk unnamed-chunk-4
- -
- -
- - -
-

Notes on the bootstrap

-
-
-
    -
  • The bootstrap is non-parametric
  • -
  • Better percentile bootstrap confidence intervals correct for bias
  • -
  • There are lots of variations on bootstrap procedures; the book "An Introduction to the Bootstrap"" by Efron and Tibshirani is a great place to start for both bootstrap and jackknife information
  • -
- -
- -
- - -
-

Group comparisons

-
-
-
    -
  • Consider comparing two independent groups.
  • -
  • Example, comparing sprays B and C
  • -
- -
data(InsectSprays)
-boxplot(count ~ spray, data = InsectSprays)
-
- -
plot of chunk unnamed-chunk-5
- -
- -
- - -
-

Permutation tests

-
-
-
    -
  • Consider the null hypothesis that the distribution of the observations from each group is the same
  • -
  • Then, the group labels are irrelevant
  • -
  • We then discard the group levels and permute the combined data
  • -
  • Split the permuted data into two groups with \(n_A\) and \(n_B\) -observations (say by always treating the first \(n_A\) observations as -the first group)
  • -
  • Evaluate the probability of getting a statistic as large or -large than the one observed
  • -
  • An example statistic would be the difference in the averages between the two groups; -one could also use a t-statistic
  • -
- -
- -
- - -
-

Variations on permutation testing

-
-
- - - - - - - - - - - - - - - - - - - - - - -
Data typeStatisticTest name
Ranksrank sumrank sum test
Binaryhypergeometric probFisher's exact test
Raw dataordinary permutation test
- -
    -
  • Also, so-called randomization tests are exactly permutation tests, with a different motivation.
  • -
  • For matched data, one can randomize the signs - -
      -
    • For ranks, this results in the signed rank test
    • -
  • -
  • Permutation strategies work for regression as well - -
      -
    • Permuting a regressor of interest
    • -
  • -
  • Permutation tests work very well in multivariate settings
  • -
- -
- -
- - -
-

Permutation test for pesticide data

-
-
-
subdata <- InsectSprays[InsectSprays$spray %in% c("B", "C"),]
-y <- subdata$count
-group <- as.character(subdata$spray)
-testStat <- function(w, g) mean(w[g == "B"]) - mean(w[g == "C"])
-observedStat <- testStat(y, group)
-permutations <- sapply(1 : 10000, function(i) testStat(y, sample(group)))
-observedStat
-
- -
[1] 13.25
-
- -
mean(permutations > observedStat)
-
- -
[1] 0
-
- -
- -
- - -
-

Histogram of permutations

-
-
-
plot of chunk unnamed-chunk-7
- -
- -
- - -
- - - - - - - - - - - - - - + + + + Resampled inference + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

Resampled inference

+

Statistical Inference

+

Brian Caffo, Jeff Leek, Roger Peng
Johns Hopkins Bloomberg School of Public Health

+
+
+
+ + + + +
+

The jackknife

+
+
+
    +
  • The jackknife is a tool for estimating standard errors and the bias of estimators
  • +
  • As its name suggests, the jackknife is a small, handy tool; in contrast to the bootstrap, which is then the moral equivalent of a giant workshop full of tools
  • +
  • Both the jackknife and the bootstrap involve resampling data; that is, repeatedly creating new data sets from the original data
  • +
+ +
+ +
+ + +
+

The jackknife

+
+
+
    +
  • The jackknife deletes each observation and calculates an estimate based on the remaining \(n-1\) of them
  • +
  • It uses this collection of estimates to do things like estimate the bias and the standard error
  • +
  • Note that estimating the bias and having a standard error are not needed for things like sample means, which we know are unbiased estimates of population means and what their standard errors are
  • +
+ +
+ +
+ + +
+

The jackknife

+
+
+
    +
  • We'll consider the jackknife for univariate data
  • +
  • Let \(X_1,\ldots,X_n\) be a collection of data used to estimate a parameter \(\theta\)
  • +
  • Let \(\hat \theta\) be the estimate based on the full data set
  • +
  • Let \(\hat \theta_{i}\) be the estimate of \(\theta\) obtained by deleting observation \(i\)
  • +
  • Let \(\bar \theta = \frac{1}{n}\sum_{i=1}^n \hat \theta_{i}\)
  • +
+ +
+ +
+ + +
+

Continued

+
+
+
    +
  • Then, the jackknife estimate of the bias is +\[ +(n - 1) \left(\bar \theta - \hat \theta\right) +\] +(how far the average delete-one estimate is from the actual estimate)
  • +
  • The jackknife estimate of the standard error is +\[ +\left[\frac{n-1}{n}\sum_{i=1}^n (\hat \theta_i - \bar\theta )^2\right]^{1/2} +\] +(the deviance of the delete-one estimates from the average delete-one estimate)
  • +
+ +
+ +
+ + +
+

Example

+
+
+

We want to estimate the bias and standard error of the median

+ +
library(UsingR)
+data(father.son)
+x <- father.son$sheight
+n <- length(x)
+theta <- median(x)
+jk <- sapply(1:n, function(i) median(x[-i]))
+thetaBar <- mean(jk)
+biasEst <- (n - 1) * (thetaBar - theta)
+seEst <- sqrt((n - 1) * mean((jk - thetaBar)^2))
+
+ +
+ +
+ + +
+

Example test

+
+
+
c(biasEst, seEst)
+
+ +
## [1] 0.0000 0.1014
+
+ +
library(bootstrap)
+temp <- jackknife(x, median)
+c(temp$jack.bias, temp$jack.se)
+
+ +
## [1] 0.0000 0.1014
+
+ +
+ +
+ + +
+

Example

+
+
+
    +
  • Both methods (of course) yield an estimated bias of 0 and a se of 0.1014
  • +
  • Odd little fact: the jackknife estimate of the bias for the median is always \(0\) when the number of observations is even
  • +
  • It has been shown that the jackknife is a linear approximation to the bootstrap
  • +
  • Generally do not use the jackknife for sample quantiles like the median; as it has been shown to have some poor properties
  • +
+ +
+ +
+ + +
+

Pseudo observations

+
+
+
    +
  • Another interesting way to think about the jackknife uses pseudo observations
  • +
  • Let +\[ + \mbox{Pseudo Obs} = n \hat \theta - (n - 1) \hat \theta_{i} +\]
  • +
  • Think of these as ``whatever observation \(i\) contributes to the estimate of \(\theta\)''
  • +
  • Note when \(\hat \theta\) is the sample mean, the pseudo observations are the data themselves
  • +
  • Then the sample standard error of these observations is the previous jackknife estimated standard error.
  • +
  • The mean of these observations is a bias-corrected estimate of \(\theta\)
  • +
+ +
+ +
+ + +
+

The bootstrap

+
+
+
    +
  • The bootstrap is a tremendously useful tool for constructing confidence intervals and calculating standard errors for difficult statistics
  • +
  • For example, how would one derive a confidence interval for the median?
  • +
  • The bootstrap procedure follows from the so called bootstrap principle
  • +
+ +
+ +
+ + +
+

The bootstrap principle

+
+
+
    +
  • Suppose that I have a statistic that estimates some population parameter, but I don't know its sampling distribution
  • +
  • The bootstrap principle suggests using the distribution defined by the data to approximate its sampling distribution
  • +
+ +
+ +
+ + +
+

The bootstrap in practice

+
+
+
    +
  • In practice, the bootstrap principle is always carried out using simulation
  • +
  • We will cover only a few aspects of bootstrap resampling
  • +
  • The general procedure follows by first simulating complete data sets from the observed data with replacement

    + +
      +
    • This is approximately drawing from the sampling distribution of that statistic, at least as far as the data is able to approximate the true population distribution
    • +
  • +
  • Calculate the statistic for each simulated data set

  • +
  • Use the simulated statistics to either define a confidence interval or take the standard deviation to calculate a standard error

  • +
+ +
+ +
+ + +
+

Nonparametric bootstrap algorithm example

+
+
+
    +
  • Bootstrap procedure for calculating confidence interval for the median from a data set of \(n\) observations

    + +

    i. Sample \(n\) observations with replacement from the observed data resulting in one simulated complete data set

    + +

    ii. Take the median of the simulated data set

    + +

    iii. Repeat these two steps \(B\) times, resulting in \(B\) simulated medians

    + +

    iv. These medians are approximately drawn from the sampling distribution of the median of \(n\) observations; therefore we can

    + +
      +
    • Draw a histogram of them
    • +
    • Calculate their standard deviation to estimate the standard error of the median
    • +
    • Take the \(2.5^{th}\) and \(97.5^{th}\) percentiles as a confidence interval for the median
    • +
  • +
+ +
+ +
+ + +
+

Example code

+
+
+
B <- 1000
+resamples <- matrix(sample(x, n * B, replace = TRUE), B, n)
+medians <- apply(resamples, 1, median)
+sd(medians)
+
+ +
## [1] 0.08834
+
+ +
quantile(medians, c(0.025, 0.975))
+
+ +
##  2.5% 97.5% 
+## 68.41 68.82
+
+ +
+ +
+ + +
+

Histogram of bootstrap resamples

+
+
+
hist(medians)
+
+ +

plot of chunk unnamed-chunk-4

+ +
+ +
+ + +
+

Notes on the bootstrap

+
+
+
    +
  • The bootstrap is non-parametric
  • +
  • Better percentile bootstrap confidence intervals correct for bias
  • +
  • There are lots of variations on bootstrap procedures; the book "An Introduction to the Bootstrap"" by Efron and Tibshirani is a great place to start for both bootstrap and jackknife information
  • +
+ +
+ +
+ + +
+

Group comparisons

+
+
+
    +
  • Consider comparing two independent groups.
  • +
  • Example, comparing sprays B and C
  • +
+ +
data(InsectSprays)
+boxplot(count ~ spray, data = InsectSprays)
+
+ +

plot of chunk unnamed-chunk-5

+ +
+ +
+ + +
+

Permutation tests

+
+
+
    +
  • Consider the null hypothesis that the distribution of the observations from each group is the same
  • +
  • Then, the group labels are irrelevant
  • +
  • We then discard the group levels and permute the combined data
  • +
  • Split the permuted data into two groups with \(n_A\) and \(n_B\) +observations (say by always treating the first \(n_A\) observations as +the first group)
  • +
  • Evaluate the probability of getting a statistic as large or +large than the one observed
  • +
  • An example statistic would be the difference in the averages between the two groups; +one could also use a t-statistic
  • +
+ +
+ +
+ + +
+

Variations on permutation testing

+
+
+ + + + + + + + + + + + + + + + + + + + + + +
Data typeStatisticTest name
Ranksrank sumrank sum test
Binaryhypergeometric probFisher's exact test
Raw dataordinary permutation test
+ +
    +
  • Also, so-called randomization tests are exactly permutation tests, with a different motivation.
  • +
  • For matched data, one can randomize the signs + +
      +
    • For ranks, this results in the signed rank test
    • +
  • +
  • Permutation strategies work for regression as well + +
      +
    • Permuting a regressor of interest
    • +
  • +
  • Permutation tests work very well in multivariate settings
  • +
+ +
+ +
+ + +
+

Permutation test for pesticide data

+
+
+
subdata <- InsectSprays[InsectSprays$spray %in% c("B", "C"), ]
+y <- subdata$count
+group <- as.character(subdata$spray)
+testStat <- function(w, g) mean(w[g == "B"]) - mean(w[g == "C"])
+observedStat <- testStat(y, group)
+permutations <- sapply(1:10000, function(i) testStat(y, sample(group)))
+observedStat
+
+ +
## [1] 13.25
+
+ +
mean(permutations > observedStat)
+
+ +
## [1] 0
+
+ +
+ +
+ + +
+

Histogram of permutations

+
+
+

plot of chunk unnamed-chunk-7

+ +
+ +
+ + +
+ + + + + + + + + + + + + + \ No newline at end of file diff --git a/06_StatisticalInference/03_06_resampledInference/index.md b/06_StatisticalInference/03_06_resampledInference/index.md index d1544df72..448841d3a 100644 --- a/06_StatisticalInference/03_06_resampledInference/index.md +++ b/06_StatisticalInference/03_06_resampledInference/index.md @@ -1,294 +1,287 @@ ---- -title : Resampled inference -subtitle : Statistical Inference -author : Brian Caffo, Jeff Leek, Roger Peng -job : Johns Hopkins Bloomberg School of Public Health -logo : bloomberg_shield.png -framework : io2012 # {io2012, html5slides, shower, dzslides, ...} -highlighter : highlight.js # {highlight.js, prettify, highlight} -hitheme : tomorrow # -url: - lib: ../../librariesNew - assets: ../../assets -widgets : [mathjax] # {mathjax, quiz, bootstrap} -mode : selfcontained # {standalone, draft} - ---- - - - -## The jackknife - -- The jackknife is a tool for estimating standard errors and the bias of estimators -- As its name suggests, the jackknife is a small, handy tool; in contrast to the bootstrap, which is then the moral equivalent of a giant workshop full of tools -- Both the jackknife and the bootstrap involve *resampling* data; that is, repeatedly creating new data sets from the original data - ---- - -## The jackknife - -- The jackknife deletes each observation and calculates an estimate based on the remaining $n-1$ of them -- It uses this collection of estimates to do things like estimate the bias and the standard error -- Note that estimating the bias and having a standard error are not needed for things like sample means, which we know are unbiased estimates of population means and what their standard errors are - ---- - -## The jackknife - -- We'll consider the jackknife for univariate data -- Let $X_1,\ldots,X_n$ be a collection of data used to estimate a parameter $\theta$ -- Let $\hat \theta$ be the estimate based on the full data set -- Let $\hat \theta_{i}$ be the estimate of $\theta$ obtained by *deleting observation $i$* -- Let $\bar \theta = \frac{1}{n}\sum_{i=1}^n \hat \theta_{i}$ - ---- - -## Continued - -- Then, the jackknife estimate of the bias is - $$ - (n - 1) \left(\bar \theta - \hat \theta\right) - $$ - (how far the average delete-one estimate is from the actual estimate) -- The jackknife estimate of the standard error is - $$ - \left[\frac{n-1}{n}\sum_{i=1}^n (\hat \theta_i - \bar\theta )^2\right]^{1/2} - $$ -(the deviance of the delete-one estimates from the average delete-one estimate) - ---- - -## Example -### We want to estimate the bias and standard error of the median - - -```r -library(UsingR) -data(father.son) -x <- father.son$sheight -n <- length(x) -theta <- median(x) -jk <- sapply(1 : n, - function(i) median(x[-i]) - ) -thetaBar <- mean(jk) -biasEst <- (n - 1) * (thetaBar - theta) -seEst <- sqrt((n - 1) * mean((jk - thetaBar)^2)) -``` - - ---- - -## Example - - -```r -c(biasEst, seEst) -``` - -``` -[1] 0.0000 0.1014 -``` - -```r -library(bootstrap) -temp <- jackknife(x, median) -c(temp$jack.bias, temp$jack.se) -``` - -``` -[1] 0.0000 0.1014 -``` - - ---- - -## Example - -- Both methods (of course) yield an estimated bias of 0 and a se of 0.1014 -- Odd little fact: the jackknife estimate of the bias for the median is always $0$ when the number of observations is even -- It has been shown that the jackknife is a linear approximation to the bootstrap -- Generally do not use the jackknife for sample quantiles like the median; as it has been shown to have some poor properties - ---- - -## Pseudo observations - -- Another interesting way to think about the jackknife uses pseudo observations -- Let -$$ - \mbox{Pseudo Obs} = n \hat \theta - (n - 1) \hat \theta_{i} -$$ -- Think of these as ``whatever observation $i$ contributes to the estimate of $\theta$'' -- Note when $\hat \theta$ is the sample mean, the pseudo observations are the data themselves -- Then the sample standard error of these observations is the previous jackknife estimated standard error. -- The mean of these observations is a bias-corrected estimate of $\theta$ - ---- - -## The bootstrap - -- The bootstrap is a tremendously useful tool for constructing confidence intervals and calculating standard errors for difficult statistics -- For example, how would one derive a confidence interval for the median? -- The bootstrap procedure follows from the so called bootstrap principle - ---- - -## The bootstrap principle - -- Suppose that I have a statistic that estimates some population parameter, but I don't know its sampling distribution -- The bootstrap principle suggests using the distribution defined by the data to approximate its sampling distribution - ---- - -## The bootstrap in practice - -- In practice, the bootstrap principle is always carried out using simulation -- We will cover only a few aspects of bootstrap resampling -- The general procedure follows by first simulating complete data sets from the observed data with replacement - - - This is approximately drawing from the sampling distribution of that statistic, at least as far as the data is able to approximate the true population distribution - -- Calculate the statistic for each simulated data set -- Use the simulated statistics to either define a confidence interval or take the standard deviation to calculate a standard error - ---- -## Nonparametric bootstrap algorithm example - -- Bootstrap procedure for calculating confidence interval for the median from a data set of $n$ observations - - i. Sample $n$ observations **with replacement** from the observed data resulting in one simulated complete data set - - ii. Take the median of the simulated data set - - iii. Repeat these two steps $B$ times, resulting in $B$ simulated medians - - iv. These medians are approximately drawn from the sampling distribution of the median of $n$ observations; therefore we can - - - Draw a histogram of them - - Calculate their standard deviation to estimate the standard error of the median - - Take the $2.5^{th}$ and $97.5^{th}$ percentiles as a confidence interval for the median - ---- - -## Example code - - -```r -B <- 1000 -resamples <- matrix(sample(x, - n * B, - replace = TRUE), - B, n) -medians <- apply(resamples, 1, median) -sd(medians) -``` - -``` -[1] 0.08546 -``` - -```r -quantile(medians, c(.025, .975)) -``` - -``` - 2.5% 97.5% -68.43 68.82 -``` - - ---- -## Histogram of bootstrap resamples - - -```r -hist(medians) -``` - -
plot of chunk unnamed-chunk-4
- - ---- - -## Notes on the bootstrap - -- The bootstrap is non-parametric -- Better percentile bootstrap confidence intervals correct for bias -- There are lots of variations on bootstrap procedures; the book "An Introduction to the Bootstrap"" by Efron and Tibshirani is a great place to start for both bootstrap and jackknife information - - ---- -## Group comparisons -- Consider comparing two independent groups. -- Example, comparing sprays B and C - - -```r -data(InsectSprays) -boxplot(count ~ spray, data = InsectSprays) -``` - -
plot of chunk unnamed-chunk-5
- - ---- -## Permutation tests -- Consider the null hypothesis that the distribution of the observations from each group is the same -- Then, the group labels are irrelevant -- We then discard the group levels and permute the combined data -- Split the permuted data into two groups with $n_A$ and $n_B$ - observations (say by always treating the first $n_A$ observations as - the first group) -- Evaluate the probability of getting a statistic as large or - large than the one observed -- An example statistic would be the difference in the averages between the two groups; - one could also use a t-statistic - ---- -## Variations on permutation testing -Data type | Statistic | Test name ----|---|---| -Ranks | rank sum | rank sum test -Binary | hypergeometric prob | Fisher's exact test -Raw data | | ordinary permutation test - -- Also, so-called *randomization tests* are exactly permutation tests, with a different motivation. -- For matched data, one can randomize the signs - - For ranks, this results in the signed rank test -- Permutation strategies work for regression as well - - Permuting a regressor of interest -- Permutation tests work very well in multivariate settings - ---- -## Permutation test for pesticide data - -```r -subdata <- InsectSprays[InsectSprays$spray %in% c("B", "C"),] -y <- subdata$count -group <- as.character(subdata$spray) -testStat <- function(w, g) mean(w[g == "B"]) - mean(w[g == "C"]) -observedStat <- testStat(y, group) -permutations <- sapply(1 : 10000, function(i) testStat(y, sample(group))) -observedStat -``` - -``` -[1] 13.25 -``` - -```r -mean(permutations > observedStat) -``` - -``` -[1] 0 -``` - - ---- -## Histogram of permutations -
plot of chunk unnamed-chunk-7
- - - +--- +title : Resampled inference +subtitle : Statistical Inference +author : Brian Caffo, Jeff Leek, Roger Peng +job : Johns Hopkins Bloomberg School of Public Health +logo : bloomberg_shield.png +framework : io2012 # {io2012, html5slides, shower, dzslides, ...} +highlighter : highlight.js # {highlight.js, prettify, highlight} +hitheme : tomorrow # +url: + lib: ../../librariesNew + assets: ../../assets +widgets : [mathjax] # {mathjax, quiz, bootstrap} +mode : selfcontained # {standalone, draft} + +--- + +## The jackknife + +- The jackknife is a tool for estimating standard errors and the bias of estimators +- As its name suggests, the jackknife is a small, handy tool; in contrast to the bootstrap, which is then the moral equivalent of a giant workshop full of tools +- Both the jackknife and the bootstrap involve *resampling* data; that is, repeatedly creating new data sets from the original data + +--- + +## The jackknife + +- The jackknife deletes each observation and calculates an estimate based on the remaining $n-1$ of them +- It uses this collection of estimates to do things like estimate the bias and the standard error +- Note that estimating the bias and having a standard error are not needed for things like sample means, which we know are unbiased estimates of population means and what their standard errors are + +--- + +## The jackknife + +- We'll consider the jackknife for univariate data +- Let $X_1,\ldots,X_n$ be a collection of data used to estimate a parameter $\theta$ +- Let $\hat \theta$ be the estimate based on the full data set +- Let $\hat \theta_{i}$ be the estimate of $\theta$ obtained by *deleting observation $i$* +- Let $\bar \theta = \frac{1}{n}\sum_{i=1}^n \hat \theta_{i}$ + +--- + +## Continued + +- Then, the jackknife estimate of the bias is + $$ + (n - 1) \left(\bar \theta - \hat \theta\right) + $$ + (how far the average delete-one estimate is from the actual estimate) +- The jackknife estimate of the standard error is + $$ + \left[\frac{n-1}{n}\sum_{i=1}^n (\hat \theta_i - \bar\theta )^2\right]^{1/2} + $$ +(the deviance of the delete-one estimates from the average delete-one estimate) + +--- + +## Example +### We want to estimate the bias and standard error of the median + + +```r +library(UsingR) +data(father.son) +x <- father.son$sheight +n <- length(x) +theta <- median(x) +jk <- sapply(1:n, function(i) median(x[-i])) +thetaBar <- mean(jk) +biasEst <- (n - 1) * (thetaBar - theta) +seEst <- sqrt((n - 1) * mean((jk - thetaBar)^2)) +``` + + +--- + +## Example test + + +```r +c(biasEst, seEst) +``` + +``` +## [1] 0.0000 0.1014 +``` + +```r +library(bootstrap) +temp <- jackknife(x, median) +c(temp$jack.bias, temp$jack.se) +``` + +``` +## [1] 0.0000 0.1014 +``` + + +--- + +## Example + +- Both methods (of course) yield an estimated bias of 0 and a se of 0.1014 +- Odd little fact: the jackknife estimate of the bias for the median is always $0$ when the number of observations is even +- It has been shown that the jackknife is a linear approximation to the bootstrap +- Generally do not use the jackknife for sample quantiles like the median; as it has been shown to have some poor properties + +--- + +## Pseudo observations + +- Another interesting way to think about the jackknife uses pseudo observations +- Let +$$ + \mbox{Pseudo Obs} = n \hat \theta - (n - 1) \hat \theta_{i} +$$ +- Think of these as ``whatever observation $i$ contributes to the estimate of $\theta$'' +- Note when $\hat \theta$ is the sample mean, the pseudo observations are the data themselves +- Then the sample standard error of these observations is the previous jackknife estimated standard error. +- The mean of these observations is a bias-corrected estimate of $\theta$ + +--- + +## The bootstrap + +- The bootstrap is a tremendously useful tool for constructing confidence intervals and calculating standard errors for difficult statistics +- For example, how would one derive a confidence interval for the median? +- The bootstrap procedure follows from the so called bootstrap principle + +--- + +## The bootstrap principle + +- Suppose that I have a statistic that estimates some population parameter, but I don't know its sampling distribution +- The bootstrap principle suggests using the distribution defined by the data to approximate its sampling distribution + +--- + +## The bootstrap in practice + +- In practice, the bootstrap principle is always carried out using simulation +- We will cover only a few aspects of bootstrap resampling +- The general procedure follows by first simulating complete data sets from the observed data with replacement + + - This is approximately drawing from the sampling distribution of that statistic, at least as far as the data is able to approximate the true population distribution + +- Calculate the statistic for each simulated data set +- Use the simulated statistics to either define a confidence interval or take the standard deviation to calculate a standard error + +--- +## Nonparametric bootstrap algorithm example + +- Bootstrap procedure for calculating confidence interval for the median from a data set of $n$ observations + + i. Sample $n$ observations **with replacement** from the observed data resulting in one simulated complete data set + + ii. Take the median of the simulated data set + + iii. Repeat these two steps $B$ times, resulting in $B$ simulated medians + + iv. These medians are approximately drawn from the sampling distribution of the median of $n$ observations; therefore we can + + - Draw a histogram of them + - Calculate their standard deviation to estimate the standard error of the median + - Take the $2.5^{th}$ and $97.5^{th}$ percentiles as a confidence interval for the median + +--- + +## Example code + + +```r +B <- 1000 +resamples <- matrix(sample(x, n * B, replace = TRUE), B, n) +medians <- apply(resamples, 1, median) +sd(medians) +``` + +``` +## [1] 0.08834 +``` + +```r +quantile(medians, c(0.025, 0.975)) +``` + +``` +## 2.5% 97.5% +## 68.41 68.82 +``` + + +--- +## Histogram of bootstrap resamples + + +```r +hist(medians) +``` + +![plot of chunk unnamed-chunk-4](assets/fig/unnamed-chunk-4.png) + + +--- + +## Notes on the bootstrap + +- The bootstrap is non-parametric +- Better percentile bootstrap confidence intervals correct for bias +- There are lots of variations on bootstrap procedures; the book "An Introduction to the Bootstrap"" by Efron and Tibshirani is a great place to start for both bootstrap and jackknife information + + +--- +## Group comparisons +- Consider comparing two independent groups. +- Example, comparing sprays B and C + + +```r +data(InsectSprays) +boxplot(count ~ spray, data = InsectSprays) +``` + +![plot of chunk unnamed-chunk-5](assets/fig/unnamed-chunk-5.png) + + +--- +## Permutation tests +- Consider the null hypothesis that the distribution of the observations from each group is the same +- Then, the group labels are irrelevant +- We then discard the group levels and permute the combined data +- Split the permuted data into two groups with $n_A$ and $n_B$ + observations (say by always treating the first $n_A$ observations as + the first group) +- Evaluate the probability of getting a statistic as large or + large than the one observed +- An example statistic would be the difference in the averages between the two groups; + one could also use a t-statistic + +--- +## Variations on permutation testing +Data type | Statistic | Test name +---|---|---| +Ranks | rank sum | rank sum test +Binary | hypergeometric prob | Fisher's exact test +Raw data | | ordinary permutation test + +- Also, so-called *randomization tests* are exactly permutation tests, with a different motivation. +- For matched data, one can randomize the signs + - For ranks, this results in the signed rank test +- Permutation strategies work for regression as well + - Permuting a regressor of interest +- Permutation tests work very well in multivariate settings + +--- +## Permutation test for pesticide data + +```r +subdata <- InsectSprays[InsectSprays$spray %in% c("B", "C"), ] +y <- subdata$count +group <- as.character(subdata$spray) +testStat <- function(w, g) mean(w[g == "B"]) - mean(w[g == "C"]) +observedStat <- testStat(y, group) +permutations <- sapply(1:10000, function(i) testStat(y, sample(group))) +observedStat +``` + +``` +## [1] 13.25 +``` + +```r +mean(permutations > observedStat) +``` + +``` +## [1] 0 +``` + + +--- +## Histogram of permutations +![plot of chunk unnamed-chunk-7](assets/fig/unnamed-chunk-7.png) + + + diff --git a/06_StatisticalInference/03_06_resampledInference/index.pdf b/06_StatisticalInference/03_06_resampledInference/index.pdf index df08fe3df..ce8822a3c 100644 Binary files a/06_StatisticalInference/03_06_resampledInference/index.pdf and b/06_StatisticalInference/03_06_resampledInference/index.pdf differ diff --git a/06_StatisticalInference/cp.R b/06_StatisticalInference/cp.R new file mode 100644 index 000000000..bea948ee7 --- /dev/null +++ b/06_StatisticalInference/cp.R @@ -0,0 +1,18 @@ +## A program for copying the index.pdf files and naming them +## appropriately in the lectured directory +## Brian Caffo +## +## Has to be run within the directory and won't overwrite +## unless you change this to TRUE +overwrite = FALSE + +## Get the directory names (they all start with 0) +dirNames <- dir(pattern = "^0") + +## Loop over them and copy the files +sapply(dirNames, function(x) + file.copy(from = paste(x, "/index.pdf", sep = ""), + to = paste("lectures/", x, ".pdf", sep = ""), + overwrite = overwrite + ) + ) diff --git a/06_StatisticalInference/homework/hw1.Rmd b/06_StatisticalInference/homework/hw1.Rmd index 7f31c63a0..f5476f5c7 100644 --- a/06_StatisticalInference/homework/hw1.Rmd +++ b/06_StatisticalInference/homework/hw1.Rmd @@ -40,22 +40,22 @@ Creating Data Products --- &radio -Consider influenza epidemics for two parent heterosexual families. Suppose that the probability is 15% that at least one of the parents has contracted the disease. The probability that the father has contracted influenza is 6% while that the mother contracted the disease is 5%. What is the probability that both contracted influenza expressed as a whole number percentage? +Consider influenza epidemics for two parent heterosexual families. Suppose that the probability is 15% that at least one of the parents has contracted the disease. The probability that the father has contracted influenza is 10% while that the mother contracted the disease is 9%. What is the probability that both contracted influenza expressed as a whole number percentage? 1. 15% -2. 6% -3. 5% -4. _2%_ +2. 10% +3. 9% +4. _4%_ *** .hint -$A = Father$, $P(A) = .06$, $B = Mother$, $P(B) = .05$ +$A = Father$, $P(A) = .10$, $B = Mother$, $P(B) = .09$ $P(A\cup B) = .15$, *** .explanation -$P(A\cup B) = P(A) + P(B) - 2 P(AB)$ thus -$$.15 = .06 + .05 - 2 P(AB)$$ +$P(A\cup B) = P(A) + P(B) - P(AB)$ thus +$$.15 = .10 + .09 - P(AB)$$ ```{r} -(0.15 - .06 - .05) / 2 +.10 + .09 - .15 ``` --- &radio diff --git a/06_StatisticalInference/homework/hw2.Rmd b/06_StatisticalInference/homework/hw2.Rmd index 3a568425c..b38eab299 100644 --- a/06_StatisticalInference/homework/hw2.Rmd +++ b/06_StatisticalInference/homework/hw2.Rmd @@ -157,10 +157,10 @@ Let $p=.5$ and $X$ be binomial *** .explanation -`r round(pbinom(4, prob = .5, size = 6, lower.tail = TRUE) * 100, 1)` +`r round(pbinom(4, prob = .5, size = 6, lower.tail = FALSE) * 100, 1)` ```{r} -round(pbinom(4, prob = .5, size = 6, lower.tail = TRUE) * 100, 1) +round(pbinom(4, prob = .5, size = 6, lower.tail = FALSE) * 100, 1) ``` --- &multitext @@ -210,9 +210,9 @@ If you roll ten standard dice, take their average, then repeat this process over $$Var(\bar X) = \sigma^2 /n$$ *** .explanation -The answer will be `r round( mean(1 : 6 - 3.5) ^2 / 100, 3)` -since the variance of the sampling distribution of the mean is $\sigma^2/12$ -and the variance of a die roll is +The answer will be `r round( mean( (1 : 6 - 3.5) ^2) / 10, 3)` +since the variance of the sampling distribution of the mean is $\sigma^2/10$ +where $\sigma^2$ is the variance of a single die roll, which is ```{r} mean((1 : 6 - 3.5)^2) diff --git a/06_StatisticalInference/homework/hw4.Rmd b/06_StatisticalInference/homework/hw4.Rmd index bf5a8da3b..ae4a27b02 100644 --- a/06_StatisticalInference/homework/hw4.Rmd +++ b/06_StatisticalInference/homework/hw4.Rmd @@ -35,3 +35,316 @@ knit_hooks$set(plot = knitr:::hook_plot_html) Creating Data Products - Please help improve this with pull requests here (https://github.com/bcaffo/courses) + + +--- &multitext +Load the data set `mtcars` in the `datasets` R package. You want +to test whether the MPG is $\mu_0$ or smaller using a one sided +5% level test. ($H_0 : \mu = \mu_0$ versus $H_a : \mu < \mu_0$). +Using that data set and a Z test: + +1. what is the smallest value of $\mu_0$ that you would reject for? + +Both to two decimal places. + +*** .hint +This is the inversion of a one sided hypothesis test. It yields confidence +bounds. (Note inverting a two sidded test yields confidence intervals.) +Think about the derivation of the confidence interval. + +*** .explanation +We want to solve +$$ +\frac{\sqrt{n}(\bar{X} - \mu_0)}{s} = Z_{0.05} +$$ +Or $$\mu_0 = \bar{X} - Z_{0.05} s / \sqrt{n} = \bar{X} + Z_{0.95} s / \sqrt{n}$$. Note that the quantile is negative. + +```{r} +mn <- mean(mtcars$mpg); s <- sd(mtcars$mpg); z <- qnorm(.05) +mu0 <- mn - z * s / sqrt(nrow(mtcars)) +``` +Note, it's easy to get tripped up in this problem on signs. If you get a value +that's less than $\bar X$, then clearly it's wrong, since you'd never reject for +$H_0:\mu = \mu_0$ versus $H_a : \mu < \mu_0$ if $\mu_0$ was less than your observed mean. +Also note the answer to "What is the largest value for which you would reject for?" is +infinity. + +`r round(mu0, 2) ` + + +--- &multitext +Consider again the `mtcars` dataset. Use a two group t-test to test +the hypothesis that the 4 and 6 cyl cars have the same mpg. Use +a two sided test with unequal variances. + +1. Do you reject at the 5% level (enter 0 for no, 1 for yes). +2. What is the P-value to 4 decimal places expressed as a proportion? + + +*** .hint +Use `t.test` with the options `var.equal=FALSE`, `paired=FALSE`, `altnernative` as `two.sided`. + +*** .explanation + +```{r} +m4 <- mtcars$mpg[mtcars$cyl == 4] +m6 <- mtcars$mpg[mtcars$cyl == 6] +p <- t.test(m4, m6, paired = FALSE, alternative="two.sided", var.equal=FALSE)$p.value +``` +The answer to 1. is `r as.integer(p < .05)`
+The answer to 2. is `r round(p, 4)` + + +--- &multitext +A sample of 100 men yielded an average PSA level of 3.0 with a sd of 1.1. What +are the complete set of values that a 5% two sided Z test of $H_0 : \mu = \mu_0$ +would reject the null hypothesis for? + +1. Enter the lower value to 2 decimal places. +2. Enter the upper value to 2 decimal places. + +*** .hint +This is equivalent to the confidence interval. + +*** .explanation +The answer to 1 is + `r round(3 - qnorm(.975) * 1.1 / sqrt(100), 2)`
+The answer to 2 is `r round(3 - qnorm(.975) * 1.1 / sqrt(100), 2)` + + +--- &multitext +You believe the coin that you're flipping is biased towards heads. You get 55 heads out of +100 flips. + +1. What's the exact relevant pvalue to 4 decimal places (expressed as a proportion)? +2. Would you reject a 1 sided hypothesis at $\alpha = .05$? (0 for no 1 for yes)? + +*** .hint +Use `pbinom` for a hypothesis that $p=.5$ veruss $p>.5$ where $p$ is the binomial success +probability. + +*** .explanation +Note you have to start at 54 as it `lower.tail = FALSE` gives the strictly greater than +probabilities +```{r} +ans <- round(pbinom(54, prob = .5, size = 100, lower.tail = FALSE),4) +``` +The answer to 1 is `r ans`
+The answer to 2 is `r as.integer(ans < .05)`
+ + +--- &multitext + +A web site was monitored for a year and it received 520 hits per day. In the first +30 days in the next year, the site received 15,800 hits. Assuming that web hits +are Poisson. + +1. Give an exact one sided P-value to the hypothesis that web hits are up this year over last +to four significant digits (expressed as a proportion). +2. Does the one sided test reject (0 for no 1 for yes)? + + + +*** .hint +Consider using `ppois` with $\lambda=520 * 30$. Note this is nearly exactly Gaussian, +so one could get away with the Gaussian calculation. + +*** .explanation +This test comes with the important assumption that web hits are a Poisson process. + +```{r} +pv <- ppois(15800 - 1, lambda = 520 * 30, lower.tail = FALSE) +``` + +The answer to 1 is `r round(pv, 4)`
+The answer to 2 is `r as.integer(pv < .05)`
+ +Also, compare with the Gaussian approximation where $\hat \lambda \sim N(\lambda, \lambda / t)$ +```{r} +pnorm(15800 / 30, mean = 520, sd = sqrt(520 / 30), lower.tail = FALSE) +``` +As $t\rightarrow \infty$ this becomes more Gaussian. The approximation is pretty good for this +setting. + + +--- &multitext + +Suppose that in an AB test, one advertising scheme led to an average of 10 purchases per day for a sample of 100 days, while the other led to 11 purchaces per day, also for a sample of 100 days. +Assuming a common standard deviation of 4 purchaces per day. +Assuming that the groups are independent and that they days are iid, perform a Z test of +equivalence. + +1. What is the P-value reported to 3 digits expressed as a proportion? +2. Do you reject the test? (O for no 1 for yes). + +*** .hint +The standard error is +$$ +s \sqrt{\frac{1}{n_1} + \frac{1}{n_2}} +$$ + +*** .explanation +```{r} +m1 <- 10; m2 <- 11 +n1 <- n2 <- 100 +s <- 4 +se <- s * sqrt(1 / n1 + 1 / n2) +ts <- (m2 - m1) / se +pv <- 2 * pnorm(-abs(ts)) +``` + +The answer to 1 is `r round(pv, 3)`
+The answer to 2 is `r as.integer(pv < .05)` + + +--- &radio + +A confidence interval for the mean contains: + +1. _All of the values of the hypothesized mean for which we would fail to reject with +$\alpha = 1 - Conf. Level$._ +2. All of the values of the hypothesized mean for which we would fail to reject with +$2 \alpha = 1 - Conf. Level$. +3. All of the values of the hypothesized mean for which we would reject with +$\alpha = 1 - Conf. Level$. +4. All of the values of the hypothesized mean for which we would fail to reject with +$2 \alpha = 1 - Conf. Level$. + +*** .hint +This is directly from the notes. Note that a confidence interval gives +values of $\mu$ that are supported by the data whereas a test rejects +for values of $\mu$ different from $\mu_0$. + +*** .explanation +The only complicated part of this is the 2. Note that a 95% interval corresponds +to a 5% level two sided test. So it's $\alpha = 1 - Conf.Level$. The confusion is that +for both the two sided test and confidence interval, one needs to calculate +$Z_{1 - \alpha / 2}$ (or the relevant T quantile). + + +--- &multitext +Consider two problems previous. Assuming that 10 purchases per day is a benchmark null value, +that days are iid and that the standard deviation is 4 purchases for day. Suppose that you +plan on sampling 100 days. What would be the power for a one sided 5% +Z mean test that purchases per day +have increased under the alternative of $\mu = 11$ purchase per day? + + +1. Give your result as a proportion to 3 decimal places. + +*** .hint +Under $H_0$ $\bar X \sim N(10, .4)$. Under $H_a$ $\bar X \sim N(11, .4)$. We reject +when $\bar X \geq 10 + Z_{.95} .4$. + + +*** .explanation +The hint prettty much gives it away. +```{r} +power <- pnorm(10 + qnorm(.95) * .4, mean = 11, sd = .4, lower.tail = FALSE) +``` +The answer is `r round(power, 3)` + + +--- &multitext +Researchers would like to conduct a study of healthy adults to detect a four year mean brain volume loss of .01 mm3. Assume that the standard deviation of four year volume loss in this population is .04 mm3. + +1. What is necessary sample size for the study for a 5% one sided test versus a null hypothesis of no volume loss to acheive 80% power? (Always round up) + + + +*** .hint +Under $H_0$ $\bar X$ is $N(0, .05 / \sqrt{n})$ and is $N(.01, .05 / \sqrt{n})$ under $H_a$. +We will reject if +$$ +\bar X \geq Z_{.95} s / sqrt{n} +$$ +which has probability 0.05 under $H_0$. Under $H_a$ it has probability +$$ +P\left( \frac{\bar X - 0.01}{s / \sqrt{n}} \geq \frac{.01}{s / \sqrt{n}} + z_{.95} \right) += P\left( Z \geq \frac{.01}{s / \sqrt{n}} + z_{.95}\right) +$$ + +*** .explanation +Looking at the hint we set +$$ +\frac{.01}{s / \sqrt{n}} + z_{.95} = z_{.2} +$$ +$$ +n = \frac{(z_{.95} - z_{.2})^2 s^2}{.01^2} = \frac{ (z_{.95} + z_{.8})^2 s^2}{.01^2} +$$ +So we get +```{r} +n <- (qnorm(.95) + qnorm(.8)) ^ 2 * .04 ^ 2 / .01^2 +``` +The answer is `r ceiling(n)` + + +--- &radio + +In a court of law, all things being equal, if via policy you require a lower +standard of evidence to convict people then + +1. Less guilty people will be convicted. +2. _More innocent people will be convicted._ +3. More Innocent people will be not convicted. + + +*** .hint +Think about it. + +*** .explanation +If you require less evidence to convict, then you will convict more people, guilty and +innocent. Relate this property back to hypothesis tests. + + +--- &multitext +Consider the `mtcars` data set. + +1. Give the p-value for a t-test for assuming +constant variance comparing MPG for 6 and 8 cylinder cars as a proportion to 3 decimal places. +2. Give the associated P-value for a z test. +3. Give the common standard deviation estimate for MPG across cylinders to 3 decimal places. +4. Would the t test reject at the two sided 0.05 level (0 for no 1 for yes)? + + +*** .hint +```{r} +mpg8 <- mtcars$mpg[mtcars$cyl == 8] +mpg6 <- mtcars$mpg[mtcars$cyl == 6] +m8 <- mean(mpg8); m6 <- mean(mpg6) +s8 <- sd(mpg8); s6 <- sd(mpg6) +n8 <- length(mpg8); n6 <- length(mpg6) +``` + +*** .explanation +```{r} +p <- t.test(mpg8, mpg6, paired = FALSE, alternative="two.sided", var.equal=TRUE)$p.value +mixprob <- (n8 - 1) / (n8 + n6 - 2) +s <- sqrt(mixprob * s8 ^ 2 + (1 - mixprob) * s6 ^ 2) +z <- (m8 - m6) / (s * sqrt(1 / n8 + 1 / n6)) +pz <- 2 * pnorm(-abs(z)) +## Hand calculating the T just to check +#2 * pt(-abs(z), df = n8 + n6 - 2) +```` + +1. `r round(p, 3)`
+2. `r round(pz, 3)`
+3. `r round(s, 3)`
+4. `r as.integer(p < .05)` + + +--- &radio +The Bonferonni correction controls this + +1. False discovery rate +2. _The familywise error rate_ +3. The rate of true rejections +4. The number of true rejections + +*** .hint +This is pretty much straight out of the notes + +*** .explanation +The Bonferonni correction is the classic correction for the familywise error rate. + + diff --git a/06_StatisticalInference/homework/hw4.html b/06_StatisticalInference/homework/hw4.html index 565621a6d..cc90107c4 100644 --- a/06_StatisticalInference/homework/hw4.html +++ b/06_StatisticalInference/homework/hw4.html @@ -59,6 +59,500 @@

About these slides

+ +
+ +
+

Load the data set mtcars in the datasets R package. You want +to test whether the MPG is \(\mu_0\) or smaller using a one sided +5% level test. (\(H_0 : \mu = \mu_0\) versus \(H_a : \mu < \mu_0\)). +Using that data set and a Z test:

+ +
    +
  1. what is the smallest value of \(\mu_0\) that you would reject for?
  2. +
+ +

Both to two decimal places.

+ + + + + + +
+

This is the inversion of a one sided hypothesis test. It yields confidence +bounds. (Note inverting a two sidded test yields confidence intervals.) +Think about the derivation of the confidence interval.

+ +
+
+

We want to solve +\[ +\frac{\sqrt{n}(\bar{X} - \mu_0)}{s} = Z_{0.05} +\] +Or \[\mu_0 = \bar{X} - Z_{0.05} s / \sqrt{n} = \bar{X} + Z_{0.95} s / \sqrt{n}\]. Note that the quantile is negative.

+ +
mn <- mean(mtcars$mpg); s <- sd(mtcars$mpg); z <- qnorm(.05)
+mu0 <- mn - z * s / sqrt(nrow(mtcars))
+
+ +

Note, it's easy to get tripped up in this problem on signs. If you get a value +that's less than \(\bar X\), then clearly it's wrong, since you'd never reject for +\(H_0:\mu = \mu_0\) versus \(H_a : \mu < \mu_0\) if \(\mu_0\) was less than your observed mean. +Also note the answer to "What is the largest value for which you would reject for?" is +infinity.

+ +

21.84

+ +
+
+
+ +
+ + +
+ +
+

Consider again the mtcars dataset. Use a two group t-test to test +the hypothesis that the 4 and 6 cyl cars have the same mpg. Use +a two sided test with unequal variances.

+ +
    +
  1. Do you reject at the 5% level (enter 0 for no, 1 for yes).
  2. +
  3. What is the P-value to 4 decimal places expressed as a proportion?
  4. +
+ + + + + + +
+

Use t.test with the options var.equal=FALSE, paired=FALSE, altnernative as two.sided.

+ +
+
+
m4 <- mtcars$mpg[mtcars$cyl == 4]
+m6 <- mtcars$mpg[mtcars$cyl == 6]
+p <- t.test(m4, m6, paired = FALSE, alternative="two.sided", var.equal=FALSE)$p.value
+
+ +

The answer to 1. is 1
+The answer to 2. is 4e-04

+ +
+
+
+ +
+ + +
+ +
+

A sample of 100 men yielded an average PSA level of 3.0 with a sd of 1.1. What +are the complete set of values that a 5% two sided Z test of \(H_0 : \mu = \mu_0\) +would reject the null hypothesis for?

+ +
    +
  1. Enter the lower value to 2 decimal places.
  2. +
  3. Enter the upper value to 2 decimal places.
  4. +
+ + + + + + +
+

This is equivalent to the confidence interval.

+ +
+
+

The answer to 1 is + 2.78
+The answer to 2 is 2.78

+ +
+
+
+ +
+ + +
+ +
+

You believe the coin that you're flipping is biased towards heads. You get 55 heads out of +100 flips.

+ +
    +
  1. What's the exact relevant pvalue to 4 decimal places (expressed as a proportion)?
  2. +
  3. Would you reject a 1 sided hypothesis at \(\alpha = .05\)? (0 for no 1 for yes)?
  4. +
+ + + + + + +
+

Use pbinom for a hypothesis that \(p=.5\) veruss \(p>.5\) where \(p\) is the binomial success +probability.

+ +
+
+

Note you have to start at 54 as it lower.tail = FALSE gives the strictly greater than +probabilities

+ +
ans <- round(pbinom(54, prob = .5, size = 100, lower.tail = FALSE),4)
+
+ +

The answer to 1 is 0.1841
+The answer to 2 is 0

+ +
+
+
+ +
+ + +
+ +
+

A web site was monitored for a year and it received 520 hits per day. In the first +30 days in the next year, the site received 15,800 hits. Assuming that web hits +are Poisson.

+ +
    +
  1. Give an exact one sided P-value to the hypothesis that web hits are up this year over last +to four significant digits (expressed as a proportion).
  2. +
  3. Does the one sided test reject (0 for no 1 for yes)?
  4. +
+ + + + + + +
+

Consider using ppois with \(\lambda=520 * 30\). Note this is nearly exactly Gaussian, +so one could get away with the Gaussian calculation.

+ +
+
+

This test comes with the important assumption that web hits are a Poisson process.

+ +
pv <- ppois(15800 - 1, lambda = 520 * 30, lower.tail = FALSE)
+
+ +

The answer to 1 is 0.0553
+The answer to 2 is 0

+ +

Also, compare with the Gaussian approximation where \(\hat \lambda \sim N(\lambda, \lambda / t)\)

+ +
pnorm(15800 / 30, mean = 520, sd = sqrt(520 / 30), lower.tail = FALSE)
+
+ +
[1] 0.05466
+
+ +

As \(t\rightarrow \infty\) this becomes more Gaussian. The approximation is pretty good for this +setting.

+ +
+
+
+ +
+ + +
+ +
+

Suppose that in an AB test, one advertising scheme led to an average of 10 purchases per day for a sample of 100 days, while the other led to 11 purchaces per day, also for a sample of 100 days. +Assuming a common standard deviation of 4 purchaces per day. +Assuming that the groups are independent and that they days are iid, perform a Z test of +equivalence.

+ +
    +
  1. What is the P-value reported to 3 digits expressed as a proportion?
  2. +
  3. Do you reject the test? (O for no 1 for yes).
  4. +
+ + + + + + +
+

The standard error is +\[ +s \sqrt{\frac{1}{n_1} + \frac{1}{n_2}} +\]

+ +
+
+
m1 <- 10; m2 <- 11
+n1 <- n2 <- 100
+s <- 4
+se <- s * sqrt(1 / n1 + 1 / n2)
+ts <- (m2 - m1) / se
+pv <- 2 * pnorm(-abs(ts))
+
+ +

The answer to 1 is 0.077
+The answer to 2 is 0

+ +
+
+
+ +
+ + +
+ +
+

A confidence interval for the mean contains:

+ +
    +
  1. All of the values of the hypothesized mean for which we would fail to reject with +\(\alpha = 1 - Conf. Level\).
  2. +
  3. All of the values of the hypothesized mean for which we would fail to reject with +\(2 \alpha = 1 - Conf. Level\).
  4. +
  5. All of the values of the hypothesized mean for which we would reject with +\(\alpha = 1 - Conf. Level\).
  6. +
  7. All of the values of the hypothesized mean for which we would fail to reject with +\(2 \alpha = 1 - Conf. Level\).
  8. +
+ + + + + + +
+

This is directly from the notes. Note that a confidence interval gives +values of \(\mu\) that are supported by the data whereas a test rejects +for values of \(\mu\) different from \(\mu_0\).

+ +
+
+

The only complicated part of this is the 2. Note that a 95% interval corresponds +to a 5% level two sided test. So it's \(\alpha = 1 - Conf.Level\). The confusion is that +for both the two sided test and confidence interval, one needs to calculate +\(Z_{1 - \alpha / 2}\) (or the relevant T quantile).

+ +
+
+
+ +
+ + +
+ +
+

Consider two problems previous. Assuming that 10 purchases per day is a benchmark null value, +that days are iid and that the standard deviation is 4 purchases for day. Suppose that you +plan on sampling 100 days. What would be the power for a one sided 5% +Z mean test that purchases per day +have increased under the alternative of \(\mu = 11\) purchase per day?

+ +
    +
  1. Give your result as a proportion to 3 decimal places.
  2. +
+ + + + + + +
+

Under \(H_0\) \(\bar X \sim N(10, .4)\). Under \(H_a\) \(\bar X \sim N(11, .4)\). We reject +when \(\bar X \geq 10 + Z_{.95} .4\).

+ +
+
+

The hint prettty much gives it away.

+ +
power <- pnorm(10 + qnorm(.95) * .4, mean = 11, sd = .4, lower.tail = FALSE)
+
+ +

The answer is 0.804

+ +
+
+
+ +
+ + +
+ +
+

Researchers would like to conduct a study of healthy adults to detect a four year mean brain volume loss of .01 mm3. Assume that the standard deviation of four year volume loss in this population is .04 mm3.

+ +
    +
  1. What is necessary sample size for the study for a 5% one sided test versus a null hypothesis of no volume loss to acheive 80% power? (Always round up)
  2. +
+ + + + + + +
+

Under \(H_0\) \(\bar X\) is \(N(0, .05 / \sqrt{n})\) and is \(N(.01, .05 / \sqrt{n})\) under \(H_a\). +We will reject if +\[ +\bar X \geq Z_{.95} s / sqrt{n} +\] +which has probability 0.05 under \(H_0\). Under \(H_a\) it has probability +\[ +P\left( \frac{\bar X - 0.01}{s / \sqrt{n}} \geq \frac{.01}{s / \sqrt{n}} + z_{.95} \right) += P\left( Z \geq \frac{.01}{s / \sqrt{n}} + z_{.95}\right) +\]

+ +
+
+

Looking at the hint we set +\[ +\frac{.01}{s / \sqrt{n}} + z_{.95} = z_{.2} +\] +\[ +n = \frac{(z_{.95} - z_{.2})^2 s^2}{.01^2} = \frac{ (z_{.95} + z_{.8})^2 s^2}{.01^2} +\] +So we get

+ +
n <- (qnorm(.95) + qnorm(.8)) ^ 2 * .04 ^ 2 / .01^2
+
+ +

The answer is 99

+ +
+
+
+ +
+ + +
+ +
+

In a court of law, all things being equal, if via policy you require a lower +standard of evidence to convict people then

+ +
    +
  1. Less guilty people will be convicted.
  2. +
  3. More innocent people will be convicted.
  4. +
  5. More Innocent people will be not convicted.
  6. +
+ + + + + + +
+

Think about it.

+ +
+
+

If you require less evidence to convict, then you will convict more people, guilty and +innocent. Relate this property back to hypothesis tests.

+ +
+
+
+ +
+ + +
+ +
+

Consider the mtcars data set.

+ +
    +
  1. Give the p-value for a t-test for assuming +constant variance comparing MPG for 6 and 8 cylinder cars as a proportion to 3 decimal places.
  2. +
  3. Give the associated P-value for a z test.
  4. +
  5. Give the common standard deviation estimate for MPG across cylinders to 3 decimal places.
  6. +
  7. Would the t test reject at the two sided 0.05 level (0 for no 1 for yes)?
  8. +
+ + + + + + +
+
mpg8 <- mtcars$mpg[mtcars$cyl == 8]
+mpg6 <- mtcars$mpg[mtcars$cyl == 6]
+m8 <- mean(mpg8); m6 <- mean(mpg6)
+s8 <- sd(mpg8); s6 <- sd(mpg6)
+n8 <- length(mpg8); n6 <- length(mpg6)
+
+ +
+
+
p <- t.test(mpg8, mpg6, paired = FALSE, alternative="two.sided", var.equal=TRUE)$p.value
+mixprob <- (n8 - 1) / (n8 + n6 - 2)
+s <- sqrt(mixprob * s8 ^ 2  +  (1 - mixprob) * s6 ^ 2)
+z <- (m8 - m6) / (s * sqrt(1 / n8 + 1 / n6))
+pz <- 2 * pnorm(-abs(z))
+## Hand calculating the T just to check
+#2 * pt(-abs(z), df = n8 + n6 - 2)
+
+ +
    +
  1. 0
  2. +
  3. 0
  4. +
  5. 2.27
  6. +
  7. 1
  8. +
+ +
+
+
+ +
+ + +
+ +
+

The Bonferonni correction controls this

+ +
    +
  1. False discovery rate
  2. +
  3. The familywise error rate
  4. +
  5. The rate of true rejections
  6. +
  7. The number of true rejections
  8. +
+ + + + + + +
+

This is pretty much straight out of the notes

+ +
+
+

The Bonferonni correction is the classic correction for the familywise error rate.

+ +
+
+
+ +
+ - - - - - - - - - - - - - - - - - - - - - - -
-

Math notation in R markdown

-

-


Johns Hopkins Bloomberg School of Public Health

-
-
- - - -
-

Why math notation in R markdown?

-
-
-
    -
  • Math notation is the standard for technical descriptions of machine learning/statistical models.
  • -
  • You may want to intersperse your technical decriptions with plain language descriptions.
  • -
  • Math notation allows you to be precise - -
      -
    • "We fit a linear model with terms for age, sex" versus \(Y_i = \alpha + \beta_a A_i + \beta_s S_i + \epsilon_i\)
    • -
  • -
  • Math notation allows you to be concise - -
      -
    • "We estimated the intercept to be 3.3" versus \(\hat{\alpha}=3.3\).
    • -
  • -
- -
- -
- - -
-

What notation system does R markdown use?

-
-
-
    -
  • R markdown uses the same system as Latex
  • -
  • The basic idea: - -
      -
    • You write your document in R markdown
    • -
    • You include symbols for math notation
    • -
    • You indicate math by wrapping the symbols with certain text
    • -
  • -
  • Often the symbols are intuitive: \alpha gives you \(\alpha\)
  • -
- -
- -
- - -
-

How to write math inline

-
-
-

Including math in a sentence involves wrapping the symbols in $ symbols. For -example if you write this in an R markdown file:

- -

"The intercept was estimated as $\hat{\alpha} = 4$"

- -

Then you get the following text after running knit2html or slidify on your document.

- -

"The intercept was estimated as \(\hat{\alpha} = 4\)"

- -
- -
- - -
-

How to write math on a separate line

-
-
-

Sometimes you have several equations you would like to line up. The -way that you do that is with a double dollar sign $$ and the align command.

- -

For example if you write

- -

- -

Then you get the following text after running knit2html or slidify on your document.

- -

\[ - \begin{aligned} - y &= \beta_0 + \beta_1 + x_1 + \epsilon\\ - x &= \gamma z \\ - z &\sim N(0,1) - \end{aligned} -\]

- -
- -
- - -
-

Common symbols

-
-
-
    -
  • Subscripts to get \(a_{b}\) write: $a_{b}$
  • -
  • Superscripts write \(a^{b}\) write: $a^{b}$
  • -
  • Greek letters like \(\alpha, \beta, \ldots\) write: $\alpha, \beta, \ldots$
  • -
  • Sums like \(\sum_{n=1}^N\) write: $\sum_{n=1}^N$
  • -
  • Multiplication like \(\times\) write: $\times$
  • -
  • Products like \(\prod_{n=1}^N\) write: $\prod_{n=1}^N$
  • -
  • Inequalities like \(<, \leq, \geq\) write: $<, \leq, \geq$
  • -
  • Distributed like \(\sim\) write: $\sim$
  • -
  • Hats like \(\widehat{\alpha}\) write: $\widehat{\alpha}$
  • -
  • Averages like \(\bar{x}\) write: $\bar{x}$
  • -
  • Fractions like \(\frac{a}{b}\) write: $\frac{a}{b}$
  • -
  • Big parentheses like \(\left(\frac{a}{b}\right)\) write: $\left(\frac{a}{b}\right)$
  • -
- -
- -
- - -
-

For more information

-
- - -
- - -
- - - - - - - - - - - - - - - - - \ No newline at end of file diff --git a/06_StatisticalInference/mathNotation/index.md b/06_StatisticalInference/mathNotation/index.md deleted file mode 100644 index fcb77d8e8..000000000 --- a/06_StatisticalInference/mathNotation/index.md +++ /dev/null @@ -1,105 +0,0 @@ ---- -title : Math notation in R markdown -subtitle : -author : -job : Johns Hopkins Bloomberg School of Public Health -logo : bloomberg_shield.png -framework : io2012 # {io2012, html5slides, shower, dzslides, ...} -highlighter : highlight.js # {highlight.js, prettify, highlight} -hitheme : zenburn # -url: - lib: ../../libraries - assets: ../../assets -widgets : [mathjax] # {mathjax, quiz, bootstrap} -mode : selfcontained # {standalone, draft} ---- - - - - - -## Why math notation in R markdown? - -* Math notation is the standard for technical descriptions of machine learning/statistical models. -* You may want to intersperse your technical decriptions with plain language descriptions. -* Math notation allows you to be precise - * "We fit a linear model with terms for age, sex" versus $Y_i = \alpha + \beta_a A_i + \beta_s S_i + \epsilon_i$ -* Math notation allows you to be concise - * "We estimated the intercept to be 3.3" versus $\hat{\alpha}=3.3$. - ---- - -## What notation system does R markdown use? - -* R markdown uses the same system as [Latex](http://en.wikipedia.org/wiki/LaTeX) -* The basic idea: - * You write your document in R markdown - * You include symbols for math notation - * You indicate math by wrapping the symbols with certain text -* Often the symbols are intuitive: _\alpha_ gives you $\alpha$ - ---- - -## How to write math inline - -Including math in a sentence involves wrapping the symbols in $ symbols. For -example if you write this in an R markdown file: - -
"The intercept was estimated as `$\hat{\alpha} = 4$`"
- -Then you get the following text after running _knit2html_ or _slidify_ on your document. - -
"The intercept was estimated as $\hat{\alpha} = 4$"
- ---- - -## How to write math on a separate line - -Sometimes you have several equations you would like to line up. The -way that you do that is with a double dollar sign \$$ and the align command. - -For example if you write - - - -Then you get the following text after running _knit2html_ or _slidify_ on your document. - -$$ - \begin{aligned} - y &= \beta_0 + \beta_1 + x_1 + \epsilon\\ - x &= \gamma z \\ - z &\sim N(0,1) - \end{aligned} -$$ - ---- - -## Common symbols - -* Subscripts to get $a_{b}$ write: `$a_{b}$` -* Superscripts write $a^{b}$ write: `$a^{b}$` -* Greek letters like $\alpha, \beta, \ldots$ write: `$\alpha, \beta, \ldots$` -* Sums like $\sum_{n=1}^N$ write: `$\sum_{n=1}^N$` -* Multiplication like $\times$ write: `$\times$` -* Products like $\prod_{n=1}^N$ write: `$\prod_{n=1}^N$` -* Inequalities like $<, \leq, \geq$ write: `$<, \leq, \geq$` -* Distributed like $\sim$ write: `$\sim$` -* Hats like $\widehat{\alpha}$ write: `$\widehat{\alpha}$` -* Averages like $\bar{x}$ write: `$\bar{x}$` -* Fractions like $\frac{a}{b}$ write: `$\frac{a}{b}$` -* Big parentheses like $\left(\frac{a}{b}\right)$ write: `$\left(\frac{a}{b}\right)$` - - ---- - -## For more information - -* Rstudio's equations page: - * http://www.rstudio.com/ide/docs/authoring/using_markdown_equations -* Lists of Latex symbols - * http://www.rpi.edu/dept/arc/training/latex/LaTeX_symbols.pdf - * http://www.giss.nasa.gov/tools/latex/ltx-2.html - * http://omega.albany.edu:8008/Symbols.html - - - diff --git a/09_DevelopingDataProducts/rStudioPresent/index.md b/09_DevelopingDataProducts/rStudioPresent/index.md index b998542ae..6b5c3334e 100644 --- a/09_DevelopingDataProducts/rStudioPresent/index.md +++ b/09_DevelopingDataProducts/rStudioPresent/index.md @@ -1,7 +1,7 @@ RStudio Presenter === author: Brian Caffo, Jeff Leek Roger Peng -date: May 21 2014 +date: May 24 2014 transition: rotate