STA 2000 笔记
统计学笔记,仅作为个人学习记录,不保证正确性。
Index
- Index
- Chapter 1: Defining and Collecting Data
- Chapter 2: Organizing and Visualizing Variables
- Chapter 3: Numerical Descriptive Measures
- 3.1 Measures of Central Tendency
- 3.2 Measures of Variation
- 3.3 Shape of a Distribution
- 3.4 Quartile Measures
- 3.5 Five Number Summary
- 3.6 Numerical Descriptive Measures for Populations
- 3.7 Empirical Rule
- 3.8 Chebyshev's Rule
- 3.9 Covariance
- 3.10 Correlation Coefficient
- 3.11 Pitfalls in Numerical Descriptive Measures
- Chapter 4: Basic Probability
- Chapter 5: Discrete Probability Distributions
- Chapter 6: The Normal Distribution
- Chapter 7: Sampling Distributions
- Chapter 8: Confidence Interval Estimation
- Chapter 9: Fundamentals of Hypothesis Testing: One-Sample Tests
- Chapter 10: Two-Sample Tests and One-Way ANOVA
Chapter 1: Defining and Collecting Data
1.1 Variables
Categorical(qualitative): a variable that can be placed into a specific category, according to some characteristic or attribute.
- Nominal: no natural ordering of the categories
- Ordinal: natural ordering of the categories
Numerical(quantitative): a variable that can be measured numerically.
- Discrete: arise from a counting process
- Continuous: arise from a measuring process
1.1.1 Measurement Scales
Interval scale: an ordered scale in which the difference between measurements is a meaningful quantity but the measurements do not have a true zero point.
Ratio scale: an ordered scale in which the difference between measurements is a meaningful quantity and the measurements have a true zero point.
1.2 Population and Sample
Population: the set of all elements of interest in a particular study.
- contains all measurements of interest to the investigator
Sample: a subset of the population.
- a part of the population selected for analysis
1.2.1 Parameter and Statistic
Population parameter: a numerical measure that describes an aspect of a population.
Sample statistic: a numerical measure that describes an aspect of a sample.
1.2.2 Sources of Data
Primary sources: data that are generated by the investigator conducting the study.
- from a political survey
- collected from an experiment
- observed data
Secondary sources: data that were produced by someone other than the investigator conducting the study.
- analyzing census data
- examining data from print journals or data published on the Internet
1.2.3 Probability Sample
In a probability sample, items in the sample are chosen on the basis of known probabilities.
- Simple random sample: every individual or item from the frame has an equal chance of being selected
- Systematic sample: the items are selected according to a specified time or item interval in the sampling frame
- divide frame of
individuals into groups of individuals:
- divide frame of
- Stratified sample: divide population into two or more subgroups(strata)according to some characteristic that is important to the study
- Cluster sample: population is divided into several "clusters" or sections, then some of those clusters are randomly selected and all members of the selected clusters are used as the sample
Chapter 2: Organizing and Visualizing Variables
2.1 Organizing Categorical Data
- Summary table: tallies the frequencies or percentages of items in a set of categories so that you can see differences between categories
- Contingency table: a table that classifies sample observations according to two or more identifiable categories so that the relationship between the categories can be studied
2.2 Organizing Numerical Data
- Ordered array: a sequence of data, in rank order, from the smallest value to the largest value.
- Frequency distribution: a summary table in which the data are arranged into numerically ordered classes
2.3 Visualizing Categorical Data
- Bar chart: visualizes a categorical variable as a series of bars
- Pie chart: a circle broken up into slices that represent categories
- Doughnut chart: the outer part of a circle broken up into pieces that represent categories
- Pareto chart: a vertical bar chart, where categories are shown in descending order of frequency
- Side by side bar chart: a bar chart that compares two or more categories
- Doughnut chart(contingency): a doughnut chart that compares two or more categories
2.4 Visualizing Numerical Data
- Stem-and-leaf display: a simple way to see how the data are distributed and where concentrations of data exist
- Histogram: a vertical bar chart of the data in a frequency distribution
- Percentage polygon: formed by having the midpoint of each class represent the data in that class and then connecting the sequence Of midpoints at their respective class percentages
2.4.1 Visualing Two Numerical Variables
- Scatter plot: used for numerical data consisting of paired observations taken from two numerical variables
- Time series plot: used to study patterns in the values of a numeric variable over time
Chapter 3: Numerical Descriptive Measures
Central tendency: the extent or inclination to which the values of a numerical variable group or cluster around a typical or central value.
3.1 Measures of Central Tendency
Measure of central tendency: a single value that attempts to describe a set of data by identifying the central position within that set of data.
3.1.1 Mean
-> sample mean; pronounced "X-bar" -> sample size -> the th value in the sample -> the observed values
3.1.2 Median
- Sample size is odd:
- Sample size is even:
3.1.3 Mode
Mode: the value that occurs most frequently in a data set.
3.2 Measures of Variation
Measure of variation: gives information on the spread or variability or dispersion of the data values.
3.2.1 Range
3.2.2 Sample Variation
Sample variance: average of squared deviations of values from the mean.
-> arithmetic mean -> sample size -> the th value in the sample
3.2.3 Sample Standard Deviation
3.2.4 Coefficient of Variation
3.2.5 Z-Score
Z-Score: the number of standard deviations that a given value
- a data value is considered an extreme outlier if its Z-Score is less than -3 or greater than +3
3.3 Shape of a Distribution
3.3.1 Skewness
Skewness: a measure of the degree of asymmetry of a distribution.
3.3.2 Kurtosis
Kurtosis: a measure of the degree of peakedness of a distribution.
3.4 Quartile Measures
Quartile: a value that divides a data set into four groups containing(as far as possible)an equal number of observations.
-> first quartile -> second quartile; the median -> third quartile
3.4.1 Locating Quartiles
ranked value ranked value ranked value
-> the number of observed values
3.4.2 Interquartile Range
Interquartile range: the difference between the third and first quartiles.
3.5 Five Number Summary
Five number summary: the five numbers that help describe the center, spread and shape of data.
/ Median
3.5.1 Relationships Among the Five Number Summary and Distribution Shape
Left-Skewed | Symmetric | Right-Skewed |
---|---|---|
3.5.2 Boxplot
Boxplot: a graphical display of the five number summary.
3.6 Numerical Descriptive Measures for Populations
3.6.1 Population Mean
-> population mean -> population size -> the th value in the population
3.6.2 Population Variance
3.6.3 Population Standard Deviation
3.7 Empirical Rule
Empirical rule: approximates the variation of data in a symmetric mound-shaped distribution.
- 68% of the data values lie within one standard deviation of the mean
- 95% of the data values lie within two standard deviations of the mean
- 99.7% of the data values lie within three standard deviations of the mean
3.8 Chebyshev's Rule
Chebyshev's rule: applies to any data set, regardless of the shape of the distribution.
- at least
of the data values lie within standard deviations of the mean, where is any value greater than 1
3.9 Covariance
Covariance: a measure of the linear association between two variables.
-> positive covariance; as increases, increases -> negative covariance; as increases, decreases -> no linear relationship between and
3.10 Correlation Coefficient
Correlation coefficient: a measure of the linear association between two variables.
3.10.1 Features of the Correlation Coefficient
The population coefficient of correlation,
The sample coefficient of correlation,
3.11 Pitfalls in Numerical Descriptive Measures
- Data analysis is objective
- Data interpretation is subjective
Chapter 4: Basic Probability
Sample space: the set of all possible outcomes of an experiment.
4.1 Events
Simple event: an event described by a single characteristic or an event that is a set of outcomes of an experiment.
Joint event: an event described by two or more characteristics.
Complement of an event A:
- all events that are not part of event A
4.2 Probability
Probability: the numerical value representing the chance, likelihood, or possibility that a certain event will occur.
- always between 0 and 1
Impossible event: an event that has no chance of occurring.
Certain event: an event that is sure to occur.
Mutually exclusive events: events that cannot occur at the same time.
Collectively exhaustive events: the set of events that covers the entire sample space.
- one of the events must occur
4.2.1 Three Approaches to Assigning Probability
A priori probability: a probability assignment based upon prior knowledge of the process involved.
- Example: randomly selecting a day from the year 2019. What is the probability that the day is in January?
Empirical probability: a probability assignment based upon observations obtained from probability experiments.
- Example: