统计学笔记,仅作为个人学习记录,不保证正确性。

Index

-----

Chapter 1: Defining and Collecting Data

1.1 Variables

Categorical(qualitative): a variable that can be placed into a specific category, according to some characteristic or attribute.

  • Nominal: no natural ordering of the categories
  • Ordinal: natural ordering of the categories

Numerical(quantitative): a variable that can be measured numerically.

  • Discrete: arise from a counting process
  • Continuous: arise from a measuring process

1.1.1 Measurement Scales

Interval scale: an ordered scale in which the difference between measurements is a meaningful quantity but the measurements do not have a true zero point.

Ratio scale: an ordered scale in which the difference between measurements is a meaningful quantity and the measurements have a true zero point.

1.2 Population and Sample

Population: the set of all elements of interest in a particular study.

  • contains all measurements of interest to the investigator

Sample: a subset of the population.

  • a part of the population selected for analysis

1.2.1 Parameter and Statistic

Population parameter: a numerical measure that describes an aspect of a population.

Sample statistic: a numerical measure that describes an aspect of a sample.

1.2.2 Sources of Data

Primary sources: data that are generated by the investigator conducting the study.

  • from a political survey
  • collected from an experiment
  • observed data

Secondary sources: data that were produced by someone other than the investigator conducting the study.

  • analyzing census data
  • examining data from print journals or data published on the Internet

1.2.3 Probability Sample

In a probability sample, items in the sample are chosen on the basis of known probabilities.

  • Simple random sample: every individual or item from the frame has an equal chance of being selected
  • Systematic sample: the items are selected according to a specified time or item interval in the sampling frame
    • divide frame of NN individuals into groups of kk individuals: k=Nnk=\frac{N}{n}
  • Stratified sample: divide population into two or more subgroups(strata)according to some characteristic that is important to the study
  • Cluster sample: population is divided into several "clusters" or sections, then some of those clusters are randomly selected and all members of the selected clusters are used as the sample
-----

Chapter 2: Organizing and Visualizing Variables

2.1 Organizing Categorical Data

  • Summary table: tallies the frequencies or percentages of items in a set of categories so that you can see differences between categories
  • Contingency table: a table that classifies sample observations according to two or more identifiable categories so that the relationship between the categories can be studied

2.2 Organizing Numerical Data

  • Ordered array: a sequence of data, in rank order, from the smallest value to the largest value.
  • Frequency distribution: a summary table in which the data are arranged into numerically ordered classes

2.3 Visualizing Categorical Data

  • Bar chart: visualizes a categorical variable as a series of bars
  • Pie chart: a circle broken up into slices that represent categories
  • Doughnut chart: the outer part of a circle broken up into pieces that represent categories
  • Pareto chart: a vertical bar chart, where categories are shown in descending order of frequency
  • Side by side bar chart: a bar chart that compares two or more categories
  • Doughnut chart(contingency): a doughnut chart that compares two or more categories

2.4 Visualizing Numerical Data

  • Stem-and-leaf display: a simple way to see how the data are distributed and where concentrations of data exist
  • Histogram: a vertical bar chart of the data in a frequency distribution
  • Percentage polygon: formed by having the midpoint of each class represent the data in that class and then connecting the sequence Of midpoints at their respective class percentages

2.4.1 Visualing Two Numerical Variables

  • Scatter plot: used for numerical data consisting of paired observations taken from two numerical variables
  • Time series plot: used to study patterns in the values of a numeric variable over time
-----

Chapter 3: Numerical Descriptive Measures

Central tendency: the extent or inclination to which the values of a numerical variable group or cluster around a typical or central value.

3.1 Measures of Central Tendency

Measure of central tendency: a single value that attempts to describe a set of data by identifying the central position within that set of data.

3.1.1 Mean

Xˉ=i=1nXin=X1+X2++Xnn\bar{X} = \frac{\sum_{i=1}^{n} X_i}{n} = \frac{X_1 + X_2 + \cdots + X_n}{n}

  • Xˉ\bar{X} -> sample mean; pronounced "X-bar"
  • nn -> sample size
  • XiX_i -> the iith value in the sample
  • XnX_n -> the observed values

3.1.2 Median

  1. Sample size is odd:

Median=x+12thposition\text{Median} = \frac{x+1}{2}^{th} \text{position}

  1. Sample size is even:

Median=n2thandn2+1thpositions\text{Median} = \frac{n}{2}^{th} \text{and} \frac{n}{2} + 1^{th} \text{positions}

3.1.3 Mode

Mode: the value that occurs most frequently in a data set.

3.2 Measures of Variation

Measure of variation: gives information on the spread or variability or dispersion of the data values.

3.2.1 Range

Range=MaximumMinimum\text{Range} = \text{Maximum} - \text{Minimum}

3.2.2 Sample Variation

Sample variance: average of squared deviations of values from the mean.

S2=i=1nXiXˉ2n1S^2 = \frac{\sum_{i=1}^{n}(X_i - \bar{X})^2}{n-1}

  • xˉ\bar{x} -> arithmetic mean
  • nn -> sample size
  • XiX_i -> the iith value in the sample

3.2.3 Sample Standard Deviation

S=i=1nXiXˉ2n1S = \sqrt{\frac{\sum_{i=1}^{n}(X_i - \bar{X})^2}{n-1}}

3.2.4 Coefficient of Variation

CV=SXˉ×100%CV = \frac{S}{\bar{X}} \times 100\%

3.2.5 Z-Score

Z-Score: the number of standard deviations that a given value XX is above or below the mean.

  • a data value is considered an extreme outlier if its Z-Score is less than -3 or greater than +3

Z=XXˉSZ = \frac{X - \bar{X}}{S}

3.3 Shape of a Distribution

3.3.1 Skewness

Skewness: a measure of the degree of asymmetry of a distribution.

3.3.2 Kurtosis

Kurtosis: a measure of the degree of peakedness of a distribution.

3.4 Quartile Measures

Quartile: a value that divides a data set into four groups containing(as far as possible)an equal number of observations.

  • Q1Q_1 -> first quartile
  • Q2Q_2 -> second quartile; the median
  • Q3Q_3 -> third quartile

3.4.1 Locating Quartiles

  1. Q1=n+14Q_1 = \frac{n+1}{4} ranked value
  2. Q2=n+12Q_2 = \frac{n+1}{2} ranked value
  3. Q3=3n+14Q_3 = \frac{3(n+1)}{4} ranked value
  • nn -> the number of observed values

3.4.2 Interquartile Range

Interquartile range: the difference between the third and first quartiles.

IQR=Q3Q1IQR = Q_3 - Q_1

3.5 Five Number Summary

Five number summary: the five numbers that help describe the center, spread and shape of data.

  • minimum\text{minimum}
  • Q1Q_1
  • Q2Q_2 / Median
  • Q3Q_3
  • maximum\text{maximum}

3.5.1 Relationships Among the Five Number Summary and Distribution Shape

Left-Skewed Symmetric Right-Skewed
Medianminimum>maximumMedianMedian - \text{minimum} > \text{maximum} - Median MedianminimummaximumMedianMedian - \text{minimum} \approx \text{maximum} - Median Medianminimum<maximumMedianMedian - \text{minimum} < \text{maximum} - Median
Q1minimum>maximumQ3Q_1 - \text{minimum} > \text{maximum} - Q_3 Q1minimummaximumQ3Q_1 - \text{minimum} \approx \text{maximum} - Q_3 Q1minimum<maximumQ3Q_1 - \text{minimum} < \text{maximum} - Q_3
MedianQ1>Q2MedianMedian - Q_1 > Q_2 - Median MedianQ1Q2MedianMedian - Q_1 \approx Q_2 - Median MedianQ1Q2MedianMedian - Q_1 \approx Q_2 - Median

3.5.2 Boxplot

Boxplot: a graphical display of the five number summary.

3.6 Numerical Descriptive Measures for Populations

3.6.1 Population Mean

μ=i=1NXiN\mu = \frac{\sum_{i=1}^{N} X_i}{N}

  • μ\mu -> population mean
  • NN -> population size
  • XiX_i -> the iith value in the population

3.6.2 Population Variance

σ2=i=1NXiμ2N\sigma^2 = \frac{\sum_{i=1}^{N}(X_i - \mu)^2}{N}

3.6.3 Population Standard Deviation

σ=i=1NXiμ2N\sigma = \sqrt{\frac{\sum_{i=1}^{N}(X_i - \mu)^2}{N}}

3.7 Empirical Rule

Empirical rule: approximates the variation of data in a symmetric mound-shaped distribution.

  • 68% of the data values lie within one standard deviation of the mean
  • 95% of the data values lie within two standard deviations of the mean
  • 99.7% of the data values lie within three standard deviations of the mean

3.8 Chebyshev's Rule

Chebyshev's rule: applies to any data set, regardless of the shape of the distribution.

  • at least 11k21-\frac{1}{k^2} of the data values lie within kk standard deviations of the mean, where kk is any value greater than 1

3.9 Covariance

Covariance: a measure of the linear association between two variables.

CovX,Y=i=1nXiXˉYiYˉn1Cov(X,Y)= \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{n-1}

  • CovX,Y>0Cov(X,Y)> 0 -> positive covariance; as XX increases, YY increases
  • CovX,Y<0Cov(X,Y)< 0 -> negative covariance; as XX increases, YY decreases
  • CovX,Y=0Cov(X,Y)= 0 -> no linear relationship between XX and YY

3.10 Correlation Coefficient

Correlation coefficient: a measure of the linear association between two variables.

r=CovX,YSXSYr = \frac{Cov(X,Y)}{S_X S_Y}

  • CovX,Y=i=1nXiXˉYiYˉn1Cov(X,Y)= \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{n-1}
  • SX=i=1nXiXˉ2n1S_X = \sqrt{\frac{\sum_{i=1}^{n}(X_i - \bar{X})^2}{n-1}}
  • SY=i=1nYiYˉ2n1S_Y = \sqrt{\frac{\sum_{i=1}^{n}(Y_i - \bar{Y})^2}{n-1}}

3.10.1 Features of the Correlation Coefficient

The population coefficient of correlation, ρ\rho, is a measure of the linear association between two variables.

The sample coefficient of correlation, rr, is a measure of the linear association between two variables.

3.11 Pitfalls in Numerical Descriptive Measures

  • Data analysis is objective
  • Data interpretation is subjective
-----

Chapter 4: Basic Probability

Sample space: the set of all possible outcomes of an experiment.

4.1 Events

Simple event: an event described by a single characteristic or an event that is a set of outcomes of an experiment.

Joint event: an event described by two or more characteristics.

Complement of an event A:

  • all events that are not part of event A

4.2 Probability

Probability: the numerical value representing the chance, likelihood, or possibility that a certain event will occur.

  • always between 0 and 1

Impossible event: an event that has no chance of occurring.

Certain event: an event that is sure to occur.

Mutually exclusive events: events that cannot occur at the same time.

Collectively exhaustive events: the set of events that covers the entire sample space.

  • one of the events must occur

4.2.1 Three Approaches to Assigning Probability

A priori probability: a probability assignment based upon prior knowledge of the process involved.

  • PA=number of outcomes in Atotal number of outcomesP(A)= \frac{\text{number of outcomes in A}}{\text{total number of outcomes}}
  • Example: randomly selecting a day from the year 2019. What is the probability that the day is in January?
    • PJanuary=31365P(\text{January})= \frac{31}{365}

Empirical probability: a probability assignment based upon observations obtained from probability experiments.

  • PA=number of times A occursnumber of times the experiment is repeatedP(A)= \frac{\text{number of times A occurs}}{\text{number of times the experiment is repeated}}
  • Example: Pmale taking stats=number of males taking statstotal number of peopleP(\text{male taking stats})= \frac{\text{number of males taking stats}}{\text{total number of people}}