Course Note
Essay by Song Ye • February 26, 2017 • Course Note • 1,346 Words (6 Pages) • 892 Views
Page 1 of 6
Basic Definitions
- Experimental unit(样本单位)) – an object (e.g., person, thing, transaction, or event) upon which we collect data
- Population (总体)– set of units that we are interested in studying
- Sample (样本)– subset of units of a population
- Variable (变量)– a characteristic or property of an individual experimental unit
- Statistical Inference (统计推断)– making an estimate or prediction or some other generalization about a population based on information contained in a sample
Elements of Descriptive Statistics
- (1) Data set (population or sample), (2) variable of interest, (3) graphs or numerical measures, (4) conclusions about the data pattern
Elements of Statistical Inference
- (1) Population, (2) variable of interest, (3) sample, (4) inference, (5) measure of reliability
Types of data
- 1) Quantitative data – can be described numerically
- Age, height, weight, size of family, income, GDP, CPI, Stock market average, monthly sales revenue
- Quantitative Data: Cross-sectional / Time Series
- 2) Qualitative data – not inherently numerical
- Also called categorical data or attributes data
- Color of eyes, accurate / not, left or right handed, yes/no variables, employment status, defect/no defect, occupation code…
Collecting Data & Sampling Techniques
- Published source
- Designed experiment
- Observational study (Survey)
- Random sample selected from the target population of interest
- Selection bias – a subset of the experimental units in the population are excluded from the sample
- Nonresponse bias – researchers are unable to obtain data on all experimental units selected for the sample。(Solution – sample the non-respondents to determine characteristics of non-respondents.
- )
Sampling Designs
- 1) Stratified random sample
- Split up the population into strata
- Randomly sample from each strata
- More representative sample?
- Dependent on the researcher’s strata
- 2) Systematic random sampling
- Take every kth item
- Useful for processes
- Watch out for systematic sampling biases (cycles)
- Security lines at airports - TSA
- 3) Cluster and multi-stage clustering
- Cluster the population into subpopulations
- Randomly select clusters to get to the elements of the population
- 4) Convenience samples – sample elements are selected that are convenient to the researchers
Types of Sample Design Errors
- Sampling error(抽样误差) – difference between the estimator (sample statistic) and the true population parameter.
- Due to sample vs. population
- 抽样方法本身所引起的误差。当由总体中随机地抽取样本时,哪个样本被抽到是随机的,由所抽到的样本得到的样本指标x与总体指标μ之间偏差,称为实际抽样误差。当总体相当大时,可能被抽取的样本非常多,不可能列出所有的实际抽样误差,而用平均抽样误差来表征各样本实际抽样误差的平均水平。
- Nonsampling (measurement) error(非抽样误差) – all other errors that cause a difference between an estimator and a population parameter.
- 非抽样误差是指除抽样误差以外所有的误差的总和。应该说非抽样误差的产生贯穿了市场调查的每一个环节,任何一个环节出错都有可能导致非抽样误差增加而使数据失真。我们平时说的控制误差主要指的就是控制非抽样误差。
- Poor sampling design
- Interviewer errors / interviewer biases
- False information provided by respondent
- Poorly worded or loaded questions
- Data errors
- Undercoverage
- Non respondents
Descriptive Statistics
- Four things to know about any distribution
- 1) Measures of Location (Central Tendency)
- Midrange
[pic 1]
- Mode(众数)
- Median(中间数,比Median大的有50%,比Median小的有50%)
- Mean
[pic 2]
- trimmed mean:mean of the middle x% of the data
- 2) Measures of Variability (Dispersion)
- Range = [max minus min]
- interquartile range :Qu - QL
- Upper (3rd) Quartile – Lower (1st) Quartile
- 75th percentile - 25th percentile
- ‘Upper management’ - range of the middle 50% of the distribution
- Qu = 3/4 (n + 1) Round to nearest integer
- QL = 1/4 (n + 1) Round to nearest integer
- Box plot
- IQR is the box, median is the line in the box
- Hinge points are at the edges of the box
- QU and QL
- Inner fence: Hinge point +/-1.5 (IQR)
- Suspect outliers
- Outer fence: Hinge point +/- 3 (IQR)
- Designated as highly suspect outliers
- Whiskers – lines to the edges of the inner fence
- Variance(方差): population / sample
- Population variance
[pic 3]
- Sample variance
[pic 4]
- Computational equation for sample variance
[pic 5]
- Tells how far on average each value is away from the mean
- standard deviation
- Population: σ Sample: s
[pic 6] [pic 7]
- coefficient of variation (变异系数,比较两组数据离散程度大小)
[pic 8]
- 3) Shape
- Symmetry / skewness / mathematical form
- Mound-shaped Distributions-Use the Empirical Rule
- z-score(标准分数):
[pic 9] [pic 10]
[pic 11]
- Approximately 68% of the data is within 1 standard deviation
- Approximately 95% of the data is within 2 standard deviations
- Approximately 99.7% (essentially all) of the data is within 3 standard deviations
- z > 2 = possible outlier, z > 3 = outlier.
- For any shaped distribution-Chebyshev’s Inequality(切比雪夫不等式),k是标准差的个数。
[pic 12]
- 4) Data patterns (for time series data)
- Time series
Graphical Techniques
- Bar chart
- Vertical axis:frequency, relative frequency
- Horizontal axis:variable of interest (Xi)
- Pareto diagram - Bar chart with bars ordered by frequency (highest to lowest)
Random Variables and Probability Distributions(随机变量与概率分布)
- Types of Random Variables-Discrete / Continuous
- Discrete – random variable that can only take on a finite number of values (countable)
Number of defects per product, occupation code, type of failure, reason for customer return, type of customer complaint
- Continuous – random variable that can take on any infinite value within an interval (measurement)
Wait time at a fast-food window, strength of a laptop case, response time of a computer system
...
...
Only available on Essays24.com