さて、いよいよデータ分析のコースも後半戦で、それっぽくなってきました。
「統計学の基礎」についてです。それなりに専門用語がたくさん出てきますが、1対1で対応する日本語をいちいち調べて訳すのがめんどくさ 時間がかかるので、英語のままポイントだけごりごりメモっていきます。
Module 1: Introduction to statistics
Statistics is the study of the collection, analysis and interpretation of data.
統計学とは、データの収集、分析および解釈のための学問。
Data professionals use statistics to predict
- Future sales revenue
- The success of a new ad campaign
- The rate of return on a financial investment
- The number of downloads for a new apps
Econometrics: a branch of economics that uses statistics to analyze economic problems.
A/B testing is a way to compare two versions of something to find out which version performs better.
A/B テストは、二つの比較候補のうちから、より良いものを見つけ出す手法
Statistical significance refers to the claim that the results of a test or experiment are not explainable by chance alone.
統計学的優位性とは、テストや実験の結果が確率によるも偶然だけではないということの主張を指す。
Descriptive statistics describe or summarize the main features of a dataset.
- Visuals, like graphs and tables
- Summary stats
Summary statistics let you summarize your data using a single number.
Inferential statistics allow data professionals to make inferences about a dataset based on a sample of the data.
The population includes every possible element that you are interested in measuring. As we’ve discussed, a sample is a subset of a population.
A representative sample is a sample that accurately reflects the population.
A parameter is a characteristic of a population. A statistic is a characteristic of a sample.
Measure of central tendency:
- Mean
- Median
- Mode
When to use mean and median:
- If outliers, use median
- If no outliers, use mean
Measures of dispersion:
- Range
- Standard deviation
The formula for sample standard deviation:
\(\sqrt\frac{(x-\bar{x})^2}{n-1}\)
Measures of position help you determine the position of a value in relation to other values in a dataset.
Measures of position:
- Percentiles
- Quartiles
- Interquartile range
- Five number summary
A percentile is the value below which a percentage data falls.
A quartile divides the values in a dataset into four equal parts
The interquartile range (IQR) is the distance between the first quartile (Q1), and the third quartile (Q3).
You can summarize the major divisions in your dataset with the five number summary. The five numbers include: the minimum, the first quartile, the median or second quartile, the third quartile, and the maximum.
For a boxplot, the horizontal lines on each side of the box, known as whiskers.
any of the long, stiff hairs growing on the face of a cat, mouse, or other mammal:
Cambridge Dictionary – whisker
whisker
猫とかネズミのヒゲのことを “whisker” と言う、ってとこから箱ひげチャートの「ひげ」は来てるってことです。
Module 2: Probability
Probability is the branch of mathematics that deals with measuring and quantifying uncertainty. In other words, probability uses math to describe the likelihood of something happening.
Types of probability:
- Objective probability
- Classical probability
- Empirical probability
- Subjective probability
Objective probability is based on statistics, experiments, and mathematical measurements.
Subjective probability is based on personal feelings, experience, or judgment.
Classical probability is based on formal reasoning about events with equally likely outcomes. To calculate classical probability for an event, you divide the number of desired outcomes by the total number of possible outcomes.
Empirical probability is based on experimental or historical data; it represents the likelihood of an event occurring based on the previous results of an experiment or past events. To calculate empirical probability, you divide the number of times a specific event occurs by the total number of events.
A random experiment is a process whose outcome cannot be predicted with certainty.
Probability notation:
- P(A) = probability of event A
- P(B) = probability of event B
- P(A’) = probability of not event A
In stats, the In stats, the complement of an event is the event not occurring. of an event is the event not occurring.
Complement rule: P(A’) = 1 – P(A)
Two events are mutually exclusive if they cannot occur at the same time.
Addition rule (for mutually exclusive events):
P(A or B) = P(A) + P(B)
Two events are independent if the occurrence of one event does not change the probability of the other event.
Multiplication rule (for independent events):
P(A and B) = P(A) * P(B)
Conditional probability refers to the probability of an event occurring given that another event has already occurred.
Two events are dependent if the occurrence of one event changes the probability of the other event.
Conditional probability P(A and B) = P(A) * P(B|A)
- P(A and B): probability of event A and event B
- P(A): probability of event A
- P(B|A): probability of event B given event A
Bayes’ theorem, also known as Bayes’ rule, is a math formula for determining conditional probability. It’s named after Thomas Bayes, an 18th century mathematician from London, England.
\(P(A \vert B) = \displaystyle \frac{P(A \vert B) * P(A)}{P(B)} \)
In Bayesian statistics, prior probability refers to the probability of an event before new data is collected. Posterior probability is the updated probability of an event based on new data.
Bayes’ theorem is the foundation for the field of Bayesian statistics, also known as Bayesian inference, which is a powerful method for analyzing and interpreting data in modern data analytics.
In the theorem, prior probability is the probability of event A. Posterior probability, or what you’re trying to calculate, is the probability of event A given event B.
- P(A): Prior probability
- P(A|B): Posterior probability
Sometimes, statisticians and data professionals use the term “likelihood” to refer to the probability of event B given event A, and the term “evidence” to refer to the probability of event B.
- P(B|A): Likelihood
- P(B): Evidence
Bayes’ theorem (expanded version)
\(P(A\vert B) = \displaystyle \frac{P(B\vert A)*P(A)}{P(B\vert A)*P(A)+P(B\vert not \ A)*P(not \ A)}\)
This longer version of Bayes’ theorem is often used to evaluate tests such as medical diagnostic tests, quality control tests or software test such as spam filters.
A false positive is a test result that indicates something is present when it really is not.
A false negative is a test result that indicates something is not present when it really is.
A probability distribution describes the likelihood of the possible outcomes of a random event.
A random variable represents the values for the possible outcomes of a random event.
- Discrete random variable: has a countable number of possible values.
- Continuous random variable: takes all the possible values in some range of numbers.
Discrete vs continuous variables:
- COUNT the number of outcomes -> discrete
- MEASURE the outcome -> continuous
The term sample space describes the set of all possible values for a random variable.
The binomial distribution is a discrete distribution that models the probability of events with only two possible outcomes, success or failure.
Binomial distribution formula:
\(P(X=k)=\displaystyle \frac {n!}{k!(n-k)!}p^{k} (1-p)^{n-k}\)
The Poisson distribution is a probability distribution that models the probability that a certain number of events will occur during a specific time period.
\( \begin{eqnarray}
P(X=k)=\displaystyle \frac{\lambda ^{k} e^{-\lambda}}{k!}
\end{eqnarray} \)
- λ: the mean number of events that occur during a specific time period
- k: the number of events
- e: a constant equal to approximately 2.71828
Distribution | Given | Want to find |
---|---|---|
Poisson | The average number of events happening for a specific time period | The probability of a certain number of events happening inthat time period |
Binomial | An exact probability of an event happening | The probability of the event happening a certain number of times in a repeated trial |
The Bernoulli distribution is similar to the binomial distribution as it also models events that have only two possible outcomes (success or failure). The only difference is that the Bernoulli distribution refers to only a single trial of an experiment, while the binomial refers to repeated trials. A classic example of a Bernoulli trial is a single coin toss.
The normal distribution is a continuous probability distribution that is symmetrical on both sides of the mean and bell-shaped.
The empirical rule:
- 68% of values fall within 1 standard deviation of the mean
- 95% of values fall within 2 standard deviation of the mean
- 99.7% of values fall within 3 standard deviation of the mean
Two types of probability functions:
- Probability Mass Functions (PMFs) represent discrete random variables
- Probability Density Functions (PDFs) represent continuous random variables
A z-score is a measure of how many standard deviations below or above the population mean a data point is.
In statistics, standardization is the process of putting different variables on the same scale.
A standard normal distribution is just a normal distribution with a mean of 0 and a standard deviation of 1
\(z = \displaystyle \frac{x- \mu}{\sigma}\)
SciPy is an open source software you can use for solving mathematical, scientific, engineering, and technical problems. It allows you to manipulate and visualize data with a wide range of Python commands. SciPy stats is a module designed specifically for statistics.
Statsmodels is a Python package that lets you explore data, work with statistical models, and perform statistical tests. It includes an extensive list of stats functions for different types of data.
Module 3: Sampling
Sampling is the process of selecting a subset of data from a population.
Inferential statistics use sample data to draw conclusions or make predictions about a larger population.
Five steps for the sampling process:
- Identify the target population
- Select the sampling frame
- Choose the sampling method
- Determine the sampling size
- Collect the sample data
The target population is the complete set of elements that you’re interested in knowing more about.
A sampling frame is a list of all the items in your target population.
The difference between a target population and a sampling frame is that the population is general and the frame is specific.
There are two main types of sampling methods: probability sampling and non-probability sampling.
- Probability sampling uses random selection to generate a sample.
- Non-probability sampling is often based on convenience or the personal preferences of the researcher rather than random selection.
Sample size refers to the number of individuals or items chosen for a study or experiment.
Four different probability sampling methods:
- Simple random sampling — very member of a population is selected randomly and has an equal chance of being chosen.
- Stratified random sampling — you divide a population into groups and randomly select some members from each group to be in the sample.
- Cluster random sampling — you divide a population into clusters, randomly select certain clusters, and include all members from the chosen clusters in the sample.
- Systematic random sampling — you put every member of a population into an ordered sequence. Then you choose a random starting point in the sequence and select members for your sample at regular intervals.
Non-probability sampling methods
- Convenience sampling — you choose members of a population that are easy to contact or reach.
- Voluntary sampling — consists of members of a population who volunteer to participate in a study.
- Snowball sampling — researchers recruit initial participants to be in a study and then ask them to recruit other people to participate in the study.
- Purposive sampling — researchers select participants based on the purpose of their study.
Undercoverage bias occurs when some members of a population are inadequately represented in the sample.
Nonresponse bias occurs when certain groups of people are less likely to provide responses.
A point estimate uses a single value to estimate a population parameter.
The central limit theorem states that the sampling distribution of the mean approaches a normal distribution as the sample size increases. In other words, as your sample increases, your sampling distribution assumes the shape of a bell curve.
A population proportion refers to the percentage of individuals or elements in a population that share a certain characteristic.
Standard error of the population:
\( SE(\hat{p_{}}) = \sqrt{ \displaystyle \frac{\hat{p_{}}(1-\hat{p_{}})}{n}}\)
When a population element can be selected only one time, you were sampling without replacement.
A random seed is a starting point for generating random numbers.
Module 4: Confidence intervals
A confidence interval is a range of values that describes the uncertainty surrounding an estimate.
Frequentist vs. Bayesian:
- Confidence intervals (Frequentist)
- Credible intervals (Bayesian)
Confidence intervals give data professionals a way to express the uncertainty caused by randomness and provide a more reliable estimate.
The margin of error represents the maximum expected difference between a population parameter and a sample estimate.
The confidence level describes the likelihood that a particular sampling method will produce a confidence interval that includes the population parameter.
What you can say is that if you take repeated random samples from the population, and construct a confidence interval for each sample using the same method, you can expect 95% of your intervals to capture the population mean.
Misinterpretation 1: 95% refers to the probability that the population mean falls within the constructed interval.
Misinterpretation 2: 95% refers to the percentage of data values that fall within the interval.
Misinterpretation 3: 95% refers to the percentage of sample means that fall within the interval.
Steps for constructing a confidence interval:
- Identify a sample statistic
- Choose a confidence level
- Find the margin of error
- Calculate the interval
z-scores for large sample sizes, t-scores for small sample sizes.
For small sample sizes, you need to use a different distribution, called the t-distribution. Statistically speaking, this is because there is more uncertainty involved in estimating the standard error for small sample sizes.
Module 5: Introduction to hypothesis testing
Hypothesis testing is a statistical procedure that uses sample data to evaluate an assumption about a population parameter.
Steps for performing a hypothesis test:
- State the null hypothesis and the alternative hypothesis
- Choose a significance level
- Find the p-value
- Reject or fail to reject the null hypothesis
The null hypothesis is a statement that is assumed to be true unless there is convincing evidence to the contrary. The null hypothesis typically assumes that there is no effect in the population, and that your observed data occurs by chance.
The alternative hypothesis is a statement that contradicts the null hypothesis, and is accepted as true only if there is convincing evidence for it. The alternative hypothesis typically assumes that there is an effect in the population, and that your observed data does not occur by chance.
Note: The null and alternative hypotheses are always claims about the population. That’s because the aim of hypothesis testing is to make inferences about a population based on a sample.
Characteristics of the null hypothesis:
- In statistics, the null hypothesis is often abbreviated as H sub zero (\(H_{0}\)).
- When written in mathematical terms, the null hypothesis always includes an equality symbol (usually =, but sometimes ≤ or ≥).
- Null hypotheses often include phrases such as “no effect,” “no difference,” “no relationship,” or “no change.”
Characteristics of the alternative hypothesis:
- In statistics, the alternative hypothesis is often abbreviated as H sub a (\(H_{a}\)).
- When written in mathematical terms, the alternative hypothesis always includes an inequality symbol (usually ≠, but sometimes < or >).
- Alternative hypotheses often include phrases such as “an effect,” “a difference,” “a relationship,” or “a change.”
The significance level is also the probability of rejecting the null hypothesis when it is true.
P-value refers to the probability of observing results as or more extreme than those observed when the null hypothesis is true.
Drawing conclusion:
- If p-value < significance level: reject the null hypothesis
- If p-value > significance level: fail to reject the null hypothesis
Types of errors in hypothesis testing
- Type I error
- Type II error
Type I error (false positive): The rejection of null hypothesis that is actually true.
Type II error (false negative): The failure to reject the null hypothesis which is actually false.
The probability of making a Type I error is called alpha (α). Your significance level, or alpha (α), represents the probability of making a Type I error.
The probability of making a Type II error is called beta (β), and beta is related to the power of a hypothesis test (power = 1- β). Power refers to the likelihood that a test can correctly detect a real effect when there is one.
A Type I error means rejecting a null hypothesis which is actually true. In general, making a Type I error often leads to implementing changes that are unnecessary and ineffective, and which waste valuable time and resources.
A Type II error means failing to reject a null hypothesis which is actually false. In general, making a Type II error may result in missed opportunities for positive change and innovation. A lack of innovation can be costly for people and organizations.
A one-sample test determines whether or not a population parameter, like a mean or proportion, is equal to a specific value.
A two-sample test determines whether or not two population parameters, such as two means or two proportions, are equal to each other.
One-sample z-test assumptions:
- The data is a random sample of normally distributed population
- The population standard deviation is know
A test statistic that indicates how closely your data match the null hypothesis. For a z-test, your test statistic is a z-score; for a t-test, it’s a t-score.
A corresponding p-value that tells you the probability of obtaining a result at least as extreme as the observed result if the null hypothesis is true.
A z-score is a measure of how many standard deviations below or above the population mean a data point is.
\(z= \displaystyle \frac{\bar{x} – \mu}{\frac{\sigma}{\sqrt{n}}}\)
In data analytics, two-sample tests are frequently used for A/B testing.
Two-sample t-test for means assumptions:
- The two samples are independent of each other
- For each sample, the data is drawn randomly from a normally distributed population
- The population standard deviation is unknow
T-score:
\( t = \displaystyle
\frac{
\bar{X_{}}_{\tiny 1} – \bar{X_{}}_{\tiny 2}
}
{\sqrt{\left(
\frac{s_{\tiny 1}^{\tiny 2}}{n_{\tiny 1}}
+
\frac{s_{\tiny 2}^{\tiny 2}}{n_{\tiny 2}} \right)}
}
\)
A t-tests DO NOT apply to proportions.
Z-score for comparing the two-sample proportions:
\(
z = \displaystyle
\frac{
\hat{p_{\tiny 1}} – \hat{p_{\tiny 2}}
}
{
\sqrt{
\hat{p_{\tiny 0}}
\left( 1- \hat {p_{\tiny 0}} \right)
\left( \frac{1}{n_{\tiny 1}} + \frac{1}{n_{\tiny 2}} \right)
}
}
\)
\(\hat {p_{\tiny 0}}\): A pooled proportion – weighted average of the proportions from your two samples, which has a separate formula.
A/B testing is a way to compare two versions of something to find out which version performs better.
Three main features of a typical A/B test:
- Test design
- Sampling
- Hypothesis testing
An A/B test is a basic version of what’s known as a randomized controlled experiment. In a randomized controlled experiment, test subjects are randomly assigned to a control group and a treatment group. The treatment is the new change being tested in the experiment. The control group is not exposed to the treatment. The treatment group is exposed to the treatment. The difference in metric values between the two groups measures the treatment’s effect on the test subjects.
Experimental design refers to planning an experiment in order to collect data to answer your research question.
Module 6: Course 4 end-of-course project
最終モジュールは、いつものごとく 3つのシナリオから選んでデータ分析を行うロールプレイです。
タクシーのデータ分析、ショート動画関係のデータ分析、地図アプリのデータ分析、の中から選びます。
基本的には、Google Advanced Data Analytics Specialization のその他のコースと同じものを選択して一貫性を持たせる方が無難ですね。
今回は、このコースの名 “The Power of Statistics” が示す通り、統計を使った A/Bテストや統計的仮設検定、記述統計、なんかが大きなテーマとなります。
まとめ
終了です。というか、だんだんめんどくさくなって英語のまま自分用のメモを残すだけになってます。時間あれば対訳もやっていきたいな、とは思ってます。(きっとやらんけど)
一通り、終わってみましたが、内容的にはそこまで難しくはないかなと、統計学といっても結構基本的なところなので、ある程度知識があればすんなり進むと思います。
個人的には、品質管理検定の2級を楽に合格できるぐらいの実力があれば、このコースの統計関係の内容はほぼ理解できているもの、とみなしてもいいかなと思います。
統計学が一つ専門分野である以上、専門用語は結構覚える必要があります。特に英単語が通常の意味とは違った意味で使われたりするので要注意ですね。
さて、続いて回帰分析についてやっていきます。Regression Analysis: Simplify Complex Data Relationships ってコースですね。
それでは!
コメント