Regression Analysis: Simplify Complex Data Relationships – Coursera

それでは、引き続き、回帰分析についてみていきます。こちらも統計に続いて、それなりに専門的な内容となるので、英語ばっかりになるかもです。

要所要所で日本語のコメント入れていきます。

Module 1: Introduction to complex data relationships
Module 2: Simple linear regression
Module 3: Multiple linear regression
Module 4: Advanced hypothesis testing
Module 5: Logistic regression
Module 6: Course 5 end-of-course project
まとめ

Module 1: Introduction to complex data relationships

Regression analysis or regression models are a group of statistical techniques that use existing data to estimate the relationship between a single dependent variable and one or more independent variables.

Model assumptions are statements about the data that must be true to justify the use of particular data techniques.

A line is a collection of an infinite number of points extending in two opposite directions.

Linear regression is a technique that estimates the linear relationship between a continuous dependent variable y and one or more independent variables x.

まずはイントロダクションということで、重要な概念の説明がサラっとあります。

Module 2: Simple linear regression

The dependent variable is the variable a given model estimates. Sometimes the dependent variable is also called a response or outcome variable and is commonly represented with the letter y.

We assume that the dependent variable tends to vary based on the values of independent variables, typically represented by an x. Independent variables are also referred to as explanatory variables or predictor variables.

The slope refers to the amount we expect y, the dependent variable, to increase or decrease per one unit increase of x, the independent variable.

The intercept is the value of y, the dependent variable, when x, the independent variable, equals 0.

Positive correlation is a relationship between two variables that tend to increase or decrease together.

Negative correlation, on the other hand, is an inverse relationship between two variables. When one variable increases, the other variable tends to decrease.

It is important to note that correlation is not causation. For example in your cake shop, people buying coffee does not cause cake sales to increase. When modeling variable relationships, a data scientist must be mindful of the extent of their claims.

Causation describes a cause and effect relationship where one variable directly causes the other to change in a particular way.

Linear regression equation (for population parameters):

\(\mu\{Y|X\} = \beta_{\tiny 0} + \beta_{\tiny 1}X \)

\(\beta_{\tiny 0}\): Intercept

\(\beta_{\tiny 1}\): slope

In statistics, we write the intercept as \(\beta_{\tiny 0}\), which we sometimes call Beta naught, and the slope is written as \(\beta_{\tiny 1}\) mu of y and the Betas are sometimes called parameters.

数字のゼロは、”zero” 以外にも “naught” という単語を使うこともあるんですね。つまり、”beta naught” は、そのまま「ベータゼロ」って意味です。

the number 0 or zero:

Cambridge Dictionary – naught

Linear regression equation (for estimates of parameters):

\(\hat {\mu_{}} \{Y|X\} = \hat {\beta_{\tiny 0}} + \hat {\beta_{\tiny 1}}X \)

Regression coefficients: the estimated betas in a regression model. Represented as \(\hat{\beta_{i}}\)

Ordinary Least Squares Estimation (OLS): Common way to calculate linear regression coefficients \((\hat{\beta_{}})_{n}\)

Loss function: A function that measures the distance between the observed values and the model’s estimated values

Logistic regression is a technique that models a categorical variable based on one or more independent variables.

Logistic regression model:

\( \mu\{Y|X\} = Prob(Y=1|X) = p\)

\( g(p)=\beta_{\tiny 0} + \beta_{\tiny 1}X \)

g(p): link function

A link function is a non-linear function that connects or links the dependent variable to the independent variables mathematically.

Simple linear regression is a technique that estimates the linear relationship between one independent variable, X, and one continuous dependent variable, Y.

Best fit line: The line that fits the data best by minimizing a loss function or error.

The predicted values are the estimated y values for each x calculated by a model.

Residual: The difference between observed or actual values and the predicted values of the regression line.

Residual = Observed – Predicted
\( \varepsilon_{i} = y_{i} – \hat {y_{i}} \)

The sum of the residuals is always equal to zero for OLS estimators.

Sum of Squared Residuals (SSR): The sum of the squared differences between each observed value and its associated predicted value

\( \displaystyle \sum_{i=1}^{n}(Observed – Predicted)^{2} \)

\( \displaystyle \sum_{i=1}^{n}(y_{i} – \hat {y_{i}})^{2} \)

Ordinary least squares, also known as OLS, is a method that minimizes the sum of squared residuals to estimate parameters in a linear regression model.

For simple linear regression, one way to write the formulas is as follows:

\( \displaystyle \hat {\beta_{1}} =
\frac {\sum_{i=1}^n(X_{i}-\bar {X_{}})(Y_{i}-\bar{Y_{}})} {\sum_{i=1}^n(X_{i}-\bar {X_{}})^{2}}
\)
\( \displaystyle
\hat {\beta_{0}} = \bar{Y}-\hat{\beta_{1}}\bar{X}
\)

Linear regression assumptions:

Linearily
Normality
independent observation
Homoscedasticity

The quantile-quantile plot (Q-Q plot) is a graphical tool used to compare two probability distributions by plotting their quantiles against each other. D

A scatterplot matrix is a series of scatterplots that show the relationship between pairs of variables.

A confidence band is the area surrounding the line that describes the uncertainty around the predicted outcomes at every value of X.

Common evaluation metrics:

R²
Mean Squared Error (MSE)
Mean Absolute Error (MAE)

R² (The coefficient of determination) measures the proportion of variation in the dependent variable – Y, explained by the independent variable(s) – X.

\(\displaystyle
R^{2}=1-\frac{\text{Sum of squared residuals}\qquad}{\text{Total sum of squares}\qquad}
\)

At most R-squared can equal 1, which would mean that X explains 100% of the variance and Y. If R-squared equals 0, then that would mean X explains 0% of the variance in Y.

A hold-out sample is a random sample of observed data that is not used to fit the model.

MSE (mean squared error) is the average of the squared difference between the predicted and actual values.

MAE (mean absolute error) is the average of the absolute difference between the predicted and actual values.

統計学はちょっとかじった程度だったので、Q-Q plot は初見でした。

要するに、理論的な分布（このコースの文脈では正規分布）と、実際のデータの分布がどの程度似通っているのかを確認するためのツールってことですね。

ここの記事がかなりわかりやすかったです。感謝感謝
https://qiita.com/kenmatsu4/items/59605dc745707e8701e0

Module 3: Multiple linear regression

While simple linear regression only allows one independent variable X, multiple linear regression allows us to have many independent variables that are associated with changes in the one continuous dependent variable Y.

Multiple linear regression, also known as multiple regression, is a technique that estimates the linear relationship between one continuous dependent variable and two or more independent variables.

Full multiple regression equation:
\(
y = \beta_{\tiny 0}+\beta_{\tiny 1}X_{\tiny 1}+\beta_{\tiny 2}X_{\tiny 2}+ \cdots +\beta_{\tiny n}X_{\tiny n}
\)

One hot encoding is a data transformation technique that turns one categorical variable into several binary variables.

No multicollinearity assumption: The no multicollinearity assumption states that no two independent variables, X_i and X_j can be highly correlated with each other.

The Variance Inflation Factors (VIF) quantifies how correlated each independent variable is with all the other independent variables.

Multiple linear regression assumptions:

(Multivariate) normality: The errors are normally distributed.*
Independent observations: Each observation in the dataset is independent.
Homoscedasticity: The variation of the errors is constant or similar across the model.*
No multicollinearity: No two independent variables (X_i and X_j) can be highly correlated with each other.

* Note on errors and residuals

An interaction term represents how the relationship between two independent variables is associated with changes in the mean of the dependent variable.

Overfitting in the data space is when a model fits the observed or training data to specifically and is unable to generate suitable estimates for the general population.

Adjusted R squared is a variation of the R squared regression evaluation metrics that penalize is unnecessary.

An underfitting model has a low R-squared value.

Training data is used to build the model, and test data is used to evaluate the model’s performance after it has been built. Splitting the sample data in this way is also called holdout sampling, with the holdout sample being the test data.

The holdout sample might also be called the validation data. Regardless, the general idea remains the same: this is the data that is used to evaluate the model.

Overfitting causes a model to perform well on training data, but its performance is considerably worse when evaluated using the unseen test data. That’s why data scientists compare model performance on training data versus test data to identify overfitting.

An overfitting model fits the observed or training data too specifically, making the model unable to generate suitable estimates for the general population. This multiple regression model has captured the signal (i.e. the relationship between the predictors and the outcome variable) and the noise (i.e. the randomness in the dataset that is not part of that relationship). You cannot use an overfitting model to draw conclusions for the population because this model only applies to the data used to build it.

A model that underfits the sample data is described as having a high bias whereas a model that does not perform well on new data is described as having high variance. In data science, there is a phenomenon known as the bias versus variance tradeoff.

Variable selection, also known as feature selection, is the process of determining which variables or features to include in a given model.

Forward selection is a stepwise variable selection process. It begins with the null model with zero independent variables and considers all possible variables to add. It incorporates the independent variable that contributes the most explanatory power to the model based on the chosen metric and threshold.

Backwards elimination is a stepwise variable selection process that begins with the full model with all possible independent variables and removes the independent variable that adds the least explanatory power to the model based on the chosen metric and threshold.

The extra sum of squares F-test quantifies the difference between the amount of variance that is left unexplained by a reduced model, that is explained by the full model.

The bias variance tradeoff balances two model qualities, bias and variance, to minimize overall error for unobserved data.

Bias: Simplifies the model predictions by making assumptions about the variable relationships. A highly biased model may oversimplify the relationship, underfitting to the observed data and generate inaccurate estimates.

Variance: Model flexibility and complexity, so the model learns from existing data. A model with high variance can overfit to observed data and generate inaccurate estimates for unseen data.

Note this variance is not to be confused with the variance of a distribution.

Regularization: A set of regression techniques that shrinks regression coefficient estimates toward zero, adding in bias, to reduce variance.

Regularized regression:

Lasso regression
Ridge regression
Elastic-net regression

Module 4: Advanced hypothesis testing

Chi-squared test [χ²] will help us determine if two categorical variables are associated with one another, and whether a categorical variable follows an expected distribution.

The χ² goodness of fit test determines whether an observed categorical variable follows an expected distribution.

\( \displaystyle
\chi^{2}=\sum{
\frac{\text{(Observed – Expected)}^{\qquad 2}}{\text{Expected}}
}
\)

Note that the χ² goodness of fit test does not produce reliable results when there are any expected values of less than five.

When the model is fully specified (i.e., you know all the possible categorical levels), then:

degrees of freedom = number of categorical levels – 1.

This is because the counts of each level are free to fluctuate, but once you know the counts for all levels, the last level cannot vary. It must result in a total of all levels when summed with the others.

The χ² test for independence determines whether or not two categorical variables are associated with each other.

\( \displaystyle
\text{expected value}=\frac{\text{column total * row total}\qquad}{\text{overall total}\qquad}
\)

Analysis of Variance commonly (ANOVA) is a group of statistical techniques that test the difference of means between three or more groups. ANOVA is an extension of t-tests. While t-tests examine the difference of means between two groups. ANOVA can test means between several groups.

One-way ANOVA: Compares the means of one continuous dependent variable based on three or more groups of one categorical variables.

Two-way ANOVA: Compares the means of one continuous dependent variable based on three or more groups of two categorial variables.

Five steps in performing a one-way ANOVA test:

Calculate group means and grand (overall) mean
Calculate the sum of squares between groups (SSB) and the sum of squares within groups (SSW)
Calculate mean squares for both SSB and SSW
Compute the F-statistic
Use the F-distribution and the F-statistic to get a p-value, which you use to decide whether to reject the null hypothesis

The sum of squares between groups (SSB):

\( \displaystyle
\text{SSB}=\sum_{g=1}n_g{\left(M_{g}-M_{G}\right)}^{2}
\)
where:
n_g = the number of samples in the g_th group
M_g= mean of the g_th group
M_G= grand mean

The sum of squares within groups (SSW):

\( \displaystyle
\text{SSW}=\sum_{g=1}\sum_{i=1}{\left(x_{gi}-M_{G}\right)}^{2}
\)
where:
x_gi = sample i of g_th group
M_g = mean of the g_th group

Mean squares between groups (MSSB):

\( \displaystyle
\text{MSSB}=\frac{\text{SSB}\ }{k-1\ }
\)
where:
k = the number of groups
Note: k−1 represents the degrees of freedom between groups

Mean squares within groups (MSSW):

\( \displaystyle
\text{MSSW}=\frac{\text{SSW}\ }{n-k\ }
\)
where:
n = the total number of samples in all groups
k = the number of groups
Note: n−k represents the degrees of freedom within groups

The F-statistic is the ratio of the mean sum of squares between groups (MSSB) to the mean sum of squares within groups (MSSW):

\( \displaystyle
\text{F-statistic}=\frac{\text{MSSB}\ }{\text{MSSW}\ }
\)

A higher F-statistic indicates a greater variability between group means relative to the variability within groups, suggesting that at least one group mean is significantly different from the others.

Assumptions of ANOVA:

The dependent values for each group come from normal distributions
The variances across groups are equal
Observations are independent of each other

Regression analysis	ANOVA
IF and by HOW MUCH variables impact an outcome variable	Pairwise comparisons Understand nuance among elements that fueled regression analysis

Post hoc test: Performs a pairwise comparison between all available groups while controlling for the error rate.

There are many post hoc tests that can be run. One of the most common ANOVA post hoc tests is the Tukey’s HSD (honestly significantly different) test.

ANCOVA (Analysis of covariance): A Statistical technique that tests the difference of means between three or more groups while controlling for the effects of covariates, or variable(s) irrelevant to your test.

※ Covariates are the variables that are not of direct interest to the question we are trying to address.

MAMOVA (Multivariate analysis of variance): An extension of ANOVA that compares how two or more continuous outcome variables vary according to categorical independent variables.

One-way MANOVA:
- One categorical independent variable for continuous outcome variables
Two-way MANOVA:
- Two categorical independent variables for continuous outcome variables

MANCOVA (Multivariate analysis of covariance): An extension of ANCOVA and MANOVA that compares how two or more continuous outcome variables vary according to categorical independent variables, while controlling for covariates.

Module 5: Logistic regression

Logistic regression: A technique that models a categorical dependent variable (Y) based on one or more independent variables (X)

Binomial logistic regression: A technique that models the probability of an observation falling into one of two categories, based on one or more independent variables

Binomial logistic regression linearity assumptions: There should be a linear relationship between each X variable and the logit of the probability that Y equals 1.

\( \displaystyle
\text{Odds}=\frac{p}{1-p}
\)

Logit (log-odds): The logarithm of the odds of a given probability. So the logit of probability p is equal to the logarithm of p divided by 1 minus p.

\( \displaystyle
\text{logit(p)}=log{\left(\frac{p}{1-p}\right)}
\)

Logit in terms of X variables:

\( \displaystyle
\text{logit(p)}=log{\left(\frac{p}{1-p}\right)}
=
\beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2}+\cdots+\beta_{n}X_{n}
\)

Tuesday, July 23, 2024
22:17

Maximum likelihood estimation (MLE): A technique for estimating the beta parameters that maximize the likelihood of the model producing the observed data.

Likelihood: The probability of observing the actual data, given some set of beta parameters.

Binomial logistic regression assumptions:

Linearity
Independent observations
No multicollinearity
No extreme outliers

Confusion matrix: A graphical representation of how accurate a classifier is at predicting the labels for a categorical variable.

Evaluation matrix for Binomial logistic regression:

Precision: The proportion of positive predictions that were true positives.
- \( \displaystyle
  \text{Precision}=\frac{\text{True Positives}\qquad}{\text{True Positives + False Positives}\qquad}
  \)
Recall: The proportion of positives the model was able to identify correctly.
- \( \displaystyle
  \text{Recall}=\frac{\text{True Positives}\qquad}{\text{True Positives + False Negatives}\qquad}
  \)
Accuracy: The proportion of data points that were correctly categorized.
- \( \displaystyle
  \text{Accuracy}=\frac{\text{True Positives + True Negatives}\qquad}{\text{Total Predictions}\qquad}
  \)

An ROC curve helps in visualizing the performance of a logistic regression classifier. ROC curve stands for receiver operating characteristic curve. To visualize the performance of a classifier at different classification thresholds, you can graph an ROC curve. In the context of binary classification, a classification threshold is a cutoff for differentiating the positive class from the negative class.

AUC stands for area under the ROC curve. AUC provides an aggregate measure of performance across all possible classification thresholds. AUC ranges in value from 0.0 to 1.0. A model whose predictions are 100% wrong has an AUC of 0.0, and a model whose predictions are 100% correct has an AUC of 1.0.

Module 6: Course 5 end-of-course project

いつものごとく、最後のロールプレイです。

Course 5 で学んだ内容、今回のテーマは主に Multiple Linear Regression (多重線形回帰) となります。もちろん、いつも通り、 PACE フレームワークに基づいたプロセスや、 Executive summary の作成も問われます。

まとめ

完了です。

この辺りから、そんなに詳しくない領域の内容の講義になってきたので、楽しみながら学習してます。

Regression Analysis は、学習データから将来を予測するような使い方ができ、応用範囲がかなり広いので、しっかりマスターしておくと良いですね。

生成した回帰モデルの評価指標がいろいろあってややこしいので、その辺りは使いながら慣れていく必要があります。

また、分析手法ごとの理論的な裏付けとなる分布はどれが対応しているのかを理解することが重要です。

求める従属変数の性質は、線形回帰分析、分散分析、ロジスティック回帰分析のうちどれを採用すべきかの判断基準となります。

その他、こちらのコース 5 から、全体的に（特に最後の End-of-course project）Python のコードを書く量がそれなりに増えてます。ただ単純に統計用の外部モジュールを使うだけではなく、自分で関数定義を行い、それをデータフレームに対して適用するなどなど。

それなりに難しいですが、実りの多いコースなので楽しんで進めればベストです。

それでは、講義としては最終のコースとなる、The Nuts and Bolts of Machine Learning に進んでいきます。