実質的にほぼ最終となる、Google Advanced Data Analytics Professional Certificate のコースです。モダンな機械学習の基礎、についてのコースとなります。
Module 1: The different types of machine learning
Supervised machine learning: Uses labeled datasets to train algorithms to classify or predict outcomes.
Unsupervised machine learning: Uses algorithms to analyze and cluster unlabeled datasets.
Reinforcement learning is often used in robotics and is based on rewarding or punishing a computer’s behaviors.
Deep learning models are made of layers of interconnected nodes. Each layer of nodes receives signals from its preceding layer. Nodes that are activated by the input they receive, then pass transformed signals either to another layer or to a final output.
There’s one aspect of machine learning and data science that every data professional should know: quality is more important than quantity. A small amount of diverse and representative data is often more valuable for data professionals than a large amount of biased and unrepresentative data.
Recommendation systems: Unsupervised learning techniques that use unlabeled data to offer relevant suggestions to users.
Content-based filtering: Comparisons are made based on attributes of content.
Popularity bias: The phenomenon of more popular items being recommended too frequently.
Integrated Development Environment (IDE): A piece of software that has an interface to write, run, and test a piece of code.
モジュール 1は、とりあえず基本だけささっと流す感じです。特に難しく感じる部分はないかと思います。
Module 2: Workflow for building complex models
Module 2 では、機械学習モデルを構築するためのワークフローを PACE フレームワークの流れで見ていきます。
Feature engineering: The process of using practical, statistical, and data science knowledge to select, transform, or extract characteristics, properties, and attributes from raw data.
Feature selection: Select the features in the data that contribute the most to predicting your response variable.
Feature transformation: Modifying existing features in a way that improves accuracy when training the model.
Feature extraction: Taking multiple features to create a new one that would improve the accuracy of the algorithm
A log-normal distribution is a continuous distribution whose logarithm is normally distributed.
Scaling is when you adjust the range of a feature’s values by applying a normalization function to them. Scaling helps prevent features with very large values from having undue influence over a model compared to features with smaller values, but which may be equally important as predictors.
Normalization (e.g., MinMaxScaler in scikit-learn) transforms data to reassign each value to fall within the range [0, 1]. When applied to a feature, the feature’s minimum value becomes zero and its maximum value becomes one.
\( \displaystyle
x_{i,\ normalized}=\frac{x_{i}-x_{min}}{x_{max}-x_{min}}
\)
Standardization (e.g., StandardScaler in scikit-learn) transforms each value within a feature so they collectively have a mean of zero and a standard deviation of one.
\( \displaystyle
x_{i,\ standardized}=\frac{x_{i}-x_{mean}}{x_{stand.dev.}}
\)
Variable encoding is the process of converting categorical data to numerical data. Consider the bank churn dataset.
Class imbalance: When a dataset has a predictor variable that contains more instances of one outcome than another.
Class balancing refers to the process of changing the data by altering the number of samples in order to make the ratios of classes in the target variable less asymmetrical.
Downsampling is the process of making the minority class represent a larger share of the whole dataset simply by removing observations from the majority class.
Upsampling is basically the opposite of downsampling, and is done when the dataset doesn’t have a very large number of observations in the first place. Instead of removing observations from the majority class, you increase the number of observations in the minority class.
Naïve Bayes: A supervised classification technique that is based on Bayes’ Theorem with an assumption of independence among predictors.
Posterior probability: The probability of an event occurring after taking into consideration new information.
\( \displaystyle
P(x|c)=\frac{P(c|x)P(c)}{P(x)}
\)
\(P(x|c)\): Posterior probability
\(P(c|x)\): Likelihood of a predictor x given a class c
\(P(c)\): Class prior probability
\(P(x)\): Predictor prior probability
\(
P(c|X)=P(x_{\tiny 1}|c)*P(x_{\tiny 2}|c)*P(x_{\tiny n}|c)*P(c)
\)
\(P(c|X)\): Posterior probability
\(P(x_{\tiny n}|c)\): Conditional probability
Naive Bayes implementations in scikit-learn:
- GaussianNB (used for continuous, normally distributed features)
- MultinomialNB (used for discrete features)
- BernoulliNB (used for Boolean features)
- CategoricalNB (used for categorical features)
F1 score: The harmonic mean of precision and recall. The idea behind this metric is that it penalizes low values of either metric, which prevents one very strong factor—precision or recall—from “carrying” the other, when it is weaker.
\( \displaystyle
F_{1}=2\cdot
\left(
\frac{\text{precision}\cdot \text{recall}\quad}{\text{precision}+\text{recall}\quad}
\right)
\)
F𝛽 score: A score which considers one (recall or precision) more important than the other. In an F𝛽 score, 𝛽 is a factor that represents how many times more important recall is compared to precision. In the case of F1 score, 𝛽 = 1, and recall is therefore 1x as important as precision (i.e., they are equally important).
\( \displaystyle
F_{𝛽}=\left(1+𝛽^{2}\right)\cdot
\left(
\frac{\text{precision}\cdot \text{recall}\quad}{\left(𝛽^{2}\cdot \text{precision}\right)+\text{recall}\quad}
\right)
\)
Module 3: Unsupervised learning techniques
K-means works by minimizing intercluster variance. In other words, it aims to minimize the distance between points and their centroids. This means that K-means works best when the clusters are round.
K-means‘ characteristics:
- Unsupervised learning
- Partitioning algorithm
- Cluster unlabeled data
Centroid: The center of a cluster determined by the mathematical mean of all the points in that cluster.
Steps for K-means:
- Randomly place centroids in the data space.
- Assign each point to its nearest centroid.
- Update the location of each centroid to the mean position of all the points assigned to it.
- Repeat steps 2 and 3 until the model converges (i.e., all centroid locations remain unchanged with successive iterations).
Clustering vs. Partitioning:
Note that even though K-means is a partitioning algorithm, data professionals typically talk about it as a clustering algorithm. The difference is that outlying points in clustering algorithms can exist outside of the clusters. However, for partitioning algorithms, all points must be assigned to a cluster. In other words, K-means does not allow unassigned outliers.
DBSCAN stands for density-based spatial clustering of applications with noise. Instead of trying to minimize variance between points in each cluster, DBSCAN searches your data space for continuous regions of high density.
Hyperparameters are external configuration variables that data scientists use to manage machine learning model training.
The most important hyperparameters for DBSCAN in scikit-learn are:
- eps: Epsilon (ε) – The radius of your search area from any given point
- min_samples: The number of samples in an ε-neighborhood for a point to be considered a core point (including itself)
Agglomerative clustering works by first assigning every point to its own cluster, then progressively combining clusters based on intercluster distance.
Agglomerative clustering requires that you specify a desired number of clusters or a distance threshold, which is the linkage distance (explained further in the next section) above which clusters will not be merged.
There are different ways to measure the distances that determine whether or not to merge the clusters. This is known as the linkage. Some of the most common are:
- Single: The minimum pairwise distance between clusters
- Complete: The maximum pairwise distance between clusters
- Average: The distance between each cluster’s centroid and other clusters’ centroids.
- Ward: This is not a distance measurement. Instead, it merges the two clusters whose merging will result in the lowest inertia.
Important hyperparameters available for agglomerative clustering in scikit-learn:
- n_clusters: The number of clusters you want in your final model
- linkage: The linkage method to use to determine which clusters to merge (as described above)
- affinity: The metric used to calculate the distance between clusters. Default = euclidean distance.
- distance_threshold: The distance above which clusters will not be merged (as described above)
Inertia is a measurement of intracluster distance. It indicates how compact the clusters are in a model. Specifically, inertia is the sum of the squared distance between each point and the centroid of the cluster that it’s assigned to.
\( \displaystyle
\text{Inertia}=\sum_{i=1}^{n}\left(x_{i}-C_{k}\right)^{2}
\)
- n = the number of observations in the data,
- xi = the location of a particular observation,
- Ck = the location of the centroid of cluster k, which is the cluster to which point xi is assigned.
A silhouette analysis is the comparison of different models’ silhouette scores. To calculate a model’s silhouette score, first, a silhouette coefficient is calculated for each instance in the data.
\( \displaystyle
\text{Silhouette coefficient}=\frac{(b-a)}{\text{max}(a, b)}
\)
- a = the mean distance between the instance and each other instance in the same cluster
- b = the mean distance from the instance to each instance in the nearest other cluster (i.e., excluding the cluster that the instance is assigned to)
- max(a, b) = whichever value is greater, a or b
The silhouette score is the mean silhouette coefficient over all the observations in a model. The greater the silhouette score, the better defined the model clusters, because the points in a given cluster are closer to each other, and the clusters themselves are more separated from each other.
Note that, unlike inertia, silhouette coefficients contain information about both intracluster distance (captured by the variable a) and intercluster distance (captured by the variable b).
Module 4: Tree-based modeling
Tree-based learning: A type of supervised machine learning that performs classification and regression tasks
Decision tree: Flow-chart-like supervised classification model, and a representation of various solutions that are available to solve a given problem, based on the possible outcomes of related choices.
Decision tree characteristics:
- Require no assumptions regarding distribution of data
- Handles collinearity very easily
- Often doesn’t require data preprocessing
Root node: The first node of the tree, where the first decision is made.
Decision node: Nodes of the tree where decisions are made.
Leaf node: The nodes where a final prediction is made.
Child node: A node that is pointed to from another node.
Gini impurity:
\( \displaystyle
\text{Gini impurity}=1-\sum_{i=1}^{N}P(i)^2
\)
where i = class
P(i) = the probability of samples belonging to class i in a given node.
\( \displaystyle
Gi_{total}=
\left(
\frac{\text{number of samples in LEFT child}\qquad}{\text{number of samples in BOTH child nodes}\qquad \qquad}
\right)
*
Gi_{\text{left child}}
\newline
+
\displaystyle
\left(
\frac{\text{number of samples in RIGHT child}\qquad}{\text{number of samples in BOTH child nodes}\qquad \qquad}
\right)
*
Gi_{\text{right child}}
\)
Hyperparameters: Parameters that can be set before the model is trained.
Max depth: Defines how “long” a decision tree can get.
Min samples leaf: Defines the minimum number of samples for a leaf node.
GridSearch: A tool to confirm that a model achieves its intended purpose by systematically checking every combination of hyperparameters to identify which set produces the best results, based on the selected metric
Model validation is the whole process of evaluating different models, selecting one, and then continuing to analyze the performance of the selected model to better understand its strengths and limitations.
Cross-validation: A process that uses different portions of the data to test and train a model on different iterations. Cross-validation splits the training data into k number of folds, trains a model on k – 1 folds, and uses the fold that was held out to get a validation score. This process repeats k times, each time using a different fold as the validation set.
Model validation includes:
- Using validation sets
- Cross validation
Ensemble learning (or ensembling): Aggregating their outputs to make a prediction.
Base learner: Each individual model that comprises an ensemble.
Weak learner: A model that performs slightly better than randomly guessing.
Bagging stands for Bootstrap + aggregating.
Bootstrapping refers to sampling with replacement. In ensemble modeling architectures, this means that for each base learner, the same observation can and will be sampled multiple times.
Random forest: Ensemble of decision trees trained on bootstrapped data with randomly selected features. So to speak, “Bagging + random feature sampling = Random forest.”
Decision trees continue splitting until:
- Leaf nodes are all pure
- Reach min leaf size or max depth
- Performance metric achieved
Magic commands (“magics”): Commands that are built into iPython to simplify common tasks. They always begin with either “%” or “%%”
Boosting: Technique that builds an ensemble of weak learners sequentially, with each consecutive learner trying to correct the errors of the one that preceded it.
A weak learner is a model whose prediction is only slightly better than a random guess, and a base learner is any individual model in an ensemble.
Boosting differences from random forest and bagging:
- Learners are built sequentially, not in parallel
- Not limited to tree-based learners
Adaptive boosting (AdaBoost): A boosting methodology where each consecutive base learner assigns greater weight to the observations incorrectly predicted by the preceding learner.
Gradient boosting: A boosting methodology where each base learner in the sequence is built to predict the residual errors of the model that preceded it.
Gradient boosting machines (GBMs): Model ensembles that use gradient boosting
Advantages of gradient boosting machines (GBMs):
- High accuracy
- Generally scalable
- Work well with missing data
- Don’t require scaling
Disadvantages of gradient boosting machines (GBMs):
- Tuning many hyperparameters can be time-consuming
- Difficult to interpret
- Have difficulty with extrapolation
- Prone to overfitting if too many hyperparameters are tuned
Black-box model: Any model whose predictions cannot be precisely explained.
Extrapolation is a model’s ability to predict new values that fall outside of the range of values in the training data.
XGBoost: Extreme gradient boosting, an optimized GBM package.
Module 5: Course 6 end-of-course project
最終の end-of-courser プロジェクトはいつものロールプレイです。
- Automatidata
- TikTok
- Waze
の3つのシナリオから、選んでロールプレイを行います。今回でいえば、機械学習モデルの構築です。
Remember, sometimes your data simply will not be predictive of your chosen target. This is common. Machine learning is a powerful tool, but it is not magic. If your data does not contain predictive signal, even the most complex algorithm will not be able to deliver consistent and accurate predictions. Do not be afraid to draw this conclusion.
まとめ
はい、終了です。
一通り、機械学習についての基礎が網羅されています。
XGBoost とか、K-means とか、なかなか日常生活ではお目にかからないような単語がたくさん出てくるので、全くの初見でやると結構しんどいものがあると思います。
こちらの本にさっと目を通すだけでもしておけば、機械学習特有の言葉遣いとかにびっくりしなくて済みます。ついでに、チャートのプロットやデータの構造とかも豊富に使用してあるので、このあたりになれる意味合いでも一読の価値があります。
Introduction to Machine Learning with Python
仕組みはなんとなく理解していたつもりになっていましたが、改めて Google の秀逸な教材で噛み砕いて説明されると新しい発見がたくさんありました。特にそれぞれのアルゴリズムの特徴を大まかな概念図で示してもらえるのが助かります。なかなか文字の定義だけだとイメージしきれない部分があ流ので。
実際に機械学習モデルを構築したり, hyperparameters を調整したりする際に思っていたより時間がかかった(このコースで扱うサンプルデータでも10〜20分)ので、実務で使う際はハードの選定をどのようにするかが悩みどころとなりそうです。
素直にクラウド上で専用のインスタンス立ち上げるのが手っ取り早い?
ともあれ、こちらの認定プログラムもいよいよ大詰めです。続いて Google Advanced Data Analytics Capstone を進めていきます。
それでは!
コメント