Learning#

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
import pandas as pd

Overview#

There is no consensus on the precise definitions of data science, machine learning, deep learning and artificial intelligence. For our purposes, we consider the following definitions:

  • Data Science (DS): a cross-disciplinary field that employs scientific approaches, processes, algorithms and systems used to extract meaning and insights from data.

  • Artificial Intelligence (AI): a field of research aiming to develop artificial systems with human-level learning and reasoning abilities, possessing the qualities of intentionality, intelligence and adaptability.

  • Machine Learning (ML): a subset of the field of AI which involves algorithms with the ability to learn without being explicitly programmed. These algorithms learn from data to improve their accuracy, adaptability and utility.

A little ML history: Arthur Samuel is an American computer scientist who is credited for coining the term, “machine learning” with his research in computer systems at UIUC (he initiated the ILLIAC project) then at IBM where he developed the first checkers program on IBM’s first commercial computer in 1959. Robert Nealey, a self-proclaimed checkers master, played the game on an IBM 7094 computer in 1962, and he lost to the computer. Compared to what can be done today, this feat seems trivial, but it’s considered a major milestone in the field of artificial intelligence.

  • Artificial neural networks (ANNs): comprised of node layers, containing an input layer, one or more hidden layers, and an output layer. Each node, or artificial neuron, connects to another and has an associated weight and threshold. If the output of any individual node is above the specified threshold value, that node is activated, sending data to the next layer of the network. Otherwise, no data is passed along to the next layer of the network by that node.

  • Deep Learning (DL): a subset of ML in which artificial neural networks adapt and learn from large datasets. The “deep” in deep learning is just referring to the number of layers in a neural network. A neural network that consists of more than three layers—which would be inclusive of the input and the output—can be considered a deep learning algorithm or a deep neural network. A neural network that only has three layers is just a basic neural network. You can think of deep learning as “scalable machine learning” that eliminates some of the human intervention required (through flexible frameworks) and enables the use of larger data sets to continually improve the model performance.

The figure below summarizes these definitions and relationships:

https://raw.githubusercontent.com/illinois-ipaml/MachineLearningForPhysics/main/img/Learning-AI.png

Types of Learning#

https://raw.githubusercontent.com/illinois-ipaml/MachineLearningForPhysics/main/img/Learning-TypesOfLearning.png

Supervised Learning#

Supervised learning, also known as supervised machine learning, is where machines are taught by example. It is defined by its use of labeled datasets to train algorithms to classify new data or predict outcomes accurately. As input data is fed into the model, the model adjusts its weights until it has been fitted appropriately. This occurs as part of the cross validation process to ensure that the model avoids overfitting or underfitting. Supervised learning helps organizations solve a variety of real-world problems at scale, such as classifying spam in a separate folder from your inbox. Some methods used in supervised learning include neural networks, naïve bayes, linear regression, logistic regression, random forest, and support vector machine (SVM).

There are two main categories of supervised learning that are mentioned below:

  • Classification: Classification is a process of categorizing data or objects into predefined categories based on their features or attributes and determining to what category new observations belong.

  • Regression: Regression is a process to estimate the relationships among variables when the output variable is a real or continuous value.

Advantages of Supervised Machine Learning:

  • Supervised Learning models can have high accuracy as they are trained on labelled data.

  • The process of decision-making in supervised learning models is often interpretable.

  • It can often be used in pre-trained models which saves time and resources when developing new models from scratch.

Disadvantages of Supervised Machine Learning:

  • It has limitations in knowing patterns and may struggle with unseen or unexpected patterns that are not present in the training data.

  • It can be time-consuming and costly as it relies on labeled data only.

  • It may lead to poor generalizations based on new data.

Unsupervised Learning#

Unsupervised learning is a machine learning technique in which an algorithm discovers patterns and relationships using unlabeled data. Unlike supervised learning, unsupervised learning doesn’t involve providing the algorithm with labeled target outputs. The primary goal of Unsupervised learning is often to discover hidden patterns, similarities, or clusters within the data, which can then be used for various purposes, such as data exploration, visualization, dimensionality reduction, and more.

There are two main categories of unsupervised learning that we have already studied extensively:

  • Clustering: Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups. It is basically a collection of objects on the basis of similarity and dissimilarity between them.

  • Dimensionality Reduction: Dimensionality reduction is a technique used to reduce the number of features in a dataset while retaining as much of the important information as possible. In other words, it is a process of transforming high-dimensional data into a lower-dimensional space that still preserves the essence of the original data. This can be done for a variety of reasons, such as to reduce the complexity of a model, to improve the performance of a learning algorithm, or to make it easier to visualize the data.

Advantages of Unsupervised Machine Learning:

  • It helps to discover hidden patterns and various relationships between the data.

  • Used for tasks such as anomaly detection and data exploration. Techniques such as autoencoders and dimensionality reduction that can be used to extract meaningful features from raw data.

  • It does not require labeled data and reduces the effort of data labeling.

Disadvantages of Unsupervised Machine Learning:

  • Without using labels, it may be difficult to predict the quality of the model’s output.

  • Cluster Interpretability may not be clear and may not have meaningful interpretations.

Semi-Supervised Learning#

Semi-supervised learning is a type of machine learning that falls in between supervised and unsupervised learning. It is a method that uses a small amount of labeled data and a large amount of unlabeled data to train a model. The goal of semi-supervised learning is to learn a function that can accurately predict the output variable based on the input variables, similar to supervised learning. However, unlike supervised learning, the algorithm is trained on a dataset that contains both labeled and unlabeled data. Semi-supervised learning is particularly useful when there is a large amount of unlabeled data available, but it’s too expensive or difficult to label all of it.

Advantages of Semi-supervised Machine Learning:

  • It leads to better generalization as compared to supervised learning, as it takes both labeled and unlabeled data.

  • Can be applied to a wide range of data.

Disdvantages of Semi-supervised Machine Learning:

  • Semi-supervised methods can be more complex to implement compared to other approaches.

  • It still requires some labeled data that might not always be available or easy to obtain.

  • The unlabeled data can impact the model performance accordingly.

Reinforcement Learning#

Reinforcement learning is a learning method that interacts with an environment by producing actions and discovering errors. It is the science of decision making - learning the optimal behavior in an environment to obtain maximum reward. Trial, error, and delay are the most relevant characteristics of reinforcement learning. These methods allows machines to become autonomous, self-learners that automatically determine the ideal behaviour within specific context in order to maximize performance. This type of learning is crucial for applications that involve decision-making in unpredictable environments.

Advantages of Reinforcement Machine Learning:

  • It has autonomous decision-making that is well-suited for tasks and that can learn to make a sequence of decisions without human guidance, like robotics and game-playing. For promising science and engineering applications, it can be used for beam controls in particle acclerators, steering of high-temperature plasma in fusion systems, just to name a few.

  • This technique is preferred to achieve long-term results that are very difficult to achieve.

  • It is used to solve a complex problems that cannot be solved by conventional techniques.

Disadvantages of Reinforcement Machine Learning:

  • Training Reinforcement Learning agents can be computationally expensive and time-consuming.

  • Reinforcement learning is not preferable to solving simple problems.

  • It needs a lot of data and a lot of computation, which makes it impractical and costly.

Learning in a Probabilistic Context#

Learning a Model#

We have focused recently on two questions:

  • Assuming a model, which parameters best explain my data?

  • Given competing models, what are the relative odds that they explain my data?

In the framework of Bayesian inference, we answer the first question by estimating the posterior

\[ \Large P(\Theta_M\mid D, M) = \frac{P(D\mid \Theta_M, M)\, P(\Theta_M, M)}{P(D\mid M)} \]

for the assumed model \(M\) and its parameters \(\Theta_M\). We answer the second question by estimating the posterior ratio:

\[ \Large \frac{P(M_1\mid D)}{P(M_2\mid D)} = \frac{P(D\mid M_1)\, P(M_1)}{P(D\mid M_2)\, P(M_2)} \; . \]

In either case, the fundamental object is the joint probability,

\[ \Large P(D, \Theta_M, M), \]

or, with competing models, \(M_1\) and \(M_2\), the pair of joint probabilities,

\[ \Large P(D, \Theta_{M_1}, M_1) \quad , \quad P(D, \Theta_{M_2}, M_2) \; . \]

These are the fundamental objects since any conditional or marginalized probability can be derived from them. Note that the observed random variables \(D\) are given, but the unobserved (latent) random variables \(\Theta_M\) require that we make a choice of model(s).

Unsupervised Learning#

The questions above concern models and their parameters. However, we can also ask questions about “new” data, where “new” means either not yet observed or else already observed but omitted from our analysis. For example:

  • KEY QUESTION: Given observed data \(D\), how likely is unobserved data \(D'\)?

This is the fundamental question of unsupervised learning, and can be framed in probabilistic language starting from the joint probability

\[ \Large P(D, D', \Theta_M, M) \]

assuming the model \(M\) with parameters \(\Theta_M\).

To answer this question, we must estimate:

\[\begin{split} \Large \begin{aligned} P(D'\mid D, M) &= \int d\Theta_M\, P(D',\Theta_M\mid D, M) \\ &= \int d\Theta_M\, P(D'\mid D,\Theta_M, M)\,P(\Theta_M\mid D, M)\\ &= \int d\Theta_M\, \frac{P(D',D\mid \Theta_M, M)}{P(D\mid \Theta_M, M)}\, P(\Theta_M\mid D, M) \;. \\ \end{aligned} \end{split}\]

If we assume that \(D\) and \(D'\) are statistically independent datasets, then

\[ \Large P(D',D\mid \Theta_M, M) = P(D'\mid\Theta_M, M)\, P(D\mid \Theta_M, M) \; , \]

and we can simplify

\[ \Large P(D'\mid D, M) = \int d\Theta_M\, P(D'\mid\Theta_M, M)\, P(\Theta_M\mid D, M) \; . \]

Note that in order to evaluate the RHS, we must have already learned the model \(M\) and determined the posterior \(P(\Theta_M\mid D, M)\) of its parameters.

Supervised Learning#

If we split the features of our data \(D\) into two categories, \(X\) and \(Y\), we can ask a new question about unobserved data \(D'\) that splits into \(X'\) and \(Y'\):

  • KEY QUESTION: Given observed data \((X,Y)\) from \(D\) and \(X'\) from \(D'\), how likely is the remaining unobserved data \(Y'\)?

This is the fundamental question of supervised learning, and the relevant joint probability is now

\[ \Large P(D, D', \Theta_M, M) = P(X, Y, X', Y', \Theta_M, M) \]

for the assumed model \(M\) with parameters \(\Theta_M\).

To answer this question, we can estimate:

\[\begin{split} \Large \begin{aligned} P(Y'\mid X, Y, X', M) &= \int d\Theta_M\, ~P(Y',\Theta_M\mid X, Y, X', M) \\ &= \int d\Theta_M\, ~P(Y'\mid X, Y, X',\Theta_M, M)\, ~P(\Theta_M\mid X, Y, X', M) \\ &= \int d\Theta_M\, \frac{P(X, Y, X', Y'\mid \Theta_M, M)}{P(X, Y, X'\mid \Theta_M, M)}\, ~P(\Theta_M\mid X, Y, X', M) \; . \end{aligned} \end{split}\]

If we again assume that \(D\) and \(D'\) are statistically independent, then

\[ \Large P(X,Y,X',Y'\mid\Theta_M,M) = P(X',Y'\mid\Theta_M,M)\, ~P(X,Y\mid\Theta_M,M) \; , \]

and (after integrating out \(X'\)),

\[ \Large P(X,Y,Y'\mid\Theta_M,M) = P(Y'\mid\Theta_M,M)\, ~P(X,Y\mid\Theta_M,M) \; . \]

We can then simplify:

\[ \Large P(Y'\mid X, Y, X', M) = \int d\Theta_M\, \frac{P(X', Y'\mid \Theta_M, M)}{P(X'\mid \Theta_M, M)}\, ~P(\Theta_M\mid X, Y, X', M) \; . \]

Note that this formulation of the problem has \(P(\Theta_M\mid X, Y, X', M)\) on the RHS, which indicates that we must re-learn the model \(M\) each time we are given new data \(X'\).

An alternative formulation reveals that, while valid, this is not necessary: start from the unsupervised result above, with \(D=(X,Y)\) and \(D'=(X',Y')\),

\[ \Large P(X',Y'\mid X,Y, M) = \int d\Theta_M\, ~P(X',Y'\mid\Theta_M, M)\, ~P(\Theta_M\mid X,Y, M) \]

then integrate out \(X'\),

\[\begin{split} \Large \begin{aligned} P(X'\mid X,Y, M) &= \int d\Theta_M\, \left[ \int dX'\, ~P(X',Y'\mid\Theta_M, M)\right] ~P(\Theta_M\mid X,Y, M) \\ &= \int d\Theta_M\, ~P(Y'\mid\Theta_M, M)\, ~P(\Theta_M\mid X,Y, M) \; . \end{aligned} \end{split}\]

We can now answer our original question using

\[\begin{split} \Large \begin{aligned} P(Y'\mid X,Y,X',M) &= \frac{P(P(X',Y'\mid X,Y, M)}{P(X'\mid X,Y, M)} \\ &= \frac{\int d\Theta_M\, ~P(X',Y'\mid\Theta_M, M)\, ~P(\Theta_M\mid X,Y, M)} {\int d\Theta_M\, ~P(Y'\mid\Theta_M, M)\, ~P(\Theta_M\mid X,Y, M)} \; . \end{aligned} \end{split}\]

Note how this formulation allows us to learn the model \(M\) once with the original data \(D=(X,Y)\) but requires two separate marginalizations (integrals) over the model parameters \(\Theta_M\).

Terminology#

The data \(D\) used to learn the model used in unsupervised or supervised learning is referred as the training data. In supervised learning, the features appearing in \(X\) are the input features and \(Y\) are referred to as the target features.

We use different terminology (and approaches to modeling) for supervised learning depending on the type of target features we wish to learn:

  • regression \(~~\) : predict continuous-value target features.

  • classification: predict discrete-valued target features.

Note that the target features might be a mix of continuous and discrete features, so this terminology is incomplete.

Most of the high-profile machine learning applications from Google, Facebook, etc, involve classification rather than regression, so proportionally more effort has gone into developing and optimizing classification algorithms. However, most scientific applications are more naturally expressed as regression problems: this presents both a challenge and an opportunity to the scientific ML community! Also note that a regression problem can always be converted into a classification problem by binning the output (assuming you don’t need infinite accuracy), which can be surprisingly effective.

All unsupervised and supervised learning algorithms involve priors, \(P(\Theta_M\mid M)\) but they are not always stated explicitly. Sometimes priors are expressed implicitly via terms that are referred to as regularization conditions.

Model Selection Revisited#

When learning a model \(M\), our quantitative measure of how well it explains the data \(D\) is the evidence \(P(D\mid M)\).

Since unsupervised and supervised learning requires first learning the model, we could use the same measure, but a different measure is also useful: how well does \(M\) explain unobserved data \(D'\)? In other words, how well does the learned model generalize and offer predictive power? In order to answer this question, in practice, you must hold back some of your observed data when learning the model and can then measure how well the model “postdicts” the held back data. The held back data is referred to as the test sample and this process is known as cross validation.

In cases where the model and its parameters have some physical reality (for example, projectile motion modeled with Newtonian physics and parameterized by \(g\), …), these measures are essentially the same since the model is dictated by some independent reality and not chosen specifically to explain the observed data \(D\).

However, when predicting future data is the main goal and there is no first-principles model available, the model and its parameters are essentially unconstrained and these two measures can easily diverge. In particular, optimizing how well the model explains the observed data leads to over-fitting and poor ability to generalize to new data.

For example, suppose the observed data \(D\) consists of \(N\) samples \(x_i\) of a single feature \(x\), then a model \(M\) with the likelihood (\(\delta_D\) is the Dirac delta function)

\[ \Large P(x\mid \Theta_M, M) = \frac{1}{N}\, \sum_{i=1}^N \delta_D(x - x_i) \]

trivially explains the data perfectly with the parameters

\[ \Large \Theta_M = \{ x_1, x_2, \ldots, x_N \} \; . \]

This purely empirical approach to model building is an extreme case of over-fitting and offers no generalization power. (Note that the likelihood above is a kernel density estimate with a Dirac delta function kernel).


Acknowledgments#

  • Initial version: Mark Neubauer

© Copyright 2023