INITIALIZATION
Knowledgebase: ki-dev-large
Base Query: What is machine learning?
Model: gpt-4o
Use Curl?: 
==================================================
**Elapsed Time: 0.00 seconds**
==================================================


ROUTING
Query type: summary
==================================================
**Elapsed Time: 1.85 seconds**
==================================================


RAG PARAMETERS
Max Context To Include: 100
Lowest Score to Consider: 0.1
==================================================
**Elapsed Time: 0.00 seconds**
==================================================


VECTOR SEARCH ALGORITHM TO USE 
Use MMR search?: True
Use Similarity search?: False
==================================================
**Elapsed Time: 0.10 seconds**
==================================================


VECTOR SEARCH DONE 
==================================================
**Elapsed Time: 6.13 seconds**
==================================================


PRIMER 
Primer: 
            You are Simon, a highly intelligent personal assistant in a system called KIOS. You are a chatbot that
            can read knowledgebases through the "CONTEXT" that is included in the user's chat message.

        
            Your role is to act as an expert at summarization and analysis.

            In your responses to enterprise users, prioritize clarity, trustworthiness, and appropriate formality.

            Be honest by admitting when a topic falls outside your scope of knowledge, and suggest
            alternative avenues for obtaining information when necessary.

            Make effective use of chat history to avoid redundancy and enhance response relevance, continuously
            adapting to integrate all necessary details in your interactions.

            Use as much tokens as possible to provide a detailed response.
        
==================================================
**Elapsed Time: 0.18 seconds**
==================================================


FINAL QUERY 
Final Query: CONTEXT: ##########
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 1

Context: # A Brief Introduction to Machine Learning for Engineers

(2018), “A Brief Introduction to Machine Learning for Engineers”, Vol. XX, No. XX, pp 1–231. DOI: XXX.

Osvaldo Simeone  
Department of Informatics  
King’s College London  
osvaldo.simeone@kcl.ac.uk
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 2

Context: # Contents

I Basics 5

1 Introduction 6  
   1.1 What is Machine Learning? 6  
   1.2 When to Use Machine Learning? 8  
   1.3 Goals and Outline 11  

2 A Gentle Introduction through Linear Regression 15  
   2.1 Supervised Learning 15  
   2.2 Inference 17  
   2.3 Frequentist Approach 19  
   2.4 Bayesian Approach 36  
   2.5 Minimum Description Length (MDL) 42  
   2.6 Information-Theoretic Metrics 44  
   2.7 Interpretation and Causality 47  
   2.8 Summary 49  

3 Probabilistic Models for Learning 51  
   3.1 Preliminaries 52  
   3.2 The Exponential Family 53  
   3.3 Frequentist Learning 59
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 3

Context: # Learning Theory

## Table of Contents

3.4 Bayesian Learning ......................................... 63  
3.5 Supervised Learning via Generalized Linear Models (GLM) ... 69  
3.6 Maximum Entropy Property* ................................. 71  
3.7 Energy-based Models* ...................................... 72  
3.8 Some Advanced Topics* ..................................... 73  
3.9 Summary ................................................... 74  

## II Supervised Learning

### 75  

4 Classification ................................................ 76  
4.1 Preliminaries: Stochastic Gradient Descent ............ 77  
4.2 Classification as a Supervised Learning Problem ...... 78  
4.3 Discriminative Deterministic Models ..................... 80  
4.4 Discriminative Probabilistic Models: Generalized Linear Models ............................................... 90  
4.5 Discriminative Probabilistic Models: Beyond GLM ..... 96  
4.6 Generative Probabilistic Models .......................... 102  
4.7 Boosting .................................................... 106  
4.8 Summary ................................................... 107  

## 5 Statistical Learning Theory*

### 113  

5.1 A Formal Framework for Supervised Learning .......... 114  
5.2 PAC Learnability and Sample Complexity ............... 119  
5.3 PAC Learnability for Finite Hypothesis Classes ....... 120  
5.4 VC Dimension and Fundamental Theorem of PAC Learning .................................................... 124  
5.5 Summary ................................................... 126  

## III Unsupervised Learning

### 129  

6 Unsupervised Learning ....................................... 130  
6.1 Unsupervised Learning ................................... 131  
6.2 K-Means Clustering ...................................... 134  
6.3 ML, ELBO and EM ......................................... 136  
6.4 Directed Generative Models ............................. 148  
6.5 Undirected Generative Models ............................ 155
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 9

Context: # Acronyms

- **AI:** Artificial Intelligence
- **AMP:** Approximate Message Passing
- **BN:** Bayesian Network
- **DAG:** Directed Acyclic Graph
- **ELBO:** Evidence Lower Bound
- **EM:** Expectation Maximization
- **ERM:** Empirical Risk Minimization
- **GAN:** Generative Adversarial Network
- **GLM:** Generalized Linear Model
- **HMM:** Hidden Markov Model
- **i.i.d.:** independent identically distributed
- **KL:** Kullback-Leibler
- **LASSO:** Least Absolute Shrinkage and Selection Operator
- **LBP:** Loopy Belief Propagation
- **LL:** Log-Likelihood
- **LLR:** Log-Likelihood Ratio
- **LS:** Least Squares
- **MC:** Monte Carlo
- **MCMC:** Markov Chain Monte Carlo
- **MDL:** Minimum Description Length
- **MFVI:** Mean Field Variational Inference
- **MI:** Maximum Likelihood
- **NRF:** Markov Random Field
- **NLL:** Negative Log-Likelihood
- **PAC:** Probably Approximately Correct
- **pdf:** probability density function
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 12

Context: # 1 Introduction

Having taught courses on machine learning, I am often asked by colleagues and students with a background in engineering to suggest “the best place to start” to get into this subject. I typically respond with a list of books – for a general, but slightly outdated introduction, read this book; for a detailed survey of methods based on probabilistic models, check this other reference; to learn about statistical learning, I found this text useful; and so on. This answer strikes me, and most likely also my interlocutors, as quite unsatisfactory. This is especially so since the size of many of these books may be discouraging for busy professionals and students working on other projects. This monograph is an attempt to offer a basic and compact reference that describes key ideas and principles in simple terms and within a unified treatment, encompassing also more recent developments and pointers to the literature for further study.

## 1.1 What is Machine Learning?

A useful way to introduce the machine learning methodology is by means of a comparison with the conventional engineering design flow.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 13

Context: # 1.1. What is Machine Learning?

This starts with an in-depth analysis of the problem domain, which culminates with the definition of a mathematical model. The mathematical model is meant to capture the key features of the problem under study and is typically the result of the work of a number of experts. The mathematical model is finally leveraged to derive hand-crafted solutions to the problem.

For instance, consider the problem of defining a chemical process to produce a given molecule. The conventional flow requires chemists to leverage their knowledge of models that predict the outcome of individual chemical reactions, in order to craft a sequence of suitable steps that synthesize the desired molecule. Another example is the design of speech translation or image/video compression algorithms. Both of these tasks involve the definition of models and algorithms by teams of experts, such as linguists, psychologists, and signal processing practitioners, not infrequently during the course of long standardization meetings.

The engineering design flow outlined above may be too costly and inefficient for problems in which faster or less expensive solutions are desirable. The machine learning alternative is to collect large data sets, e.g., of labeled speech, images, or videos, and to use this information to train general-purpose learning machines to carry out the desired task. While the standard engineering flow relies on domain knowledge and on design optimized for the problem at hand, machine learning lets large amounts of data dictate algorithms and solutions. To this end, rather than requiring a precise model of the set-up under study, machine learning requires the specification of an objective, of a model to be trained, and of an optimization technique.

Returning to the first example above, a machine learning approach would proceed by training a general-purpose machine to predict the outcome of known chemical reactions based on a large data set, and then by using the trained algorithm to explore ways to produce more complex molecules. In a similar manner, large data sets of images or videos would be used to train a general-purpose algorithm with the aim of obtaining compressed representations from which the original input can be recovered with some distortion.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 14

Context: # 1.2 When to Use Machine Learning?

Based on the discussion above, machine learning can offer an efficient alternative to the conventional engineering flow when development cost and time are the main concerns, or when the problem appears to be too complex to be studied in its full generality. On the flip side, the approach has the key disadvantages of providing generally suboptimal performance, or hindering interpretability of the solution, and to apply only to a limited set of problems.

In order to identify tasks for which machine learning methods may be useful, reference [31] suggests the following criteria:

1. The task involves a function that maps well-defined inputs to well-defined outputs.
2. Large data sets exist or can be created containing input-output pairs.
3. The task provides clear feedback with clearly definable goals and metrics.
4. The task does not involve long chains of logic or reasoning that depend on diverse background knowledge or common sense.
5. The task does not require detailed explanations for how the decision was made.
6. The task has a tolerance for error and no need for provably correct or optimal solutions.
7. The phenomenon or function being learned should not change rapidly over time.
8. No specialized dexterity, physical skills, or mobility is required.

These criteria are useful guidelines for the decision of whether machine learning methods are suitable for a given task of interest. They also offer a convenient demarcation line between machine learning as is intended today, with its focus on training and computational statistics tools, and more general notions of Artificial Intelligence (AI) based on knowledge and common sense [87] (see [126] for an overview on AI research).
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 15

Context: # 1.2. When to Use Machine Learning?

## 1.2.1 Learning Tasks

We can distinguish among three different main types of machine learning problems, which are briefly introduced below. The discussion reflects the focus of this monograph on parametric probabilistic models, as further elaborated on in the next section.

1. **Supervised learning:** We have \( n \) labeled training examples \( D = \{(x_i, y_i)\}_{i=1}^n \), where \( x_i \) represents a covariate, or explanatory variable, while \( y_i \) is the corresponding label, or response. For instance, variable \( y_i \) may represent the text of an email, while \( y_i \) may be a binary variable indicating whether the email is spam or not. The goal of supervised learning is to predict the value of the label \( y \) for an input \( x \) that is not in the training set. In other words, supervised learning aims at generalizing the observations in the data set \( D \) to new inputs. For example, an algorithm trained on a set of emails should be able to classify a new email not present in the data set \( D \).

   We can generally distinguish between classification problems, in which the label \( y \) is discrete, as in the example above, and regression problems, in which variable \( y \) is continuous. An example of a regression task is the prediction of tomorrow’s temperature \( t \) based on today’s meteorological observations \( x \).

   An effective way to learn a predictor is to identify from the data set \( D \) a predictive distribution \( p(y|x) \) from a set of parametrized distributions. The conditional distribution \( p(y|x) \) defines a profile of beliefs over all possible values of the label \( y \) given the input \( x \). For instance, for temperature prediction, one could learn mean and variance of a Gaussian distribution \( p(y|x) \) as a function of the input \( x \). As a special case, the output of a supervised learning algorithm may be in the form of a deterministic predictive function \( \hat{y} = f(x) \).

2. **Unsupervised learning:** Suppose now that we have an unlabeled set of training examples \( D = \{x_i\}_{i=1}^n \). Less well defined than supervised learning, unsupervised learning generally refers to the task of learning properties of the mechanism that generates this data set. Specific tasks and applications include clustering, which is the problem of grouping similar examples \( x_i \); dimensionality reduction, feature extraction, and representation learning, all related to the problem of simplifying data.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 16

Context: # Introduction

representing the data in a smaller or more convenient space; and generative modelling, which is the problem of learning a generating mechanism to produce artificial examples that are similar to available data in the data set \( D \).

As a generalization of both supervised and unsupervised learning, **semi-supervised learning** refers to scenarios in which not all examples are labelled, with the unlabelled examples providing information about the distribution of the covariates \( x \).

## 3. Reinforcement learning

Reinforcement learning refers to the problem of inferring optimal sequential decisions based on rewards or punishments received as a result of previous actions. Under supervised learning, the “label” refers to an action to be taken when the learner is in an informational state about the environment given by a variable \( x \). Upon taking an action \( a \) in a state \( x \), the learner is provided with feedback on the immediate reward accrued via this decision, and the environment moves on to a different state. As an example, an agent can be trained to navigate a given environment in the presence of obstacles by penalizing decisions that result in collisions.

Reinforcement learning is hence neither supervised, since the learner is not provided with the optimal actions \( t \) to select in a given state \( x \); nor is it fully unsupervised, given the availability of feedback on the quality of the chosen action. Reinforcement learning is also distinguished from supervised and unsupervised learning due to the influence of previous actions on future states and rewards.

This monograph focuses on supervised and unsupervised learning. These general tasks can be further classified along the following dimensions:

- **Passive vs. active learning**: A passive learner is given the training examples, while an active learner can affect the choice of training examples on the basis of prior observations.
  
- **Offline vs. online learning**: Offline learning operates over a batch of training samples, while online learning processes samples in a streaming fashion. Note that reinforcement learning operates inherently in an online manner, while supervised and unsupervised learning can be carried out by following either offline or online formulations.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 17

Context: # 1.3 Goals and Outline

This monograph considers only passive and offline learning.

## 1.3 Goals and Outline

This monograph aims at providing an introduction to key concepts, algorithms, and theoretical results in machine learning. The treatment concentrates on probabilistic models for supervised and unsupervised learning problems. It introduces fundamental concepts and algorithms by building on first principles, while also exposing the reader to more advanced topics with extensive pointers to the literature, within a unified notation and mathematical framework. Unlike other texts that are focused on one particular aspect of the field, an effort has been made here to provide a broad but concise overview in which the main ideas and techniques are systematically presented. Specifically, the material is organized according to clearly defined categories, such as discriminative and generative models, frequentist and Bayesian approaches, exact and approximate inference, as well as directed and undirected models. This monograph is meant as an entry point for researchers with a background in probability and linear algebra. A prior exposure to information theory is useful but not required.

Detailed discussions are provided on basic concepts and ideas, including overfitting and generalization, Maximum Likelihood and regularization, and Bayesian inference. The text also endeavors to provide intuitive explanations and pointers to advanced topics and research directions. Sections and subsections containing more advanced material that may be regarded at a first reading are marked with a star (+).

The reader will find here further discussions on computing platform or programming frameworks, such as map-reduce, nor details on specific applications involving large data sets. These can be easily found in a vast and growing body of work. Furthermore, rather than providing exhaustive details on the existing myriad solutions in each specific category, techniques have been selected that are useful to illustrate the most salient aspects. Historical notes have also been provided only for a few selected milestone events.

Finally, the monograph attempts to strike a balance between the algorithmic and theoretical viewpoints. In particular, all learning algorithms are presented in a manner that emphasizes their theoretical foundations while also providing practical insights.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 18

Context: # Introduction

Algorithms are introduced on the basis of theoretical arguments, often based on information-theoretic measures. Moreover, a chapter is devoted to statistical learning theory, demonstrating how to set the field of supervised learning on solid theoretical foundations. This chapter is more theoretically involved than the others, and proofs of some key results are included in order to illustrate the theoretical underpinnings of learning. This contrasts with other chapters, in which proofs of few theoretical results are kept at a minimum in order to focus on the main ideas.

The rest of the monograph is organized into five parts. The first part covers introductory material. Specifically, Chapter 2 introduces the frequentist, Bayesian and Minimum Description Length (MDL) learning frameworks; the discriminative and generative categories of probabilistic models; as well as key concepts such as training, loss, generalization, and overfitting – all in the context of a simple linear regression problem. Information-theoretic metrics are also briefly introduced, as well as the advanced topics of interpretation and causality. Chapter 3 then provides an introduction to the exponential family of probabilistic models, to Generalized Linear Models (GLMs), and to energy-based models, emphasizing main properties that will be invoked in later chapters.

The second part concerns supervised learning. Chapter 4 covers linear and non-linear classification methods via discriminative and generative models, including Support Vector Machines (SVMs), kernel methods, logistic regression, multi-layer neural networks, and boosting. Chapter 5 is a brief introduction to the statistical learning framework of the Probably Approximately Correct (PAC) theory, covering the Vapnik-Chervonenkis (VC) dimension and the fundamental theorem of PAC learning.

The third part, consisting of a single chapter, introduces unsupervised learning. In particular, in Chapter 6, unsupervised learning models are addressed by distinguishing among directed models, for which Expectation Maximization (EM) is derived as the iterative maximization of the Evidence Lower Bound (ELBO); undirected models, for which Restricted Boltzmann Machines (RBMs) are discussed as a representative example; and discriminative models trained using the InfoMax.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 21

Context: # 2
## A Gentle Introduction through Linear Regression

In this chapter, we introduce the frequentist, Bayesian, and MDL learning frameworks, as well as key concepts in supervised learning, such as discriminative and generative models, training loss, generalization, and overfitting. This is done by considering a simple linear regression problem as a recurring example. We start by introducing the problem of supervised learning and by presenting some background on inference. We then present the frequentist, Bayesian, and MDL learning approaches in this order. The treatment of MDL is limited to an introductory discussion, as the rest of the monograph concentrates on frequentist and Bayesian viewpoints. We conclude with an introduction to the important topic of information-theoretic metrics, and with a brief introduction to the advanced topics of causal inference and interpretation.

## 2.1 Supervised Learning

In the standard formulation of a supervised learning problem, we are given a training set \( D \) containing \( N \) training points \( (x_n, y_n) \), \( n = 1, \ldots, N \). The observations \( x_n \) are considered to be free variables, and known as covariates, domain points, or explanatory variables; while the target
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 22

Context: ```markdown
variables \( t_n \) are assumed to be dependent on \( x_n \), and are referred to as dependent variables, labels, or responses. An example is illustrated in Fig. 2.1. We use the notation \( \mathbf{x}_D = (x_1, \ldots, x_N)^T \) for the covariates and \( \mathbf{t}_D = (t_1, \ldots, t_N)^T \) for the labels in the training set \( D \). Based on this data, the goal of supervised learning is to identify an algorithm to predict the label \( t \) for a new, that is, as of yet unobserved, domain point \( x \).

![Figure 2.1: Example of a training set \( D \) with \( N = 10 \) points \( (x_n, t_n) \), \( n = 1, \ldots, N \).](path_to_image)

The outlined learning task is clearly impossible in the absence of additional information on the mechanism relating variables \( z \) and \( t \). With reference to Fig. 2.1, unless we assume, say, that \( z \) and \( t \) are related by a function \( t = f(z) \) with some properties, such as smoothness, we have no way of predicting the label \( t \) for an unobserved domain point \( z \). This observation is formalized by the no free lunch theorem to be reviewed in Chapter 5: one cannot learn rules that generalize to unseen examples without making assumptions about the mechanism generating the data. The set of all assumptions made by the learning algorithm is known as the inductive bias.

This discussion points to a key difference between memorizing and learning. While the former amounts to mere retrieval of a value \( t_n \), 
```
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 23

Context: 2.2. Inference
====================

Corresponding to an already observed pair \((x_t, y_t) \in \mathcal{D}\), learning entails the capability to predict the value \(t\) for an unseen domain point \(x\). Learning, in other words, converts experience – in the form of \(\mathcal{D}\) – into expertise or knowledge – in the form of a predictive algorithm. This is well captured by the following quote by Jorge Luis Borges: “To think is to forget details, generalize, make abstractions.” [138]

By and large, the goal of supervised learning is that of identifying a predictive algorithm that minimizes the generalization loss, that is, the error in the prediction of a new label \(t\) for an unobserved explanatory variable \(x\). How exactly to formulate this problem, however, depends on one’s viewpoint on the nature of the model that is being learned. This leads to the distinction between the frequentist and the Bayesian approaches, which is central to this chapter. As will be discussed, the MDL philosophy deviates from the mentioned focus on prediction as the goal of learning, by targeting instead a parsimonious description of the data set \(\mathcal{D}\).

### 2.2 Inference

Before we start our discussion of learning, it is useful to review some basic concepts concerning statistical inference, as they will be needed throughout this chapter and in the rest of this monograph. We specifically consider the inference problem of predicting a \(y\) given the observation of another \(x\) under the assumption that their joint distribution \(p(x, y)\) is known. As a matter of terminology, it is noted that here we will use the term “inference” as it is typically intended in the literature on probabilistic graphical models (see, e.g., [81]), thereby diverging from its use in other branches of the machine learning literature (see, e.g., [23]).

In order to define the problem of optimal inference, one needs to define a non-negative loss function \(\ell(t, \hat{t})\). This defines the cost, or loss or risk, incurred when the correct value is \(t\) while the estimate is \(\hat{t}\). An important example is the \(\ell_t\) loss:

\[
\ell_t(\hat{t}) = |t - \hat{t}|, \quad (2.1)
\]

which includes as a special case the quadratic loss \(\ell_q(t, \hat{t}) = (t - \hat{t})^2\).
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 25

Context: # 2.3 Frequentist Approach

The goal of supervised learning methods is broadly speaking that of obtaining a predictor \( \hat{t}(x) \) that performs close to the optimal predictor \( t^* \) based only on the training set \( D \), and hence without knowledge of the joint distribution \( p(x, t) \). The closeness in performance is measured by the difference between the generalization loss \( L_p(\hat{t}) \) achieved by the trained predictor and the minimum generalization loss \( L_p(t^*) \) of the optimal predictor, which depends on the true distribution \( p(x, t) \). Strictly speaking, this statement applies only for the frequentist approach, which is discussed next. As will be explained later in the chapter, in fact, while the Bayesian approach still centers around the goal of prediction, its modelling assumptions are different. Furthermore, the MDL approach concentrates on the task of data compression rather than prediction.

## 2.3 Frequentist Approach

According to the frequentist viewpoint, the training data points \( (x_n, t_n) \in D \) are independent identically distributed (i.i.d.) rvs drawn from a true and unknown distribution \( p(x, t) \):

\[
(x_n, t_n) \sim p(x, t), \quad i = 1, \ldots, N.
\]
  
The new observation \( (x, t) \) is also independently generated from the same true distribution \( p(x, t) \); the domain point \( x \) is observed and the label \( t \) must be predicted. Since the probabilistic model \( p(x, t) \) is not known, one cannot solve directly problem (2.3) to find the optimal prediction that minimizes the generalization loss \( L_p \) in (2.2).

Before discussing the available solutions to this problem, it is worth observing that the definition of the “true” distribution \( p(x, t) \) depends on how the data is collected. As in the example of the “beauty AI” context, if the rankings \( t_n \) assigned to pictures \( x_n \) of faces are affected by racial biases, the distribution \( p(x, t) \) will reflect these prejudices and produce skewed results [62].

## Taxonomy of solutions

There are two main ways to address the problem of learning how to perform inference when not knowing the distribution \( p(x, t) \):
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 26

Context: # A Gentle Introduction through Linear Regression

- **Separate learning and (plug-in) inference:** Learn first an approximation, say \( p_\theta(t|x) \), of the conditional distribution \( p(t|x) \) based on the data \( D \), and then plug this approximation in (2.3) to obtain an approximation of the optimal decision as

  \[
  \hat{t}_\theta(x) = \arg \min_t E_{p_\theta(t|x)}[\ell(t, \widehat{y})]. \tag{2.6}
  \]

- **Direct inference via Empirical Risk Minimization (ERM):** Learn directly an approximation \( \hat{t}_\theta(\cdot) \) of the optimal decision rule by minimizing an empirical estimate of the generalization loss (2.2) obtained from the data set as

  \[
  \hat{t}_\theta(\cdot) = \arg \min_t L_p(\hat{t}). \tag{2.7}
  \]

  where the empirical risk, or empirical loss, is

  \[
  L_p(\hat{t}) = \frac{1}{N} \sum_{n=1}^{N} \ell(t_n, \hat{t}(x_n)). \tag{2.8}
  \]

The notation \( L_p(\hat{t}) \) highlights the dependence of the empirical loss on the predictor \( \hat{t} \) and on the training set \( D \).

In practice, as we will see, both approaches optimize a set of parameters that define the probabilistic model or the predictor. Furthermore, the first approach is generally more flexible, since having an estimate \( p(t|x) \) of the posterior distribution \( p(t|x) \) allows the prediction (2.6) to be computed for any loss function. In contrast, the ERM solution (2.7) is tied to a specific choice of the loss function \( \ell \). In the rest of this section, we will start by taking the first approach, and discuss later how this relates to the ERM formulation.

## Linear regression example.

For concreteness, in the following, we will consider the following running example inspired by [23]. In the example, data is generated according to the true distribution \( p(t|x) = p(x)p(t|x) \), where \( x \) is 

\[
x = z \sim N(\sin(2\pi z), 0.1). \tag{2.9}
\]

The training set in Fig. 2.1 was generated from this distribution. If this true distribution were known, the optimal predictor under the \( \ell_2 \) loss would be
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 29

Context: # 2.3. Frequentist Approach

Determination of a specific model \( p(y|x, \theta) \) in the class called for the selection of the parameter vector \( \theta = (\nu, \beta) \). As we will see, these two types of variables play a significantly different role during learning and should be clearly distinguished, as discussed next.

## 1. Model order \( M \) (and hyperparameters)

The model order defines the "capacity" of the hypothesis class, that is, the number of degrees of freedom in the model. The larger \( M \) is, the more capable a model is to fit the available data. For instance, in the linear regression example, the model order determines the size of the weight vector \( w \). More generally, variables that define the class of models to be learned are known as hyperparameters. As we will see, determining the model order, and more broadly the hyperparameters, requires a process known as validation.

## 2. Model parameters \( \theta \)

Assigning specific values to the model parameters \( \theta \) identifies a hypothesis within the given hypothesis class. This can be done by using learning criteria such as Maximum Likelihood (ML) and Maximum a Posteriori (MAP).

We postpone a discussion of validation to the next section, and we start by introducing the ML and MAP learning criteria.

### 2.3.3 Maximum Likelihood (ML) Learning

Assume now that the model order \( M \) is fixed, and that we are interested in learning the model parameters \( \theta \). The ML criterion selects a value of \( \theta \) under which the training set \( D \) has the maximum probability of being observed. In other words, the value \( \theta \) selected by ML is the most likely to have generated the observed training set. Note that there might be more than one such value.

To proceed, we need to write the probability (density) of the observed labels \( y \) in the training set \( D \) given the corresponding domain points \( x \). Under the assumed discriminative model, this is given as:

\[
p(y|x, w, \beta) = \prod_{n=1}^{N} p(y_n | x_n, w, \beta)
\]

\[
= \prod_{n=1}^{N} N(t_n|h_n(x_n), \beta^{-1})
\]
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 31

Context: # 2.3. Frequentist Approach

Note that, in (2.18), we have assumed the typical overdetermined case in which the inequality \( N > (M + 1) \) holds. More generally, one has the ML solution \( w_{ML} = X^{\top} \beta \). Finally, differentiating the NLL with respect to \( \beta \) yields the ML estimate

\[
\frac{1}{\beta_{ML}} = L_D(w_{ML}).
\]

## Figure 2.2: 
Illustration of underfitting and overfitting in ML learning: The dashed line is the optimal predictor (2.10), which depends on the unknown true distribution, while the other lines correspond to the predictor \( \hat{y}(x) = \mu(x, w_{ML}) \) learned via ML with different model orders \( M \).

### Overfitting and Underfitting
Adopting the \( \ell_2 \) loss, let us now compare the predictor \( \hat{y}_{ML}(x) = \mu(x, w_{ML}) \) learned via ML with the optimal, but unknown, predictor \( r^*(x) \) in (2.10). To this end, Fig. 2.2 shows the optimal predictor \( r^*(x) \) as a dashed line and the ML-based predictor \( \hat{y}_{ML}(x) \) obtained with different values of the model order \( M \) for the training set \( D \) in Fig. 2.1 (also shown in Fig. 2.2 for reference).

We begin by observing that, with \( M = 1 \), the ML predictor underfits the data: the model is not rich enough to capture the variations present in the data. As a result, the training loss \( L_D(w_{ML}) \) in (2.16) is large.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 33

Context: # 2.3. Frequentist Approach

Since the latter accounts for all pairs \((x, y) \sim p(x, y)\), while the former includes only the training points used for learning. More importantly, the key observation here is that increasing \(M\) allows one to better fit—and possibly overfit—the training set, hence reducing \(L_p(M)\). The generalization loss \(L_p(M)\) instead tends to decrease at first, as we move away from the underfitting regime, but it eventually increases for sufficiently large \(M\). The widening of the gap between training and generalization provides evidence that overfitting is occurring. From Fig. 2.3, we can hence conclude that, in this example, model orders larger than \(M = 7\) should be avoided since they lead to overfitting, while model orders less than \(M = 3\) should also not be considered in order to avoid underfitting.

![Figure 2.4](#)  
**Figure 2.4**: Square root of the generalization loss \(L_p(M)\) and of the training loss \(L_p(M)\) as a function of the training set size \(N\). The asymptote of the generalization and training losses is given by the minimum generalization loss \(L_p(w^*)\) (cf. (2.21)) achievable for the given model order (see Fig. 2.5).

## What if we had more data?

Extrapolating from the behavior observed in Fig. 2.2, we can surmise that, as the number \(N\) of data points increases, overfitting is avoided even for larger values of \(M\). In fact, when the training set is big as compared to the number of parameters in \(\theta\),
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 38

Context: # A Gentle Introduction through Linear Regression

The manifestation of overfitting is the large value of norm \( \| w \| \) of the vector of weights. We can use this observation as prior information, that is as part of the inductive bias, in designing a learning algorithm.

To this end, we can impose a prior distribution on the vector of weights that gives lower probability to large values. A possible, but not the only way to do this, is to assume a Gaussian prior as

$$
w \sim \mathcal{N}(0, \alpha^{-1}I),
$$

so that all weights are a priori i.i.d. zero-mean Gaussian variables with variance \( \alpha^{-1} \). Increasing \( \alpha \) forces the weights to be smaller as it reduces the probability associated with large weights. The precision variable \( \alpha \) is an example of a hyperparameter. In a Bayesian framework, hyperparameters control the distribution of the model parameters. As anticipated, hyperparameters are determined via validation.

Rather than maximizing the LL, that is, the probability density \( p(t|x, w, \beta) \) of the labels in the training set, for ML, the MAP criterion prescribes the maximization of the joint probability distribution of weights and labels given the prior \( p(w) = \mathcal{N}(w|0, \alpha^{-1}I) \), that is,

$$
p(t|x, w, \beta) = p(w) \prod_{n=1}^{N} p(t_n|x_n, w, \beta).
$$

Note that a prior probability can also be assumed for the parameter \( \beta \), but in this example we leave \( \beta \) as a deterministic parameter. The MAP learning criterion can hence be formulated as

$$
\min_{w} - \sum_{n=1}^{N} \ln p(t_n|x_n, w, \beta) - \ln p(w).
$$

The name “Maximum a Posteriori” is justified by the fact that problem (2.26) is equivalent to maximizing the posterior distribution of the parameters given the available data, as we will further discuss in the next section. This yields the following problem for the weight vector:

$$
\min_{w} L_D(w) + \frac{\lambda}{2} \| w \|^2,
$$

where we have defined \( \lambda = \alpha / \beta \) and we recall that the training loss is \( L_D(w) = N^{-1} \| t - Xw \|^2 \).
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 41

Context: ## 2.3 Frequentist Approach

### 2.3.5 Regularization

We have seen above that the MAP learning criterion amounts to the addition of a regularization function \( R(\theta) \) to the ML or ERM learning losses. This function penalizes values of the weight vector \( \mathbf{w} \) that are likely to occur in the presence of overfitting, or, generally, that are improbable on the basis of the available prior information. The net effect of this addition is to effectively decrease the capacity of the model, as the set of values for the parameter vector \( \theta \) that the learning algorithm is likely to choose from is reduced. As a result, seen, regularization can control overfitting and its optimization requires validation.

Regularization generally refers to techniques that aid in reducing overfitting during training. The discussion in the previous subsection has focused on a specific form of regularization that is grounded in a probabilistic interpretation in terms of MAP learning. We note that the same techniques, such as ridge regression and LASSO, can also be introduced independently of a probabilistic framework in an ERM formulation. Furthermore, apart from the discussed addition of regularization terms to the empirical risk, there are other ways to perform regularization.

One approach is to modify the optimization scheme by using techniques such as early stopping [56]. Another is to augment the training set by generating artificial examples and hence effectively increasing the number \( N \) of training examples. Related to this idea is the technique known as bagging. With bagging, we first create a number \( K \) of bootstrap data sets. Bootstrap data sets are obtained by selecting \( N \) data points uniformly with replacement from \( D \) so that the same data point generally appears multiple times in the bootstrap data set. Then, we train the model \( K \) times, each time over a different bootstrap set. Finally, we average the results obtained from the models using equal weights. If the errors accrued by the different models were independent, bagging would yield an estimation error that decreases with \( K \). In practice, significantly smaller gains are obtained, particularly for large \( K \), given that the bootstrap data sets are all drawn from \( D \) and hence the estimation errors are not independent [23].
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 44

Context: # Regression

Figure 2.7: Bayesian Network (BN) describing the joint distribution (2.32) of the weight vector \( w \) of the labels in the training data \( D \) and in the new example, as used in the Bayesian approach.

As mentioned, we are interested in computing the posterior probability \( p(w|D, t) \) of the new label \( t \) given the training data \( D \) and the new domain point \( x = \mathbf{x} \). Dropping again the domain variables to simplify the notation, we apply standard rules of probability to obtain:

\[ 
p(t|D) = \frac{p(t|D, w) p(w|D)}{p(t|D)} = \int p(t|w)p(w|D)dw \tag{2.33} 
\]

where the second equality follows from the marginalization rule \( p(t|D, t) = \int p(t|w, t)dw \) and the last equality from Bayes' theorem. Putting back the dependence on the domain variables, we obtain the predictive distribution as

\[ 
p(t|x, D) = \int p(w|D) p(t|x, w)dw \tag{2.34} 
\]

This is the key equation. Accordingly, the Bayesian approach considers the predictive probability \( p(t|x, w) \) associated with each value of the weight vector \( w \) weighted by the posterior belief 

\[ 
p(w|D) = \frac{p(D|w)p(w)}{p(D)} \tag{2.35} 
\]

The posterior belief \( p(w|D) \), which defines the weight of the parameter vector \( w \), is therefore proportional to the prior belief \( p(w) \) multiplied by the correction \( p(D|w) \) due to the observed data.

Computing the posterior \( p(w|D) \), and a fortiori also the predictive distribution \( p(t|x, D) \), is generally a difficult task that requires the adoption of approximate inference techniques covered in Chapter 8. For this
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 48

Context: 42  
# A Gentle Introduction through Linear Regression  

In fact, one still needs to set the hyperparameter \( \alpha \). As seen in Fig. 2.9, varying \( \alpha \) can lead to different conclusions on the optimal value of \( M \). An alternative approach would be to treat \( \alpha \) and \( M \) as rvs with given priors to be specified (see, e.g., [131]). This would not obviate the problem of selecting hyperparameters—now defining the prior distributions of \( \alpha \) and \( M \)—but it can lead to powerful hierarchical models. The necessary tools will be discussed in Chapter 7.  

As a final note, rather than using often impractical exhaustive search methods, the optimization over the hyperparameters and the model under \( M \) for all criteria discussed so far, namely ML, MAP, and Bayesian, can be carried out using so-called Bayesian optimization tools [132]. A drawback of these techniques is that they have their own hyperparameters that need to be selected.  

## 2.5 Minimum Description Length (MDL)*  

In this section, we briefly introduce a third, conceptually distinct, learning philosophy—the MDL criterion. The reader is warned that the treatment here is rather superficial, and that a more formal definition of the MDL criterion would require a more sophisticated discussion, which can be found in [10]. Furthermore, some background in information theory is preferable in order to fully benefit from this discussion.  

To start, we first recall from the treatment above that learning requires the identification of a model, or hypothesis class—here the model under \( M \)—and of a specific hypothesis, defined by parameters \( \theta \) (here \( \theta = (y, \beta) \)) within the class. While MDL can be used for both tasks, we will focus here only on the first.  

To build the necessary background, we now need to review the relationship between probabilistic models and compression. To this end, consider a signal \( t \) taking values in a finite alphabet \( \mathcal{X} \), e.g., a pixel.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 52

Context: 46 A Gentle Introduction through Linear Regression

where equality holds if and only if the two distributions \( p \) and \( q \) are identical. Nevertheless, the KL divergence has also some seemingly unfavorable features, such as its non-symmetry, that is, the inequality \( KL(p||q) \neq KL(q||p) \). We will see in Chapter 8 that the absence of symmetry can be leveraged to define different types of approximate inference and learning techniques.

Importantly, the KL divergence can be written as

\[ 
KL(p||q) = \mathbb{E}_{x \sim p(x)}\left[-\ln g(x)\right] - \mathbb{E}_{x \sim p(x)}\left[-\ln p(x)\right] 
\]

(2.45)

where the first term, \( H(p||q) \), is known as cross-entropy between \( p(x) \) and \( q(x) \) and plays an important role as a learning criterion as discussed below; while the second term, \( H(q) \), is the entropy of distribution \( p(x) \), which is a measure of randomness. We refer to Appendix A for further discussion on the entropy.

Based on the decomposition (2.45), we observe that the cross-entropy \( H(p||q) \) can also be taken as a measure of divergence between two distributions when one is interested in optimizing over the distribution \( q(x) \), since the latter does not appear in the entropy term. Note that the cross-entropy, unlike the KL divergence, can be negative.

Using the definition (2.41), the expected log-loss on the right-hand side of (2.40) can be expressed as

\[ 
\mathbb{E}_{(x,y)}\left[-\ln p(y|x,w,\beta)\right] = \mathbb{E}_{q(x)}\left[H(p(y|x)||p(y|x,\beta))\right] 
\]

(2.46)

which can be easily verified by using the law of iterated expectations. Therefore, the average log-loss is the average over the domain point \( x \) of the cross-entropy between the real predictive distribution \( p(y|x) \) and the predictive distribution \( p(y|x, w) \) dictated by the model. From (2.46), the ML problem (2.15) can be interpreted as an attempt to make the model-based discriminative distribution \( p(y|x, w, \beta) \) as close as possible to the actual posterior \( p(y|x) \). This is done by minimizing the KL divergence, or equivalently the cross-entropy, upon averaging over \( p(x) \).

As final remarks, in machine learning, it is common to use the notation \( KL(p||q) \) even when \( q(x) \) is unnormalized, that is, when \( q(x) \) is
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 54

Context: # A Gentle Introduction through Linear Regression

![Figure 2.10: Illustration of Simpson's paradox [113]](path_to_image)

The way out of this conundrum is to leverage prior information we have about the problem domain. In fact, we can explain away this spurious correlation by including another measurable variable in the model, namely age. To see this, consider the same data, now redrawn by highlighting the age of the individual corresponding to each data point. The resulting plot, seen in Fig. 2.10 (bottom), reveals that older people — within the observed bracket — tend to have a higher cholesterol as well as to exercise more. Therefore, age is a common cause of both exercise and cholesterol levels. In order to capture the causal relationship between the latter variables, we hence need to adjust for age. Doing this requires to consider the trend within each age separately, recovering the expected conclusion that exercising is useful to lower one’s cholesterol.  

We conclude that in this example the correlation between \( z \) and \( t \), while useful for prediction, should not be acted upon for decision making. When assessing the causal relationship between \( z \) and \( t \), we should first understand which other variables may explain the observations and then discount any spurious correlations.

This discussion reveals an important limitation of most existing machine learning algorithms when it comes to identifying causality relationships, or, more generally, to answering counterfactual queries [112]. The study of causality can be carried out within the elegant framework of footnotes.  

7 This example is an instance of the so-called Simpson's paradox: patterns visible in the data disappear, or are even reversed, on the segregated data.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 55

Context: # 2.8 Summary

In this chapter, we have reviewed three key learning frameworks, namely frequentist, Bayesian and MDL, within a parametric probabilistic setup. The frequentist viewpoint postulates the presence of a true unknown distribution for the data and aims at learning a predictor that generalizes well on unseen data drawn from this distribution. This can be done either by learning a probabilistic model to be plugged into the expression of the optimal predictor or by directly solving the ERM problem over the predictor. The Bayesian approach outputs a predictive distribution that combines prior information with the data by solving the inference problem of computing the posterior distribution over the unseen label. Finally, the MDL method aims at selecting a model that allows the data to be described with the smallest number of bits, hence doing away with the need to define the task of generalizing over unobserved examples.

The chapter has also focused extensively on the key problem of overfitting, demonstrating how the performance of a learning algorithm can be understood in terms of bias and estimation error. In particular, while choosing a hypothesis class is essential in order to enable learning, choosing the “wrong” class constitutes an irrecoverable bias that can make learning impossible. As a real-world example, as reported in [109], including as independent variables in the ZIP code of an individual seeking credit at a bank may disadvantage certain applicants or minorities. Another example of this phenomenon is the famous experiment by B. F. Skinner on pigeons [133].

We conclude this chapter by emphasizing an important fact about the probabilistic models that are used in modern machine learning applications. In frequentist methods, typically at least two (possibly conditional) distributions are involved: the empirical data distribution and the model distribution. The former amounts to the histogram of the data which, by the law of large numbers, tends to the real distribution.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 56

Context: 50  
# A Gentle Introduction through Linear Regression

When the number of data points goes to infinity; while the latter is parameterized and is subject to optimization. For this reason, divergence metrics between the two distributions play an important role in the development of learning algorithms. We will see in the rest of the monograph that other frequentist methods may involve a single distribution rather than two, as discussed in Sec. 6.6 or an additional, so-called variational distribution, as covered in Sec. 6.3 and Chapter 8.

In contrast, Bayesian methods posit a single coherent distribution over the data and the parameters, and frame the learning problem as one of inference of unobserved variables. As we will discuss in Chapter 8, variational Bayesian methods also introduce an additional variational distribution and are a building block for frequentist learning in the presence of unobserved variables.

The running example in this chapter has been one of linear regression for a Gaussian model. The next chapter provides the necessary tools to construct and learn more general probabilistic models.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 65

Context: # 3.3 Frequentist Learning

In this section, we provide general results concerning ML and MAP learning when the probabilistic model belongs to the exponential family. As seen in the previous chapter, with ML and MAP, one postulates that the \( N \) available data points \( x_p = \{x_1, \ldots, x_N\} \) are i.i.d. realizations from the probabilistic model \( p(x|\eta) \) as

\[
x_n \sim i.i.d. \, p(x|\eta), \quad n = 1, \ldots, N. 
\]

This data is used to estimate the natural parameters \( \eta \), or the corresponding mean parameters \( \mu \).

Using (3.5), the LL of the natural parameter vector given the observation \( x_p \) can be written as

\[
\ln p(x_p|\eta) = \sum_{n=1}^{N} \ln p(x_n|\eta) = -N A(\eta) + \eta^T \sum_{n=1}^{N} u(x_n) + \sum_{n=1}^{N} \ln z(x_n). 
\]

Therefore, neglecting terms independent of \( \eta \), we can write

\[
\ln p(x_p|\eta) = -N A(\eta) + \eta^T u(x_p). 
\]

### Mean Parameter

The mapping between the natural parameters and the mean parameters \( \mu = [\mu_0, \ldots, \mu_{J-1}]^T \) can be expressed in terms of the softmax function

\[
\mu = \text{softmax}(\eta). 
\]

The softmax function \( \text{softmax}(\eta) \) converts a vector of “scores” \( \eta \) into a probability vector. Furthermore, the function has the property that, as \( \eta \) grows to infinity, \( \text{softmax}(\eta) \) tends to a vector with all zero entries except for the position corresponding to the maximum value \( \eta_k \) (assuming that it is unique). This justifies its name.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 73

Context: # 3.4. Bayesian Learning

Figure 3.2: Probability that the next review is positive using the predictive distribution (3.44) for Example 3.3.

![Figure 3.2](path/to/image)

The conjugate prior is the Dirichlet distribution, a generalization of the Beta distribution:

$$
p(\mu | \alpha) \propto \text{Dir}(\mu | \alpha) \propto C^{-1} \prod_{k=0}^{C-1} \alpha_k^{\mu_k - 1}, \tag{3.46}
$$

where \( \alpha_k \) is the hyperparameter representing the number of "prior" observations equal to \( k \). Note that the Dirichlet distribution is a joint pdf for the entries of the mean vector \( \mu \). Mean and mode vectors for the Dirichlet distribution are given as:

$$
E_{\mu | \text{Dir}(\alpha)}[ \mu ] = \frac{\alpha}{\sum_{j=0}^{C-1} \alpha_j}, \tag{3.47}
$$

$$
\text{mode}_{\mu | \text{Dir}(\alpha)}[ \mu ] = \frac{\alpha - 1}{\sum_{j=0}^{C-1} \alpha_j - C}. \tag{3.48}
$$

The posterior of the parameters is the Dirichlet distribution:

$$
p(\mu | D, \alpha) \propto C^{-1} \prod_{k=0}^{C-1} N_k^{\mu_k + \alpha_k - 1} = \text{Dir}(\mu | N), \tag{3.49}
$$

in which we can again interpret \( \alpha_k + N_k \) as the effective number of observations equal to \( k \). From this distribution, we can obtain the MAP estimate.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 75

Context: # 3.5 Supervised Learning via Generalized Linear Models (GLM)

Distributions in the exponential family are not directly suitable to serve as discriminative probabilistic models to be used in supervised learning tasks. In this section, we introduce Generalized Linear Models (GLMs), which are popular probabilistic discriminative models that build on members of the exponential family.

## Figure 3.3: Gaussian-Gaussian model

The prior distribution \( N(\mu | \phi_0 = 1, \sigma^2 = 3) \) (dotted), true distribution \( N(\mu | \theta_0 = 0, \sigma^2 = 1) \) (dashed), \( N \) observations (circles), ML solution (diamond), the MAP estimate (star), and Bayesian predictive distribution (solid line).

\[
\mu = 0: \quad (i) \text{ the MAP estimate } \hat{\mu}_{MAP} \text{ tends to the ML estimate } \hat{\mu}_{ML}; \text{ and } (ii) \text{ the Bayesian predictive distribution tends to the ML predictive distribution, which in turn coincides with the true distribution due to (i).}
\]
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 79

Context: # 3.8 Some Advanced Topics*

With energy-based models, the key formula (3.28) of the gradient of the LL with respect to the model's parameters generalizes as

$$
\frac{1}{N} \nabla_{n} \ln p(z | \eta) = - \frac{1}{N} \sum_{n=1}^{N} \nabla_{n} E_{c}(z_{n} | \eta) + \sum_{c} E_{c} - p(z_{c}) \nabla_{n} E_{c}(x_{i}).
$$
(3.58)

Generalizing the discussion around (3.28), the first term in (3.58) is the “positive” component that points in a direction that minimizes the energy of the observations \( x_{i} \); while the second term is the “negative” component that pushes up the energy of the unobserved configurations.

In gradient-ascent methods, the application of the first term is typically referred to as the positive phase, while the second is referred to as the negative phase. (The negative phase is even taken by some authors to model the working of the brain while dreaming!) [56] While for the exponential family the expectation in the negative phase readily yields the mean parameters, for more general models, the evaluation of this term is generally prohibitive and typically requires Monte Carlo approximations, which are discussed in Chapter 8.

## 3.8 Some Advanced Topics*

The previous sections have focused on the important class of parametric probabilistic models in the exponential family. Here we briefly put the content of this chapter in the broader context of probabilistic models for machine learning. First, it is often useful to encode additional information about the relationships among the model variables by means of a graphical formalism that will be discussed in Chapter 7. Second, the problem of learning the distribution of given observations, which has been studied here using parametric models, can also be tackled using a non-parametric approach. Accordingly, the distribution is inferred making only assumptions regarding its local smoothness. Typical techniques in this family include Kernel Density Estimation and Nearest Neighbor Density Estimation (see, e.g., [140]).

Furthermore, rather than learning individual probability densities, in some applications, it is more useful to directly estimate ratios of densities. This is the case, for instance, when one wishes to estimate the
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 81

Context: # Part II
## Supervised Learning
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 82

Context: # 4
## Classification

The previous chapters have covered important background material on learning and probabilistic models. In this chapter, we use the principles and ideas covered so far to study the supervised learning problem of classification. Classification is arguably the quintessential machine learning problem, with the most advanced state of the art and the most extensive application to problems as varied as email spam detection and medical diagnosis. 

Due to space limitations, this chapter cannot provide an exhaustive review of all existing techniques and latest developments, particularly in the active field of neural network research. For instance, we do not cover decision trees here (see, e.g., [155]). Rather, we will provide a principled taxonomy of approaches, and offer a few representative techniques for each category within a unified framework. We will specifically proceed by first introducing as preliminary material the Stochastic Gradient Descent optimization method. Then, we will discuss deterministic and probabilistic discriminative models, and finally we will cover probabilistic generative models.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 83

Context: # 4.1 Preliminaries: Stochastic Gradient Descent

In this section, we review a technique that is extensively used in the solution of optimization problems that define learning problems such as ML and MAP (see Chapter 2). The technique is known as Stochastic Gradient Descent (SGD). SGD is introduced here and applied throughout this monograph to other learning problems, including unsupervised learning and reinforcement learning. Discussions about convergence and about more advanced optimization techniques, which may be skipped at a first reading, can be found in the Appendix A of this chapter.

SGD addresses optimization problems of the form:

$$
\min_{\theta} \sum_{i=1}^{N} f_i(\theta),
$$

where \( \theta \) is the vector of variables to be optimized. The cost function \( f_i(\theta) \) typically depends on the \( n \)th example in the training set \( D \). Following the notation set in Chapter 2, for example, in the case of discriminative deterministic models, the conventional form for the cost functions is:

$$
f_i(\theta) = \ell(t_n, \hat{y}(x_n, \theta)), \quad (4.2)
$$

where \( \ell \) is a loss function; \( (x_n, t_n) \) is the \( n \)th training example; and \( \hat{y}(\cdot, \theta) \) is a predictor parameterized by vector \( \theta \).

SGD requires the differentiability of cost functions \( f_i(\cdot) \). The idea is to move at each iteration in the direction of maximum descent for the cost function in \( (4.1) \), when the latter is evaluated \( S \) of samples from the training set. Given a learning rate schedule \( \eta(t) \) and an initialization \( \theta(0) \) of the parameters, SGD repeats in each iteration until convergence the following two steps:

1. Pick a mini-batch \( S \) of indices from the set \( \{1, \ldots, N\} \) according to some predetermined order or randomly;
2. Update the parameters according to:

   $$
   \theta(t+1) = \theta(t) - \eta(t) \nabla f_i(\theta(t)),
   $$

where \( \nabla f_i(\theta(t)) \) is the gradient of the cost function evaluated at the current parameters.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 84

Context: - Update the weights in the direction of steepest local descent as

  \[
  \theta^{(t)} \gets \theta^{(t-1)} - \frac{\eta^{(t)}}{S} \sum_{s \in S} \nabla f_s(\theta^{(t-1)}).
  \] (4.3)

The learning rate \(\eta^{(t)}\) as a function of the iteration \(t\) is generally considered to be part of the hyperparameters to be optimized via validation. More discussion on this can be found in Appendix A of this chapter.

## 4.2 Classification as a Supervised Learning Problem

Classification is a supervised learning problem in which the label \(t\) can take a discrete finite number of values. We refer to Sec. 2.1 for an introduction to supervised learning. In binary classification, each domain point \(x\) is assigned to either one of two classes, which are denoted as \(C_0\) and \(C_1\) and identified by the value of the label \(t\) as follows:

- \( x \in C_0 \; \text{if} \; t = -1 \) (4.4a)  
- \( x \in C_1 \; \text{if} \; t = 1. \) (4.4b)  

Note that we will find it convenient to use either the label \(t = 0\) or \(t = -1\) to identify class \(C_0\). In the more general case of \(K\) classes \(C_0, C_1, \ldots, C_{K-1}\), we will instead prefer to use one-hot encoding (Sec. 3.2) by labelling a point \(x \in C_k\) with a \(K \times 1\) label that contains all zeros except for a “1” entry at position \(k + 1\).

### Example 4.1. 

Examples of binary classification include email spam detection and creditworthiness assessment. In the former case, the domain point \(x\) may be encoded using the bag-of-words model, so that each entry represents the count of the number of times that each term in a given set appears in the email. In the latter application, the domain vector \(x\) generally includes valuable information to decide on whether a customer should be granted credit, such as credit score and salary (see, e.g., [2]). 

*While less useful, the "hot dog" vs. "not hot dog" classifier designed in the "Silicon Valley" HBO show (Season 4) is also a valid example.*
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 89

Context: # 4.3 Discriminative Deterministic Models

The geometric structure of the data. In particular, in the example, the two classes are not linearly separable in the space of the covariates—no hyperplane separates exactly the domain points in the two classes. In such cases, classifiers of the form (4.5) yield large average losses due to the bias induced by the choice of the model (see Sec. 2.3.3).

1. **Overfitting:** When D is large and the data points N are insufficient, learning the D + 1 weights of the classifier may cause overfitting.  
2. **Data-dependent domain size:** In some applications, the dimension D may even change from data point to data point, that is, it may vary with the index n. For example, a text \( x_n \), e.g., represented in ASCII format, will have a different dimension \( D_n \) depending on the number of words in the text.

To address these problems, a powerful approach is that of working with feature vectors \( \phi(x) \), \( k = 1, \ldots, D \), rather than directly with the covariates \( z \) as the input to the classifier. A feature \( \phi(x_n) \) is a generally non-linear function of the vector \( z \). It is important to emphasize that these functions are fixed and not learned.

Choosing a number of features \( D' > D \), which yields an overcomplete representation of the data point \( x_n \), may help against bias; while opting for an undercomplete representation with \( D' < D \) may help solve the problem of overfitting. Furthermore, the same number of features \( D' \), e.g., word counts in a bag-of-words model, may be selected irrespective of the size of the data point, addressing also the last problem listed above.

The feature-based model can be expressed as (4.5) with activation

\[
a(z, \theta) = \sum_{k=1}^{D} w_k \phi_k(z) = \mathbf{w}^T \boldsymbol{\phi}(z).
\]

(4.8)

where we have defined the feature vector \( \phi(x) = [\phi_1(x) \cdots \phi_D(x)]^T \). Note that model (4.5) is a special case of (4.8) with the choice \( \phi(z) = [1, z^T]^T \).

## 4.3.2 Learning

As seen in Sec. 2.3.3, learning of deterministic discriminative models can be carried out by means of ERM for a given loss function \( \ell \). Further...
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 91

Context: # 4.3 Discriminative Deterministic Models

on the original loss function. In this way, the actual average loss is guaranteed to be smaller than the value attained under the surrogate loss function. Examples of surrogate functions that will be considered in the following can be found in Fig. 4.4.

![Some notable surrogate loss functions for binary classification along with the 0-1 loss.](fig4.4.png)

## Perceptron Algorithm

The perceptron algorithm is one of the very first machine learning and AI algorithms. It was introduced by Frank Rosenblatt at the Cornell Aeronautical Laboratory in 1957 to much fanfare in the popular press – it is “the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence,” reported *The New York Times* [144].

Using the feature-based model for generality, the perceptron algorithm attempts to solve problem (4.11) with the surrogate perceptron.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 92

Context: ```
86 Classification

The loss function is defined as 

$$
\ell(t(x, \hat{w})) = \max(0, -t \cdot a(x, \hat{w}))
$$
(4.12)

The perceptron process assigns zero cost to a correctly classified example \( x \), whose functional margin \( t \cdot a(x, \hat{w}) \) is positive, and a cost equal to the absolute value of the functional margin for a misclassified example, whose functional margin is negative. A comparison with the 0-1 loss is shown in Fig. 4.4. The perceptron algorithm tackles problem (4.11) with \( \lambda = 0 \) via SGD with mini-batch size \( S = 1 \). The resulting algorithm works as follows. First, the weights \( \hat{w}^{(0)} \) are initialized. Then, for each iteration \( i = 1, 2, \ldots \):

1. Pick a training example \( (x_n, t_n) \) uniformly with replacement from \( D \):
   - If the example is correctly classified, i.e., if \( t_n a(x_n, \hat{w}^{(i)}) \geq 0 \), do not update the weights: 
     $$
     \hat{w}^{(i)} = \hat{w}^{(i-1)}.
     $$
   - If the example is not correctly classified, i.e., if \( t_n a(x_n, \hat{w}^{(i)}) < 0 \), update the weights as:
     $$
     \hat{w}^{(i)} = \hat{w}^{(i-1)} - \nabla_{\hat{w}} \ell(t_n, a(x_n, \hat{w}^{(i)})) \big|_{\hat{w} = \hat{w}^{(i-1)}} = \hat{w}^{(i-1)} + t_n a(x_n, \hat{w}^{(i-1)}).
     $$
(4.13)

It can be proved that, at each step, the algorithm reduces the term \( \ell(t_n, a(x_n, \hat{w})) \) in the perception loss related to the selected training example even if the latter is misclassified. It can also be shown that, if the training set is linearly separable, the perceptron algorithm finds a weight vector \( \hat{w} \) that separates the two classes exactly in a finite number of steps [23]. However, convergence can be slow. More importantly, the perceptron fails on training sets that are not linearly separable, such as the XOR training set \( D = \{(0, 0), (0, 1), (1, 0), (1, 1)\} \) [97]. This realization came as a disappointment and contributed to the first soc-called AI winter period characterized by a reduced funding for AI and machine learning research [154].

## Support Vector Machine (SVM)

SVM, introduced in its modern form by Cortes and Vapnik [37] in 1995, was among the main causes for a renewal of interest in machine learning.
```
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 95

Context: # 4.3. Discriminative Deterministic Models

zero Lagrange multipliers. Note that the Lagrange multipliers are returned by standard solvers such as the ones implemented by the CVX toolbox in MATLAB [58].

![Example of binary classification with SVM using polynomial features up to degree \( M \) (λ/N = 0.2).](./figure_4_5.png)

## Example 4.2
In the example in Fig. 4.5, the illustrated \( N = 80 \) training samples are fed to a SVM using the monomial feature vector \( \phi(\mathbf{x}) = [1, x_1^2, \ldots, x_1^M, x_2^M]^\top \) and \( \lambda/N = 0.2 \) for given model orders \( M \). The decision boundary is shown using dashed and solid lines. It is seen that, using a sufficiently large order (here \( M = 3 \)), SVM is able to effectively partition the two samples in the two classes. Furthermore, even with larger values of \( M \) (here \( M = 8 \)), SVM appears not to suffer from significant overfitting thanks to the quadratic regularization term.

The optimization problem (4.16) may be conveniently tackled by using Lagrange duality techniques. This approach also allows one to naturally introduce the powerful tool of kernel methods. The interested reader can find this discussion in Appendix B of this chapter.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 98

Context: ```
## 4.4.2 Learning

Consider first ML. The NLL function can be written as

\[
-\ln p(t|x, \mathbf{w}) = -\sum_{n=1}^{N} \ln p(t_n|x_n, \mathbf{w}) \tag{4.23}
\]

\[
= -\sum_{n=1}^{N} \{t_n \ln(y_n) + (1 - t_n) \ln(1 - y_n)\} \tag{4.24}
\]

where we have defined \(y_n = \sigma(\mathbf{w}^T \phi(x_n))\). The NLL (4.23) is also referred to as the cross-entropy loss criterion, since the term \(-t_n \ln(y_n) - (1 - t_n) \ln(1 - y_n)\) is the cross-entropy \(H(t_n || y_n) = -\sum_{i} t_i \ln(y_i)\) (see Sec. 2.6). We note that the cross-entropy can be used to obtain upper bounds on the probability of error (see, e.g., [50]). The ML problem of minimizing the NLL is convex (see Sec. 3.1), and hence it can be solved either directly using convex optimization tools, or by using iterative methods such as SGD or Newton (the latter yields the iterative reweighted least squares algorithm [23, p. 207]).

The development of these methods leverages the expression of the gradient of the LL function (3.28) (used with \(N = 1\) for the exponential family). To elaborate, using the chain rule for differentiation, we can write the gradient

\[
\nabla_{\mathbf{w}} p(t | x, \mathbf{w}) = \nabla_{t} \text{Bern}(t_n) |_{t_n=t} \cdot \nabla_{\mathbf{w}} \sigma(\mathbf{w}^T \phi(x_n)) \tag{4.25}
\]

which, recalling that \(\nabla_{\ln(\text{Bern}(t_n))} = (t - \sigma(\cdot))\) (cf. (3.28)), yields

\[
\nabla_{\mathbf{w}} p(t | \mathbf{w}) = (t - y) \sigma \tag{4.26}
\]

Evaluating the exact posterior distribution for the Bayesian approach turns out to be generally intractable due to the difficulty in normalizing the posterior

\[
p(\mathbf{w}) \propto p(t) \prod_{n=1}^{N} p(t_n | x_n, \mathbf{w}) \tag{4.27}
\]

We refer to [23, pp. 217-220] for an approximate solution based on Laplace approximation. Other useful approximate methods will be discussed in Chapter 8.
```
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 100

Context: # 4.4.3 Multi-Class Classification

Given their larger capacity, mixture models may be prone to overfitting. A way to control overfitting will be discussed in Sec. 4.7.

In the case of \( K \) classes, the relevant exponential family distribution is Categorical with natural parameters depending linearly on the feature vector. This yields the following discriminative model as a generalization of logistic regression:

\[
t | x, W \sim \text{Cat}(t | \mathbf{W}(x)).
\]

where the label vector \( t \) is defined using one-hot encoding (Chapter 3) and \( W \) is a matrix of weights. We can also equivalently write the vector of probabilities for the \( K \) classes as

\[
y = \text{softmax}(\mathbf{W}(x)) = \left( \frac{e^{\mathbf{W}_1(x)}}{\sum_{j=1}^K e^{\mathbf{W}_j(x)}}, \ldots, \frac{e^{\mathbf{W}_K(x)}}{\sum_{j=1}^K e^{\mathbf{W}_j(x)}} \right).
\]

where \( y = [y_1, \ldots, y_K]^T \) with \( y_h = p(c_h | x) \); and \( \mathbf{W}_k \) being the \( k \)-th row of the weight matrix \( W \).

Learning follows as for logistic regression. To briefly elaborate on this point, the NLL can be written as the cross-entropy function:

\[
-\ln p(t | x; W) = -\sum_{n=1}^{N} \ln p(t_n | x, W)
= -\sum_{n=1}^{N} t_n^T \ln(y_n).
\]

where the logarithm is applied element by element, and we have \( y_n = \text{softmax}(\mathbf{W}(x_n)) \). Note that each term in (4.32) can be expressed as the cross-entropy \( -\mathbf{t}_n \ln(y_n) = H(t_n || y_n) \). The ML problem is again convex and hence efficiently solvable. The gradient of the NLL can be again found using the general formula (3.28) for exponential models and the chain rule for derivatives. We can write

\[
\nabla W \ln p(t | x; W) = \nabla \ln p(t | \mathbf{W}(x)) \propto \nabla \mathbf{W}(\mathbf{W}(x)).
\]
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 102

Context: # 4.5 Discriminative Probabilistic Models: Beyond GLM

As depicted in Fig. 4.6, GLMs can be interpreted as three-layer neural networks in which the only hidden layer computes fixed features. The fixed features are then processed by the output classification layer. 

In various applications, determining suitable features is a complex and time-consuming task that requires significant domain knowledge. Moving beyond GLMs allows us to work with models that learn not only the weights used by the output layer for classification, but also the vector of features \( \phi(x) \) on which the output layer operates. This approach yields a much richer set of models, which, along with suitable learning algorithms, has led to widely publicized breakthroughs in applications ranging from speech translation to medical diagnosis.

As a prominent example of beyond-GLM classification models, we describe here feed-forward multi-layer neural networks, or deep neural networks.

## Figure 4.7: A multi-layer neural network.

\[
\begin{array}{ccccccc}
a^{1} & h^{1} & h^{2} & \cdots & h^{k-1} & g(\mathbf{x}) & y \\
x^{1} & W^{1} & W^{2} & \cdots & W^{k} & \text{softmax} &
\end{array}
\]
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 104

Context: # Classification

## 4.5.2 Learning

Training deep neural networks is an art [56]. The basic underlying algorithm is backpropagation, first proposed in 1969 and then reinvented in the mid-1980s [126, 125]. However, in practice, a number of tricks are required in order to obtain state-of-the-art performance, including methods to combat overfitting, such as dropout. Covering these solutions would require a separate treatise, and here we refer to [64, 56] for an extensive discussion.

Backpropagation — or backprop for short — extends the derivation done in (4.34) to evaluate the gradient of the log-likelihood to be used within an SGD-based algorithm. Again, the main ingredients are the general formula (3.28) for exponential models and the chain rule for derivatives. To elaborate, select a given training example \((x_n, y_n) = (x, t)\) to be used in an iteration of SGD. Backprop computes the derivative of the negative log-likelihood, or cross-entropy loss function, \(L(W) = -\ln p(t|x; W) = -t^T \ln y\) (cf. (4.32)), where the output \(y\) is obtained via the relations (4.35). It is important to note that, unlike the linear models studied above, the cross-entropy for multi-layer is generally a non-convex function of the weights.

Backprop computes the gradients with respect to the weight matrix by carrying out the following phases:

1. **Forward pass:** Given \(a^1, h^1, a^2, h^2, \ldots, a^L, h^L\), evaluate \(a^{L+1}, y\), and \(t\).

2. **Backward pass:** Given \(a^1, h^1, a^2, h^2, \ldots, a^L, h^L, a^{L+1}, y\), and \(t\), compute the gradients.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 106

Context: # Classification

![Figure 4.8](path/to/your/image)  
**Figure 4.8:** Probability that the class label is the same as for the examples marked with circles according to the output of a feed-forward multi-layer network with one hidden layer (L = 1) and six hidden neurons with sigmoid non-linearity. The probability is represented by the color map illustrated by the bar on the right of the figure. For reference, the solid line represents the decision line for logistic regression.

## Example 4.3
In the example in Fig. 4.8, the illustrated \( N = 300 \) training examples are used to train a logistic regression, i.e., GLM, model and a feed-forward multi-layer network with one hidden layer (L = 1) and six hidden neurons with sigmoid non-linearity \( h(z) = \sigma(x) \). The logistic model uses linear features \( \sigma(x) = [1]^T \). Both networks are trained using SGD. For logistic regression, the decision line is illustrated as a solid line, while for the multi-layer network we plot the probability that the class labels the same as for the examples marked with circles as color map. The GLM model with linear features is seen to be unable to capture the structure of the data, while the multi-layer network can learn suitable features that improve the effectiveness of classification.

## 4.5.3 Some Advanced Topics
We conclude this section by noting a few important aspects of the ongoing research on deep neural networks.

A first issue concerns the theoretical understanding of the generalization properties of deep neural networks. On the face of it, the success

Image Analysis: 

### Comprehensive Examination of the Visual Content

#### 1. Localization and Attribution
- **Image Position:**
  - **Image 1:** Located centrally on the page.

#### 2. Object Detection and Classification
- **Image 1:**
  - **Objects Detected:**
    - Data points represented by blue circles and red crosses.
    - Color map background in grayscale.
    - Decision boundary indicated by a solid line.

#### 3. Scene and Activity Analysis
- **Image 1:**
  - **Scene Description:**
    - The image depicts a scatter plot overlaid with a color map and a decision boundary.
  - **Activities:**
    - The scatter plot shows data points classified into two categories (blue circles and red crosses).
    - The color map indicates the probability of classification.
    - A solid line separates the regions, representing the decision boundary from logistic regression.

#### 4. Text Analysis
- **Extracted Text:**
    - **Caption:**
      - **Figure 4.8:** Probability that the class label is the same as for the examples marked with circles according to the output of a feed-forward multi-layer network with one hidden layer (\(L=1\)) and six hidden neurons (with sigmoid non-linearity). The probability is represented by the color map illustrated by the bar on the right of the figure. For reference, the solid line represents the decision line for logistic regression.
    - **Example 4.3:**
      - In the example in Fig. 4.8, the illustrated \(N=300\) training examples are used to train a logistic regression, i.e., GLM, model and a feed-forward multi-layer network with one hidden layer (\(L=1\)) and six hidden neurons with sigmoid non-linearity \(h(z)=\sigma(z)\). The logistic model uses linear features \(\sigma(z)=[1 \, z^T]\). Both networks are trained using SGD. For logistic regression, the decision line is illustrated as a solid line, while for the multi-layer network we plot the probability that the class label is the same as for the examples marked with circles as color map. The GLM model with linear features is seen to be unable to capture the structure of the data, while the multi-layer network can learn suitable features that improve the effectiveness of classification.

#### 5. Diagram and Chart Analysis
- **Image 1:**
  - **Axes and Scales:**
    - X-axis (\(Z_1\)): Ranges approximately from -4 to 4.
    - Y-axis (\(Z_2\)): Ranges approximately from -4 to 4.
  - **Legends:**
    - Red crosses (\( \times \)): Data points of class 1.
    - Blue circles (\( \circ \)): Data points of class 2.
    - Color map: Represents the probability, with a bar on the right showing the gradient from 0 to 1.
  - **Key Insights:**
    - The decision boundary (solid line) shows the separation made by logistic regression.
    - The color map illustrates that the feed-forward multi-layer network better captures the structure of the data compared to the linear model.

#### 8. Color Analysis
- **Image 1:**
  - **Dominant Colors:**
    - Red and blue are used to differentiate classes.
    - Black, white, and shades of gray indicate the probability regions.
  - **Impact on Perception:**
    - The color scheme helps to distinguish between different classes and the corresponding decision areas effectively.

#### 9. Perspective and Composition
- **Image 1:**
  - **Perspective:**
    - Direct, front-facing view typical of scatter plots.
  - **Composition:**
    - Center-aligned with data points spread across the plot.
    - Decision boundary centrally positioned, highlighting the separation between classes.
    - Color bar on the right side provides probability information.

#### 10. Contextual Significance
- **Image 1:**
  - **Overall Document Contribution:**
    - The plot supports the explanation of Example 4.3 in comparing the effectiveness of logistic regression vs. multi-layer networks.
    - It visually demonstrates how different models handle training data and classification.

By considering these aspects, we can see that this page offers a detailed explanation of machine learning classification techniques with a practical example illustrated by the scatter plot and color map. The image effectively highlights the differences in performance and decision-making capabilities of logistic regression and a multi-layer network.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 108

Context: Furthermore, beside ML, there exist Bayesian learning algorithms [53], including simplified approaches such as dropout [56, 64]. Significant progress has also been made on applications such as image recognition by modeling the underlying structure, or geometry, of the data [30]. As an important case in point, convolutional neural networks leverage the stationarity, locality and spatial invariance of image features by limiting the receptive field of the neurons (i.e., by setting to zero weights that connect to "distant" pixels) and by tying the weights of neurons in the same layer.

Another recent development is the design of event-driven spiking neural networks that can be implemented on neuromorphic computing platforms with extremely low energy consumption (see, e.g., [4, 11]).

## 4.6 Generative Probabilistic Models

As discussed in Chapter 2, discriminative models do not attempt to model the distribution of the domain points \( \mathbf{x} \), learning only the predictive distribution \( p(\mathbf{y}) \). In contrast, generative models aim at modeling the joint distribution by specifying parameterized versions of the prior distribution \( p(\mathbf{t}) \), or \( p(\mathbf{c}) \), and of the class-conditional probability distribution \( p(\mathbf{x} | \mathbf{c}) \). As a result, generative models make more assumptions about the data by considering also the distribution of the covariates \( \mathbf{x} \). As such, generative models may suffer from bias when the model is incorrectly selected. However, the capability to capture the properties of the distribution of the explanatory variables \( \mathbf{x} \) can improve learning if the class-conditional distribution \( p(\mathbf{x} | \mathbf{c}) \) has significant structure.

### 4.6.1 Model

Generative models for binary classification are typically defined as follows:

\[
t \sim \text{Bern}(\pi) \tag{4.38a}
\]

\[
x | t \sim \text{exponential}(\eta_t) \tag{4.38b}
\]
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 110

Context: By the general rules derived in Chapter 2 for the exponential family, ML selects the moment matching estimates

\[
\pi_{ML} = \frac{N[1]}{N} \tag{4.42a}
\]

\[
\mu_{k, ML} = \frac{1}{N[|k|]} \sum_{n=1}^{N} x_n \tag{4.42b}
\]

\[
\Sigma_{k, ML} = \frac{1}{N[|k|]} \sum_{n=1}^{N} (x_n - \mu_k)(x_n - \mu_k)^{T} \tag{4.42c}
\]

The resulting predictive distribution for the label of a new sample is then given by (4.39) by plugging in the estimates above as

\[
p(C_k | x) = \frac{\pi_{ML} N(x | \mu_{k, ML}, \Sigma_{k, ML})}{\pi_{ML} N(x | \mu_{1, ML}, \Sigma_{1, ML}) + (1 - \pi_{ML}) N(x | \mu_{2, ML}, \Sigma_{2, ML})} \tag{4.43}
\]

### Linear Discriminant Analysis (LDA)
Setting \(\Sigma_k = \Sigma\) for both classes \(k = 1, 2\) yields the Linear Discriminant Analysis (LDA) model \[[104]\]. Imposing that two generally distinct parameters, such as \(\Sigma_1\) and \(\Sigma_2\) are equal is an example of parameter tying or sharing. By reducing the number of parameters to be learned, parameter sharing can reduce overfitting at the potential cost of introducing bias (see Chapter 2).

The use of the assumption of conjugate priors, and of a prior independence of the parameters, MAP and Bayesian approaches can be directly obtained by following the derivations discussed in Chapter 2. We refer to [23, 15, 104] for details.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 112

Context: where \( t \) is encoded using one-hot encoding, so that the label of each example is given by the vector \( t_n = [t_{n,1}, \ldots, t_{n,K}]^T \). Following the discussion above, moment matching yields the ML estimates

\[
\pi_k, ML = \frac{N |[k]}{N} \sum_{n=1}^N t_{n,k} \quad (4.45a)
\]

\[
\mu_{k,ML} = \frac{1}{N |[k]|} \sum_{n=1}^N t_{n,k} x_n \quad (4.45b)
\]

\[
\Sigma_{k,ML} = \frac{1}{N |[k]|} \sum_{n=1}^N t_{n,k} (x_n - \mu_{k}) (x_n - \mu_{k})^T \quad (4.45c)
\]

### 4.7 Boosting*

In this section, we return to the mixture models of the form (4.29) and discuss a popular training approach to reduce overfitting. We focus on deterministic discriminative models with activations \( \alpha_k(x, \tilde{\theta}) \), \( k = 1, \ldots, K \), in which the mixture predictor is given as 

\[
a(x, \tilde{w}) = \sum_{k=1}^K \pi_k \alpha_k(x, \tilde{w}) \quad (4.46)
\]

with learnable parameters \( \{\pi_k\} \) and \( \{\tilde{w}\} \). The technique, known as boosting, trains one model \( \alpha_k(x, \tilde{w}) \) at a time in a sequential fashion, from \( h = 1 \) to \( h = K \), hence adding one predictor at each training step \( k \). As a result, boosting increases the capacity of the model in a sequential manner, controlling the sum in (4.46) over a growing number of predictors. In this way, one starts by training a model with a large bias, or approximation error; and progressively decreases the bias at the cost of a potentially larger estimation error (see Chapter 2 and the next chapter for further discussion on bias and estimation error). As we will discuss below, each model is trained by solving an ERM problem in which the contribution of a training example is weighted by the rate of the previously trained models.

To elaborate, boosting—more specifically the AdaBoost scheme—can be described as solving an ERM problem with the exponential loss function \( l(t, a(x, \tilde{w})) = \exp(-t \cdot a(x, \tilde{w})) \), as plotted in Fig. 4.4.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 116

Context: # Classification

Tom methods, which require calculation of the Hessian to evaluate the local curvature of the objective, and approximated Newton methods, which leverage an estimate of the Hessian. The practical and theoretical implications of these methods is still under discussion [157].

Related to second-order methods is the natural gradient approach, which applies most naturally to probabilistic models in which the function to be optimized is a LL [5]. The conventional gradient method updates the parameters by moving in the direction that minimizes the cost function under a constraint on the norm of the update vector in the parameter space. A potentially problematic aspect of this method is that the Euclidean distance \( \|\theta' - \theta''\|^2 \) between two parameter vectors \( \theta' \) and \( \theta'' \), two mean parameters in a model within the exponential family, does not provide a direct measure of the distance of the two corresponding distributions in terms of relevant metrics such as the KL divergence. The natural gradient method addresses this issue by measuring the size of the update directly in terms of the KL divergence between distributions. This modifies the update by pre-multiplying the gradient with the inverse of the Fisher information matrix [5].

The discussion above focuses on the common case of differentiable cost functions. ERM problems typically include possibly non-differentiable regularization terms. To tackle these problems, techniques such as the subgradient method and proximal gradient can be used in lieu of SGD [22]. Other important aspects of optimization schemes include parallelism and non-convexity (see, e.g., [130, 141, 161]). Alternatives to gradient methods that do not require differentiability include evolutionary schemes [128].

## Appendix B: Kernel Methods

In this section, we provide a brief introduction to kernel methods. This section requires some background in Lagrange duality.

We start by revisiting the problem (4.16) solved by SVM. Using Lagrange duality, the optimization (4.16) can be solved in the dual domain, that is, by optimizing over the dual variables or the Lagrange multipliers. Referring for details to [23, Chapter 7], the resulting problem…
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 118

Context: Before leaving the subject of kernel methods, it is worth noting that an important class of methods including **k–Nearest Neighbor (k-NN)** uses kernels that are data-dependent. k-NN is also an example of non-parametric learning rules. In contrast to the other schemes studied here, it does not rely on a parametric model of the (probabilistic) relationship between input and output. Instead, k-NN leverages the assumption that the labels of nearby points \( x \) should be similar [81].
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 119

Context: # 5
## Statistical Learning Theory*

Statistical learning theory provides a well-established theoretical framework in which to study the trade-off between the number \( N \) of available data points and the generalization performance of a trained machine. The approach formalizes the notions of model capacity, estimation error (or generalization gap), and bias that underlie many of the design choices required by supervised learning, as we have seen in the previous chapters. 

This chapter is of mathematical nature, and it departs from the algorithmic focus of the text so far. While it may be skipped at a first reading, the chapter sheds light on the key empirical observations made in the previous chapters relative to learning in a frequentist setup. It does so by covering the theoretical underpinnings of supervised learning within the classical framework of statistical learning theory. 

To this end, the chapter contains a number of formal statements with proofs. The proofs have been carefully selected in order to highlight and clarify the key theoretical ideas. This chapter follows mostly the treatment in [133].
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 120

Context: # 5.1 A Formal Framework for Supervised Learning

In this chapter, we concentrate on discriminative deterministic models for binary classification, as it is typically done in statistical learning theory. We also focus on the standard 0-1 loss \( \ell(t, \hat{t}) = \mathbb{I}\{t \neq \hat{t}\} \), for which the generalization loss is the probability of error. The labels \( t \) for the two classes take values in the set \{0, 1\} (cf. (4.4)).

The learning problem is formalized as follows. Assume that a model, or hypothesis class, \( \mathcal{H} \) has been selected. This set contains a possibly uncountable number of predictors \( i \) that map each point \( x \) in the domain space to a label \( \ell(x) \in \{0, 1\} \). We would like to choose a specific hypothesis, or predictor, \( \hat{f} \in \mathcal{H} \) that minimizes the generalization error (cf. (2.2)).

\[
L_{\mathcal{H}}(\hat{f}) = \mathbb{E}_{x \sim \mathcal{D}}[\ell(t, \hat{f}(x))]. \tag{5.1}
\]

Solving this inference problem would yield an optimal model within class \( \mathcal{H} \) as 

\[
\hat{f} \in \arg\min_{f \in \mathcal{H}} L(f). \tag{5.2}
\]

The notation in (5.2) emphasizes that there may be multiple optimal hypotheses returned by the minimization of the generalization error \( L(\hat{f}) \). Nevertheless, to fix the ideas, it is useful to think of the case in which there is a unique optimal hypothesis. This is, for instance, the case when the loss function is strictly convex. Obtaining the optimal predictor (5.2) requires knowledge of the true distribution \( p(x, t) \), which is not available.

## Example 5.1

For the linear (deterministic) methods studied in Chapter 4, the model is defined as

\[
\mathcal{H} = \{t_b(x) = \langle w, x \rangle + w_0\} \tag{5.3}
\]

with \( w \in \mathbb{R}^d \), and similarly for the feature-based version. Identifying a hypothesis within this class requires the selection of the weight vector \( w \in \mathbb{R}^d \).

In lieu of the true distribution \( p(x, t) \), what is available is an i.i.d. training set 

\[
\mathcal{D} = \{(x_n, t_n)\}_{n=1}^N \sim p(x, t). \tag{5.4}
\]
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 124

Context: # Statistical Learning Theory

the fraction of training examples that are correctly classified. It is observed that, as \( N \) increases, the training loss, \( L_D(\hat{\theta}) \), or empirical risk, becomes an increasingly reliable estimate of the generalization loss uniformly for all hypotheses, parameterized by \( \theta \), in the model.

![Generalization and training losses for a scalar threshold classifier model.](path/to/image.png)

### Figure 5.3: Generalization and training losses for a scalar threshold classifier model.

As suggested by the example, if \( N \) is large enough, the empirical risk, or training loss, \( L_D(\hat{\theta}) \), approximates increasingly well (with high probability) the generalization loss \( L_{\mathcal{H}}(\hat{\theta}) \) for any fixed hypothesis in \( \hat{\theta} \in \mathcal{H} \) by the law of large numbers. It would then seem that the problem is solved: Since \( L_D(\hat{\theta}) \approx L_{\mathcal{H}}(\hat{\theta}) \) for any \( \hat{\theta} \), the ERM solution \( \hat{\theta}_{\text{ERM}} \), which minimizes the training loss \( L_D(\hat{\theta}) \), should also approximately minimize the generalization loss \( L_{\mathcal{H}}(\hat{\theta}) \), and hence we have \( \hat{\theta}_{\text{ERM}} \approx \hat{\theta}^* \).

However, this argument is incorrect. In fact, we need the training loss \( L_D(\hat{\theta}) \) to be an accurate approximation of the generalization loss \( L_{\mathcal{H}}(\hat{\theta}) \) uniformly for all hypotheses in \( \hat{\theta} \in \mathcal{H} \) in order to ensure the condition:

\[
\hat{\theta}_{\text{ERM}} \approx \hat{\theta}^*
\]

As we will see in the rest of this chapter, guaranteeing this condition requires to observe a number of samples \( N \) that grows with the "capacity" of the model \( \mathcal{H} \), that is, roughly, with the number of parameters defining the hypotheses in \( \mathcal{H} \). Moreover, some models
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 126

Context: # Statistical Learning Theory

Let \( x \in \mathbb{R}^D \) yield no information on the value of \( i(x) \) for other values of \( x \). As another, less obvious, example the class 

\[
\mathcal{H} = \{ h_w(x) = 1(\sin(wx) > 0) \}
\]

is not PAC learnable despite being parameterized by a single scalar [133].

## Definition 5.3
The sample complexity \( N_{\mathcal{H}}( \epsilon, \delta ) \) of model \( \mathcal{H} \) is the minimal value of \( N_{\mathcal{H}}( \epsilon, \delta ) \) that satisfies the requirements of PAC learning for \( \mathcal{H} \).

We will see next that the sample complexity depends on the capacity of the model \( \mathcal{H} \). Note that the sample complexity of the two examples above is infinite since they are not PAC learnable. We also remark that PAC learnability may be alternatively defined under the additional conditions on the scaling of \( N_{\mathcal{H}}( \epsilon, \delta ) \) as a function of \( \epsilon \) and \( \delta \), as well as on the computational complexity of the learning rule. We will not consider these more refined definitions here, and we refer the reader to [51, 133] for discussion.

## 5.3 PAC Learnability for Finite Hypothesis Classes

In this section, we consider models with a finite number of hypotheses. The main result is summarized in the following theorem, which is proved below in Sec. 5.3.1.

### Theorem 5.1
A finite hypothesis class \( \mathcal{H} \) is PAC learnable with sample complexity satisfying the inequality

\[
N_{\mathcal{H}}( \epsilon, \delta ) \leq \frac{2 \ln | \mathcal{H} | + 2 \ln(2 / \delta)}{\epsilon^2} \leq N_{\text{ERM}}^{\mathcal{H}}( \epsilon, \delta ). 
\]

Moreover, the ERM algorithm achieves the upper bound \( N_{\text{ERM}}^{\mathcal{H}}( \epsilon, \delta ) \).

The previous theorem shows that all finite classes are PAC learnable. Furthermore, for all classes \( \mathcal{H} \), ERM is a PAC learning rule for any desired levels of accuracy and confidence \( (\epsilon, \delta) \), as long as \( N \) is larger than the threshold \( N_{\text{ERM}}^{\mathcal{H}}( \epsilon, \delta ) \). This threshold, which we will refer to,
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 134

Context: # Statistical Learning Theory

For a richer model, it is more difficult to find a representative distribution \( q(x) \) that yields an average coding length close to the minimum \( H(p(x) || q(x)) \) for all values of \( x \).

It turns out that the minimax redundancy equals the capacity \( C(p(x) || p(\theta)) \), which is defined as \( C(p(x) || p(\theta)) = \max_{q(x)} I(q; \theta) \) [38]. This is shown by the following sequence of equalities:

\[
\Delta R(H) = \min_{q(x)} \max_{p(\theta)} \sum_{x} p(\theta) p(x | \theta) \log \frac{p(x | \theta)}{q(x)} \tag{5.27a}
\]

\[
= \max_{p(\theta)} \min_{q(x)} \sum_{x} p(\theta) p(x | \theta) \log \frac{p(x | \theta)}{q(x)} \tag{5.27b}
\]

\[
= \max_{p(\theta)} \sum_{x} p(\theta) p(x | \theta) \log \frac{p(x | \theta)}{p(x)} \tag{5.27c}
\]

\[
= C(p(x) || p(\theta)) \tag{5.27d}
\]

where the first equality follows since the average of a set of numbers is no larger than any of the numbers; the second is a consequence of the minimax theorem since the term \( \sum_{x} \sum_{y} p(\theta) p(x | \theta) \log \frac{q(x | \theta)}{q(x)} \) is convex in \( q \) and concave in \( p \); and the third equality follows by Gibbs' inequality (see Sec. 2.6 and A.5 in Appendix A). As a final note, the mutual information \( I(x; \theta) \) between model parameter and data also plays a central role in obtaining bounds on the performance of estimation [91].
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 135

Context: # Part III
## Unsupervised Learning

### 1. Introduction
Unsupervised learning is a type of machine learning that does not require labeled output. It relies on patterns and structures in input data to generate insights.

### 2. Common Algorithms
- K-means Clustering
- Hierarchical Clustering
- DBSCAN

### 3. Applications
- Market Basket Analysis
- Image Compression
- Anomaly Detection

### 4. Conclusion
Unsupervised learning provides significant value across various domains, enabling organizations to extract meaningful information from unstructured data.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 136

Context: # 6 Unsupervised Learning

Unlike supervised learning, unsupervised learning tasks operate over unlabelled data sets. Apart from this general statement, unsupervised learning is more loosely defined than supervised learning, and it also lacks a strong theoretical framework to mirror the PAC learning theory covered in the previous chapter. Nevertheless, it is widely expected that future breakthroughs in machine learning will come mainly from advances in the theory and design of unsupervised learning algorithms. This is due to the availability of huge repositories of unlabelled data, as well as to the broader applicability of learning tasks whereby the machine learns, as it were, without supervision or feedback. Unsupervised learning is also considered by some as the key to the development of general, as opposed to task-specific AI [137] (see also [142, 87] for more on general AI).

Generally speaking, unsupervised learning algorithms aim at learning some properties of interest of the mechanism underlying the generation of the data. In this sense, unsupervised learning concerns the study of generative models, although, as we will see, this statement comes with some caveats. A common aspect of many models used for unsupervised learning is the presence of hidden, or latent, variables.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 138

Context: # Unsupervised Learning

may wish to cluster a set \( D \) of text documents according to their topics, by modeling the latter as an unobserved label \( z_n \). Broadly speaking, this requires to group together documents that are similar according to some metric. It is important at this point to emphasize the distinction between classification and clustering: While the former assumes the availability of a labelled set of training examples and evaluates its (generalization) performance on a separate set of unlabelled examples, the latter works with a single, unlabelled, set of examples. The different notation used for the labels \( z_n \) in lieu of \( t_n \) is meant to provide a reminder of this key difference.

## Dimensions reduction and representation:
Given the set \( D \), we would like to represent the data points \( x_n \in D \) in a space of lower dimensionality. This makes it possible to highlight independent explanatory factors, and/or to ease visualization and interpretation [93], e.g., for text analysis via vector embedding (see, e.g., [124]).

## Feature extraction:
Feature extraction is the task of deriving functions of the data points \( x_n \) that provide useful lower-dimensional inputs for tasks such as supervised learning. The extracted features are unobserved, and hence latent, variables. As an example, the hidden layer of a deep neural network extracts features from the data for use by the output layer (see Sec. 4.5).

## Generation of new samples:
The goal here is to train a machine that is able to produce samples that are approximately distributed according to the true distribution \( p(x) \). For example, in computer graphics for filmmaking or gaming, one may want to train a software that is able to produce artistic scenes based on a given description.

The variety of tasks and the difficulty in providing formal definitions, e.g., on the realism of an artificially generated image, may hinder unsupervised learning, at least in its current state, a less formal field than supervised learning. Often, loss criteria in unsupervised learning measure the divergence between the learned model and the empirical data distribution, but there are important exceptions, as we will see.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 142

Context: # Unsupervised Learning

We obtain

\[
\mu^{new}_{k} = \frac{1}{\sum_{n=1}^{N} \delta_{k,z_{n}}} \sum_{n=1}^{N} \delta_{k,z_{n}} x_{n}
\]

The new cluster representative \(\mu^{new}_{k}\) for each cluster \(k\) is the mean of the data points assigned to cluster \(k\).

- If a convergence criterion is not satisfied, set \( \mu^{old}_{k} \leftarrow \mu^{new}_{k} \) and return to the E step.

Since both the E step and M step minimize the objective function in (6.4), respectively over the cluster assignment variables \(z_n\) and the cluster representatives \(\mu_k\), the value of the objective function is non-decreasing across the iterations. This ensures convergence. Illustrations of convergence and examples can be found in [23, Chapter 9]. As a note, the algorithm is also known as Lloyd-Max quantization [54].

At a high level, K-means alternates between: 
1. Making inferences about the hidden variables \(\{z_{n}\}\) based on the current model defined by the representatives \(\{\mu_{k}\}\) in the E step; and 
2. Updating the model \(\{\mu_{k}\}\) to match the data \(\{x_{n}\}\) and the inferred variables \(\{z_{n}\}\) in the M step. We will see that a similar algorithmic structure is applied by many unsupervised learning algorithms.

Before we move on to discussing more general solutions, it is worth spending a few words on the problem of selecting the number \(K\) of clusters. A first possibility is to add or remove clusters until certain heuristic criteria are satisfied, such as the “purity” of clusters. A second approach is hierarchical clustering, whereby one builds a tree, known as a dendrogram, that includes clustering solutions with an increasing number of clusters as one moves away from the root (see, e.g., [51]). Yet another solution is to let \(K\) be selected automatically by adopting a non-parametric Bayesian approach via a Dirichlet process prior [104].

## 6.3 ML, ELBO and EM

In this section, we discuss two key technical tools that are extensively used in tackling unsupervised learning problems, namely the Evidence Lower Bound (ELBO) and the EM algorithm. The starting point is the fundamental problem of learning a probabilistic, directed or undirected,
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 143

Context: # 6.3 ML, ELBO and EM

## 6.3.1 ML Learning

From Sec. 2.6, we know that ML, asymptotically in \( N \), tends to minimize a KL divergence between the true distribution \( p(x) \) and the selected hypothesis in the model. This is a criterion that is well suited for many of the unsupervised learning tasks mentioned above, such as density estimation or generation of new samples, and it is hence useful to start the discussion with ML learning.

Before we do that, a few remarks are in order. First, it is often useful to choose divergences other than KL, which are tailored to the specific application of interest [8]. We will further discuss this aspect in Sec. 6.4.3. Second, when the goal is representation learning, the machine aims at obtaining useful features \( z \). Hence, minimizing the KL divergence to match the true distribution \( p(x) \) does not directly address the objective of representation learning, unless appropriate restrictions are imposed on the model \( p(x|z, \theta) \). If the generative model \( p(x|z, \theta) \) is too powerful, then it can disregard the features \( z \) and still obtain a high likelihood for the data (see, e.g., [69]). We will get back to this point in Sec. 6.6. Finally, obtaining ML solutions is often impractical, particularly in large models. Nevertheless, understanding the ML problem allows one to better gauge the impact of the approximations and simplifications made to obtain efficient algorithms.

To proceed, we consider probabilistic, directed or undirected, models for \( x \), for the purpose of simplifying the notation, on a data set with a single observation \( (N = 1) \). The extension to a data set with arbitrary number \( N \) of points only requires adding an outer sum over the sample index in the LL function. We also acknowledge discrete hidden RVs, and the generalization to continuous RVs is obtained by substituting sums with suitable integrals. The ML problem can be written as the maximization of the LL function as:

\[
\max_\theta \ln p(x|\theta) = \ln \left( \sum_z p(x, z | \theta) \right) \tag{6.7}
\]
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 144

Context: # 138 Unsupervised Learning

where \( x \) denotes the data and \( z \) the hidden or latent variables. Note the marginalization with respect to the hidden variables in (6.7). This problem should be contrasted with the supervised learning ML problem obtained when both \( x \) and \( z \) are observed, namely

\[
\max_{\theta} \ln p(x, z | \theta). \tag{6.8}
\]

## Example 6.1

Consider a directed generative Bernoulli-Gaussian model characterized as

- \( z \sim \text{Bern}(0.5) \tag{6.9a} \)
- \( x | z = 0 \sim \mathcal{N}(2, 1) \tag{6.9b} \)
- \( x | z = 1 \sim \mathcal{N}(\theta, 1) \tag{6.9c} \)

This corresponds to a mixture of Gaussians model, in which the only parameter \( \theta \) is the mean of one of the two Gaussian components. Assume that we observe \( x = 0 \). Consider first the supervised learning case, where we assume that we also measure \( z = 1 \). In this case, the LL function in (6.8) is 

\[
\ln p(x | z = 1) = \ln \mathcal{N}(x | \theta, 1) + \ln(0.5).
\]

In contrast, with unsupervised learning, the LL function in (6.7) is 

\[
\ln p(x | \theta) = \ln(0.5 \cdot \mathcal{N}(x | 0, 1) + 0.5 \cdot \mathcal{N}(x | \theta, 1)).
\]

The LL functions are shown in Fig. 6.2. Unlike supervised learning, the LL for unsupervised learning is seen to be non-concave. In this example, this is a consequence of the fact that, when \( \theta \) is sufficiently large in absolute value, the probability of the data \( x \) resulting from the fixed Gaussian distribution centered at \( x = 2 \) makes the contribution of the Gaussian centered \( \theta \) increasingly irrelevant.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 154

Context: ```
6.4 Directed Generative Models
In this section, we discuss directed generative models, in which, as seen in Fig. 6.1(a), one posits a parameterized prior \( p(z|\theta) \) of the latent variables \( z \) and a parameterized conditional decoding distribution \( p(x|z, \theta) \). As discussed, the goal can be that of learning the distribution \( p(x) \) of the true data to extract useful features \( z \). In the latter case, the use of directed generative model is referred to as performing analysis by synthesis. We first discuss two prototypical applications of EM to perform ML learning. We then outline an alternative approach, known as GAN, which can be thought of as a generalization of ML. Accordingly, rather than selecting a priority by ML, GANs learn at the same time divergence and generative model.

Multi-layer extensions of generative directed models discussed here fall in the category of Helmholtz machines. We refer to, e.g., [42], for a discussion about the task of training these networks via an approximation of the EM algorithm, which uses tools covered in Chapter 8.

6.4.1 Mixture of Gaussians Model
The mixture of Gaussians model can be described by the following directed generative model:

\[
z_n \sim \text{Cat}(\pi) \tag{6.28a}
\]
\[
x_n | z_n = k \sim \mathcal{N}(\mu_k, \Sigma_k). \tag{6.28b}
\]

with parameter vector \( \theta = [\pi, \mu_1, \ldots, \mu_K, \Sigma_1, \ldots, \Sigma_K] \). We use one-hot encoding for the categorical variables \( z_n = [z_{n, 1}, \ldots, z_{n, K}] \). Note that this model can be thought of as an unsupervised version of the QDA model for the fully observed, or supervised, case studied in Sec. 4.6.
```
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 156

Context: # Unsupervised Learning

![Figure 6.5: True distribution (dashed line) and mixture of Gaussians model learned via EM (solid line).](path/to/image)

## Example 6.3

Consider data generated from the multi-modal distribution shown in Fig. 6.5 as a dashed line, which was obtained as a mixture of two Gaussians and an exponential distribution. When \( M = 1 \), the mixture of Gaussian distribution learned via EM corresponds to the conventional ML estimate, and is hence obtained via moment matching (see Chapter 3). This is plotted in the figure for a given data realization with \( N = 100 \). Running EM with the larger values \( M = 2 \) and \( M = 3 \) also returns the same distribution. This distribution is seen to be inclusive of the entire support of the true distribution and to smooth out the edges of the original distribution. As we will further discuss in Sec. 6.4.3, this is a consequence of the fact that ML minimizes the KL divergence over the second argument.

A similar model that applies to binary data, rather than continuous data, such as black-and-white images, is the mixture of Bernoulli models. The EM algorithm can be derived by following the same steps detailed here [23].
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 162

Context: # Unsupervised Learning

## 6.5.1 Restricted Boltzmann Machines (RBM)

RBMs are typically characterized by an \( M \)-dimensional binary hidden vector \( z \in \{0, 1\}^M \), while the observed variables may be discrete or continuous. Here we consider the case of binary observed variables, which is suitable to model, e.g., black-and-white images or positive/negative recommendations. The RBM model is defined as (6.3), where the joint distribution of visible and latent variables is a log-linear model, and hence part of the exponential family. Mathematically, we have:

\[
p(x | z) = \frac{1}{Z(\theta)} \exp(-E(x, z | \theta)), \tag{6.43}
\]

with the energy function given as

\[
E(x, z | \theta) = -a^T z - b^T x - z^T W z, \tag{6.44}
\]

and the partition function as 

\[
Z(\theta) = \sum_{x, z} \exp(-E(x, z | \theta)). 
\]

The parameter vector \(\theta\) includes the \( M \times 1 \) vector \( a \), the \( D \times 1 \) vector \( b \), and the \( D \times M \) matrix \( W \).

The qualifier "restricted" captures the fact that the energy function (6.44) only features cross-terms that include one variable from \( x \) and one from \( z \), and no term includes both variables in either \( x \) or \( z \). In other words, the model accounts for the affinities between variables in \( x \) and \( z \), but directly between variables in either \( x \) or \( z \). For instance, if \( (i, j) \)th entry \( w_{ij} \) of matrix \( W \) is a large positive number, variables \( x_i \) and \( z_j \) tend to have high signals in order to minimize the energy and hence maximize the probability; and the opposite is true when \( w_{ij} \) is negative. As we will see in the next chapter, this type of probabilistic relationship can be represented by the undirected graphical model known as MRF shown in Fig. 6.6.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 164

Context: Learning is typically done by means of an approximate SGD method that leverages MC techniques. Recalling the general formula (3.28) for the exponential family, the gradient at the current iteration \( \theta = \theta_{\text{gold}} \) can be computed using (6.22) as

\[
\nabla_{\theta} \ln p(x|\theta) |_{\theta = \theta_{\text{gold}}} = E_{x \sim p_{\theta_{\text{gold}}}(x)}[z_i z_j] - E_{x \sim p_{\theta}(x)}[z_i z_j | \theta].
\]

which can be further simplified using (6.45). The gradient presents the structure, noted in Chapter 3, given by the difference between a positive component, which depends on the data \( z \), and a negative component, which instead requires an ensemble average over all other possible samples \( x \sim p_{\theta_{\text{gold}}} \). Similar relations can be obtained for the gradients with respect to \( a \) and \( b \), namely

\[
\nabla_a \ln p(x|\theta) |_{\theta = \theta_{\text{gold}}} = E_{x \sim p_{\theta_{\text{gold}}}(x)}[z] - E_{x \sim p_{\theta}(x)}[z],
\]

and

\[
\nabla_b \ln p(x|\theta) |_{\theta = \theta_{\text{gold}}} = z_j - E_{x \sim p_{\theta}(x)}[x_j].
\]

In order to evaluate the expectations in the negative components of the gradients, one typically uses a method known as Markov Chain Monte Carlo (MCMC) to be discussed in Chapter 8. More specifically, a simplified approach known as Contrastive Divergence (CD) has been found to be an effective approximation. Accordingly, one “clamps” the visible variables to the observed variables \( x = x \) and generates the sequence \( x \to x^{(1)} \to x^{(2)} \), using the conditional probability (6.45a) to generate \( z \) from \( x \). The resulting samples are used to approximate the expectations as

\[
\nabla_{\theta} \ln p(x|\theta) |_{\theta = \theta_{\text{gold}}} = E_{x^{(0)}}[z_i^{(0)}] - E_{x^{(1)}}[z_i^{(0)}].
\]

and similar expressions apply for the other gradients. The CD scheme can also be generalized to CD-k by increasing the Markov chain sequence to \( k \) steps, and using the resulting samples \( x^{(k)} \) in lieu of \( x^{(1)} \) and \( x^{(0)} \).

Extensions and application of the RBM are discussed in [20, 153]. Generalizations of RBMs with multiple layers of hidden variables are referred to as Deep Boltzmann Machines. We refer to [64] for discussion about training and applications.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 168

Context: # Unsupervised Learning

The function \( F_E(x) = W^T x \) and the decoder as \( G_D(z) = W z \). The parameters \( \theta \) are hence given by the \( D \times M \) matrix \( W \). By these definitions, the \( M \) columns of \( W \) have the interpretation of linear features as for PCA.

With a quadratic loss function, the learning problem (6.53) is given as:

\[
\min_{W} \frac{1}{N} \sum_{n=1}^{N} \|x_n - W W^T x_n\|^2
\]

This problem can be solved in closed form. The solution is in fact given by the \( M \) principal eigenvectors of the sample covariance matrix \( N^{-1} \sum_{n=1}^{N} x_n x_n^T \). Extensions of PCA that use the kernel trick (Chapter 4) can be also devised (see, e.g., [104]). Furthermore, PCA can be seen to be a special case of dictionary learning by setting in the latter \( \mu = 0 \) and by taking the limit \( \sigma \to 0 \).

## Dictionary Learning

In dictionary learning, the decoder is linear, i.e., \( G_D(z) = Wz \). However, the encoder is limited only by constraints on the set of feasible latent variables \( z \). A typical such constraint is sparsity: the latent vector should have a small number of non-zero elements. Defining as \( C \) the set of feasible values for \( z \), the dictionary learning problem can be formulated as:

\[
\min_{z_n \in C} \frac{1}{N} \sum_{n=1}^{N} \|x_n - W z_n\|^2
\]

The name of the method accounts for the fact that matrix \( W \) can be thought of as the dictionary of \( M \) features – its columns – that are used to describe the data. The problem above is typically solved using alternate optimization. Accordingly, one optimizes over \( W \) for a fixed set of latent vectors \( \{z_n\} \), which is a standard least squares problem; and then the optimizer over each latent vector \( z_n \) for a fixed matrix can be tackled by using standard sparsity-based methods, such as LASSO (Chapter 2).

## Multi-layer Autoencoders

Multi-layer autoencoders.* One can also represent both encoder and decoder as multi-layer neural networks. In this case, the weights of encoder and decoders are typically tied to be the transpose of one another, in a manner similar to PCA. Training is often done first layer-by-layer, e.g., using RBM training, and then via backpropagation across the entire network [64].
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 170

Context: ```markdown
where \( 0 < d < 1 \) is a parameter. Hence, the rank of a page is a weighted sum of a generic rank of equal to 1, which enables the choice of new pages, and of an aggregate “vote” from other pages. The latter term is such that if any other page \( j \) votes for page \( i \) it links to page \( i \) and its vote equals its own rank divided by the total number of outgoing links, that is, \( r_j/C_j \). In essence, a page \( i \) is highly ranked if it is linked by pages with high rank. The equation (6.58) can be solved recursively to obtain the ranks of all pages. A variation of PageRank that tailors ranking to an individual’s preferences can be obtained as described in [63].

## 6.9 Summary

In this chapter, we have reviewed the basics of unsupervised learning. A common aspect of all considered approaches is the presence of hidden, or latent, variables that help explain the structure of the data. We have first reviewed ML learning via EM, and variations thereof, for directed and undirected models. We have then introduced the GAN method as a generalization of ML in which the KL divergence is replaced by a divergence measure that is learned from the data. We have then reviewed discriminative models, which may be trained via the InfoMax principle, and autoencoders.

In the next section, we broaden the expressive power of the probabilistic models considered so far by discussing the powerful framework of probabilistic graphical models.
```
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 172

Context: # 7  
## Probabilistic Graphical Models  

As we have seen in the previous chapters, probabilistic models are widely used in machine learning. Using Fig. 6.1 as an example, we have encountered both directed and undirected models, which have been used to carry out supervised and unsupervised learning tasks. Graphical models encode structural information about the random variables of interest, both observed and latent. They hence provide a principled way to define parametric probabilistic models with desired features.  

The selection of a probabilistic graphical model hence follows the same general rules that have been discussed so far: A more specialized, or structured, model may help reduce overfitting, and hence the generalization gap. This is done, as we will see, by reducing the number of parameters to be learned. On the flip side, specialization may come at the cost of an irrecoverable bias.  

In this chapter, we provide an introduction to the vast field of probabilistic graphical models, which is a powerful framework that allows us to represent and learn structured probabilistic models. The goal here is to introduce the main concepts and tools, while referring to the extensive treatments in [81, 15, 104, 151] for additional information.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 181

Context: # 7.2 Bayesian Networks

## 7.2.2 Global Conditional Independence

As we have discussed, BNs are defined by local conditional independence properties that are encoded in the factorization of a joint distribution and in the supporting DAG. BNs can also be used to assess generic global conditional independence queries of the type: Is a given subset of variables A independent of another set B conditioned on a third subset C of variables? The d-separation algorithm outputs either a positive response or a “maybe” answer to this question. It is briefly described as follows:

1. **Build a subgraph G'** from the original DAG by keeping all vertices in the subsets A, B, and C, as well as all the edges and vertices encountered by moving backwards on the DAG or one more edges from the vertices in A, B, and C.
2. **Build a subgraph G''** from G' by deleting all edges coming out of the vertices in C.
3. **If there is no path**, neglecting the directionality of the edges, between a node in A and a node in B, then the conditional independence relation A ⫫ B | C holds. Else, if one such path exists, then there is at least one joint distribution that factorizes as for the given DAG for which the condition A ⫫ B | C does not hold.

### Example 7.6

Consider the so-called V-structure x → y ← z. Using d-separation, it can be seen that the conditional independence x ⫫ y | z does not hold in grain.

## 7.2.3 Learning

Assume that the DAG of a BN is given. Structure learning, that is, the decision of which edges should be included in the graph based on available training data, is also an important problem that will need to be considered here. Making explicit the dependence of the probability factors on learnable parameters μ, the joint distribution encoded by a BN can be written as 

\[ p(x_1, \ldots, x_K) = \prod_{k=1}^{K} p(x_k | \text{Pa}(x_k); \mu_k) \tag{7.12} \]
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 182

Context: # Probabilistic Graphical Models

where \(\mu_{k|x_{p}} \) are the parameters defining the conditional distribution \(p(x_{k}|x_{p}) \). Note that the parameters \(\mu_{k|x_{p}} \) are generally different for different values of \(k\) and the parents’ variables \(x_{p}\) (see, e.g., (7.11)). In most cases of interest, the probability distribution \(p(x_{k}|x_{p}; \mu_{k|x_{p}})\) is in the exponential family or is a GLM (see Chapter 3).

As we have already seen in previous examples, the parameters \(\mu_{k|x_{p}}\) can either be **separate**, that is, distinct for each \(k\) and each value of \(x_{p}\), or they can be **tied**. In the latter case, some of the parameters \(\mu_{k|x_{p}}\) are constrained to be equal across different values of \(x_{p}\) and/or across different values of \(k\). As a special case of tied parameters, the value of \(\mu_{k|x_{p}}\) may also be independent of \(x_{p}\), such as in the case for GLMs.

As for the data, we have seen that the terms \(x_{1}, \ldots, x_{N}\) can be either fully observed in the training set, as in supervised learning, or they can be partially observed in unsupervised learning.

For the sake of brevity, we describe learning only for the case of fully observed data with separate parameters, and we briefly mention extensions to the other cases.

## Fully Observed Data with Separate Parameters

We are given a fully observed data set \(D = \{x_{n}\}_{n=1}^{N} \) with each data point written as \(x_{n} = [x_{n,1}, \ldots, x_{n,K}]^{T}\). For concreteness, assume that all variables are categorical. Denoting as \(x_{p,n}\) the parents of variable \(x_{n,k}\), the log likelihood function can be factorized as:

\[
\ln p(D|\mu) = \sum_{n=1}^{N} \ln p(x_{n}|x_{p,n}; \mu_{k|x_{p}}) \tag{7.13a}
\]

\[
= \sum_{k=1}^{K} \sum_{n=1}^{N} \ln p(x_{n,k}|x_{p,n}; \mu_{k|x_{p}}) \tag{7.13b}
\]

\[
= \sum_{k=1}^{K} \sum_{x_{n,k} \in \mathcal{A}_{k}} \ln p(x_{n,k}|x_{p,n}; \mu_{k|x_{p}}). \tag{7.13c}
\]
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 198

Context: # Approximate Inference and Learning

is that the M-projection is obtained by moment matching. Specifically, one needs to find a value of parameter ϕ such that the expectations of the sufficient statistics of the model \( q(z|ϕ) \) under distribution \( q(z|ϕ) \) match with the same expectations under the true distribution \( p(z|x) \).

In mathematical terms, the M-projection in the exponential family model \( q(z|ϕ) \propto \exp(ϕ^T u(z)) \), with sufficient statistics \( u(z) \), yields natural parameters \( ϕ^* \) that satisfy the moment matching condition:

\[ 
E_{p(z|x)}[u(z)] = E_{q(z|ϕ)}[u(z)]. 
\]

This derivation is detailed in the Appendix of this chapter. Amortized inference can be defined in a similar way as for I-projection.

![Figure 8.1: Example of I- and M-projections of a mixture of Gaussians distribution (dashed line).](image_link)

## Example 8.2

This simple example is meant to provide an intuitive comparison between the approximations produced by I- and M-projections. To this end, consider a mixture of Gaussians distribution:

\[
p(x|z) = 0.3N(x|μ_1 = -1, σ_1^2 = 0.3) + 0.7N(x|μ_2 = 1, σ_2^2 = 0.3). 
\]
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 204

Context: # Approximate Inference and Learning

where we have defined the learning signal 

\[
\psi_{\theta}(x, z) = \ln q_{\theta}(z) - \ln p(x, z).
\] (8.26)

To obtain (8.25), we have used the identity \(\nabla_{\theta}\ln q_{\theta}(z) = \nabla_{\theta}q_{\theta}(z)/q_{\theta}(z)\), which follows from the chain rule for differentiation, as well as the equality \(\mathbb{E}_{q_{\theta}(z)}[\nabla_{\theta}\ln q_{\theta}(z)] = 0\) (see [25] for details). 

MC-based VI methods evaluate an MC approximation of the gradient (8.25) at a given value \( \phi \) by drawing one or more samples \( z_m \sim q_{\theta}(z) \), with \( m = 1, \ldots, M \), and then computing

\[
\nabla_{\theta}\text{KL}(q_{\theta}(z) || p(x, z)) = \frac{1}{M} \sum_{m=1}^{M} \left[\nabla_{\theta} z_m \right] \psi_{\theta}(x, z_m). \quad (8.27)
\]

This estimate is also known as the likelihood ratio or REINFORCE gradient, and it can be used to update the value of \( \phi \) using SGD (see Sec. 4.1). The name reflects the origin and the importance of the approach in the reinforcement learning literature, as we briefly discuss in Chapter 9.

In practice, these gradients have high variance. This is intuitively due to the fact that this estimator does not use any information about how a change in \( z \) affects the learning signal \( \phi(x, z) \), since it only depends on the value of this signal. Therefore, techniques such as Rao-Blackwellization or control variates need to be introduced in order to reduce its variance (see [80] for a review). Simplifications arise because when the variational posterior is assumed to factorize, e.g., according to the mean-field factorization, this approach is used in the black-box inference method [120].

> Note: In reinforcement learning, it is useful to compute the gradient of the average reward \(\mathbb{E}_{q_{\theta}(z)}[R(x, a)]\) with respect to the parameters \(\theta\), defining the distribution \(q_{\theta}(z)\) of the action \(a\) given the current state \(x\) of the environment. The reward \(R(x, a)\) is a function, possibly stochastic, of the action of the state.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 206

Context: 200
# Approximate Inference and Learning

can directly benefit from approximate inference in evaluating the posterior of the parameters and the variational posterior, we focus here on the frequentist viewpoint.

## ML Learning in the Presence of Hidden Variables

ML learning in the presence of hidden variables can be approximated by maximizing the ELBO with respect to both the model parameters \( \theta \) that define the forward model \( p(z | x) \) and the parameters \( \phi \) of the variational (amortized) model \( q(z | x, \phi) \). To understand what is accomplished with this optimization, it is useful to write the ELBO over a data set \( D = \{x_n\}_{n=1}^N \) using (6.14) as

\[
\sum_{n=1}^N \log p(x_n | \theta) - \text{KL}(q(z | x_n, \phi) \| p(z | x_n, \theta)). \tag{8.30}
\]

As such, for any fixed \( \phi \), optimizing the ELBO over \( \theta \) maximizes the likelihood function in the first term under a variational regularization term that penalizes posteriors \( p(z | x) \) that are significantly different from the selected variational posteriors \( q(z | x, \phi) \). The choice of a given model for the variational posterior hence drives learning, and should be treated as model or hyperparameter selection [68].

The maximization of the ELBO over both model and variational parameters can be carried out in different ways. As a first approach, one can use EM by performing the E step via approximate inference to evaluate the posterior of the latent variables. When V is used for this purpose, the resulting scheme is known as variational EM. Alternatively, one can use SGD with respect to both parameter vectors \( \theta \) and \( \phi \) by leveraging the REINFORCE method or the reparametrization trick. The reparametrization trick approach is for instance used in the VAE method for generative modelling [80], also known as Deep Generative Latent Models [122] (see also [99]). The KL divergence can also be substituted by an adversarially learned divergence as in the GAN approach [47] (see Sec. 6.4.3).

When the variational parameters are updated using an M-projection, rather than the I-projection that results from the maximization of the ELBO, the approach of jointly optimizing model and variational parameters yields the wake-sleep algorithm [65].
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 209

Context: # 9
## Concluding Remarks

This monograph has provided a brief introduction to machine learning by focusing on parametric probabilistic models for supervised and unsupervised learning problems. It has endeavored to describe fundamental concepts within a unified treatment starting from first principles. Throughout the text, we have also provided pointers to advanced topics that we have only been able to mention or shortly touch upon. Here, we offer a brief list of additional important aspects and open problems that have not been covered in the preceding chapters.

- **Privacy:** In many applications, data sets used to train machine learning algorithms contain sensitive private information, such as personal preferences for recommendation systems. It is hence important to ensure that the learned model does not reveal any information about the individual entries of the training set. This constraint can be formulated using the concept of differential privacy. Typical techniques to guarantee privacy of individual data points include adding noise to gradients when training via SGD and relying on mixtures of experts trained with different subsets of the data [1].

- **Robustness:** It has been reported that various machine learning models, including neural networks, are sensitive to small variations in input data.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 210

Context: # Concluding Remarks

The data, producing the wrong response upon minor, properly chosen, changes in the explanatory variables. Note that such adversarially chosen examples, which cause a specific machine to fail, are conceptually different from the randomly selected examples that are assumed when defining the generalization properties of a network. There is evidence that finding such examples is possible even without knowing the internal structure of a machine, but solely based on black-box observations [11]. Modifying the training procedure in order to ensure robustness to adversarial examples is an active area of research with important practical implications [55].

## Computing Platforms and Programming Frameworks

In order to scale up machine learning applications, it is necessary to leverage distributed computing architectures and related standard programming frameworks [17, 7]. As a complementary and more futuristic approach, recent work has been proposed to leverage the capabilities of an annealing-based quantum computers as samplers [82] or discrete optimizers [103].

## Transfer Learning

Machines trained for a certain task currently need to be re-trained in order to be re-purposed for a different task. For instance, a machine that learned how to drive a car would need to be retrained in order to learn how to drive a truck. The field of transfer learning covers scenarios in which one wishes to transfer the expertise acquired from some tasks to others. Transfer learning includes different related paradigms, such as multitask learning, lifelong learning, zero-shot learning, and domain adaptation [149]. 

In multitask learning, several tasks are learned simultaneously. Typical solutions for multitask learning based on neural networks preserve the presence of common representations among neural networks trained for different tasks [19]. Lifelong learning utilizes a machine trained on a number of tasks to carry out a new task by leveraging the knowledge accumulated during the previous training phases [143]. Zero-shot learning refers to models capable of recognizing unseen classes with training examples available only for related, but different, classes. This often entails the task of learning representation of classes, such as prototype vectors, that generate data in the class through a fixed probabilistic mechanism [52]. Domain adaptation will be discussed separately in the next point.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 211

Context: - **Domain adaptation:** In many learning problems, the available data has a different distribution from the data on which the algorithm will be tested. For instance, in speech recognition, one has available data for a given user at learning time, but it may be desirable, after learning, to use the same machine for another user. For supervised learning, this is typically modeled by assuming that the distribution of the covariates \( x \) is different during training and test, while the discriminative conditional distribution \( p(y|x) \) is the same for both phases [149]. A generalization of PAC theory analyzes domain adaptation, obtaining bounds on the generalization error under the desired test distribution as a function of the difference between the training and test distributions [26].

- **Communication-efficient learning:** In distributed computing platforms, data is typically partitioned among the processors and communication among the processors entails latency and energy consumption. An important research problem is that of characterizing the best trade-off between learning performance and communication overhead [160].

- **Reinforcement learning:** Reinforcement learning is at the heart of recent successes of machine learning methods in acquiring the skills necessary to play video games or games against human opponents (see, e.g., [99]). In reinforcement learning, one wishes to learn the optimal mapping, say \( \phi(x, \theta) \), between the observed state \( x \) of the world and an action \( a \). Unlike supervised learning, the optimal action \( a \) is not known, but the machine observes a reward/punishment signal depending on the effect of the action. A popular approach, referred to as deep reinforcement learning, models the mapping \( \phi(x, \theta) \) using a deep neural network. This is trained to maximize the average reward via SGD by using the REINFORCE method (Chapter 8) to estimate the gradient [135, 88, 7, 9].
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 235

Context: # References

[136] Smith, S. L., P.-J. Kindermans, and Q. V. Le. 2017. “Don’t Decay the Learning Rate, Increase the Batch Size.” ArXiv e-prints. Nov. arXiv:1711.00168 [cs.LG].

[137] Spectrum. “Will the Future of AI Learning Depend More on Nature or Nurture?” URL: https://spectrum.ieee.org/tech-talk/robotics/artificial-intelligence.

[138] Stigler, S. M. 2016. *The Seven Pillars of Statistical Wisdom.* Harvard University Press.

[139] Subramanian, S., T. Palpanas, D. Papadopoulos, V. Kalogeraki, and D. Gunopulos. 2006. “Online outlier detection in sensor data using non-parametric models.” In *Proceedings of the 32nd International Conference on Very Large Data Bases.* VLDB Endowment. 187–198.

[140] Sugiyama, M., T. Suzuki, and T. Kanamori. 2012. *Density Estimation in Machine Learning.* Cambridge University Press.

[141] Sun, Y., P. Babu, and D. P. Palomar. 2017. “Majorization-minimization algorithms in signal processing, communications, and machine learning.” *IEEE Transactions on Signal Processing.* 65(3): 794–816.

[142] Tegmark, M. 2017. *Life 3.0: Being Human in the Age of Artificial Intelligence.* Allen Lane.

[143] Thrun, S. 1996. “Is learning the n-th thing any easier than learning the first?” In *Advances in Neural Information Processing Systems.* 640–646.

[144] Times, T. N. Y. 1958. *New Navy Device Learns by Doing; Psychologist Shows Embryo of Computer Designed to Read and Grow Wiser.* URL: http://www.nytimes.com/1958/07/08/archives/new-baby-device-learns-by-doing-psychologist-shows-embryo-of.html.

[145] Tishby, N., F. C. Pereira, and W. Bialek. 2000. “The information bottleneck method.” *arXiv preprint* physics/0004057.

[146] Tsybakov, A. B. 2009. *Introduction to Nonparametric Estimation.* 

[147] Turner, R. E. and M. Sahayani. 2011. “Two problems with variable-selection maximisation for time-series models.” *Bayesian Time Series Models:* 115–138.

[148] Ur, Pyg. *Deep Universal Probabilistic Programming.* URL: http://pyro.ai/.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 1

Context: # A First Encounter with Machine Learning

**Max Welling**  
*Donald Bren School of Information and Computer Science*  
*University of California Irvine*

November 4, 2011
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 3

Context: # Contents

Preface iii  
Learning and Intuition vii  

1. Data and Information  
   1.1 Data Representation ...................................................... 2  
   1.2 Preprocessing the Data .................................................. 4  

2. Data Visualization .................................................................. 7  

3. Learning  
   3.1 In a Nutshell ............................................................... 15  

4. Types of Machine Learning  
   4.1 In a Nutshell ............................................................... 20  

5. Nearest Neighbors Classification  
   5.1 The Idea In a Nutshell .................................................. 23  

6. The Naive Bayesian Classifier  
   6.1 The Naive Bayes Model .................................................. 25  
   6.2 Learning a Naive Bayes Classifier ............................... 27  
   6.3 Class-Prediction for New Instances ............................. 28  
   6.4 Regularization ............................................................... 30  
   6.5 Remarks .................................................................... 31  
   6.6 The Idea In a Nutshell .................................................. 32  

7. The Perceptron  
   7.1 The Perceptron Model .................................................. 34
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 4

Context: ```
# CONTENTS

7.2 A Different Cost function: Logistic Regression .......................... 37  
7.3 The Idea In a Nutshell .................................................. 38  

# 8 Support Vector Machines ................................................ 39  
8.1 The Non-Separable case ................................................. 43  

# 9 Support Vector Regression .............................................. 47  

# 10 Kernel ridge Regression ............................................... 51  
10.1 Kernel Ridge Regression ............................................... 52  
10.2 An alternative derivation ............................................. 53  

# 11 Kernel K-means and Spectral Clustering ............................... 55  

# 12 Kernel Principal Components Analysis ................................ 59  
12.1 Centering Data in Feature Space ....................................... 61  

# 13 Fisher Linear Discriminant Analysis .................................. 63  
13.1 Kernel Fisher LDA .................................................... 66  
13.2 A Constrained Convex Programming Formulation of FDA ................. 68  

# 14 Kernel Canonical Correlation Analysis ................................ 69  
14.1 Kernel CCA ............................................................ 71  

# A Essentials of Convex Optimization ..................................... 73  
A.1 Lagrangians and all that ............................................... 73  

# B Kernel Design .......................................................... 77  
B.1 Polynomials Kernels .................................................... 77  
B.2 All Subsets Kernel ...................................................... 78  
B.3 The Gaussian Kernel ..................................................... 79  
```
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 5

Context: # Preface

In winter quarter 2007, I taught an undergraduate course in machine learning at UC Irvine. While I had been teaching machine learning at a graduate level, it became soon clear that teaching the same material to an undergraduate class was a whole new challenge. Much of machine learning is built upon concepts from mathematics such as partial derivatives, eigenvalue decompositions, multivariate probability densities, and so on. I quickly found that these concepts could not be taken for granted at an undergraduate level. The situation was aggravated by the lack of a suitable textbook. Excellent textbooks do exist for this field, but I found all of them to be too technical for a first encounter with machine learning. This experience led me to believe there was a genuine need for a simple, intuitive introduction into the concepts of machine learning. 

A first read to wet the appetite so to speak, a prelude to the more technical and advanced textbooks. Hence, the book you see before you is meant for those starting out in the field who need a simple, intuitive explanation of some of the most useful algorithms that our field has to offer.

Machine learning is a relatively recent discipline that emerged from the general field of artificial intelligence only quite recently. To build intelligent machines, researchers realized that these machines should learn from and adapt to their environment. It is simply too costly and impractical to design intelligent systems by first gathering all the expert knowledge ourselves and then hard-wiring it into a machine. For instance, after many years of intense research, we can now recognize faces in images to a high degree of accuracy. But the world has approximately 30,000 visual object categories according to some estimates (Biederman). Should we invest the same effort to build good classifiers for monkeys, chairs, pencils, axes, etc., or should we build systems to can observe millions of training images, some with labels (e.g., in these pixels in the image correspond to a car) but most of them without side information? Although there currently is no system which can recognize even in the order of 1000 object categories (the best system can get).
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 6

Context: # PREFACE

about 60% correct on 100 categories, the fact that we pull it off seemingly effortlessly serves as a “proof of concept” that it can be done. But there is no doubt in my mind that building truly intelligent machines will involve learning from data.

The first reason for the recent successes of machine learning and the growth of the field as a whole is rooted in its multidisciplinary character. Machine learning emerged from AI but quickly incorporated ideas from fields as diverse as statistics, probability, computer science, information theory, convex optimization, control theory, cognitive science, theoretical neuroscience, physics, and more. To give an example, the main conference in this field is called: **advances in neural information processing systems**, referring to information theory and theoretical neuroscience and cognitive science.

The second, perhaps more important reason for the growth of machine learning is the exponential growth of both available data and computer power. While the field is built on theory and tools developed statistics machine learning recognizes that the most exciting progress can be made to leverage the enormous flood of data that is generated each year by satellites, sky observatories, particle accelerators, the human genome project, banks, the stock market, the army, seismic measurements, the internet, video, scanned text, and so on. It is difficult to appreciate the exponential growth of data that our society is generating. To give an example, a modern satellite generates roughly the same amount of data all previous satellites produced together. This insight has shifted the attention from highly sophisticated modeling techniques on small datasets to more basic analysis on such larger data-sets (the latter sometimes called **data-mining**). Hence the emphasis shifted to algorithmic efficiency and as a result many machine learning faculty (like myself) can typically be found in computer science departments. To give some examples of recent successes of this approach one would only have to turn on one computer and perform an internet search. Modern search engines do not run terribly sophisticated algorithms, but they manage to store and sift through almost the entire content of the internet to return sensible search results. There has also been much success in the field of machine translation, not because a new model was invented but because many more translated documents became available.

The field of machine learning is multifaceted and expanding fast. To sample a few sub-disciplines: statistical learning, kernel methods, graphical models, artificial neural networks, fuzzy logic, Bayesian methods, and so on. The field also covers many types of learning problems, such as supervised learning, supervised learning, semi-supervised learning, active learning, reinforcement learning, etc. I will only cover the most basic approaches in this book from a highly per
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 10

Context: # LEARNING AND INTUITION

Baroque features or a more “dull” representation, whatever works. Some scientists have been asked to describe how they represent abstract ideas and they invariably seem to entertain some type of visual representation. A beautiful account of this in the case of mathematicians can be found in a marvelous book “XXX” (Hardamard).

By building accurate visual representations of abstract ideas, we create a database of knowledge in the unconscious. This collection of ideas forms the basis for what we call intuition. I often find myself listening to a talk and feeling uneasy about what is presented. The reason seems to be that the abstract idea I am trying to capture from the talk clashed with a similar idea that is already stored. This in turn can be a sign that I either misunderstood the idea before and need to update it, or that there is actually something wrong with what is being presented. In a similar way, I can easily detect that some idea is a small perturbation of what I already knew (I feel happily bored), or something entirely new (I feel intrigued and slightly frustrated). While the novice is continuously challenged and often feels overwhelmed, the more experienced researcher feels at ease 90% of the time because the “new” idea was already in his/her database which therefore needs no and very little updating.

Somehow our unconscious mind can also manipulate existing abstract ideas into new ones. This is what we usually think of as creative thinking. One can stimulate this by seeding the mind with a problem. This is a conscious effort and is usually a combination of detailed mathematical derivations and building an intuitive picture or metaphor for the thing one is trying to understand. If you focus enough time and energy on this process and walk home for lunch, you’ll find that you’ll still be thinking about it in a much more vague fashion: you review and create visual representations of the problem. Then you get your mind off the problem altogether and when you walk back to work, suddenly parts of the solution surface into consciousness. Somehow, your unconscious took over and kept working on your problem. The essence is that you created visual representations as the building blocks for the unconscious mind to work with.

In any case, whatever the details of this process are (and I am no psychologist), I suspect that any good explanation should include both an intuitive part, including examples, metaphors and visualizations, and a precise mathematical part where every equation and derivation is properly explained. This then is the challenge I have set to myself. It will be your task to insist on understanding the abstract idea that is being conveyed and build your own personalized visual representations. I will try to assist in this process but it is ultimately you who will have to do the hard work.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 13

Context: # Chapter 1

## Data and Information

Data is everywhere in abundant amounts. Surveillance cameras continuously capture video; every time you make a phone call your name and location get recorded. Often, your clicking pattern is recorded when surfing the web. Most financial transactions are recorded; satellites and observatories generate terabytes of data every year; the FBI maintains a DNA database of most convicted criminals. Soon, all written text from our libraries is digitized. Need I go on?

But data in itself is useless. Hidden inside the data is valuable information. The objective of machine learning is to pull the relevant information from the data and make it available to the user. What do we mean by "relevant information"? When analyzing data, we typically have a specific question in mind, such as: 

- “How many types of car can be discerned in this video?” 
- “What will be the weather next week?”

The answer can take the form of a single number (there are 5 cars), or a sequence of numbers (or the temperature next week) or a complicated pattern (the cloud configuration next week). If the answer to our query is complex, we like to visualize it using graphs, bar plots, or even little movies. One should keep in mind that the particular analysis depends on the task one has in mind.

Let me spell out a few tasks that are typically considered in machine learning:

### Prediction

Here we ask ourselves whether we can extrapolate the information in the data to new use cases. For instance, if I have a database of attributes of Hummers such as weight, color, number of people it can hold, etc., and another database of attributes for Ferraris, then we can try to predict the type of car (Hummer or Ferrari) from a new set of attributes. Another example is predicting the weather given all the recorded weather patterns in the past: can we predict the weather next week, or the stock prices?
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 15

Context: # 1.1. DATA REPRESENTATION

Most datasets can be represented as a matrix, \( X = [X_{n,k}] \), with rows indexed by "attribute-index" \( i \) and columns indexed by "data-index" \( n \). The value \( X_{n,k} \) for attribute \( i \) and data-case \( n \) can be binary, real, discrete, etc., depending on what we measure. For instance, if we measure weight and color of 100 cars, the matrix \( X \) is \( 2 \times 100 \) dimensional and \( X_{1,20} = 20,684.57 \) is the weight of car nr. 20 in some units (a real value) while \( X_{2,20} = 2 \) is the color of car nr. 20 (say one of 6 predefined colors).

Most datasets can be cast in this form (but not all). For documents, we can give each distinct word of a prespecified vocabulary a number and simply count how often a word was present. Say the word "book" is defined to have nr. 10, 568 in the vocabulary then \( X_{1,1005} = 4 \) would mean: the word book appeared 4 times in document 5076. Sometimes the different data-cases do not have the same number of attributes. Consider searching the internet for images about cats. You’ll retrieve a large variety of images most with a different number of pixels. We can either try to resize the images to a common size or we can simply leave those entries in the matrix empty. It may also occur that a certain entry is supposed to be there but it couldn’t be measured. For instance, if we run an optical character recognition system on a scanned document some letters will not be recognized. We’ll use a question mark “?” to indicate that that entry wasn’t observed.

It is very important to realize that there are many ways to represent data and not all are equally suitable for analysis. By this I mean that in some representation the structure may be obvious while in other representation it may become totally obscure. It is still there, but just harder to find. The algorithms that we will discuss are based on certain assumptions, such as, “Humans and Ferraries can be separated with a line,” see figure ?. While this may be true if we measure weight in kilograms and height in meters, it is no longer true if we decide to re-code these numbers into bit-strings. The structure is still in the data, but we would need a much more complex assumption to discover it. A lesson to be learned is thus to spend some time thinking about in which representation the structure is as obvious as possible and transform the data if necessary before applying standard algorithms. In the next section we’ll discuss some standard preprocessing operations. It is often advisable to visualize the data before preprocessing and analyzing it. This will often tell you if the structure is a good match for the algorithm you had in mind for further analysis. Chapter ? will discuss some elementary visualization techniques.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 23

Context: # Chapter 3

## Learning

This chapter is without question the most important one of the book. It concerns the core, almost philosophical question of what learning really is (and what it is not). If you want to remember one thing from this book, you will find it here in this chapter.

Ok, let’s start with an example. Alice has a rather strange ailment. She is not able to recognize objects by their visual appearance. At home, she is doing just fine: her mother explained to Alice for every object in her house what it is and how you use it. When she is home, she recognizes these objects (if they have not been moved too much), but when she enters a new environment she is lost. For example, if she enters a new meeting room, she needs a long time to infer what the chairs and the table are in the room. She has been diagnosed with a severe case of "overfitting." 

What is the matter with Alice? Nothing is wrong with her memory because she remembers the objects once she has seen them. In fact, she has a fantastic memory. She remembers every detail of the objects she has seen. And every time she sees a new object, she reasons that the object in front of her is surely not a chair because it doesn’t have all the features she has seen in earlier chairs. The problem is that Alice cannot generalize the information she has observed from one instance of a visual object category to other, yet unobserved members of the same category. The fact that Alice’s disease is so rare is understandable; there must have been a strong selection pressure against this disease. 

Imagine our ancestors walking through the savanna one million years ago. A lion appears on the scene. Ancestral Alice has seen lions before, but not this particular one, and it does not induce a fear response. Of course, she has no time to infer the possibility that this animal may be dangerous logically. Alice’s contemporaries noticed that the animal was yellow-brown, had manes, etc., and immediately un-
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 24

Context: # CHAPTER 3. LEARNING

They understood that this was a lion. They understood that all lions have these particular characteristics in common, but may differ in some other ones (like the presence of a scar or scar someplace).

Bob has another disease which is called over-generalization. Once he has seen an object he believes almost everything is some, perhaps twisted instance of the same object class (In fact, I seem to suffer from this so now and then when I think all of machine learning can be explained by this one new exciting principle). If ancestral Bob walks the savanna and he has just encountered an instance of a lion and fled into a tree with his buddies, the next time he sees a squirrel he believes it is a small instance of a dangerous lion and flees into the trees again. Over-generalization seems to be rather common among small children.

One of the main conclusions from this discussion is that we should neither over-generalize nor over-fit. We need to be on the edge of being just right. But just right about what? It doesn’t seem there is one correct God-given definition of the category chairs. We seem to all agree, but one can surely find examples that would be difficult to classify. When do we generalize exactly right? The magic word is **PREDICTION**. From an evolutionary standpoint, all we have to do is make correct predictions about aspects of life that help us survive. Nobody really cares about the definition of lion, but we do care about our responses to the various animals (run away for lion, chase for deer). And there are a lot of things that can be predicted in the world. This food kills me but food is good for me. Drumming my fists on my hairy chest in front of a female generates opportunities for sex; sticking my hand into that yellow-orange flickering “flame” hurts my hand and so on. The world is wonderfully predictable and we are very good at predicting it.

So why do we care about object categories in the first place? Well, apparently they help us organize the world and make accurate predictions. The category *lions* is an abstraction and abstractions help us to generalize. In a certain sense, learning is all about finding useful abstractions or concepts that describe the world. Take the concept “fluid”; it describes all watery substances and summarizes some of their physical properties. Or the concept of “weight”: an abstraction that describes a certain property of objects.

Here is one very important corollary for you: 

> “machine learning is not in the business of remembering and regurgitating observed information, it is in the business of transferring (generalizing) properties from observed data onto new, yet unobserved data.” 

This is the mantra of machine learning that you should repeat to yourself every night before you go to bed (at least until the final exam).

The information we receive from the world has two components to it: there
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 26

Context: # CHAPTER 3. LEARNING

Now let’s think for a moment what we really mean with “a model”. A model represents our prior knowledge of the world. It imposes structure that is not necessarily present in the data. We call this the **inductive bias**. Our inductive bias often comes in the form of a parameterized model. That is to say, we define a family of models but let the data determine which of these models is most appropriate. A strong inductive bias means that we don’t leave flexibility in the model for the data to work on. We are so convinced of ourselves that we basically ignore the data. The downside is that if we are creating a “bad bias” towards to wrong model. On the other hand, if we are correct, we can learn the remaining degrees of freedom in our model from very few data-cases. Conversely, we may leave the door open for a huge family of possible models. If we now let the data zoom in on the model that best explains the training data it will overfit to the peculiarities of that data. 

Now imagine you sampled 10 datasets of the same size \(N\) and train these very flexible models separately on each of these datasets (note that in reality you only have access to one such dataset but please play along in this thought experiment). Let’s say we want to determine the value of some parameter \(\theta\). Because the models are so flexible, we can actually model the idiosyncrasies of each dataset. The result is that the value for \(\theta\) is likely to be very different for each dataset. But because we didn’t impose much inductive bias the average of many of such estimates will be about right. We say that the bias is small, but the variance is high. In the case of very restrictive models the opposite happens: the bias is potentially large but the variance small. Note that not only is a large bias bad (for obvious reasons), a large variance is bad as well: because we only have one dataset of size \(N\), our estimate could be very far off simply we were unlucky with the dataset we were given. 

What we should therefore strive for is to inject all our prior knowledge into the learning problem (this makes learning easier) but avoid injecting the wrong prior knowledge. If we don’t trust our prior knowledge we should let the data speak. However, letting the data speak too much might lead to overfitting, so we need to find the boundary between too complex and too simple a model and get its complexity just right. Access to more data means that the data can speak more relative to prior knowledge. That, in a nutshell, is what machine learning is all about.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 27

Context: # 3.1 In a Nutshell

Learning is all about generalizing regularities in the training data to new, yet unobserved data. It is not about remembering the training data. Good generalization means that you need to balance prior knowledge with information from data. Depending on the dataset size, you can entertain more or less complex models. The correct size of the model can be determined by playing a compression game. 

Learning = generalization = abstraction = compression.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 28

Context: # CHAPTER 3. LEARNING

## Introduction

Learning involves acquiring knowledge, skills, attitudes, or competencies. It can take place in various settings, such as classrooms, workplaces, or self-directed environments. This chapter discusses the types and processes of learning.

## Types of Learning

1. **Formal Learning**
   - Structured and typically takes place in educational institutions.
   - Includes degrees, diplomas, and certifications.

2. **Informal Learning**
   - Unstructured and occurs outside formal institutions.
   - Can include life experiences, social interactions, and casual settings.

3. **Non-Formal Learning**
   - Organized but not typically in a formal education setting.
   - Often community-based, such as workshops and training sessions.

## Learning Processes

- **Cognitive Learning**: Involves mental processes and understanding. 
- **Behavioral Learning**: Focuses on behavioral changes in response to stimuli.
- **Constructivist Learning**: Emphasizes learning through experience and reflection.

## Table of Learning Theories

| Theory                    | Key Contributor      | Description                                      |
|--------------------------|----------------------|--------------------------------------------------|
| Behaviorism              | B.F. Skinner         | Learning as a change in behavior due to reinforcement. |
| Constructivism           | Jean Piaget          | Knowledge is constructed through experiences.    |
| Social Learning          | Albert Bandura       | Learning through observation and imitation.      |

## Conclusion

Understanding the different types and processes of learning can help educators and learners optimize educational experiences. Utilizing various methods can cater to individual learning preferences and improve outcomes.

## References

- Smith, J. (2020). *Learning Theories: A Comprehensive Approach*. Educational Press.
- Doe, A. (2019). *Techniques for Effective Learning*. Learning Publishers.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 29

Context: # Chapter 4

## Types of Machine Learning

We now will turn our attention and discuss some learning problems that we will encounter in this book. The most well-studied problem in ML is that of **supervised learning**. To explain this, let’s first look at an example. Bob wants to learn how to distinguish between bobcats and mountain lions. He types these words into Google Image Search and closely studies all catlike images of bobcats on the one hand and mountain lions on the other. Some months later on a hiking trip in the San Bernardino mountains he sees a big cat... 

The data that Bob collected was labelled because Google is supposed to only return pictures of bobcats when you search for the word "bobcat" (and similarly for mountain lions). Let’s call the images \( X_1, \ldots, X_n \), and the labels \( Y_1, \ldots, Y_n \). Note that \( X_i \) are much higher dimensional objects because they represent all the information extracted from the image (approximately 1 million pixel color values), while \( Y_i \) is simply −1 or 1 depending on how we choose to label our classes. So, that would be a ratio of about 1 million to 1 in terms of information content! The classification problem can usually be posed as finding (a.k.a. learning) a function \( f(x) \) that approximates the correct class labels for any input \( x \). For instance, we may decide that \( \text{sign}(f(x)) \) is the predictor for our class label. In the following, we will be studying quite a few of these classification algorithms.

There is also a different family of learning problems known as **unsupervised learning** problems. In this case there are no labels \( Y \) involved, just the features \( X \). Our task is not to classify, but to organize the data, or to discover the structure in the data. This may be very useful for visualization data, compressing data, or organizing data for easy accessibility. Extracting structure in data often leads to the discovery of concepts, topics, abstractions, factors, causes, and more such terms that all really mean the same thing. These are the underlying semantic structures.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 30

Context: # CHAPTER 4. TYPES OF MACHINE LEARNING

Factors that can explain the data. Knowing these factors is like denoising the data where we first peel off the uninteresting bits and pieces of the signal and subsequently transform onto an often lower dimensional space which exposes the underlying factors.

There are two dominant classes of unsupervised learning algorithms: clustering-based algorithms assume that the data organizes into groups. Finding these groups is then the task of the ML algorithm and the identity of the group is the semantic factor. Another class of algorithms strives to project the data onto a lower dimensional space. This mapping can be nonlinear, but the underlying assumption is that the data is approximately distributed on some (possibly curved) lower dimensional manifold embedded in the input space. Unrolling that manifold is then the task of the learning algorithm. In this case, the dimensions should be interpreted as semantic factors.

There are many variations on the above themes. For instance, one is often confronted with a situation where you have access to many more unlabeled data (only \(X_i\)) and many fewer labeled instances (both \((X_i, Y_i)\)). Take the task of classifying news articles by topic (weather, sports, national news, international etc.). Some people may have labeled some news-articles by hand but there won’t be all that many of those. However, we do have a very large digital library of scanned newspapers available. Shouldn’t it be possible to use those scanned newspapers somehow to improve the classifier? Imagine that the data naturally clusters into well-separated groups (for instance because news articles reporting on different topics use very different words). This is depicted in Figure ??. Note that there are only very few cases which have labels attached to them. From this figure it becomes clear that the expected optimal decision boundary nicely separates these clusters. In other words, you do not expect that the decision boundary will cut through one of the clusters. Yet that is exactly what would happen if you would only be using the labeled data. Hence, by simply requiring that decision boundaries do not cut through regions of high probability we can improve our classifier. The subfield that studies how to improve classification algorithms using unlabeled data goes under the name “semi-supervised learning”.

A fourth major class of learning algorithms deals with problems where the supervised signal consists only of rewards (or costs) that are possibly delayed. Consider for example a mouse that needs to solve a labyrinth in order to obtain his food. While making his decisions he will not receive any feedback (apart from perhaps slowly getting more hungry). It’s only at the end when he reaches the cheese that he receives his positive feedback, and he will have use this to reinforce his perhaps random earlier decisions that lead him to the cheese. These problems are generally known as reinforcement learning.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 31

Context: fall under the name "reinforcement learning". It is a very general setup in which almost all known cases of machine learning can be cast, but this generally also means that these types of problems can be very difficult. The most general RL problems do not even assume that you know what the world looks like (i.e. the maze for the mouse), so you have to simultaneously learn a model of the world and solve your task in it. 

This dual task induces interesting trade-offs: should you invest time now to learn machine learning and reap the benefit later in terms of a high salary working for Yahoo!, or should you stop investing now and start exploiting what you have learned so far? This is clearly a function of age, or the time horizon that you still have to take advantage of these investments. The mouse is similarly confronted with the problem of whether he should try out this new alley in the maze that can cut down his time to reach the cheese considerably, or whether he should simply stay with what he has learned and take the route he already knows. This clearly depends on how often he thinks he will have to run through the same maze in the future. We call this the exploration versus exploitation trade-off. The reason that RL is a very exciting field of research is because of its biological relevance. Do we not also have to figure out how the world works and survive in it?

Let's go back to the news-articles. Assume we have control over what article we will label next. Which one would be pick? Surely the one that would be most informative in some suitably defined sense. Or the mouse in the maze. Given that it needs to explore, where does he explore? Surely he will try to seek out alleys that look promising, i.e. alleys that he expects to maximize his reward. We call the problem of finding the next best data-case to investigate "active learning".

One may also be faced with learning multiple tasks at the same time. These tasks are related but not identical. For instance, consider the problem if recommending movies to customers of Netflix. Each person is different and would require a separate model to make the recommendations. However, people also share commonalities, especially when people show evidence of being of the same "type" (for example a fan of a comedy fan). We can learn personalized models but share features between them. Especially for new customers, where we don’t have access to any movies that were rated by the customer, we need to "draw statistical strength" from customers who seem to be similar. From this example it is hopefully become clearer that we are trying to learn models for many different yet related problems and that we can build better models if we share some of the things learned for one task with the other ones. The trick is not to share too much nor too little and how much we should share depends on how much data and prior knowledge we have access to for each task. We call this subfield of machine learning: "multi-task learning".
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 32

Context: # Chapter 4: Types of Machine Learning

## 4.1 In a Nutshell

There are many types of learning problems within machine learning. Supervised learning deals with predicting class labels from attributes, unsupervised learning tries to discover interesting structure in data, semi-supervised learning uses both labeled and unlabeled data to improve predictive performance, reinforcement learning can handle simple feedback in the form of delayed reward, active learning optimizes the next sample to include in the learning algorithm, and multi-task learning deals with sharing common model components between related learning tasks.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 36

Context: # CHAPTER 5. NEAREST NEIGHBORS CLASSIFICATION

## 5.1 Introduction

The nearest neighbors classification is a type of instance-based learning. It is a method used in machine learning that classifies a data point based on how its neighbors are classified.

## 5.2 Algorithm

The basic algorithm for k-nearest neighbors (k-NN) can be summarized as follows:

1. **Choose the number of neighbors, k.**
2. **Calculate the distance** between the new data point and all existing data points.
3. **Sort the distances** and determine the nearest k neighbors.
4. **Vote** for the most common class among the k neighbors.

## 5.3 Distance Metrics

Common distance metrics include:

- **Euclidean Distance**
- **Manhattan Distance**
- **Minkowski Distance**

| Distance Metric | Formula                                    |
|------------------|--------------------------------------------|
| Euclidean        | \(d(x, y) = \sqrt{\sum{(x_i - y_i)^2}}\) |
| Manhattan        | \(d(x, y) = \sum{|x_i - y_i|}\)           |
| Minkowski        | \(d(x, y) = \left( \sum{|x_i - y_i|^p} \right)^{1/p}\) |

## 5.4 Applications

Nearest neighbors classification can be applied in various fields such as:

- **Image Recognition**
- **Recommendation Systems**
- **Medical Diagnosis**

## 5.5 Pros and Cons

### Pros
- Simple and easy to implement.
- Effective in high-dimensional spaces.

### Cons
- Computation cost can be high with large datasets.
- Sensitive to irrelevant features and the scale of the data.

## 5.6 Conclusion

The nearest neighbors classification is a powerful technique in machine learning. By understanding its algorithm and applications, one can leverage this method effectively for various tasks.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 39

Context: ```
6.2 Learning a Naive Bayes Classifier
===============================

Given a dataset, \(\{(X_i, Y_i), i = 1...D, n = 1...N\}\), we wish to estimate what these probabilities are. To start with the simplest one, what would be a good estimate for the number of the percentage of spam versus ham emails that our imaginary entity uses to generate emails? Well, we can simply count how many spam and ham emails we have in our data. This is given by:

\[
P(spam) = \frac{# \, spam \, emails}{total \, emails} = \frac{\sum I_{Y_i = 1}}{N} \tag{6.1}
\]

Here we mean with \(I[A = a]\) a function that is only equal to 1 if its argument is satisfied, and zero otherwise. Hence, in the equation above it counts the number of instances that \(Y_n = 1\). Since the remainder of the emails must be ham, we also find that

\[
P(ham) = 1 - P(spam) = \frac{# \, ham \, emails}{total \, emails} = \frac{\sum I_{Y_i = 0}}{N} \tag{6.2}
\]

where we have used that \(P(ham) + P(spam) = 1\) since an email is either ham or spam.

Next, we need to estimate how often we expect to see a certain word or phrase in either a spam or a ham email. In our example, we could for instance ask ourselves what the probability is that we find the word "viagra" \(k\) times, with \(k = 0, 1, > 1\), in a spam email. Let's recode this as \(X_{viagra} = 0\) meaning that we didn't observe "viagra", \(X_{viagra} = 1\) meaning that we observed it once and \(X_{viagra} = 2\) meaning that we observed it more than once. The answer is again that we can count how often these events happened in our data and use that as an estimate for the real probabilities according to which it generated emails. First for spam we find,

\[
P_{spam}(X_i = j) = \frac{# \, spam \, emails \, for \, which \, the \, word \, j \, was \, found \, t \, times}{total \, of \, spam \, emails} = \frac{\sum I_{X_n = j \land Y_n = 1}}{\sum I_{Y_i = 1}} \tag{6.3}
\]

Here we have defined the symbol \(\land\) to mean that both statements to the left and right of this symbol should hold true in order for the entire sentence to be true.
```
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 44

Context: # CHAPTER 6. THE NAIVE BAYESIAN CLASSIFIER

## Introduction

The Naive Bayesian Classifier is a simple yet powerful algorithm based on Bayes' theorem, used extensively in machine learning for classification tasks.

## Key Concepts

1. **Bayes' Theorem**: 
   \[
   P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}
   \]
   
2. **Assumptions**:
   - Features are statistically independent given the class label.
   - The prior distribution of each class.

## Types of Naive Bayes Classifiers

- **Gaussian Naive Bayes**: Assumes continuous data follows a normal distribution.
- **Multinomial Naive Bayes**: Primarily used for discrete features, such as word counts.
- **Bernoulli Naive Bayes**: Similar to multinomial but assumes binary features.

## Classification Process

1. **Calculate Prior Probabilities**:
   \[
   P(C_k) = \frac{N_k}{N}
   \]

2. **Calculate Likelihoods**:
    - For Gaussian:
    \[
    P(x|C_k) = \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{(x - \mu)^2}{2\sigma^2}}
    \]

3. **Apply Bayes' Theorem**:
    \[
    P(C_k|x) = \frac{P(x|C_k) \cdot P(C_k)}{P(x)}
    \]

4. **Choose Class with Maximum Posterior**:
   \[
   \hat{C} = \arg\max_{C_k} P(C_k|x)
   \]

## Advantages

- Easy to implement.
- Requires a small amount of training data.
- Performs well in multi-class problems.

## Disadvantages

- Assumption of feature independence is often violated in real-world applications.
- Poor performance when features are highly correlated.

## Conclusion

The Naive Bayesian Classifier serves as an excellent baseline for classification tasks and is crucial for understanding more advanced techniques in machine learning.

## References

- [Pattern Recognition and Machine Learning](http://www.example.com)
- [Machine Learning: A Probabilistic Perspective](http://www.example.com)
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 45

Context: # Chapter 7

## The Perceptron

We will now describe one of the simplest parametric classifiers: the **perceptron** and its cousin the **logistic regression** classifier. However, despite its simplicity, it should not be under-estimated! It is the workhorse for most companies involved with some form of machine learning (perhaps tying with the **decision tree** classifier). One could say that it represents the canonical parametric approach to classification where we believe that a straight line is sufficient to separate the two classes of interest. An example of this is given in Figure ??, where the assumption that the two classes can be separated by a line is clearly valid.

However, this assumption need not always be true. Looking at Figure ??, we clearly observe that there is no straight line that will do the job for us. What can we do? Our first inclination is probably to try and fit a more complicated separation boundary. However, there is another trick that we will be using often in this book. Instead, we can increase the dimensionality of the space by “measuring” more things about the data. The features can be highly nonlinear functions. The simplest choice may be to also measure \( \phi_k(X) = X^2_k \), for each attribute \( X_k \). But we may also measure cross-products such as \( \phi_{ij}(X) = X_i X_j \), \( \forall i, j \). The latter will allow you to explicitly model correlations between attributes. For example, if \( X_i \) represents the presence (1) or absence (0) of the word “viagra” and similarly for \( X_j \) the presence/absence of the word “dysfunction”, then the cross product feature \( X_i X_j \) lets you model the presence of both words simultaneously (which should be helpful in trying to find out what this document is about). We can add as many features as we like, adding another dimension for every new feature. In this higher dimensional space, we can now be more confident in assuming that the data can be separated by a line.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 48

Context: # CHAPTER 7. THE PERCEPTRON

where we have rewritten \( w^TX_n = \sum_k w_kX_{nk} \). If we minimize this cost, then \( w^TX_n - \alpha \) tends to be positive when \( Y_n = +1 \) and negative when \( Y_n = -1 \). This is what we want! Once optimized, we can then easily use our optimal parameters to perform prediction on new test data \( X_{\text{test}} \) as follows:

\[
\hat{Y}_{\text{test}} = \text{sign}\left(\sum_k w_k^T X_{\text{test}} - \alpha\right) \tag{7.3}
\]

where \( \hat{Y} \) is used to indicate the predicted value for \( Y \).

So far so good, but how do we obtain our values for \((w^*, \alpha^*)\)? The simplest approach is to compute the gradient and slowly descend on the cost function (see appendix ?? for background). In this case, the gradients are simple:

\[
\nabla_w C(w, \alpha) = -\frac{1}{n} \sum_{i=1}^{n}(Y_n - w^TX_n + \alpha)X_n = -X(Y - X^Tw + \alpha) \tag{7.4}
\]

\[
\nabla_\alpha C(w, \alpha) = \frac{1}{n} \sum_{i=1}^{n}(Y_n - w^TX_n + \alpha) = (Y - X^Tw + \alpha) \tag{7.5}
\]

where in the latter matrix expression we have used the convention that \( X \) is the matrix with elements \( X_{nk} \). Our gradient descent is now simply given as,

\[
w_{t+1} = w_t - \eta \nabla_w C(w_t, \alpha_t) \tag{7.6}
\]

\[
\alpha_{t+1} = \alpha_t - \eta \nabla_\alpha C(w_t, \alpha_t) \tag{7.7}
\]

Iterating these equations until convergence will minimize the cost function. One may criticize plain vanilla gradient descent for many reasons. For example, you need to be carefully choose the stepsize \( \eta \) or risk either excruciatingly slow convergence or exploding values of the iterates \( w_t \). Even if convergence is achieved asymptotically, it is typically slow. Using a Newton-Raphson method will improve convergence properties considerably but is also very expensive. Many methods have been developed to improve the optimization of the cost function, but that is not the focus of this book.

However, I do want to mention a very popular approach to optimization on very large datasets known as “stochastic gradient descent”. The idea is to select a single data-item randomly and perform an update on the parameters based on that:

\[
w_{t+1} = w_t + \eta(Y_n - w^TX_n + \alpha)X_n \tag{7.8}
\]

\[
\alpha_{t+1} = \alpha_t + \eta(Y_n - w^TX_n + \alpha) \tag{7.9}
\]
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 58

Context: # CHAPTER 8: SUPPORT VECTOR MACHINES

## Introduction

Support Vector Machines (SVMs) are a set of supervised learning methods used for classification, regression, and outliers detection.

## Key Concepts

1. **Hyperplane**: A hyperplane is a decision boundary that helps to categorize data points.

2. **Support Vectors**: Support vectors are the data points that are closest to the hyperplane and influence its position and orientation.

3. **Margin**: The margin is the distance between the hyperplane and the nearest data point from either class. SVM aims to maximize the margin.

## Implementation Steps

- **Step 1**: Choose the appropriate kernel (linear, polynomial, RBF).
- **Step 2**: Train the SVM model using the training dataset.
- **Step 3**: Evaluate the model using a test dataset.

## Advantages of SVM

- Effective in high-dimensional spaces.
- Robust against overfitting, especially in high-dimensional datasets.

## Disadvantages of SVM

- Less effective on very large datasets.
- Poor performance with overlapping classes.

## Conclusion

Support Vector Machines are powerful tools for classification and regression tasks, offering advantages in high-dimensional spaces while having limitations in very large datasets.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 80

Context: # CHAPTER 13. FISHER LINEAR DISCRIMINANT ANALYSIS

## Introduction

Fisher Linear Discriminant Analysis (LDA) is a supervised learning method used mainly for classification. It projects data onto a lower-dimensional space while maximizing class separability.

## Key Concepts

1. **Discriminant Function**  
   The goal is to find a linear combination of features that separates two or more classes of objects or events.

2. **LDA vs. PCA**  
   - LDA focuses on maximizing the separation between multiple classes, while Principal Component Analysis (PCA) focuses on variance.

3. **Regularization**  
   Techniques can be applied to improve results when the number of features is high compared to the number of samples.

## Mathematical Formulation

Given a set of classes \(C_1, C_2, ..., C_k\), let \( \mu_i \) be the mean of class \( C_i \) and \( S_w \) and \( S_b \) the within-class and between-class scatter matrices, respectively. The LDA objective is to maximize:

\[
J(w) = \frac{|S_b|}{|S_w|}
\]

where \(w\) is the projection vector.

## Implementation Steps

1. Calculate the mean vectors for each class.
2. Compute the scatter matrices \( S_w \) and \( S_b \).
3. Solve the generalized eigenvalue problem for the matrix \( S_w^{-1}S_b \).
4. Choose the top eigenvectors to form a new feature space.

## Use Cases

- Face recognition
- Medical diagnosis
- Marketing segmentation

## Conclusion

Fisher LDA is a powerful tool for classification tasks with a focus on maximizing the separation between classes. Its mathematical foundation allows for effective implementation across various domains. 

## References

1. McLachlan, G. J. (2004). *Discriminant Analysis and Statistical Pattern Recognition*. Wiley-Interscience. 
2. Duda, R. O., Hart, P. E., & Stork, D. G. (2000). *Pattern Classification*. Wiley-Interscience. 

## Further Reading

- [Wikipedia: Linear Discriminant Analysis](https://en.wikipedia.org/wiki/Linear_discriminant_analysis)
- [Scikit-learn: Linear Discriminant Analysis](https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html)
##########

"""QUERY: What is machine learning?"""

Consider the chat history for relevant information. Use all information included. Use as much tokens as needed. Important: If you find information separated by a | in the context, it is a table formatted in Markdown.Important: Take a look at the QUERY and only the QUERY. If this is vague or unclear, please ignore everything and ask a follow-up question instead!


Final Files Sources: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 1, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 2, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 3, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 9, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 12, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 13, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 14, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 15, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 16, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 17, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 18, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 21, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 22, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 23, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 25, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 26, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 29, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 31, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 33, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 38, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 41, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 44, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 48, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 52, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 54, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 55, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 56, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 65, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 73, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 75, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 79, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 81, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 82, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 83, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 84, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 89, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 91, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 92, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 95, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 98, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 100, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 102, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 104, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 106, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 108, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 110, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 112, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 116, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 118, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 119, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 120, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 124, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 126, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 134, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 135, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 136, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 138, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 142, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 143, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 144, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 154, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 156, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 162, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 164, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 168, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 170, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 172, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 181, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 182, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 198, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 204, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 206, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 209, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 210, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 211, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 235, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 1, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 3, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 4, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 5, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 6, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 10, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 13, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 15, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 23, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 24, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 26, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 27, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 28, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 29, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 30, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 31, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 32, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 36, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 39, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 44, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 45, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 48, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 58, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 80
==================================================
**Elapsed Time: 17.79 seconds**
==================================================


FINAL ANSWER 
Answer: Machine learning is a field of study that focuses on the development of algorithms and statistical models that enable computers to perform tasks without explicit instructions. Instead, these systems learn from data and improve their performance over time. Here are some key points about machine learning based on the provided context:

### Definition and Purpose
- **Machine Learning**: It is a discipline that emerged from artificial intelligence (AI) and is concerned with the development of algorithms that allow computers to learn from and make predictions or decisions based on data.
- **Objective**: The primary goal of machine learning is to extract relevant information from data and make it available to the user. This involves tasks such as prediction, classification, clustering, and more.

### Types of Machine Learning
1. **Supervised Learning**: Involves learning a function that maps an input to an output based on example input-output pairs. It is used for tasks where the data comes with labels, such as classification and regression.
   - **Example**: Predicting the type of car (Hummer or Ferrari) from a new set of attributes based on a labeled dataset of car attributes.
2. **Unsupervised Learning**: Involves learning patterns from unlabeled data. The goal is to discover the underlying structure of the data.
   - **Example**: Clustering news articles by topic without predefined labels.
3. **Semi-Supervised Learning**: Uses both labeled and unlabeled data to improve learning accuracy. It is useful when labeled data is scarce but unlabeled data is abundant.
   - **Example**: Improving a news article classifier by leveraging a large corpus of unlabeled articles.
4. **Reinforcement Learning**: Involves learning to make decisions by taking actions in an environment to maximize cumulative reward. Feedback is given in the form of rewards or penalties.
   - **Example**: A mouse learning to navigate a maze to find cheese.
5. **Active Learning**: Focuses on selecting the most informative data points to label next, optimizing the learning process.
   - **Example**: Choosing the next news article to label that would be most informative for improving a classifier.
6. **Multi-Task Learning**: Involves learning multiple related tasks simultaneously, sharing commonalities among them to improve overall performance.
   - **Example**: Recommending movies to different users by learning personalized models that share features.

### Key Concepts in Machine Learning
- **Generalization**: The ability of a model to perform well on new, unseen data. It is not about memorizing the training data but about transferring properties from observed data to new data.
- **Inductive Bias**: The set of assumptions a model makes to generalize from the training data to unseen data. A strong inductive bias can lead to underfitting, while a weak inductive bias can lead to overfitting.
- **Model Complexity**: The balance between a model's ability to fit the training data and its ability to generalize to new data. More complex models can capture more details but are prone to overfitting.

### Learning Algorithms
- **Nearest Neighbors Classification**: Classifies a data point based on the majority class among its nearest neighbors.
- **Naive Bayesian Classifier**: Uses Bayes' theorem with the assumption of feature independence to classify data points.
- **Perceptron**: A simple linear classifier that updates its weights based on misclassified examples.
- **Support Vector Machines (SVMs)**: Finds the hyperplane that maximizes the margin between different classes.
- **Fisher Linear Discriminant Analysis (LDA)**: Projects data onto a lower-dimensional space to maximize class separability.

### Practical Applications
- **Image Recognition**: Identifying objects or features in images.
- **Recommendation Systems**: Suggesting products or content based on user preferences.
- **Medical Diagnosis**: Classifying medical conditions based on patient data.
- **Market Basket Analysis**: Identifying associations between products in transaction data.
- **Anomaly Detection**: Detecting unusual patterns that do not conform to expected behavior.

### Conclusion
Machine learning is a multifaceted field that leverages data to build models capable of making predictions, discovering patterns, and making decisions. It encompasses various types of learning, each suited to different kinds of problems and data structures. The ultimate goal is to create systems that can learn and adapt autonomously, improving their performance over time.
==================================================
**Elapsed Time: 0.00 seconds**
==================================================