{
    "query": "what is your model?",
    "namespace": "e791d701-717d-4c06-9602-2878acdbb6ad",
    "messages": [
        {
            "role": "assistant",
            "content": "Hallo diverse NF SepTest,\n\nich bin Simon, dein KI-Chatbot, der mit dem gro\u00dfen Sprachmodell gemini-1.5-flash angetrieben wird. Ich kann dir bei allen deinen Fragen helfen. \n\nBasierend auf den B\u00fcchern, die du mir gezeigt hast, k\u00f6nnen wir \u00fcber verschiedene Themen sprechen. Hier sind ein paar Vorschl\u00e4ge:\n\n* **Diskrete Mathematik:** \n    * Mengenlehre \ud83e\uddee\n    * Relationen \ud83d\udd17\n    * Wahrscheinlichkeit \ud83c\udfb2\n    * Datenstrukturen wie Graphen und B\u00e4ume \ud83c\udf33\n    * Kombinatorik \ud83d\udd22\n    * Zahlensysteme \ud83e\uddee\n    * Logik \ud83e\udde0\n    * Beweisf\u00fchrung \u270d\ufe0f\n* **Maschinelles Lernen:**\n    * \u00dcberwachtes Lernen \ud83e\udd16\n    * Unsupervisiertes Lernen \ud83e\udde0\n    * Klassifizierung \ud83d\uddc3\ufe0f\n    * Regression \ud83d\udcc8\n    * Wahrscheinlichkeitstheorie \ud83c\udfb2\n    * Bayes'sche Inferenz \ud83e\udde0\n    * Kernel-Methoden \ud83e\uddee\n    * Neuronale Netze \ud83e\udde0\n    * Statistische Lerntheorie \ud83d\udcca\n    * Approximation von Inferenz und Lernen \ud83e\udde0\n    * Probabilistische Graphische Modelle \ud83c\udf33\n\nWas m\u00f6chtest du dir aus dieser Liste genauer ansehen? \ud83e\udd14\n"
        },
        {
            "role": "user",
            "content": "halo"
        },
        {
            "role": "assistant",
            "content": "Hello! \ud83d\udc4b  How can I help you today? \ud83d\ude0a \n\nIs there anything specific you'd like to know or discuss? \n"
        },
        {
            "role": "user",
            "content": "what is your model?"
        },
        {
            "role": "assistant",
            "content": ""
        }
    ],
    "stream": true,
    "language_level": "",
    "chat_channel": ":3979:1726553257050",
    "language": "German",
    "tone": "neutral",
    "writing_style": "standard",
    "model": "gemini-1.5-flash",
    "knowledgebase": "ki-dev-large",
    "seed": 1055,
    "client_id": 1726553257050,
    "all_context": false,
    "follow_up_for": null,
    "knowledgebase_files_count": 3,
    "override_command": null,
    "disable_clarity_check": false,
    "custom_primer": "",
    "logging": true,
    "query_route": ""
}


INITIALIZATION
Knowledgebase: ki-dev-large
Base Query: what is your model?
Model: gemini-1.5-flash
Use Curl?: None
==================================================
**Elapsed Time: 0.00 seconds**
==================================================


ROUTING
Query type: summary
==================================================
**Elapsed Time: 1.80 seconds**
==================================================


RAG PARAMETERS
Max Context To Include: 100
Lowest Score to Consider: 0.1
==================================================
**Elapsed Time: 0.14 seconds**
==================================================


VECTOR SEARCH ALGORITHM TO USE 
Use MMR search?: True
Use Similarity search?: False
==================================================
**Elapsed Time: 0.10 seconds**
==================================================


VECTOR SEARCH DONE 
==================================================
**Elapsed Time: 6.56 seconds**
==================================================


PRIMER 
Primer: 
            You are Simon, a highly intelligent personal assistant in a system called KIOS. You are a chatbot that
            can read knowledgebases through the "CONTEXT" that is included in the user's chat message.

        
            Your role is to act as an expert at summarization and analysis.

            In your responses to enterprise users, prioritize clarity, trustworthiness, and appropriate formality.

            Be honest by admitting when a topic falls outside your scope of knowledge, and suggest
            alternative avenues for obtaining information when necessary.

            Make effective use of chat history to avoid redundancy and enhance response relevance, continuously
            adapting to integrate all necessary details in your interactions.

            Use as much tokens as possible to provide a detailed response.
        
==================================================
**Elapsed Time: 0.19 seconds**
==================================================


FINAL QUERY 
Final Query: CONTEXT: ##########
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 1

Context: # A Brief Introduction to Machine Learning for Engineers

(2018), “A Brief Introduction to Machine Learning for Engineers”, Vol. XX, No. XX, pp 1–231. DOI: XXX.

Osvaldo Simeone  
Department of Informatics  
King’s College London  
osvaldo.simeone@kcl.ac.uk
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 2

Context: # Contents

I Basics 5

1 Introduction 6  
   1.1 What is Machine Learning? 6  
   1.2 When to Use Machine Learning? 8  
   1.3 Goals and Outline 11  

2 A Gentle Introduction through Linear Regression 15  
   2.1 Supervised Learning 15  
   2.2 Inference 17  
   2.3 Frequentist Approach 19  
   2.4 Bayesian Approach 36  
   2.5 Minimum Description Length (MDL) 42  
   2.6 Information-Theoretic Metrics 44  
   2.7 Interpretation and Causality 47  
   2.8 Summary 49  

3 Probabilistic Models for Learning 51  
   3.1 Preliminaries 52  
   3.2 The Exponential Family 53  
   3.3 Frequentist Learning 59
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 4

Context: ```
6.6 Discriminative Models........................................159  
6.7 Autoencoders...............................................163  
6.8 Ranking*...................................................164  
6.9 Summary....................................................164  

IV Advanced Modelling and Inference..............................165  

7 Probabilistic Graphical Models................................166  
7.1 Introduction...............................................167  
7.2 Bayesian Networks..........................................170  
7.3 Markov Random Fields.......................................178  
7.4 Bayesian Inference in Probabilistic Graphical Models......182  
7.5 Summary....................................................185  

8 Approximate Inference and Learning............................186  
8.1 Monte Carlo Methods........................................187  
8.2 Variational Inference.......................................189  
8.3 Monte Carlo-Based Variational Learning*....................197  
8.4 Approximate Learning*......................................199  
8.5 Summary....................................................201  

V Conclusions...................................................202  

9 Concluding Remarks............................................203  

Appendices.......................................................206  

A Appendix A: Information Measures..............................207  
A.1 Entropy....................................................207  
A.2 Conditional Entropy and Mutual Information................210  
A.3 Divergence Measures.......................................212  

B Appendix B: KL Divergence and Exponential Family..........215  

Acknowledgements...............................................217  
```
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 5

Context: References
===========
218
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 10

Context: # Acronyms

| Acronym | Meaning                                         |
|---------|------------------------------------------------|
| pmf     | probability mass function                       |
| PCA     | Principal Component Analysis                    |
| PPCA    | Probabilistic Principal Component Analysis      |
| QDA     | Quadratic Discriminant Analysis                |
| RBM     | Restricted Boltzmann Machine                   |
| SGD     | Stochastic Gradient Descent                     |
| SVM     | Support Vector Machine                          |
| rv      | random variable or random vector (depending on the context) |
| s.t.    | subject to                                     |
| VAE     | Variational AutoEncoder                        |
| VC      | Vapnik–Chervonenkis                            |
| VI      | Variational Inference                          |
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 11

Context: # Part I
## Basics
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 13

Context: # 1.1. What is Machine Learning?

This starts with an in-depth analysis of the problem domain, which culminates with the definition of a mathematical model. The mathematical model is meant to capture the key features of the problem under study and is typically the result of the work of a number of experts. The mathematical model is finally leveraged to derive hand-crafted solutions to the problem.

For instance, consider the problem of defining a chemical process to produce a given molecule. The conventional flow requires chemists to leverage their knowledge of models that predict the outcome of individual chemical reactions, in order to craft a sequence of suitable steps that synthesize the desired molecule. Another example is the design of speech translation or image/video compression algorithms. Both of these tasks involve the definition of models and algorithms by teams of experts, such as linguists, psychologists, and signal processing practitioners, not infrequently during the course of long standardization meetings.

The engineering design flow outlined above may be too costly and inefficient for problems in which faster or less expensive solutions are desirable. The machine learning alternative is to collect large data sets, e.g., of labeled speech, images, or videos, and to use this information to train general-purpose learning machines to carry out the desired task. While the standard engineering flow relies on domain knowledge and on design optimized for the problem at hand, machine learning lets large amounts of data dictate algorithms and solutions. To this end, rather than requiring a precise model of the set-up under study, machine learning requires the specification of an objective, of a model to be trained, and of an optimization technique.

Returning to the first example above, a machine learning approach would proceed by training a general-purpose machine to predict the outcome of known chemical reactions based on a large data set, and then by using the trained algorithm to explore ways to produce more complex molecules. In a similar manner, large data sets of images or videos would be used to train a general-purpose algorithm with the aim of obtaining compressed representations from which the original input can be recovered with some distortion.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 17

Context: # 1.3 Goals and Outline

This monograph considers only passive and offline learning.

## 1.3 Goals and Outline

This monograph aims at providing an introduction to key concepts, algorithms, and theoretical results in machine learning. The treatment concentrates on probabilistic models for supervised and unsupervised learning problems. It introduces fundamental concepts and algorithms by building on first principles, while also exposing the reader to more advanced topics with extensive pointers to the literature, within a unified notation and mathematical framework. Unlike other texts that are focused on one particular aspect of the field, an effort has been made here to provide a broad but concise overview in which the main ideas and techniques are systematically presented. Specifically, the material is organized according to clearly defined categories, such as discriminative and generative models, frequentist and Bayesian approaches, exact and approximate inference, as well as directed and undirected models. This monograph is meant as an entry point for researchers with a background in probability and linear algebra. A prior exposure to information theory is useful but not required.

Detailed discussions are provided on basic concepts and ideas, including overfitting and generalization, Maximum Likelihood and regularization, and Bayesian inference. The text also endeavors to provide intuitive explanations and pointers to advanced topics and research directions. Sections and subsections containing more advanced material that may be regarded at a first reading are marked with a star (+).

The reader will find here further discussions on computing platform or programming frameworks, such as map-reduce, nor details on specific applications involving large data sets. These can be easily found in a vast and growing body of work. Furthermore, rather than providing exhaustive details on the existing myriad solutions in each specific category, techniques have been selected that are useful to illustrate the most salient aspects. Historical notes have also been provided only for a few selected milestone events.

Finally, the monograph attempts to strike a balance between the algorithmic and theoretical viewpoints. In particular, all learning algorithms are presented in a manner that emphasizes their theoretical foundations while also providing practical insights.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 20

Context: # Introduction

from *The Computer and the Brain*, 1958.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 22

Context: ```markdown
variables \( t_n \) are assumed to be dependent on \( x_n \), and are referred to as dependent variables, labels, or responses. An example is illustrated in Fig. 2.1. We use the notation \( \mathbf{x}_D = (x_1, \ldots, x_N)^T \) for the covariates and \( \mathbf{t}_D = (t_1, \ldots, t_N)^T \) for the labels in the training set \( D \). Based on this data, the goal of supervised learning is to identify an algorithm to predict the label \( t \) for a new, that is, as of yet unobserved, domain point \( x \).

![Figure 2.1: Example of a training set \( D \) with \( N = 10 \) points \( (x_n, t_n) \), \( n = 1, \ldots, N \).](path_to_image)

The outlined learning task is clearly impossible in the absence of additional information on the mechanism relating variables \( z \) and \( t \). With reference to Fig. 2.1, unless we assume, say, that \( z \) and \( t \) are related by a function \( t = f(z) \) with some properties, such as smoothness, we have no way of predicting the label \( t \) for an unobserved domain point \( z \). This observation is formalized by the no free lunch theorem to be reviewed in Chapter 5: one cannot learn rules that generalize to unseen examples without making assumptions about the mechanism generating the data. The set of all assumptions made by the learning algorithm is known as the inductive bias.

This discussion points to a key difference between memorizing and learning. While the former amounts to mere retrieval of a value \( t_n \), 
```
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 23

Context: 2.2. Inference
====================

Corresponding to an already observed pair \((x_t, y_t) \in \mathcal{D}\), learning entails the capability to predict the value \(t\) for an unseen domain point \(x\). Learning, in other words, converts experience – in the form of \(\mathcal{D}\) – into expertise or knowledge – in the form of a predictive algorithm. This is well captured by the following quote by Jorge Luis Borges: “To think is to forget details, generalize, make abstractions.” [138]

By and large, the goal of supervised learning is that of identifying a predictive algorithm that minimizes the generalization loss, that is, the error in the prediction of a new label \(t\) for an unobserved explanatory variable \(x\). How exactly to formulate this problem, however, depends on one’s viewpoint on the nature of the model that is being learned. This leads to the distinction between the frequentist and the Bayesian approaches, which is central to this chapter. As will be discussed, the MDL philosophy deviates from the mentioned focus on prediction as the goal of learning, by targeting instead a parsimonious description of the data set \(\mathcal{D}\).

### 2.2 Inference

Before we start our discussion of learning, it is useful to review some basic concepts concerning statistical inference, as they will be needed throughout this chapter and in the rest of this monograph. We specifically consider the inference problem of predicting a \(y\) given the observation of another \(x\) under the assumption that their joint distribution \(p(x, y)\) is known. As a matter of terminology, it is noted that here we will use the term “inference” as it is typically intended in the literature on probabilistic graphical models (see, e.g., [81]), thereby diverging from its use in other branches of the machine learning literature (see, e.g., [23]).

In order to define the problem of optimal inference, one needs to define a non-negative loss function \(\ell(t, \hat{t})\). This defines the cost, or loss or risk, incurred when the correct value is \(t\) while the estimate is \(\hat{t}\). An important example is the \(\ell_t\) loss:

\[
\ell_t(\hat{t}) = |t - \hat{t}|, \quad (2.1)
\]

which includes as a special case the quadratic loss \(\ell_q(t, \hat{t}) = (t - \hat{t})^2\).
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 25

Context: # 2.3 Frequentist Approach

The goal of supervised learning methods is broadly speaking that of obtaining a predictor \( \hat{t}(x) \) that performs close to the optimal predictor \( t^* \) based only on the training set \( D \), and hence without knowledge of the joint distribution \( p(x, t) \). The closeness in performance is measured by the difference between the generalization loss \( L_p(\hat{t}) \) achieved by the trained predictor and the minimum generalization loss \( L_p(t^*) \) of the optimal predictor, which depends on the true distribution \( p(x, t) \). Strictly speaking, this statement applies only for the frequentist approach, which is discussed next. As will be explained later in the chapter, in fact, while the Bayesian approach still centers around the goal of prediction, its modelling assumptions are different. Furthermore, the MDL approach concentrates on the task of data compression rather than prediction.

## 2.3 Frequentist Approach

According to the frequentist viewpoint, the training data points \( (x_n, t_n) \in D \) are independent identically distributed (i.i.d.) rvs drawn from a true and unknown distribution \( p(x, t) \):

\[
(x_n, t_n) \sim p(x, t), \quad i = 1, \ldots, N.
\]
  
The new observation \( (x, t) \) is also independently generated from the same true distribution \( p(x, t) \); the domain point \( x \) is observed and the label \( t \) must be predicted. Since the probabilistic model \( p(x, t) \) is not known, one cannot solve directly problem (2.3) to find the optimal prediction that minimizes the generalization loss \( L_p \) in (2.2).

Before discussing the available solutions to this problem, it is worth observing that the definition of the “true” distribution \( p(x, t) \) depends on how the data is collected. As in the example of the “beauty AI” context, if the rankings \( t_n \) assigned to pictures \( x_n \) of faces are affected by racial biases, the distribution \( p(x, t) \) will reflect these prejudices and produce skewed results [62].

## Taxonomy of solutions

There are two main ways to address the problem of learning how to perform inference when not knowing the distribution \( p(x, t) \):
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 28

Context: # A Gentle Introduction through Linear Regression

## 1. Discriminative probabilistic model

With this first class of models, the posterior, or predictive, distribution \( p(y|x) \) is assumed to belong to a hypothesis class \( p(y| \theta) \) defined by a parameter vector \( \theta \). The parameter vector \( \theta \) is learned from the data set \( D \). For a given parameter vector \( \theta \), the conditional distribution \( p(y|x, \theta) \) allows the different values of the label \( y \) to be discriminated on the basis of their posterior probability. In particular, once the model is learned, one can directly compute the predictor (2.6) for any loss function.

As an example, for the linear regression problem, once a vector of parameters \( \theta = (\beta_0, \beta_1) \) is identified based on the data \( D \) during learning, the optimal prediction under the \( \ell_2 \) loss is the conditional mean \( E[y(x) | x] \), that is, \( \hat{y}(x) = \mu(x, \theta) \).

## 2. Generative probabilistic model

Instead of learning directly the posterior \( p(y|x) \), one can model the joint distribution \( p(x, y) \) as being part of a parametric family \( p(x, y | \theta) \). Note that, as opposed to discriminative models, the joint distribution \( p(x, y | \theta) \) models also the distribution of the covariates \( x \). Accordingly, the term “generative” reflects the capacity of this type of model to generate a realization of the covariates by using the marginal \( p(x) \).

Once the joint distribution \( p(x, y | \theta) \) is learned from the data, one can compute the posterior \( p(y|x, \theta) \) using Bayes’ theorem and, from it, the optimal predictor (2.6) can be evaluated for any loss function. Generative models make stronger assumptions by modeling also the distribution of the explanatory variables. As a result, an improper selection of the model may lead to more significant biases. However, there are potential advantages, such as the ability to deal with missing data or latent variables, such as in semi-supervised learning. Refer to Chapter 6 for further discussion (see also [23]).

In the rest of this section, for completeness, we consider discriminative probabilistic models \( p(y|x) \), although the main definitions will apply also to generative models.

### 2.3.2 Model Order and Model Parameters

In the linear regression example, the selection of the hypothesis class (2.12) required the definition of the polynomial degree \( M \), while the...
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 29

Context: # 2.3. Frequentist Approach

Determination of a specific model \( p(y|x, \theta) \) in the class called for the selection of the parameter vector \( \theta = (\nu, \beta) \). As we will see, these two types of variables play a significantly different role during learning and should be clearly distinguished, as discussed next.

## 1. Model order \( M \) (and hyperparameters)

The model order defines the "capacity" of the hypothesis class, that is, the number of degrees of freedom in the model. The larger \( M \) is, the more capable a model is to fit the available data. For instance, in the linear regression example, the model order determines the size of the weight vector \( w \). More generally, variables that define the class of models to be learned are known as hyperparameters. As we will see, determining the model order, and more broadly the hyperparameters, requires a process known as validation.

## 2. Model parameters \( \theta \)

Assigning specific values to the model parameters \( \theta \) identifies a hypothesis within the given hypothesis class. This can be done by using learning criteria such as Maximum Likelihood (ML) and Maximum a Posteriori (MAP).

We postpone a discussion of validation to the next section, and we start by introducing the ML and MAP learning criteria.

### 2.3.3 Maximum Likelihood (ML) Learning

Assume now that the model order \( M \) is fixed, and that we are interested in learning the model parameters \( \theta \). The ML criterion selects a value of \( \theta \) under which the training set \( D \) has the maximum probability of being observed. In other words, the value \( \theta \) selected by ML is the most likely to have generated the observed training set. Note that there might be more than one such value.

To proceed, we need to write the probability (density) of the observed labels \( y \) in the training set \( D \) given the corresponding domain points \( x \). Under the assumed discriminative model, this is given as:

\[
p(y|x, w, \beta) = \prod_{n=1}^{N} p(y_n | x_n, w, \beta)
\]

\[
= \prod_{n=1}^{N} N(t_n|h_n(x_n), \beta^{-1})
\]
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 30

Context: ```
24  A Gentle Introduction through Linear Regression

where we have used the independence of different data points. Taking the logarithm yields the Log-Likelihood (LL) function

ln p(t|xD, w, θ) = -\frac{N}{2} \sum_{n=1}^{N} \ln(t_n|x_n, w) - \frac{N}{2} \frac{\beta}{2\pi} ( \mu(x_n, w) - t_n )^2 \tag{2.14}

The LL function should be considered as a function of the model parameters \( \theta = (w, \beta) \), since the data set \( D \) is fixed and given. The ML learning problem is defined by the minimization of the Negative LL (NLL) function as

\[
\min_{w, \beta} -\frac{1}{N} \sum_{n=1}^{N} \ln(p(t_n|x_n, w, \beta)) \tag{2.15}
\]

This criterion is also referred to as cross-entropy or log-loss, as further discussed in Sec. 2.6.

If one is only interested in learning only the posterior mean, as is the case when the loss function is \( \ell_2 \), then one can tackle problem (2.14) only over the weights \( w \), yielding the optimization

\[
\min_{L_D(w)} = \frac{1}{N} \sum_{n=1}^{N} \| \mu(x_n, w) - t_n \|^2 \tag{2.16}
\]

The quantity \( L_D(w) \) is known as the training loss. An interesting observation is that this criterion coincides with the ERM problem (2.7) for the \( \ell_2 \) loss if one parametrizes the predictor as \( \hat{y}(x) = \mu(x, w) \).

The ERM problem (2.16) can be solved in closed form. To this end, we write the empirical loss as

\[
L_D(w) = N^{-1} \|y - X_D w\|^2, \tag{2.17}
\]

where 

\[
X_D = [\phi(x_1) \cdots \phi(x_N)]^T.
\]

Its minimization hence amounts to a Least Squares (LS) problem, which yields the solution

\[
w_{ML} = (X_D^T X_D)^{-1} X_D^T y_D, \tag{2.18}
\]
```
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 35

Context: 2.3. Frequentist Approach
=========================

Performance is dominated by the bias induced by the selection of the model.

![Figure 2.5: Illustration of the bias and training error based on the decomposition (2.22).](path/to/image)

### Bias and Generalization Gap

The previous paragraph introduced the notions of estimation error and bias associated with the selection of a given model order \( M \). While Chapter 5 will provide a more extensive discussion on these concepts, it is useful to briefly review them here in the context of ML learning. Estimation error and bias refer to the following decomposition of the generalization loss achieved by the given solution \( w_{ML} \):

\[
L_{p}(w_{ML}) = L^{*} + (L_{p}(w^{*}) - L_{p}(t^{*})) + (L_{p}(w_{ML}) - L_{p}(w^{*})). \tag{2.22}
\]

This decomposition is illustrated in Fig. 2.5 for \( M = 1 \) in (2.22). The term \( L^{*}(t) = 0 \) (the figure shows the square root) is, as seen, the minimum achievable generalization loss without any constraint on the hypothesis class. The term \( (L_{p}(w^{*}) - L^{*}(t^{*})) \) represents the bias, or approximation error, caused by the choice of the given hypothesis class, and hence also by the choice of \( M \). This is because, by (2.21),
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 36

Context: L(w^*) is the best generalization loss for the given model. Recall that the loss L(w^*) can be achieved when N is large enough. Finally, the term (L(w_M) - L(w^*)) is the estimation error or generalization gap[^1] that is incurred due to the fact that N is not large enough and hence we have w_M ≠ w^*.

From the decomposition (2.22), a large N allows us to reduce the estimation error, but it has no effect on the bias. This is seen in Fig. 2.4, where the asymptotic achieved by the generalization loss L(w^*) increases equals the minimum generalization loss L(w^*) for the given model. Choosing a small value of M in the regime of large data points provides a floor on the achievable generalization loss that no amount of additional data can overcome.

## Validation and Testing

In the discussion above, it was assumed that the generalization loss L(w) can somehow be evaluated. Since this depends on the true unknown distribution p(z,t), this evaluation is, strictly speaking, not possible. How then to estimate the generalization loss in order to enable model order selection using a plot as in Fig. 2.33? The standard solution is to use validation.

The most basic form of validation prescribes the division of the available data into two sets: a hold-out, or validation, set and the training set. The validation set is used to evaluate an approximation of the generalization loss L(w) via the empirical average:

L(w) = \frac{1}{N_v} \sum_{n=1}^{N_v} \ell(t_n, w) \quad (2.23)

where the sum is done over the N_v elements of the validation set.

The just described hold-out approach to validation has a clear drawback, as part of the available data needs to be set aside and not used for training. This means that the number of data points that can be used for training is smaller than the number of overall available data points. To partially alleviate this problem, a more sophisticated, and commonly used, approach to validation is k-fold cross-validation. With this method, the available data points are partitioned, typically at random, into k equally sized subsets. The generalization loss is then esti[^1]: This is also defined as generalisation error in some references.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 40

Context: # A Gentle Introduction through Linear Regression

## Figure 2.6
Square root of the generalization loss \( L_{\mathcal{G}}(w_{MAP}) \) and of the training loss \( L_D(w_{MAP}) \) as a function of \( \lambda \) (in logarithmic scale) for the training data set in Fig. 2.1 with \( M = 9 \).

![Figure 2.6](path/to/image)

Fig. 2.6 shows the squared root of the generalization loss \( L_{\mathcal{G}}(w_{MAP}) \) and of the training loss \( L_D(w_{MAP}) \) as a function of \( \lambda \) (in logarithmic scale) for the training data set in Fig. 2.1 with \( M = 9 \). The generalization loss is estimated using validation. We observe that increasing \( \lambda \), and hence the relevance of the regularization term, has a similar impact as decreasing the model order \( M \). A larger \( \lambda \) reduces the effective capacity of the model. Stated in different words, increasing \( \lambda \) reduces overfitting but may entail a larger bias.

Other standard examples for the prior distribution include the Laplace pdf, which yields the \( l_1 \) norm regularization function \( R(w) = \|w\|_1 = \sum_{j=1}^n |w_j| \). This term promotes the sparsity of the solution, which is useful in many signal recovery algorithms [14] and in non-parametric function estimation [146]. The corresponding optimization problem is

\[
\min_{w} L_D(w) + \lambda \|w\|_1 \tag{2.30}
\]

is known as LASSO (Least Absolute Shrinkage and Selection Operator).
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 41

Context: ## 2.3 Frequentist Approach

### 2.3.5 Regularization

We have seen above that the MAP learning criterion amounts to the addition of a regularization function \( R(\theta) \) to the ML or ERM learning losses. This function penalizes values of the weight vector \( \mathbf{w} \) that are likely to occur in the presence of overfitting, or, generally, that are improbable on the basis of the available prior information. The net effect of this addition is to effectively decrease the capacity of the model, as the set of values for the parameter vector \( \theta \) that the learning algorithm is likely to choose from is reduced. As a result, seen, regularization can control overfitting and its optimization requires validation.

Regularization generally refers to techniques that aid in reducing overfitting during training. The discussion in the previous subsection has focused on a specific form of regularization that is grounded in a probabilistic interpretation in terms of MAP learning. We note that the same techniques, such as ridge regression and LASSO, can also be introduced independently of a probabilistic framework in an ERM formulation. Furthermore, apart from the discussed addition of regularization terms to the empirical risk, there are other ways to perform regularization.

One approach is to modify the optimization scheme by using techniques such as early stopping [56]. Another is to augment the training set by generating artificial examples and hence effectively increasing the number \( N \) of training examples. Related to this idea is the technique known as bagging. With bagging, we first create a number \( K \) of bootstrap data sets. Bootstrap data sets are obtained by selecting \( N \) data points uniformly with replacement from \( D \) so that the same data point generally appears multiple times in the bootstrap data set. Then, we train the model \( K \) times, each time over a different bootstrap set. Finally, we average the results obtained from the models using equal weights. If the errors accrued by the different models were independent, bagging would yield an estimation error that decreases with \( K \). In practice, significantly smaller gains are obtained, particularly for large \( K \), given that the bootstrap data sets are all drawn from \( D \) and hence the estimation errors are not independent [23].
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 47

Context: 2.4. Bayesian Approach
==========================

quantity \( p(t | \mathbf{X}, \mathbf{w}_L) \) can only increase by choosing a larger model order \( M \). In fact, a larger \( M \) entails more degrees of freedom in the optimization (2.16) of the LL. A similar discussion applies also to MAP. However, this is not the case for (2.37): a larger \( M \) implies a more “spread-out” prior distribution of the weights, which may result in a more diffuse distribution of the labels in (2.37). Hence, increasing \( M \) may yield a smaller marginal likelihood.

![Figure 2.9: Marginal likelihood versus the model order M for the training set of Fig. 2.1 (\( \beta = 10, \alpha = 10^{-3} \)).](path/to/image)

To illustrate this point, Fig. 2.9 plots the marginal likelihood for the data set in Fig. 2.1 for \( \beta = 10 \) and three different values of \( \alpha \) as a function of \( M \). The marginal likelihood in this example can be easily computed since we have

\[
p(t | \mathbf{X}) = \int p(t | \mathbf{X}, \mathbf{w}) p(\mathbf{w}) d\mathbf{w}.
\] (2.38)

It is observed that the marginal likelihood presents a peak at a given value of \( M \), while decreasing when moving away from the optimal value. Therefore, we could take the value of \( M \) at which the marginal likelihood is maximized as the selected model order.

Does this mean that validation is really unnecessary when adopting a Bayesian viewpoint? Unfortunately, this is not necessarily the case.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 48

Context: 42  
# A Gentle Introduction through Linear Regression  

In fact, one still needs to set the hyperparameter \( \alpha \). As seen in Fig. 2.9, varying \( \alpha \) can lead to different conclusions on the optimal value of \( M \). An alternative approach would be to treat \( \alpha \) and \( M \) as rvs with given priors to be specified (see, e.g., [131]). This would not obviate the problem of selecting hyperparameters—now defining the prior distributions of \( \alpha \) and \( M \)—but it can lead to powerful hierarchical models. The necessary tools will be discussed in Chapter 7.  

As a final note, rather than using often impractical exhaustive search methods, the optimization over the hyperparameters and the model under \( M \) for all criteria discussed so far, namely ML, MAP, and Bayesian, can be carried out using so-called Bayesian optimization tools [132]. A drawback of these techniques is that they have their own hyperparameters that need to be selected.  

## 2.5 Minimum Description Length (MDL)*  

In this section, we briefly introduce a third, conceptually distinct, learning philosophy—the MDL criterion. The reader is warned that the treatment here is rather superficial, and that a more formal definition of the MDL criterion would require a more sophisticated discussion, which can be found in [10]. Furthermore, some background in information theory is preferable in order to fully benefit from this discussion.  

To start, we first recall from the treatment above that learning requires the identification of a model, or hypothesis class—here the model under \( M \)—and of a specific hypothesis, defined by parameters \( \theta \) (here \( \theta = (y, \beta) \)) within the class. While MDL can be used for both tasks, we will focus here only on the first.  

To build the necessary background, we now need to review the relationship between probabilistic models and compression. To this end, consider a signal \( t \) taking values in a finite alphabet \( \mathcal{X} \), e.g., a pixel.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 49

Context: # 2.5. Minimum Description Length (MDL)

In a gray scale image, fix some probability mass function (pmf) \( p(x) \) on this alphabet. A key result in information theory states that it is possible to design a lossless compression scheme that uses \( [-\log p(x)] \) bits to represent value \( x \).

By virtue of this result, the choice of a probability distribution \( p(x) \) is akin to the selection of a lossless compression scheme that produces a description of around \( -\log p(x) \) bits to represent value \( x \). Note that the description length \( -\log p(x) \) decreases with the probability assigned by \( p(x) \) to value \( x \): more likely values under \( p(x) \) are assigned a smaller description. Importantly, a decoder would need to know \( p(x) \) in order to recover \( x \) from the bit description.

At an informal level, the MDL criterion prescribes the selection of a model that compresses the training data to the shortest possible description. In other words, the model selected by MDL defines a compression scheme that describes the data set \( D \) with the minimum number of bits. As such, the MDL principle can be thought of as a formalization of Occam’s razor: choose the model that yields the simplest explanation of the data. As we will see below, this criterion naturally leads to a solution that penalizes overfitting.

### What is the length of the description of a data set \( D \) that results from the selection of a specific value of \( M \)?

The answer is not straightforward, since, for a given value of \( M \), there are as many probability distributions as there are values for the corresponding parameters \( \theta \) to choose from. A formal calculation of the description length would hence require the introduction of the concept of universal compression for a given probabilistic model \( \theta \). Here, we will limit ourselves to a particular class of universal codes known as two-part codes.

Using two-part codes, we can compute the description length for the data \( D \) that results from the choice of a model over \( M \) as follows. First, we obtain the ML solution \( (m_{ML}, \hat{\theta}_{ML}) \). Then, we describe the data set by using a compression scheme defined by the probability \( p(x | m_{ML}, \hat{\theta}_{ML}) \). As discussed, this pro-
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 50

Context: ```
44
# A Gentle Introduction through Linear Regression

Produces a description of approximately 
$$ - \sum_{n=1}^{N} \log{p(y_n | x_n, w_{ML}, \beta_{ML})} $$
bits. This description is, however, not sufficient, since the decoder of the description should also be informed about the parameters $(w_{ML}, \beta_{ML})$.

Using quantization, the parameters can be described by means of a number $C(M)$ of bits that is proportional to the number of parameters, here $M + 2$. Concatenating these bits with the description produced by the ML model yields the overall description length

$$ - \sum_{n=1}^{N} \log{p[y_n | x_n, w_{ML}, \beta_{ML}]} + C(M). \quad (2.39) $$

MDL – in the simplified form discussed here – selects the model order $M$ that minimizes the description length (2.39). Accordingly, the term $C(M)$ acts as a regularizer. The optimal value of $M$ for the MDL criterion is hence the result of the trade-off between the minimization of the overhead $C(M)$, which calls for a small value of $M$, and the minimization of the NLL, which decreases with $M$.

Under some technical assumptions, the overhead term can be often evaluated in the form 
$$ \left( \frac{K}{2} \right) N \log N + c, $$ 
where $K$ is the number of parameters in the model and $c$ is a constant. This expression is not quite useful in practice, but it provides intuition about the mechanism used by MDL to combat overfitting.

## 2.6 Information-Theoretic Metrics

We now provide a brief introduction to information-theoretic metrics by leveraging the example studied in this chapter. As we will see in the following chapters, information-theoretic metrics are used extensively in the definition of learning algorithms. Appendix A provides a detailed introduction to information-theoretic measures in terms of inferential tasks. Here we introduce the key metrics of Kullback-Leibler (KL) divergence and entropy by examining the asymptotic behavior of ML in the regime of large $N$. The case with finite $N$ is covered in Chapter 6 (see Sec. 6.4.3).
```
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 53

Context: # 2.7 Interpretation and Causality

Having learned a predictive model using any of the approaches discussed above, an important, and often overlooked, issue is the interpretation of the results returned by the learned algorithm. This has in fact grown into a separate field within the active research area of deep neural networks (see Chapter 4) [102]. Here, we describe a typical pitfall of interpretation that relates to the assessment of causality relationships between the variables in the model. We follow an example in [113].

![Fig. 2.10](path/to/figure) shows a possible distribution of data points on the plane defined by coordinates \( x = \text{exercise} \) and \( t = \text{cholesterol} \) (the numerical values are arbitrary). Learning a model that relates \( x \) to \( t \) would clearly identify an upward trend—an individual that exercises more can be predicted to have a higher cholesterol level. This prediction is legitimate and supported by the available data, but can we also conclude that exercising less would reduce one’s cholesterol? In other words, can we conclude that there exists a causal relationship between \( x \) and \( t \)? We know the answer to be no, but this cannot be ascertained from the data in the figure.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 56

Context: 50  
# A Gentle Introduction through Linear Regression

When the number of data points goes to infinity; while the latter is parameterized and is subject to optimization. For this reason, divergence metrics between the two distributions play an important role in the development of learning algorithms. We will see in the rest of the monograph that other frequentist methods may involve a single distribution rather than two, as discussed in Sec. 6.6 or an additional, so-called variational distribution, as covered in Sec. 6.3 and Chapter 8.

In contrast, Bayesian methods posit a single coherent distribution over the data and the parameters, and frame the learning problem as one of inference of unobserved variables. As we will discuss in Chapter 8, variational Bayesian methods also introduce an additional variational distribution and are a building block for frequentist learning in the presence of unobserved variables.

The running example in this chapter has been one of linear regression for a Gaussian model. The next chapter provides the necessary tools to construct and learn more general probabilistic models.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 57

Context: # 3  
## Probabilistic Models for Learning  

In the previous chapter, we have introduced the frequentist, Bayesian, and MDL learning frameworks. As we have seen, parametric probabilistic models play a key role for all three of them. The linear regression example considered in the previous chapter was limited to a simple linear Gaussian model, which is insufficient to capture the range of learning problems that are encountered in practice. For instance, scenarios of interest may encompass discrete variables or non-negative quantities.  

In this chapter, we introduce a family of probabilistic models, known as the **exponential family**, whose members are used as components in many of the most common probabilistic models and learning algorithms. The treatment here will be leveraged in the rest of the monograph in order to provide the necessary mathematical background. Throughout this chapter, we will specifically emphasize the common properties of the models in the exponential family, which will prove useful for deriving learning algorithms in the following chapters.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 58

Context: # 3.1 Preliminaries

We start with a brief review of some definitions that will be used throughout the chapter and elsewhere in the monograph (see [28] for more details). Readers with a background in convex analysis and calculus may just review the concept of sufficient statistic in the last paragraph.

First, we define a **convex set** as a subset of \(\mathbb{R}^D\), for some \(D\), that contains all segments between any two points in the set. Geometrically, convex sets hence cannot have “indentations.” Function \(f(x)\) is convex if its domain is a convex set and if it satisfies the inequality \(f(\lambda x_1 + (1 - \lambda)x_2) \leq \lambda f(x_1) + (1 - \lambda)f(x_2)\) for all \(x_1\) and \(x_2\) in its domain and for all \(0 \leq \lambda \leq 1\). Geometrically, this condition says that the function is “U”-shaped: the curve defining the function cannot be above the segment obtained by connecting any two points on the curve. A function is strictly convex if the inequality above is strict except for \(\lambda = 0\) or \(\lambda = 1\); a concave, or strictly concave, function is defined by reversing the inequality above – it is hence “n-shaped.”

The minimization of a convex (“U”) function over a convex constraint set or the maximization of a concave (“n”) function over a convex constraint set are known as convex optimization problems. For these problems, there exist powerful analytical and algorithmic tools to obtain globally optimal solutions [28].

We also introduce two useful concepts from calculus. The **gradient** of a differentiable function \(f(x)\) with \(x = [x_1, \ldots, x_D]^T \in \mathbb{R}^D\) is defined as the \(D \times 1\) vector \(\nabla f(x) = [\frac{\partial f(x)}{\partial x_1}, \ldots, \frac{\partial f(x)}{\partial x_D}]^T\) containing all partial derivatives. At any point \(x\) in the domain of the function, the gradient is a vector that points to the direction of locally maximal increase of the function. The Hessian \(\nabla^2 f(x)\) is the \(D \times D\) matrix with \((i,j)\) element given by the second-order derivative \(\frac{\partial^2 f(x)}{\partial x_i \partial x_j}\). It captures the local curvature of the function.

1. A statistic is a function of the data.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 60

Context: # Probabilistic Models for Learning

since the latter is generally easier to evaluate.

In short, distributions belonging to the exponential family are such that the logarithm of the unnormalized distribution \(\tilde{p}(\mathbf{x}|\eta)\), which is also known as the exponential function, is linear in the natural parameters, i.e.,

\[
\ln p(\mathbf{x}|\eta) = \langle \eta, u(\mathbf{x}) \rangle + h(m(\mathbf{x})). \tag{3.4}
\]

For this reason, models of the form (3.1) are also referred to as log-linear. Including the partition function, the LL of the natural parameters can be written as

\[
\ln p(\mathbf{x}|\eta) = \langle \eta, u(\mathbf{x}) - A(\eta) \rangle + h(m(\mathbf{x})), \tag{3.5}
\]

where

\[
A(\eta) = \ln Z(\eta) \tag{3.6}
\]

is the log-partition function.

As per (3.1), a probabilistic model belonging to the exponential family is identified by the set of sufficient statistics \(\{u_k(\mathbf{x})\}_{k=1}^K\), whose order is irrelevant, and by the measure \(m(\mathbf{x})\). A specific hypothesis within the model is selected by determining the natural parameter vector \(\eta\). The set of feasible values for the natural parameters contains all, and only, the vectors for which the unnormalized distribution \(\tilde{p}(\mathbf{x})\) can be normalized, that is, for which the inequality \(A(\eta) < \infty\) holds. We will see below that this set is convex.

## Example 3.1 (Gaussian pdf)

As a first example, consider the Gaussian pdf

\[
N(\mathbf{x}|\mu, \sigma^2) = \frac{1}{(2\pi \sigma^2)^{\frac{1}{2}}} \exp\left(-\frac{(\mathbf{x} - \mu)^2}{2\sigma^2}\right).
\]

This can be written in the form (3.1) upon identification of the base measure \(m(\mathbf{x}) = 1\) and of the sufficient statistics \(u(\mathbf{x}) = \mathbf{x}^2\). Note that, in order to do this, we need to map the parameters \((\mu, \sigma^2)\) to the  \(\mathbf{x}^2\), more precisely, after giving the presence of an additive constant.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 64

Context: ```markdown
setting \( C = 2 \) recovers the Bernoulli distribution. PMFs in this model are given as:

\[
\text{Cat}(x| \mu) = \prod_{k=0}^{C-1} \mu_k^{I(x=k)} \times (1 - \sum_{k=1}^{C-1} \mu_k) \quad (3.16)
\]

where we have defined \( \mu_k = \Pr[x = k] \) for \( k = 1, \ldots, C - 1 \) and \( \mu_0 = 1 - \sum_{k=1}^{C-1} \mu_k = \Pr[x = 0] \). The log-likelihood (LL) function is given as:

\[
\ln(\text{Cat}(x|\mu)) = \sum_{k=1}^{C-1} I(x=k) \ln \mu_k + \ln \mu_0 \quad (3.17)
\]

This demonstrates that the categorical model is in the exponential family, with sufficient statistics vector \( u(x) = [I(x = 0) \ldots I(x = C - 1)]^T \) and measure function \( m(x) = 1 \). Furthermore, the mean parameters \( \mu = [\mu_1, \ldots, \mu_{C-1}]^T \) are related to the natural parameter vector \( \eta = [\eta_1, \ldots, \eta_{C-1}]^T \) by the mapping:

\[
\eta_k = \ln \left( \frac{\mu_k}{1 - \sum_{j=1}^{C-1} \mu_j} \right) \quad (3.18)
\]

which again takes the form of an LLR. The inverse mapping is given by:

\[
\mu = \left( \frac{e^{\eta_k}}{1 + \sum_{j=1}^{C-1} e^{\eta_j}} \right) \quad (3.19)
\]

The parameterization given here is minimal, since the sufficient statistics \( u(x) \) are linearly independent. An overcomplete representation would instead include in the vector of sufficient statistics also the function \( I(x=0) \). In this case, the resulting vector of sufficient statistics is:

\[
u(x) = 
\begin{cases} 
1 & \text{if } x = 0 \\ 
1 & \text{if } x = C - 1 
\end{cases} \quad (3.20)
\]

which is known as one-hot encoding of the categorical variable, since only one entry equals 1 while the others are zero. Furthermore, with this...
```
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 69

Context: 3.4 Bayesian Learning
======================

As we discussed in the previous chapter, the Bayesian viewpoint is to treat all variables as jointly distributed, including the model parameters \( \mu \) and the new, unobserved value \( x \). The joint distribution is given as

\[
p(x, \mu | \alpha) = p(\mu | \alpha) \, p(x | \mu) \, p(\alpha) \tag{3.35}
\]
\textit{a prior distribution, likelihood of new data}

where \( \alpha \) represents the vector of hyperparameters defining the prior distribution. The problem of inferring the unobserved value \( x \) is solved by evaluating the predictive distribution

\[
p(x | \alpha) = \int p(\mu | x, \alpha) \, p(x | \mu) \, d\mu. \tag{3.36}
\]

This distribution accounts for the weighted average of the contributions from all values of the parameter vector \( \mu \) according to the posterior distribution \( p(\mu | x, \alpha) \). Note that, for clarity, we left indicated the dependence on the hyperparameters \( \alpha \). Using Bayes' theorem, the posterior of the parameter vector can be written as

\[
p(\mu | x, \alpha) = \frac{p(\alpha | x) \, p(x | \mu)}{p(x | \alpha)} \propto p(\mu | \alpha) \, p(x | \mu). \tag{3.37}
\]

As discussed in Chapter 2, this relationship highlights the dependence of the posterior on both the prior distribution and the likelihood \( p(x | \mu) \). We also note the denominator in (3.37) is the marginal likelihood.

### Prior Distribution
The first issue we should address is the choice of the prior distribution. There are two main approaches:

1. **Conjugate prior**: Choose the prior \( p(\alpha) \) so that posterior \( p(\alpha | x) \) has the same distribution as the prior \( p(\alpha) \) but with generally different parameters.
2. **Non-informative prior**: Choose the prior that is least informative given the observed data [23, pp. 117-120]. Here, we will work with conjugate priors, which are more commonly adopted in applications. In fact, a key advantage of models in the exponential family is that they all admit conjugate priors, and the conjugate priors are also members of the exponential family.

Rather than providing a general discussion, which may be of limited practical use, we proceed by means of representative examples. A table of models with corresponding prior distributions can be found in [155].
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 70

Context: # 3.4.1 Beta-Bernoulli Model

The Beta-Bernoulli model is suitable to study binary data. Conditioned on the parameter \( \mu = \Pr\{X = 1\} \), the pmf of the \( N \) i.i.d. available observations \( x_p \) with \( x_n \in \{0, 1\} \) is given as

\[
p(x | \mu) = \prod_{n=1}^{N} \text{Bern}(x_n | \mu) = \mu^{\sum_{n=1}^{N} x_n}(1 - \mu)^{N - \sum_{n=1}^{N} x_n}. \tag{3.38}
\]

As seen, a conjugate prior should be such that the posterior (3.37) has the same distribution of the prior but with different parameters. For the likelihood (3.38), the conjugate prior is the Beta distribution, which is defined as

\[
p(\alpha | a, b) = \text{Beta}(\alpha | a, b) \propto \alpha^{a - 1}(1 - \alpha)^{b - 1}, \tag{3.39}
\]

where \( a \) and \( b \) are hyperparameters and the normalization constant is not made explicit in order to simplify the notation. It is worth emphasizing that (3.39) is a probability distribution on a probability \( \alpha \). Plots of the beta pdf for different values of \( a, b \geq 1 \) can be found in Fig. 3.1.

![Figure 3.1: Beta distribution with different values of the hyperparameters \( a, b \geq 1 \).](#)
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 74

Context: 68  
# Probabilistic Models for Learning  

estimate as the mode. Finally, the Bayesian predictive distribution is  

$$  
p(z | x) = \frac{N |k| \alpha_k}{N + \sum_{j=0}^J \alpha_j}  
$$  
(3.50)  

One can check that the behavior in the two regimes of small and large \(N\) is consistent with the discussion on the Beta-Bernoulli model.  

## 3.4.3 Gaussian-Gaussian Model  

As a last example, we consider continuous observations that are assumed to have a Gaussian distribution \( \mathcal{N}(\mu, \sigma^2) \) with unknown mean \( \mu \) but known variance \( \sigma^2 \). The likelihood function is given by  

$$  
p(\mathbf{x} | \mu) = \prod_{i=1}^N \mathcal{N}(x_i | \mu, \sigma^2)  
$$  

The conjugate prior is also Gaussian, namely \( \mu \sim \mathcal{N}(\mu_0, \sigma_0^2) \). The posterior \( p(\mu | \mathbf{x}, \sigma^2) \) is  

$$  
\mathcal{N}(\mu | \mu_N, \sigma_N^2)  
$$  
(3.51)  

where  

$$  
\mu_N = \frac{\sigma^2/N \mu_0 + \sigma_0^2/N \bar{x}}{\sigma_0^2/N + \sigma^2/N}  
$$  
(3.52)  

recalling that the ML estimate is \(\mu_{ML} = \frac{\sum_{i=1}^N x_i}{N}\). Note that, since mean and mode are equal for the Gaussian distribution, the mean \(\mu_N\) is also the MAP estimate \(\mu_{MAP}\) of \( \mu \). Finally, the predictive distribution is also Gaussian and given as  

$$  
p(x | \mathbf{x}, \sigma^2) = \mathcal{N}(x | \mu_N, \sigma_N^2 + \sigma^2)  
$$  

Once again, as \(N\) grows large, the predictive distribution tends to that returned by the ML approach, namely \(\mathcal{N}(\mu_{ML}, \sigma^2)\).  

![Fig. 3.3](path_to_figure)  illustrates the relationship among ML, MAP and Bayesian solutions for the Gaussian-Gaussian model. In the panels, the dotted line represents the prior distribution, which is characterized by the parameters \((\mu_0, \sigma_0^2)\), and the dashed line is the true distribution (the one with \( \sigma^2 = 1\)), hence belonging to the assumed model. Each subplot plots a realization of observations (circles), along with the ML solution (diamond) and the MAP estimate (star). The solid lines represent the Bayesian predictive distribution.  

As \(N\) increases, we observe the following, already discussed phenomena: (i) the ML estimate \(\mu_{ML}\) consistently estimates the true value.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 75

Context: # 3.5 Supervised Learning via Generalized Linear Models (GLM)

Distributions in the exponential family are not directly suitable to serve as discriminative probabilistic models to be used in supervised learning tasks. In this section, we introduce Generalized Linear Models (GLMs), which are popular probabilistic discriminative models that build on members of the exponential family.

## Figure 3.3: Gaussian-Gaussian model

The prior distribution \( N(\mu | \phi_0 = 1, \sigma^2 = 3) \) (dotted), true distribution \( N(\mu | \theta_0 = 0, \sigma^2 = 1) \) (dashed), \( N \) observations (circles), ML solution (diamond), the MAP estimate (star), and Bayesian predictive distribution (solid line).

\[
\mu = 0: \quad (i) \text{ the MAP estimate } \hat{\mu}_{MAP} \text{ tends to the ML estimate } \hat{\mu}_{ML}; \text{ and } (ii) \text{ the Bayesian predictive distribution tends to the ML predictive distribution, which in turn coincides with the true distribution due to (i).}
\]
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 76

Context: To elaborate, let us denote as `exponential(η)` a probabilistic model in the exponential family, that is, a model of the form (3.1) with natural parameters η. We also write `exponential(μ|θ)` for a probabilistic model in the exponential family with mean parameters μ.

Using the notation adopted in the previous chapter, in its most common form, a GLM defines the probability of a target variable as:

\[ 
p(y|x, W) = \text{exponential}(η|Wz) \tag{3.53} 
\]

where we recall that \( z \) is the vector of explanatory variables, and \( W \) here denotes a matrix of learnable weights of suitable dimensions. According to (3.53), GLMs posit that the response variable \( y \) has a conditional distribution from the exponential family, with natural parameter vector \( η = Wz \) given by a linear function of the given explanatory variables \( z \) with weights \( W \). More generally, we may have the parametrization:

\[ 
p(y|x, W) = \text{exponential}(η|Wϕ(x)) \tag{3.54} 
\]

for some feature vector \( ϕ(x) \) obtained as a function of the input variables \( x \) (see next chapter).

While being the most common, the definition (3.54) is still not the most general for GLMs. More broadly, GLMs can be interpreted as a generalization of the linear model considered in the previous chapter, whereby the mean parameters are defined as a linear function of a feature vector. This viewpoint, described next, may also provide a more intuitive understanding of the modeling assumptions made by GLMs.

Recall that, in the recurring example of Chapter 2, the target variable was modeled as Gaussian distributed with mean given by a linear function of the covariates \( z \). Extending the example, GLMs posit the model:

\[ 
p(y|x, W) = \text{exponential}(η|g(Wϕ(x))) \tag{3.55} 
\]

where the mean parameter vector is parametrized as a function of the feature vector \( ϕ(x) \) through a generally non-linear vector function \( g(·) \) of suitable dimensions. In words, GLMs assume that the target variable is a “noisy” measure of the mean \( μ = g(Wϕ(x)) \).

When the function \( g(·) \) is selected as the gradient of the partition function of the selected model, e.g., \( g(·) = \nabla A(·) \), then, by (3.10), we...
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 77

Context: # 3.6 Maximum Entropy Property*

In this most technical section, we review the maximum entropy property of the exponential family. Besides providing a compelling motivation for adopting models in this class, this property also illuminates the relationship between natural and mean parameters.

The key result is the following: The distribution \(p(x|\theta)\) in (3.1) obtains the maximum entropy over all distributions \(p(x)\) that satisfy the constraints \(\mathbb{E}_{x \sim p(x)} [u_k(x)] = \mu_k\) for all \(k = 1, \ldots, K\). Recall that, as mentioned in Chapter 2 and discussed in more detail in Appendix A, the entropy is a measure of randomness of a random variable. Mathematically, the distribution \(p(x | \eta)\) solves the optimization problem

\[
\max_{p} H(p) \quad \text{s.t.} \quad \mathbb{E}_{x \sim p(x)} [u_k(x)] = \mu_k \quad \text{for } k = 1, \ldots, K.
\]

Each natural parameter \(\eta_k\) turns out to be the optimal Lagrange multiplier associated with the \(k\)th constraint (see [45, Ch. 6-7]).

To see the practical relevance of this result, suppose that the only information available about some data \(x\) is given by the means of given functions \(u(x)\), \(k = 1, \ldots, K\). The probabilistic model (3.1) can then be interpreted as encoding the least additional information about the data, in the sense that it is the "most random" distribution under the given constraints. This observation justifies the adoption of this model by the maximum entropy principle.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 79

Context: # 3.8 Some Advanced Topics*

With energy-based models, the key formula (3.28) of the gradient of the LL with respect to the model's parameters generalizes as

$$
\frac{1}{N} \nabla_{n} \ln p(z | \eta) = - \frac{1}{N} \sum_{n=1}^{N} \nabla_{n} E_{c}(z_{n} | \eta) + \sum_{c} E_{c} - p(z_{c}) \nabla_{n} E_{c}(x_{i}).
$$
(3.58)

Generalizing the discussion around (3.28), the first term in (3.58) is the “positive” component that points in a direction that minimizes the energy of the observations \( x_{i} \); while the second term is the “negative” component that pushes up the energy of the unobserved configurations.

In gradient-ascent methods, the application of the first term is typically referred to as the positive phase, while the second is referred to as the negative phase. (The negative phase is even taken by some authors to model the working of the brain while dreaming!) [56] While for the exponential family the expectation in the negative phase readily yields the mean parameters, for more general models, the evaluation of this term is generally prohibitive and typically requires Monte Carlo approximations, which are discussed in Chapter 8.

## 3.8 Some Advanced Topics*

The previous sections have focused on the important class of parametric probabilistic models in the exponential family. Here we briefly put the content of this chapter in the broader context of probabilistic models for machine learning. First, it is often useful to encode additional information about the relationships among the model variables by means of a graphical formalism that will be discussed in Chapter 7. Second, the problem of learning the distribution of given observations, which has been studied here using parametric models, can also be tackled using a non-parametric approach. Accordingly, the distribution is inferred making only assumptions regarding its local smoothness. Typical techniques in this family include Kernel Density Estimation and Nearest Neighbor Density Estimation (see, e.g., [140]).

Furthermore, rather than learning individual probability densities, in some applications, it is more useful to directly estimate ratios of densities. This is the case, for instance, when one wishes to estimate the
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 81

Context: # Part II
## Supervised Learning
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 85

Context: # 4.2. Classification as a Supervised Learning Problem

The binary classification problem is illustrated in Fig. 4.1. Given a training set \( D \) of labeled examples \( x_n, n = 1, \ldots, N \), the problem is to assign a new example \( x \) to either class \( C_0 \) or \( C_1 \). In this particular standard data set, the two variables in each vector \( x \) measure the sepal length and sepal width of an iris flower. The latter may belong to either the setosa or virginica family, as encoded by the label \( t_n \) and represented in the figure with different markers. Throughout, we denote as \( D \) the dimension of the domain point \( x \) (\( D = 2 \) in Fig. 4.1).

![Binary classification illustration](path/to/figure4.1.png)

**Figure 4.1:** Illustration of the binary (\( K = 2 \) classes) classification problem with a domain space of dimension \( D = 2 \): to which class should the new example \( x \) be assigned?

Following the taxonomy introduced in Chapter 2, we can distinguish the following modeling approaches, which will be reviewed in the given order throughout the rest of this chapter:

- **Deterministic deterministic models:** Model directly the deterministic mapping between domain point and label via a parameterized function \( t \approx \hat{t}(x) \).
  
- **Discriminative probabilistic models:** Model the probability of a point \( x \) belonging to class \( C_k \) via a parameterized conditional pmf \( p(t|x) \), with the relationship between \( t \) and \( C_k \) defined in (4.4). We will also
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 86

Context: write \( p(C_k | x) \) for the discriminative probability when more convenient.

- **Generative probabilistic model**: Model the joint distribution of domain point and class label by specifying the prior distribution \( p(t) \) or \( p(C_k) \), and the class-dependent probability distribution \( p(x | C_k) \), or \( p(x | C_k) \), of the domain points within each class.

Discriminative models are arguably to be considered as setting the current state of the art on classification, including popular methods such as Support Vector Machine (SVM) and deep neural networks. Generative models are potentially more flexible and powerful as they allow to capture distinct class-dependent properties of the covariates \( x \).

### 4.3 Discriminative Deterministic Models

In this section, we discuss binary classification using discriminative deterministic models. Owing to their practical importance and to their intuitive geometric properties, we focus on linear models, whereby the binary prediction \( \hat{y}(x) \) is obtained by applying a threshold rule on a decision variable \( a(x, \mathbf{w}) \) obtained as a linear function of the learnable weights \( \mathbf{w} \) (the notation will be introduced below). Note that the decision variable \( a(x, \mathbf{w}) \) may not be a linear function of the covariates \( x \). As we will discuss, this class of models underlie important algorithms that are extensively used in practical applications such as SVM. A brief discussion on multi-class classification using deterministic models is provided at the end of this section. In the next two sections, we cover discriminative probabilistic models, including GLMs and more general models.

#### 4.3.1 Model

In their simplest form, linear discriminative deterministic classification models are of the form

\[
\hat{y}(x, \mathbf{w}) = \text{sign}(a(x, \mathbf{w})) \tag{4.5}
\]
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 88

Context: 82 Classification

Given a point \( x \), it is useful to measure the confidence level at which the classifier assigns \( x \) to the class identified through rule (4.5). This can be done by quantifying the Euclidean distance between \( x \) and the decision hyperplane. As illustrated in Fig. 4.2, this distance, also known as the classification margin, can be computed as \( | \langle x, \mathbf{w} \rangle | / \| \mathbf{w} \| \).

A point \( x \) has a true label \( t \), which may or may not coincide with the one assigned by rule (4.5). To account for this, we augment the definition of margin by giving a positive sign to correctly classified points and a negative sign to incorrectly classified points. Assuming that \( t \) takes values in \( \{-1, 1\} \), this yields the definition of geometric margin as

\[
\text{margin} = \frac{t \cdot \langle x, \mathbf{w} \rangle}{\| \mathbf{w} \|} \tag{4.7}
\]

whose absolute value equals the classification margin. For future reference, we also define the functional margin as \( t \cdot \langle x, \mathbf{w} \rangle \).

**Feature-based model**. The model described above, in which the activation is a linear function of the input variables \( x \), has the following drawbacks:

1. **Bias**: As suggested by the example in Fig. 4.3, dividing the domain of the covariates \( x \) by means of a hyperplane may fail to capture. 

![Figure 4.3: A non-linearly separable training set.](path/to/image)
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 89

Context: # 4.3 Discriminative Deterministic Models

The geometric structure of the data. In particular, in the example, the two classes are not linearly separable in the space of the covariates—no hyperplane separates exactly the domain points in the two classes. In such cases, classifiers of the form (4.5) yield large average losses due to the bias induced by the choice of the model (see Sec. 2.3.3).

1. **Overfitting:** When D is large and the data points N are insufficient, learning the D + 1 weights of the classifier may cause overfitting.  
2. **Data-dependent domain size:** In some applications, the dimension D may even change from data point to data point, that is, it may vary with the index n. For example, a text \( x_n \), e.g., represented in ASCII format, will have a different dimension \( D_n \) depending on the number of words in the text.

To address these problems, a powerful approach is that of working with feature vectors \( \phi(x) \), \( k = 1, \ldots, D \), rather than directly with the covariates \( z \) as the input to the classifier. A feature \( \phi(x_n) \) is a generally non-linear function of the vector \( z \). It is important to emphasize that these functions are fixed and not learned.

Choosing a number of features \( D' > D \), which yields an overcomplete representation of the data point \( x_n \), may help against bias; while opting for an undercomplete representation with \( D' < D \) may help solve the problem of overfitting. Furthermore, the same number of features \( D' \), e.g., word counts in a bag-of-words model, may be selected irrespective of the size of the data point, addressing also the last problem listed above.

The feature-based model can be expressed as (4.5) with activation

\[
a(z, \theta) = \sum_{k=1}^{D} w_k \phi_k(z) = \mathbf{w}^T \boldsymbol{\phi}(z).
\]

(4.8)

where we have defined the feature vector \( \phi(x) = [\phi_1(x) \cdots \phi_D(x)]^T \). Note that model (4.5) is a special case of (4.8) with the choice \( \phi(z) = [1, z^T]^T \).

## 4.3.2 Learning

As seen in Sec. 2.3.3, learning of deterministic discriminative models can be carried out by means of ERM for a given loss function \( \ell \). Further...
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 91

Context: # 4.3 Discriminative Deterministic Models

on the original loss function. In this way, the actual average loss is guaranteed to be smaller than the value attained under the surrogate loss function. Examples of surrogate functions that will be considered in the following can be found in Fig. 4.4.

![Some notable surrogate loss functions for binary classification along with the 0-1 loss.](fig4.4.png)

## Perceptron Algorithm

The perceptron algorithm is one of the very first machine learning and AI algorithms. It was introduced by Frank Rosenblatt at the Cornell Aeronautical Laboratory in 1957 to much fanfare in the popular press – it is “the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence,” reported *The New York Times* [144].

Using the feature-based model for generality, the perceptron algorithm attempts to solve problem (4.11) with the surrogate perceptron.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 95

Context: # 4.3. Discriminative Deterministic Models

zero Lagrange multipliers. Note that the Lagrange multipliers are returned by standard solvers such as the ones implemented by the CVX toolbox in MATLAB [58].

![Example of binary classification with SVM using polynomial features up to degree \( M \) (λ/N = 0.2).](./figure_4_5.png)

## Example 4.2
In the example in Fig. 4.5, the illustrated \( N = 80 \) training samples are fed to a SVM using the monomial feature vector \( \phi(\mathbf{x}) = [1, x_1^2, \ldots, x_1^M, x_2^M]^\top \) and \( \lambda/N = 0.2 \) for given model orders \( M \). The decision boundary is shown using dashed and solid lines. It is seen that, using a sufficiently large order (here \( M = 3 \)), SVM is able to effectively partition the two samples in the two classes. Furthermore, even with larger values of \( M \) (here \( M = 8 \)), SVM appears not to suffer from significant overfitting thanks to the quadratic regularization term.

The optimization problem (4.16) may be conveniently tackled by using Lagrange duality techniques. This approach also allows one to naturally introduce the powerful tool of kernel methods. The interested reader can find this discussion in Appendix B of this chapter.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 97

Context: # 4.4. Discriminative Probabilistic Models: Generalized Linear Models

## 4.4.1 Model

For classification, the label \( c \) can take a finite number of values, and it can hence be described by a Bernoulli variable in the binary case or, more generally, by a Categorical variable (see Chapter 3). The GLM (3.54) for binary classification is known as logistic regression, and it assumes the predictive distribution

\[
p(t|x) = \sigma(\mathbf{w}^T \mathbf{ϕ}(x)). \tag{4.19}
\]

We recall that \( \sigma(a) = \left( 1 + \exp(-a) \right)^{-1} \) is the sigmoid function (see Chapter 2). We also observe that

\[
\sigma(-a) = 1 - \sigma(a). \tag{4.20}
\]

which implies that we can write \( p(t|x) = 1 - \sigma(\mathbf{w}^T \mathbf{ϕ}(x)) = \sigma(-\mathbf{w}^T \mathbf{ϕ}(x)) \). Intuitively, the sigmoid function in (4.19) can be thought of as a “soft” version of the threshold function sign(a) used by the deterministic models studied in the previous section.

We emphasize that the logistic regression model (4.19) is a GLM since it amounts to a Bernoulli distribution, which is in the exponential family, with natural parameter vector \( \eta = \mathbf{w}^T \mathbf{ϕ}(x) \) as in

\[
t|x \sim \text{Bern}(t| \eta) = \sigma(\mathbf{w}^T \mathbf{ϕ}(x)). \tag{4.21}
\]

### Inference

Before discussing learning, we observe that inference is straightforward. In fact, once the discriminative model (4.19) is known, the average 0-1 loss, that is, the probability of error, is minimized by choosing the label according to the following rule

\[
p(c|x) = p(t = 1|x) \frac{1}{2} C_1^0, \tag{4.22}
\]

or equivalently 

\[
\mathbf{w}^T \mathbf{ϕ}(x) \geq 0.
\]
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 102

Context: # 4.5 Discriminative Probabilistic Models: Beyond GLM

As depicted in Fig. 4.6, GLMs can be interpreted as three-layer neural networks in which the only hidden layer computes fixed features. The fixed features are then processed by the output classification layer. 

In various applications, determining suitable features is a complex and time-consuming task that requires significant domain knowledge. Moving beyond GLMs allows us to work with models that learn not only the weights used by the output layer for classification, but also the vector of features \( \phi(x) \) on which the output layer operates. This approach yields a much richer set of models, which, along with suitable learning algorithms, has led to widely publicized breakthroughs in applications ranging from speech translation to medical diagnosis.

As a prominent example of beyond-GLM classification models, we describe here feed-forward multi-layer neural networks, or deep neural networks.

## Figure 4.7: A multi-layer neural network.

\[
\begin{array}{ccccccc}
a^{1} & h^{1} & h^{2} & \cdots & h^{k-1} & g(\mathbf{x}) & y \\
x^{1} & W^{1} & W^{2} & \cdots & W^{k} & \text{softmax} &
\end{array}
\]
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 105

Context: 4.5. Discriminative Probabilistic Models: Beyond GLM

Put

\[
\delta^{L+1} = (y - t) \quad (4.36a)
\]

\[
\delta' = (W^{(L+1)})^T \delta^{(l)} \cdot h'(a^{(l)}) \quad (4.36b)
\]

\[
\nabla_{w_{ij}} L(W) = \delta^{(h^{(l)})^T} \quad \text{for } l = 1, 2, \ldots, L + 1 \quad (4.36c)
\]

where \(h'( \cdot )\) denotes the first derivative of the function \(h( \cdot )\); the product \( \cdot \) is taken element-wise; and we set \( h^0 = x \).

Backpropagation requires a forward pass and a backward pass for every considered training example. The forward pass uses the neural network as defined by equations (4.35). This entails multiplications by the weight matrices \(W^l\) in order to compute the activation vectors, as well as applications of the non-linear function \(h( \cdot )\). In contrast, the backward pass requires only linear operations, which, by (4.36b), are based on the transpose \((W^{(l)})^T\) of the weight matrix \(W^l\) used in the forward pass.

The derivatives (4.36c) computed during the backward pass are of the general form 

\[
\nabla_{w_{ij}} L(W) = h^{(l-1)} \cdot \delta_j \quad (4.37)
\]

where \(w_{ij}\) is the \((i,j)\)th element of matrix \(W^l\) corresponding to the weight between the pre-synaptic neuron \(j\) in layer \(l-1\) and the post-synaptic neuron \(i\) in layer \(l\); \(h^{(l-1)}\) is the output of the pre-synaptic neuron \(j\); and \(\delta_j\) is the back-propagated error. The back-propagated error assigns “responsibility” for the error \(t_j - y_j\) measured at the last layer (layer \(L+1\)) to each synaptic weight \(w_{ij}\) between neuron \(j\) in layer \(l-1\) and neuron \(i\) in layer \(l\). The back-propagated error is obtained via the linear operations in (4.36b).
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 109

Context: # 4.6 Generative Probabilistic Models

where \( \text{exponential}(\eta) \) represents a distribution from the exponential family with natural parameter vector \( \eta \) (see previous chapter). Accordingly, the parameters of the model are \( \theta = (\pi, \tau, \eta) \), where vectors \( \eta \) represent the natural parameters of the class-dependent distributions. As we have seen in Chapter 2, we can also equivalently use mean parameters to define the exponential family distributions. As a result of this choice, the joint distribution for \( (x, t) \) is given as 

\[
p(x, t | \theta) = p(t | x, \eta).
\]

## Inference

Given a new point \( x \), in order to minimize the probability of error, the optimal prediction of the class under 0-1 loss can be seen to satisfy the maximum a posteriori rule:

\[
p(c_1 | x) = \frac{p(x | c_1) \pi(c_1)}{p(x)} \quad \text{for} \quad c_1 \in \{C_0, C_1\}. \tag{4.39}
\]

## 4.6.2 Learning

We now focus on ML learning. The LL function can be written as 

\[
\ln p(D | \pi, \tau, \eta) = \sum_{n=1}^N \ln p(t_n | \pi) + \sum_{n=1}^N \ln p(x_n | t_n) + \sum_{n=1}^N \ln p(t_n | \eta). \tag{4.40}
\]

Given the decomposition of the LL in (4.40), we can optimize over \( \pi, \tau \) and \( \eta \) separately, obtaining the respective ML estimates. Note, however, that, while for \( x \) we can use the entire data set, the optimization over parameters \( \pi \) and \( \eta \) can leverage smaller data sets that include only the samples \( x_n \), with labels \( t_n = 0 \) or \( t_n = 1 \) respectively. As we discussed in Chapter 2, ML estimates of exponential families merely require moment matching, making these estimates generally easy to obtain. We illustrate this point below with two important examples.

### Quadratic Discriminant Analysis (QDA)

In QDA, the class-conditional distributions are Gaussian with class-dependent mean and covariance:

\[
t \sim \text{Bern}(\pi) \tag{4.41a}
\]
\[
x | t \sim \mathcal{N}(\mu_k, \Sigma_k). \tag{4.41b}
\]
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 111

Context: # 4.6  Generative Probabilistic Models

![Figure 4.9](path/to/image.png)  
**Figure 4.9:** Probability that the class label is the same as for the examples marked with circles according to the output of the generative model QDA. The probability is represented by the color map illustrated by the bar on the right of the figure. For this example, it can be seen that LDA fails to separate the two classes (not shown).

### Example 4.4
We continue the example in Sec. 4.5 by showing in Fig. 4.9 the probability (4.43) that the class label is the same as for the examples marked with circles according to the output of QDA. Given that the covariates have a structure that is well modeled by a mixture of Gaussians with different covariance matrices, QDA is seen to perform well, arguably better than the discriminative models studied in Sec. 4.5. It is important to note, however, that LDA would fail in this example. This is because a model with equal class-dependent covariance matrices, as assumed by LDA, would entail a significant bias for this example.

#### 4.6.3 Multi-Class Classification*
As an example of a generative probabilistic model with multiple classes, we briefly consider the generalization of QDA to \( K \geq 2 \) classes. Extending (4.41) to multiple classes, the model is described as:

\[
x \sim \text{Cat}(\pi) \tag{4.44a}
\]  
\[
x | t \sim \mathcal{N}(\mu_k, \Sigma_k) \tag{4.44b}
\]  

Image Analysis: 

### Analysis of the Visual Content

#### 1. Localization and Attribution
- **Image Number**: Image 1
  - This is a single image located near the top of the page, integrated within the textual content.

#### 2. Object Detection and Classification
- **Objects Identified**:
  - **Graph**: The main object is a scatter plot with mathematical notations and a probability color map.
  - **Text**: The text below the graph and the paragraphs containing examples and explanations.

#### 3. Scene and Activity Analysis
- **Scene Description**: Image 1 depicts a scatter plot graph illustrating the results of a statistical model used in generative probabilistic models. The graph is surrounded by explanatory text describing the significance of the plotted data.
- **Activities**: The scene represents an educational or informational setting, focusing on explaining a statistical concept.

#### 4. Text Analysis
- **Text Extracted**:
  - **Figure Caption**: "Figure 4.9: Probability that the class label is the same as for the examples marked with circles according to the output of the generative model QDA. The probability is represented by the color map illustrated by the bar on the right of the figure. For this example, it can be seen that LDA fails to separate the two classes (not shown)."
  - **Example Text**: "Example 4.4. We continue the example in Sec. 4.5 by showing in Fig. 4.9 the probability (4.43) that the class label is the same as for the examples marked with circles according to the output of QDA. Given that the covariates have a structure that is well modelled by a mixture of Gaussians with different covariance matrices, QDA is seen to perform well, arguably better than the discriminative models studied in Sec. 4.5. It is important to note, however, that LDA would fail in this example. This is because a model with equal class-dependent covariance matrices, as assumed by LDA, would entail a significant bias for this example."
  - **Section Text**: "4.6.3 Multi-Class Classification* As an example of a generative probabilistic model with multiple classes, we briefly consider the generalization of QDA to \(K \geq 2\) classes. Extending (4.41) to multiple classes, the model is described as \(t \sim Cat(\pi)\) (4.44a) \(x|t = k \sim \mathcal{N}(\mu_k, \Sigma_k)\) (4.44b)"
- **Significance**: The text explains concepts related to generative probabilistic models, highlighting the differences between Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA). It offers an example and discusses performance with an emphasis on covariance structures.

#### 5. Diagram and Chart Analysis
- **Graph Analysis**:
  - **Axes and Scales**:
    - **X-axis**: Z1 ranging from -4 to 4
    - **Y-axis**: Z2 ranging from -3 to 3
    - **Color Bar**: Probability scale ranging from 0.01 to 1
  - **Data Presented**: 
    - Red crosses and blue circles represent data points, with the color map indicating the probability that a data point's class label matches the model's output.
  - **Key Insights**:
    - The scatter plot illustrates that QDA accurately assigns class labels to the data points marked with circles.
    - The color map indicates higher probabilities in regions with concentrated data points, showing QDA's effectiveness.
    - The caption notes LDA's failure to separate the two classes, which suggests the necessity of QDA for this dataset.

#### 9. Perspective and Composition
- **Perspective Description**: The image is viewed directly from the front, capturing a clear view of the scatter plot and accompanying text.
- **Composition**: The scatter plot is centered in the image with explanatory text surrounding it, ensuring clarity and focus on the visual data representation.

### Contextual Significance
- **Image Contribution**: This image and its associated text play a crucial role in explaining a key concept within the broader context of generative probabilistic models. It provides a visual and theoretical comparison between LDA and QDA, supporting the reader's understanding of the material.

### Diagram and Trend Analysis
- **Trend Observation**: The scatter plot indicates clusters where QDA performs well in classifying data points. The color gradient reveals areas with varying probabilities, underscoring where QDA's classification aligns with actual data labels.

### Tables and Graph Numbers
- This image does not include tables or specific graph data points for listing.

### Process Flows and Descriptions
- No process flows or descriptions are presented in this image.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 113

Context: # 4.8 Summary

When training the \(k\)th model, the outputs \(a_1(x, \mathbf{w}_1), \ldots, a_k(x, \mathbf{w}_{k-1})\) of the previously trained models, as well as their weights \(\pi_1, \ldots, \pi_{k-1}\), are kept fixed. Excluding the models \(k + 1, \ldots, K\), the training loss can be written as

\[
L_k = \sum_{n=1}^{N} \alpha_{n}^{(k)} \exp(-\pi_k t_n - \sum_{j=1}^{k-1} \pi_j a_j(x_n, \mathbf{w}_j)). \tag{4.47}
\]

with the weights

\[
\alpha_{n}^{(k)} = \exp\left(-t_n - \sum_{j=1}^{k-1} \pi_j a_j(x_n, \mathbf{w}_j)\right). \tag{4.48}
\]

An important point is that the weights (4.48) are larger for training samples with smaller functional margin under the mixture model \(\sum_{j=1}^{k} \pi_j a_j(x_n, \mathbf{w}_j)\). Therefore, when training the \(k\)th model, we give more importance to examples that fare worse in terms of classification margins under the current mixture model. Note that, at each training step \(k\), one trains a simple model, which has the added advantage of reducing the computational complexity as compared to the direct learning of the full training model. We refer to [23, Ch. 14][33, Ch. 10] for further details.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 119

Context: # 5
## Statistical Learning Theory*

Statistical learning theory provides a well-established theoretical framework in which to study the trade-off between the number \( N \) of available data points and the generalization performance of a trained machine. The approach formalizes the notions of model capacity, estimation error (or generalization gap), and bias that underlie many of the design choices required by supervised learning, as we have seen in the previous chapters. 

This chapter is of mathematical nature, and it departs from the algorithmic focus of the text so far. While it may be skipped at a first reading, the chapter sheds light on the key empirical observations made in the previous chapters relative to learning in a frequentist setup. It does so by covering the theoretical underpinnings of supervised learning within the classical framework of statistical learning theory. 

To this end, the chapter contains a number of formal statements with proofs. The proofs have been carefully selected in order to highlight and clarify the key theoretical ideas. This chapter follows mostly the treatment in [133].
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 123

Context: # 5.1. A Formal Framework for Supervised Learning

illustrated in the next example, this expected behavior is a consequence of the law of large numbers, since a larger \( N \) allows an increasingly accurate estimation of the true general loss.

![Figure 5.2: High-probability interval (dashed arrow) for the generalization error \( L_y(f_N) \) versus the number \( N \) of data points for a model \( \mathcal{H} \).](path/to/image)

## Example 5.2

Consider the problem of binary classification using the model of threshold functions, namely

\[
\mathcal{H} = \left\{ h(x) = \begin{cases} 
0, & \text{if } x < \theta \\ 
1, & \text{if } x \geq \theta 
\end{cases} \right\}, 
\]

where \( \theta \) is a real number (D = 1). Note that the model is parametrized by the threshold \( \theta \). Make the realizability assumption that the true distribution is within the hypothesis allowed by the model, i.e., \( p(x, y) = p(y|x) \cdot p(x) \) and hence the optimal hypothesis is \( h^* = h_{\theta^*} \). Assuming a uniform distribution \( p(x) = \frac{1}{|a - b|} \) for the domain points, Fig. 5.3 shows the generalization error \( \Pr(h(x) \neq y | x) = |h(x) - y| \), as well as the training loss \( L_p(f_N) \) for the training set shown on the horizontal axis, for various values of \( N \). Note that the training loss \( L_p(f_N) \) is simply
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 130

Context: # 5.4 VC Dimension and Fundamental Theorem of PAC Learning

We have seen that finite classes are PAC learnable with sample complexity proportional to the model capacity in \(|H|\) by using ERM. In this section, we address the following questions: Is \(N_{\text{SRM}}(N,\delta)\) the smallest sample complexity? How can we define the capacity of infinite hypothesis classes? We have discussed at the end of Section 5.3 that the answer to the latter question cannot be found by extrapolating from results obtained when considering finite hypothesis classes. In contrast, we will see here that the answers to both of these questions rely on the concept of VC dimension, which serves as a more fundamental definition of capacity of a model. The VC dimension is defined next.

## Definition 5.4

A hypothesis class \(\mathcal{H}\) is said to shatter a set of domain points \(X = \{x_1,\ldots,x_n\}\) if, no matter how the corresponding labels \(\{y_i \in \{0,1\}\}\) are selected, there exists a hypothesis \(h \in \mathcal{H}\) that ensures \(\hat{y}_{n} = y_n\) for all \(n = 1,\ldots,V\).

## Definition 5.5

The VC dimension \(\text{VCdim}(\mathcal{H})\) of the model \(\mathcal{H}\) is the size of the largest set \(X\) that is shattered by \(\mathcal{H}\).

Based on the definitions above, to prove that a model has VCdim(\(\mathcal{H}\)) = V, we need to carry out the following two steps:

1. **Step 1:** Demonstrate the existence of a set \(X\) with \(|X| = V\) that is shattered by \(\mathcal{H}\);
2. **Step 2:** Prove that no set \(X'\) of dimensions \(V + 1\) exists that is shattered by \(\mathcal{H}\).

The second step is typically seen to be more difficult, as illustrated by the following examples.

## Example 5.3

The threshold function model (5.8) has VCdim(\(\mathcal{H}) = 1\), since there is clearly a set \(X\) of one sample (\(V = 1\)) that can be shattered.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 133

Context: # 5.5. Summary

To elaborate, consider a probabilistic model \( \mathcal{H} \) defined as the set of all pmfs \( p(x|\theta) \) parameterized by \( \theta \) in a given set. With some abuse of notation, we take \( \mathcal{H} \) to be also the domain of parameter \( \theta \). To fix the ideas, assume that \( \theta \) takes values over a finite alphabet. We know from Section 2.5 that a distribution \( q(x) \) is associated with a lossless compression scheme that requires around \( -\log q(x) \) bits to describe a value. Furthermore, if we were informed about the true parameter \( \theta \), the minimum average coding length would be the entropy \( H(p(\cdot|\theta)) \), which requires setting \( q(x) = p(x|\theta) \) (see Appendix A).

Assume now that we only know that the parameter \( \theta \) lies in set \( \mathcal{H} \), and hence the true distribution \( p(x|\theta) \) is not known. In this case, we cannot select the true parameter distribution, and we need instead to choose a generally different distribution \( q(x) \) to define a compression scheme. With a given distribution \( q(x) \), the average coding length is given by 

\[
\bar{L}(q(x), \theta) = -\sum_{x} p(x|\theta) \log q(x). 
\]
Therefore, the choice of a generally different distribution \( q(x) \) entails a redundancy of 

\[
\Delta R(q(x), \theta) = -\sum_{x} p(x|\theta) \log \frac{q(x)}{p(x|\theta)} \geq 0 \tag{5.22}
\]
bits.

The redundancy \( \Delta R(q(x), \theta) \) in (5.22) depends on the true value of \( \theta \). Since the latter is not known, this quantity cannot be computed. We can instead obtain a computable metric by maximizing over all values of \( \theta \in \mathcal{H} \), which yields the worst-case redundancy

\[
\Delta R(q(x), \mathcal{H}) = \max_{\theta \in \mathcal{H}} \Delta R(q(x), \theta). \tag{5.23}
\]

This quantity can be minimized over \( q(x) \) yielding the so-called min-max redundancy:

\[
\Delta R(q(x)) = \min_{q} \Delta R(q(x), \mathcal{H}) \tag{5.24}
\]
\[
= \min_{q} \max_{\theta} \sum_{x} p(x|\theta) \log \frac{p(x|\theta)}{q(x)} \tag{5.25}
\]
\[
= \min_{q} \max_{\theta} \sum_{x} p(x|\theta) \log \frac{p(x|\theta)}{q(x)}. \tag{5.26}
\]

The minimum redundancy can be taken as a measure of the capacity of model \( \mathcal{H} \), since a richer model tends to yield a larger \( \Delta R(\mathcal{H}) \). In fact,
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 139

Context: # 6.1. Unsupervised Learning

Figure 6.1: 
- (a) Directed generative models; 
- (b) Undirected generative models; 
- (c) Discriminative models; 
- (d) Autoencoders.

Models. We now review the type of models that can be used to tackle unsupervised learning problems. The models will be further discussed in the subsequent sections of this chapter.

## Directed Generative Models
Directed generative models are mixture models in which the distribution \( p(x|z) \) of the data is defined by a parameterized prior \( p(z|\theta) \) of the latent variables \( z \) and by a parameterized conditional distribution \( p(x|z, \theta) \) that defines the relationship between latent and observed variables. Accordingly, the distribution can be expressed, for discrete latent variables \( z \), as:

\[
p(z|\theta) = \sum_{x} p(z|\theta)p(x|z, \theta). \tag{6.2}
\]

A similar expression applies to continuous hidden variables with an integral in lieu of the sum. Directed models are suitable to capture the cause-effect relationships between \( z \) and \( x \). A Bayesian network (BN) describing directed generative models is shown in Fig. 6.1(a). Graphical models, including BNs, will be covered in detail in the next chapter. Examples of directed generative models include the mixture of Gaussians model and the so-called likelihood-free models, in which the conditional distribution \( p(x|z) \) is implemented by a deterministic transformation, most typically a multi-layer network.

## Undirected Generative Models
Undirected generative models parametrize directly the joint distribution of the observed variables \( x \) and the hidden variables \( z \) as \( p(x, z) \), and accordingly write the distribution of the
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 152

Context: Minimization (MM) algorithm [141]. In this class of algorithms, at each iteration, one constructs a tight lower bound of the objective function at the current iterate \(\boldsymbol{\theta}^{\text{old}}\). This bound, which should be easy to maximize, is then optimized, yielding the new iterate \(\boldsymbol{\theta}^{\text{new}}\). The process is illustrated in Fig. 6.4. As it can be seen, at each iteration one is guaranteed that the objective function is not decreased, which ensures convergence to a local optimum of the original problem. In EM, the tight lower bound is the ELBO \( \mathcal{L}^{\text{new}}(\boldsymbol{\theta})\), which is obtained by computing the posterior distribution of the latent variables \(q^{\text{new}}(\mathbf{z}) = p(\mathbf{z}|\mathbf{x}, \boldsymbol{\theta}^{\text{old}})\) at the current iterate \(\boldsymbol{\theta}^{\text{old}}\).

![Figure 6.4: Illustration of the EM algorithm. The dashed line is the LL function. The solid curves represent the ELBOs corresponding to the first two steps of the algorithm, while the dashed-dotted line is the ELBO after a number of iterations. At each step, EM maximizes the ELBO, which is a lower bound on the LL function. EM is an instance of the more general MM approach [141].](path/to/figure)

Generalizing the \(K\)-means algorithm, the EM algorithm alternates between:
1. Making inferences about the hidden variables \( \mathbf{z} \) based on the model defined by the current iterate of the model parameter \( \boldsymbol{\theta} \) in the E step; and
2. Updating the model \( \boldsymbol{\theta} \) to match the data \( \mathbf{x} \) and the inferred variables \( \mathbf{z} \) in the M step.

It is useful to emphasize that...
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 156

Context: # Unsupervised Learning

![Figure 6.5: True distribution (dashed line) and mixture of Gaussians model learned via EM (solid line).](path/to/image)

## Example 6.3

Consider data generated from the multi-modal distribution shown in Fig. 6.5 as a dashed line, which was obtained as a mixture of two Gaussians and an exponential distribution. When \( M = 1 \), the mixture of Gaussian distribution learned via EM corresponds to the conventional ML estimate, and is hence obtained via moment matching (see Chapter 3). This is plotted in the figure for a given data realization with \( N = 100 \). Running EM with the larger values \( M = 2 \) and \( M = 3 \) also returns the same distribution. This distribution is seen to be inclusive of the entire support of the true distribution and to smooth out the edges of the original distribution. As we will further discuss in Sec. 6.4.3, this is a consequence of the fact that ML minimizes the KL divergence over the second argument.

A similar model that applies to binary data, rather than continuous data, such as black-and-white images, is the mixture of Bernoulli models. The EM algorithm can be derived by following the same steps detailed here [23].
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 157

Context: # 6.4 Directed Generative Models

## 6.4.2 Probabilistic Principal Component Analysis

Probabilistic Principal Component Analysis (PPCA) is a popular generative model that describes the data in terms of \( M < D \) features that are linearly related to the data vector. Specifically, PPCA uses a linear factor generative model with \( M < D \) features described as

\[
z_n \sim \mathcal{N}(0, I) \tag{6.32a}
\]
\[
x_{n|z_n} = z_n W^T + \mu, \sigma^2 I \tag{6.32b}
\]

with parameter vector \( \theta = (W, \mu, \sigma^2) \). Equivalently, according to PPCA, the data vector can be written as

\[
x_n = W z_n + \mu + q_n. \tag{6.33}
\]

with the latent variables \( z_n \sim \mathcal{N}(0, I) \) and the additive noise \( q_n \sim \mathcal{N}(0, \sigma^2 I) \). According to (6.33), the columns \( \{w_t\} \) of the matrix \( W = [w_1, w_2, \ldots, w_M] \) can be interpreted as linear features of the data. This is in the sense that each data point is written as a linear combination of the feature vectors as

\[
x_n = \sum_{m=1}^{M} w_t z_{mn} + \mu + q_n. \tag{6.34}
\]

where \( z_{mn} \) is the \( m \)th component of the latent vector \( z_n \). In the models studied above, the latent variable was a categorical identifier of the class of the observation. As such, in PPCA, instead, the representation of an observation \( x_n \) in the hidden variable space is distributed across all the variables in vector \( z_n \). This yields a more efficient encoding of the hidden representation. This is particularly clear in the case discrete hidden variables \( h \), which will be further discussed in Sec. 6.5.1 (see also [20]).

An EM algorithm can be devised to learn the parameter vector as described in [23, p. 577]. The E step leverages the known result that the posterior of the latent variables can be written as

\[
z_{n|x_n} \sim \mathcal{N}(z_n | J^{-1} W^T (x_n - \mu), J^{-1}), \tag{6.35}
\]

with matrix \( J = \sigma^{-2} W^T W + I \).
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 159

Context: 6.4. Directed Generative Models
===============================

where we have defined the empirical distribution

$$p_D(x) = \frac{N(x)}{N}$$
(6.38)

We hence conclude that ML attempts to minimize the KL divergence between the empirical distribution \(p_D(x)\) of the data and the model distribution \(p(x|θ)\). The ML problem min \(\text{KL}(p_D(x) \| p(x|θ))\) is also known as the M-projection of the data distribution \(p_D(x)\) into the model \(p(x|θ)\) [40] (see Chapter 8 for further discussion).

The KL divergence (6.37) is a measure of the difference between two distributions with specific properties that may not be tailored to the particular application of interest. For instance, distributions obtained by minimizing (6.37) tend to provide "blurry" estimates of the distribution of the data distribution, as we have seen in Fig. 6.5. As a result, learning image distributions using M-projections is known to cause the learned model to produce unfocused images [57].

The KL divergence is not, however, the only measure of the difference between two distributions. As discussed in more detail in Appendix A, the KL divergence is in fact part of the larger class of f-divergences between two distributions \(p(x)\) and \(q(x)\). This class includes divergence measures defined by the variational expression¹

$$D_f(p \| q) = \max_{T(x)} \mathbb{E}_{p}[g(T(x))] - \mathbb{E}_{q}[\log(T(x))],$$
(6.39)

for some concave increasing function \(g(\cdot)\) (the meaning of the \(f\) subscript is discussed in Appendix A). The key new element in (6.39) is the decision rule \(T(x)\), which is also known as discriminator or critic. This function takes as input a sample \(x\) and ideally outputs a large value when \(x\) is generated from distribution \(p\), and a small value when it is instead generated from \(q\). Optimizing over \(T(x)\) hence involves the right-hand side of (6.39) is large when the two distributions are different and hence can be distinguished based on the observation of \(x\). When the domain over which the discriminator \(T(x)\) is optimized is left unconstrained, solving problem (6.39) recovers the KL divergence and

¹The term variational refers to the fact that the definition involves an optimization.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 165

Context: 6.6 Discriminative Models
=========================

When the goal is learning a representation \( x \) of data \( x \), one can try to directly learn a parametrized encoder, or discriminative model, \( p(z|x, \theta) \), as illustrated in Fig. 6.1(c). With an encoder \( p(z|x, \theta) \), we can define the joint distribution as \( p(x, z) = p(z|x, \theta)p(x) \), where \( p(x) \) is the true distribution of the data. The latter is in practice approximated using the empirical distribution \( p(x) \) of the data. This yields the joint distribution

\[
p(z|x, y) = p(y)p(z|x, \theta). \tag{6.50}
\]

From (6.50), it is observed that, unlike the frequentist methods considered up to now, discriminative models for unsupervised learning are based on the definition of a single distribution over observed and unobserved variables. As such, divergence measures, which involve two distributions, are not relevant performance metrics. In contrast, a suitable metric is the mutual information between the jointly distributed variables \( (x, z) \). The mutual information is a measure of the statistical dependence between the two RVs and is introduced in Appendix A.

To elaborate, a typical learning criterion is the maximization of the mutual information \( I_{p(x,z)}(z;x) \) between the representation \( z \) and the data \( x \) under their joint distribution \( p(z|x, y) \). Note that, for clarity, we explicitly indicated the joint distribution used to evaluate the mutual information as a subscript. It can be related to the KL divergence as \( I_{p(x,z)}(z;x) = KL(p(z|x, y) || p(z|x)) \). This relationship indicates that the mutual information quantifies the degree of statistical dependence, or the distance from independence, for the two RVs.

The resulting Information Minimization problem is given as

\[
\max_{\theta} I_{p(z|x,\theta)}(x; z). \tag{6.51}
\]

As seen, in order to achieve learning the trivial identity mapping, one needs to impose suitable constraints on the encoder \( p(z|x, \theta) \) in order to restrict its capacity. In order to tackle problem (6.51), a typical solution is to resort to an MM approach that is akin to the EM algorithm for ML learning described above (see Fig. 6.4).

To this end, as for EM, we introduce a variational distribution \( q(z|x) \), and observe that we have the following lower bound on the distribution.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 167

Context: 6.7. Autoencoders
=================

As seen in Fig. 6.1, autoencoders include parametric models for both encoder and decoder. We focus here for brevity on deterministic autoencoders, in which the encoder is defined by a function \( z = F_e(x) \) and the decoder by a function \( x = G_d(z) \). Note that the parameters defining the two functions may be tied or may instead be distinct, and that the notation is general enough to capture both cases. We will mention at the end of this section probabilistic autoencoders, in which encoder and decoder are defined by conditional probability distributions.

Autoencoders transform the unsupervised learning problem of obtaining a representation \( z = F_e(x) \) of the input \( x \) based solely on unlabeled examples \( \{x_n\}_{n=1}^N \) into an instance of supervised learning. They do so by considering the encoder \( z = F_e(x) \) with a decoder \( x = G_d(z) \), so as to obtain the input-output relationship \( x = G_d(F_e(x)) \). The key idea is to train this function by setting the target \( t \) to equal the input \( x \). As such, the machine learns to obtain an intermediate representation \( z = F_e(x) \) that makes it possible to recover a suitable estimate \( t = G_d(z) \approx x \).

To formalize the approach, learning is typically formulated in terms of the ERM problem:

\[
\min_{\theta} \sum_{n=1}^{N} \mathcal{L}(x_n, G_d(F_e(x_n))),
\]

in which, as explained, the encoder-decoder mapping \( G_d(F_e(\cdot)) \) is trained to reproduce the input at its output.

In the absence of constraints on encoder and decoder, the problem above trivially reverts to identity mapping, i.e., \( G_d(F_e(x)) = x \). In order to potentially learn a useful model, one should impose some constraints, such as dimensionality or sparsity, on the latent vector \( z \). Some notable examples are discussed next.

PCA
----

PCA assumes linear encoder and decoder, and tests their weight matrices via a transposition. Specifically, PCA sets the encoder
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 171

Context: # Part IV

## Advanced Modelling and Inference
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 172

Context: # 7  
## Probabilistic Graphical Models  

As we have seen in the previous chapters, probabilistic models are widely used in machine learning. Using Fig. 6.1 as an example, we have encountered both directed and undirected models, which have been used to carry out supervised and unsupervised learning tasks. Graphical models encode structural information about the random variables of interest, both observed and latent. They hence provide a principled way to define parametric probabilistic models with desired features.  

The selection of a probabilistic graphical model hence follows the same general rules that have been discussed so far: A more specialized, or structured, model may help reduce overfitting, and hence the generalization gap. This is done, as we will see, by reducing the number of parameters to be learned. On the flip side, specialization may come at the cost of an irrecoverable bias.  

In this chapter, we provide an introduction to the vast field of probabilistic graphical models, which is a powerful framework that allows us to represent and learn structured probabilistic models. The goal here is to introduce the main concepts and tools, while referring to the extensive treatments in [81, 15, 104, 151] for additional information.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 174

Context: # Probabilistic Graphical Models

where the parameter vector includes the \( T \times 1 \) probability vector \( \pi \) and the \( T \) probability vectors \( \mathbf{r}_w \), one for each class, each of dimension \( 2^W \). Note, in fact, that the vector \( \mathbf{r} \) can take \( 2^W \) possible values. As such, this model would require to learn \( (T-1) + T \times 2^W \) parameters, which quickly becomes impractical when the number \( W \) of words of interest is large enough. Furthermore, as we have seen, a learned model with a large number of parameters is bound to suffer from overfitting if the available data is relatively limited.

Instead of using this unstructured model, we can adopt a model that encodes some additional assumptions that can be reasonably made on the data. Here is one possible assumption: once the topic is fixed, the presence of a word is independent on the presence of other words. The resulting model is known as Bernoulli naive Bayes, and can be described as follows:

\[
t \sim \text{Cat}(\pi) \quad (7.2a)
\]

\[
x_{|t} \sim \prod_{w=1}^{W} \text{Bern}(r_{w|t}) \quad (7.2b)
\]

with parameter vector including the \( T \times 1 \) probability vector \( \pi \) and \( T \) sets of \( W \) probabilities \( r_{w|t} \), \( w = 1, \dots, W \), for each topic \( t \). Parameter \( r_{w|t} \) represents the probability of word \( w \) occurring in a document of topic \( t \). The mentioned independence assumption here allowed us to bring the number of parameters down to \( (T - 1) + TH \), which corresponds to:

\[
\begin{align*}
\pi & \\
\mathbf{r}_{w|t} & \\
t_n & \\
w = 1, \ldots, W & 
\end{align*}
\]

Figure 7.1 BN for the naive Bayes model using the plate notation. Learnable parameters are represented as dots.

The naive Bayes model can be represented graphically by the BN.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 182

Context: # Probabilistic Graphical Models

where \(\mu_{k|x_{p}} \) are the parameters defining the conditional distribution \(p(x_{k}|x_{p}) \). Note that the parameters \(\mu_{k|x_{p}} \) are generally different for different values of \(k\) and the parents’ variables \(x_{p}\) (see, e.g., (7.11)). In most cases of interest, the probability distribution \(p(x_{k}|x_{p}; \mu_{k|x_{p}})\) is in the exponential family or is a GLM (see Chapter 3).

As we have already seen in previous examples, the parameters \(\mu_{k|x_{p}}\) can either be **separate**, that is, distinct for each \(k\) and each value of \(x_{p}\), or they can be **tied**. In the latter case, some of the parameters \(\mu_{k|x_{p}}\) are constrained to be equal across different values of \(x_{p}\) and/or across different values of \(k\). As a special case of tied parameters, the value of \(\mu_{k|x_{p}}\) may also be independent of \(x_{p}\), such as in the case for GLMs.

As for the data, we have seen that the terms \(x_{1}, \ldots, x_{N}\) can be either fully observed in the training set, as in supervised learning, or they can be partially observed in unsupervised learning.

For the sake of brevity, we describe learning only for the case of fully observed data with separate parameters, and we briefly mention extensions to the other cases.

## Fully Observed Data with Separate Parameters

We are given a fully observed data set \(D = \{x_{n}\}_{n=1}^{N} \) with each data point written as \(x_{n} = [x_{n,1}, \ldots, x_{n,K}]^{T}\). For concreteness, assume that all variables are categorical. Denoting as \(x_{p,n}\) the parents of variable \(x_{n,k}\), the log likelihood function can be factorized as:

\[
\ln p(D|\mu) = \sum_{n=1}^{N} \ln p(x_{n}|x_{p,n}; \mu_{k|x_{p}}) \tag{7.13a}
\]

\[
= \sum_{k=1}^{K} \sum_{n=1}^{N} \ln p(x_{n,k}|x_{p,n}; \mu_{k|x_{p}}) \tag{7.13b}
\]

\[
= \sum_{k=1}^{K} \sum_{x_{n,k} \in \mathcal{A}_{k}} \ln p(x_{n,k}|x_{p,n}; \mu_{k|x_{p}}). \tag{7.13c}
\]
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 186

Context: # Probabilistic Graphical Models

## Example 7.8

The image denoising example described at the beginning of this chapter leverages the MRF in Fig. 7.2. In it, the maximal cliques are given by the pairs of random variables \( \{Z_i, Z_j\} \) and \( \{X_k\} \) that are connected by an edge. Associating a factor or potential to each clique yields the factorized joint distribution

\[
p(x, z) = \frac{1}{Z} \prod_{(i,j)} \phi_{ij}(z_i, z_j) \cdot \prod_{k} \psi_k(x_k), \tag{7.17}
\]

where \(\{i,j\}\) represents an edge of the undirected graph. As a notable example, in the Ising model, the variables are bipolar, i.e., \( z_i, z_j \in \{-1, +1\} \), and the potentials are defined as \( \psi(y_i, z_j) = \exp(-E(z_i, z_j)) \) and \( \phi(z_i, z_j) = \exp(-E(z_i, x_j)) \), with energy functions

\[
E(z_i, z_j) = -n_iz_j \quad \text{and} \quad E(z_i, x_j) = -n_iz_k. \tag{7.18}
\]

From this definition, a large natural parameter \(\lambda > 0\) yields a large probability – or a low energy – when \(z_i\) and \(z_j\) are equal; and, similarly, a large \(\beta > 0\) favors configurations in which \(z_i = x_i\), that is, with noise.

## Example 7.9

Another related example is given by the RBMs studied in Sec. 6.5.1, where undirected graph is shown in Fig. 6.6.

As illustrated by the previous examples, potentials are typically parameterized using the energy-based form

\[
\psi(x) = \exp(-E(x|\theta)), \tag{7.19}
\]

with parameter vector \(\theta\). This form ensures that the factors are strictly positive as long as the energy is upper bounded. A special class of such models is given by log-linear models, such as seen in Chapter 3, the energy is a linear function of the parameters.

## 7.3.2 Global Conditional Independence

MRF, in a manner similar to BNs, enable the assessment of conditional independence properties globally in the graph. The procedure is, in
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 189

Context: # 7.4. Bayesian Inference in Probabilistic Graphical Models

The EM algorithm requires the evaluation of the posterior \( p(x_{i} | x_{-i}, \theta) \) of the latent (extensive) variables \( \{z_{i}\} \) given the current iterate \( \theta \).

As discussed, when performing Bayesian inference, we can distinguish between observed variables \( x \), and latent variables \( z \). In general, only a subset of latent variables may be of interest, say \( z_{i} \), with the rest of the random variables denoted as \( z_{-i} \). The quantity to be computed is the posterior distribution:

\[
p(z_{i} | x) = \frac{p(x, z_{i})}{p(x)}, \tag{7.22}
\]

where 

\[
p(x, z_{-i}) = \sum_{z_{i}} p(x, z), \tag{7.23}
\]

and 

\[
p(x) = \sum_{z} p(x, z_{i}). \tag{7.24}
\]

with the sum being replaced by an integral for continuous variables. The key complication in evaluating these expressions is the need to sum over potentially large sets, namely the domains of variables \( z_{-i} \) and \( z_{i} \). Note that the sum in (7.23), which appears at the numerator of (7.22), is over all hidden variables that are of no interest. In contrast, the sum in (7.21), which is at the denominator of (7.22), is over the variables whose posterior probability (7.22) is the final objective of the calculation.

## Example 7.10

Consider an HMM, whose BN is shown in Fig. 7.3. Having learned the probabilistic model, a typical problem is that of inferring a given hidden variable \( z \) given the observed variables \( x = \{x_{1}, \ldots, x_{p}\} \). Computing the posterior \( p(z | x) \) requires the evaluation of the sums in (7.23) and (7.24). When the hidden variables \( z_{1}, \ldots, z_{p} \) are discrete with alphabet size \( 2 \), the complexity of step (7.23) is of the order \( |Z|^{D-1} \), since one needs to sum over the \( |Z|^{D-1} \) possible values of the hidden variables.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 198

Context: # Approximate Inference and Learning

is that the M-projection is obtained by moment matching. Specifically, one needs to find a value of parameter ϕ such that the expectations of the sufficient statistics of the model \( q(z|ϕ) \) under distribution \( q(z|ϕ) \) match with the same expectations under the true distribution \( p(z|x) \).

In mathematical terms, the M-projection in the exponential family model \( q(z|ϕ) \propto \exp(ϕ^T u(z)) \), with sufficient statistics \( u(z) \), yields natural parameters \( ϕ^* \) that satisfy the moment matching condition:

\[ 
E_{p(z|x)}[u(z)] = E_{q(z|ϕ)}[u(z)]. 
\]

This derivation is detailed in the Appendix of this chapter. Amortized inference can be defined in a similar way as for I-projection.

![Figure 8.1: Example of I- and M-projections of a mixture of Gaussians distribution (dashed line).](image_link)

## Example 8.2

This simple example is meant to provide an intuitive comparison between the approximations produced by I- and M-projections. To this end, consider a mixture of Gaussians distribution:

\[
p(x|z) = 0.3N(x|μ_1 = -1, σ_1^2 = 0.3) + 0.7N(x|μ_2 = 1, σ_2^2 = 0.3). 
\]
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 206

Context: 200
# Approximate Inference and Learning

can directly benefit from approximate inference in evaluating the posterior of the parameters and the variational posterior, we focus here on the frequentist viewpoint.

## ML Learning in the Presence of Hidden Variables

ML learning in the presence of hidden variables can be approximated by maximizing the ELBO with respect to both the model parameters \( \theta \) that define the forward model \( p(z | x) \) and the parameters \( \phi \) of the variational (amortized) model \( q(z | x, \phi) \). To understand what is accomplished with this optimization, it is useful to write the ELBO over a data set \( D = \{x_n\}_{n=1}^N \) using (6.14) as

\[
\sum_{n=1}^N \log p(x_n | \theta) - \text{KL}(q(z | x_n, \phi) \| p(z | x_n, \theta)). \tag{8.30}
\]

As such, for any fixed \( \phi \), optimizing the ELBO over \( \theta \) maximizes the likelihood function in the first term under a variational regularization term that penalizes posteriors \( p(z | x) \) that are significantly different from the selected variational posteriors \( q(z | x, \phi) \). The choice of a given model for the variational posterior hence drives learning, and should be treated as model or hyperparameter selection [68].

The maximization of the ELBO over both model and variational parameters can be carried out in different ways. As a first approach, one can use EM by performing the E step via approximate inference to evaluate the posterior of the latent variables. When V is used for this purpose, the resulting scheme is known as variational EM. Alternatively, one can use SGD with respect to both parameter vectors \( \theta \) and \( \phi \) by leveraging the REINFORCE method or the reparametrization trick. The reparametrization trick approach is for instance used in the VAE method for generative modelling [80], also known as Deep Generative Latent Models [122] (see also [99]). The KL divergence can also be substituted by an adversarially learned divergence as in the GAN approach [47] (see Sec. 6.4.3).

When the variational parameters are updated using an M-projection, rather than the I-projection that results from the maximization of the ELBO, the approach of jointly optimizing model and variational parameters yields the wake-sleep algorithm [65].
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 209

Context: # 9
## Concluding Remarks

This monograph has provided a brief introduction to machine learning by focusing on parametric probabilistic models for supervised and unsupervised learning problems. It has endeavored to describe fundamental concepts within a unified treatment starting from first principles. Throughout the text, we have also provided pointers to advanced topics that we have only been able to mention or shortly touch upon. Here, we offer a brief list of additional important aspects and open problems that have not been covered in the preceding chapters.

- **Privacy:** In many applications, data sets used to train machine learning algorithms contain sensitive private information, such as personal preferences for recommendation systems. It is hence important to ensure that the learned model does not reveal any information about the individual entries of the training set. This constraint can be formulated using the concept of differential privacy. Typical techniques to guarantee privacy of individual data points include adding noise to gradients when training via SGD and relying on mixtures of experts trained with different subsets of the data [1].

- **Robustness:** It has been reported that various machine learning models, including neural networks, are sensitive to small variations in input data.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 215

Context: A.1. Entropy  
209  

This is because the optimal estimate is the mode, i.e., the value ẋ with the largest probability p(ẋ).

**Distributional Estimate.** We now consider a different type of estimation problem in which we are permitted to choose a pmf p(ẋ) on the alphabet X as the estimate for the outcome of variable x. To facilitate intuition, we can imagine p(ẋ) to represent the fraction of one’s wager that is invested on the outcome of x being a specific value x. Note that it may not be necessarily optimal to put all of one’s money on one value x'! In fact, this depends on how we measure the reward, or conversely the cost, obtained when a value x = x is realized.

To this end, we define a non-negative loss function ℓ(ẋ, p̂x) representing the loss, or the “negative gain”, suffered when the value x = x is observed. This loss should sensibly be a decreasing function of p(ẋ) – we register a smaller loss, or conversely a larger gain, when we have wagered more on the actual outcome x. As a fairly general class of loss functions, we can hence define

\[
ℓ(ẋ, p̂x) = f(p̂(x)), 
\]

where f is a decreasing function. More general classes of loss functions are considered in [46].

Denote Δ(X) the simplex of pmfs defined over alphabet X. The generalized entropy can now be defined in a way that is formally equivalent to (A.1), with the only difference being the optimization over pmf p̂x, rather than over point estimate ẋ:

\[
H(p̂x) = \min_{p ∈ Δ(X)} \mathbb{E}_{p_x}[ℓ(x, p̂x)]. 
\]

A key example of loss function ℓ(ẋ, p̂x) in (A.3) is the log-loss ℓ(ẋ, p̂x) = - log(p̂(x)). The log-loss has a strong motivation in terms of lossless compression. In fact, as discussed in Sec. 2.5, by Kraft's inequality, it is possible to design a prefix-free and hence decodable without delay – lossless compression scheme that uses - log p̂(x) bits to represent value x. As a result, the choice of a pmf p̂x is linked to the selection of a prefix-free lossless compression scheme that requires a description of around - ln p(ẋ) bits to represent value x. The expectation in (A.4) measures the corresponding average number of bits required for lossless compression by the given scheme.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 218

Context: # Appendix A: Information Measures

The mutual information measures the decrease in average loss that is obtained by observing \( x \) as compared to having only prior information about \( p_x \). This notion of mutual information is in line with the concept of statistical information proposed by DeGroot (see [6] for a recent treatment). With the log-loss, the generalized mutual information (A.9) reduces to Shannon’s mutual information. As shown in [76], the log-loss is in fact the only loss function, up to multiplicative factors, under which the generalized mutual information (A.9) satisfies the data processing inequality, as long as the alphabet of \( x \) has more than two elements.

## A.3 Divergence Measures

We now discuss a way to quantify the “difference” between two given probabilistic models \( p_x \) and \( q_x \) defined over the same alphabet \( \mathcal{X} \). We consider here the viewpoint of binary hypothesis testing as a theoretical framework in which to tackle the issue. Other related approaches have also found applications in machine learning, including optimal transport theory and kernel methods [8].

We consider the following standard binary hypothesis testing problem. Given an observation \( x \), decide whether \( x \) was generated from \( p_x \) or from pmf \( q_x \). To proceed, we define a decision rule \( T(x) \), which should have the property that it is increasing with the certainty that a value \( x = x' \) is generated from \( p_x \) rather than \( q_x \). For example, in practice, one may impose a threshold on the rule \( T(x) \) such that, when \( T(x) \) is larger than the threshold, a decision is made that \( x \) was generated from \( p_x \).

In order to design the decision rule \( T(x) \), we can minimize a loss function or, equivalently, maximize a merit function. For convenience, here we take the latter approach, and define the problem of maximizing the merit function

\[
E_{p_x} \left[ T(x) \right] - E_{q_x} \left[ g(T(x)) \right] \tag{A.10}
\]

over the rule \( T(x) \), where \( g \) is a concave increasing function. This criterion can be motivated as follows: (i) it increases if \( T(x) \) is large, on average, for values of \( x \) generated from \( p_x \); and (ii) it decreases if, upon
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 236

Context: 230  
# References  

[149] Venkateshwar, H., S. Chakraborty, and S. Panchanathan. 2017. “Deep-Learning Systems for Domain Adaptation in Computer Vision: Learning Transferable Feature Representations”. *IEEE Signal Processing Magazine*, 34(6): 117–129. doi: 10.1109/MSP.2017.2744060.  

[150] Vincent, P., H. Larochelle, L. Lajoie, Y. Bengio, and P.-A. Mazagol. 2010. “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion”. *Journal of Machine Learning Research*, 11(Dec): 3371–3408.  

[151] Wainwright, M. J. and M. I. Jordan. 2008. “Graphical models, exponential families, and variational inference”. *Foundations and Trends® in Machine Learning*, 1(2–1): 305.  

[152] Watt, J., R. Borhani, and A. Katsaggelos. 2016. *Machine Learning Refined: Foundations, Algorithms, and Applications*. Cambridge University Press.  

[153] Welling, M., M. Rosen-Zvi, and G. E. Hinton. 2005. “Exponential family harmoniums with an application to information retrieval”. In: *Advances in neural information processing systems*, 1481–1488.  

[154] Wikipedia. [AI Winter](https://en.wikipedia.org/wiki/AI_winter).  

[155] Wikipedia. [Conjugate priors](https://en.wikipedia.org/wiki/Conjugate).  

[156] Wikipedia. [Exponential family](https://en.wikipedia.org/wiki/Exponential_family).  

[157] Wilson, A. C., R. Roelofs, M. Stern, N. Srebro, and L. Bechthold. 2017. “The Marginal Value of Adaptive Gradient Methods in Machine Learning”. *arXiv preprint arXiv:1705.08929*.  

[158] Written, I. H., E. Frank, M. A. Hall, and C. J. Pal. 2016. *Data Mining: Practical machine learning tools and techniques*. Morgan Kaufmann.  

[159] Zhang, C., J. Butepage, H. Kjelsstrøm, and S. Mandt. 2017. “Advances in Variational Inference”. *ArXiv e-prints*, Nov. arXiv: 1711.05579 [cs.LG].
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 237

Context: # References

[160] Zhang, Y., J. Duchi, M. I. Jordan, and M. J. Wainwright. 2013. “Information-theoretic lower bounds for distributed statistical estimation with communication constraints.” In: *Advances in Neural Information Processing Systems*, 2282–2336.

[161] Zhao, X. and A. H. Sayed. 2015. “Asynchronous adaptation and learning over networks—Part I: Modeling and stability analysis.” *IEEE Transactions on Signal Processing*, 63(4): 811–826.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 1

Context: # A First Encounter with Machine Learning

**Max Welling**  
*Donald Bren School of Information and Computer Science*  
*University of California Irvine*

November 4, 2011
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 2

Context: I'm unable to assist with that.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 10

Context: # LEARNING AND INTUITION

Baroque features or a more “dull” representation, whatever works. Some scientists have been asked to describe how they represent abstract ideas and they invariably seem to entertain some type of visual representation. A beautiful account of this in the case of mathematicians can be found in a marvelous book “XXX” (Hardamard).

By building accurate visual representations of abstract ideas, we create a database of knowledge in the unconscious. This collection of ideas forms the basis for what we call intuition. I often find myself listening to a talk and feeling uneasy about what is presented. The reason seems to be that the abstract idea I am trying to capture from the talk clashed with a similar idea that is already stored. This in turn can be a sign that I either misunderstood the idea before and need to update it, or that there is actually something wrong with what is being presented. In a similar way, I can easily detect that some idea is a small perturbation of what I already knew (I feel happily bored), or something entirely new (I feel intrigued and slightly frustrated). While the novice is continuously challenged and often feels overwhelmed, the more experienced researcher feels at ease 90% of the time because the “new” idea was already in his/her database which therefore needs no and very little updating.

Somehow our unconscious mind can also manipulate existing abstract ideas into new ones. This is what we usually think of as creative thinking. One can stimulate this by seeding the mind with a problem. This is a conscious effort and is usually a combination of detailed mathematical derivations and building an intuitive picture or metaphor for the thing one is trying to understand. If you focus enough time and energy on this process and walk home for lunch, you’ll find that you’ll still be thinking about it in a much more vague fashion: you review and create visual representations of the problem. Then you get your mind off the problem altogether and when you walk back to work, suddenly parts of the solution surface into consciousness. Somehow, your unconscious took over and kept working on your problem. The essence is that you created visual representations as the building blocks for the unconscious mind to work with.

In any case, whatever the details of this process are (and I am no psychologist), I suspect that any good explanation should include both an intuitive part, including examples, metaphors and visualizations, and a precise mathematical part where every equation and derivation is properly explained. This then is the challenge I have set to myself. It will be your task to insist on understanding the abstract idea that is being conveyed and build your own personalized visual representations. I will try to assist in this process but it is ultimately you who will have to do the hard work.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 13

Context: # Chapter 1

## Data and Information

Data is everywhere in abundant amounts. Surveillance cameras continuously capture video; every time you make a phone call your name and location get recorded. Often, your clicking pattern is recorded when surfing the web. Most financial transactions are recorded; satellites and observatories generate terabytes of data every year; the FBI maintains a DNA database of most convicted criminals. Soon, all written text from our libraries is digitized. Need I go on?

But data in itself is useless. Hidden inside the data is valuable information. The objective of machine learning is to pull the relevant information from the data and make it available to the user. What do we mean by "relevant information"? When analyzing data, we typically have a specific question in mind, such as: 

- “How many types of car can be discerned in this video?” 
- “What will be the weather next week?”

The answer can take the form of a single number (there are 5 cars), or a sequence of numbers (or the temperature next week) or a complicated pattern (the cloud configuration next week). If the answer to our query is complex, we like to visualize it using graphs, bar plots, or even little movies. One should keep in mind that the particular analysis depends on the task one has in mind.

Let me spell out a few tasks that are typically considered in machine learning:

### Prediction

Here we ask ourselves whether we can extrapolate the information in the data to new use cases. For instance, if I have a database of attributes of Hummers such as weight, color, number of people it can hold, etc., and another database of attributes for Ferraris, then we can try to predict the type of car (Hummer or Ferrari) from a new set of attributes. Another example is predicting the weather given all the recorded weather patterns in the past: can we predict the weather next week, or the stock prices?
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 15

Context: # 1.1. DATA REPRESENTATION

Most datasets can be represented as a matrix, \( X = [X_{n,k}] \), with rows indexed by "attribute-index" \( i \) and columns indexed by "data-index" \( n \). The value \( X_{n,k} \) for attribute \( i \) and data-case \( n \) can be binary, real, discrete, etc., depending on what we measure. For instance, if we measure weight and color of 100 cars, the matrix \( X \) is \( 2 \times 100 \) dimensional and \( X_{1,20} = 20,684.57 \) is the weight of car nr. 20 in some units (a real value) while \( X_{2,20} = 2 \) is the color of car nr. 20 (say one of 6 predefined colors).

Most datasets can be cast in this form (but not all). For documents, we can give each distinct word of a prespecified vocabulary a number and simply count how often a word was present. Say the word "book" is defined to have nr. 10, 568 in the vocabulary then \( X_{1,1005} = 4 \) would mean: the word book appeared 4 times in document 5076. Sometimes the different data-cases do not have the same number of attributes. Consider searching the internet for images about cats. You’ll retrieve a large variety of images most with a different number of pixels. We can either try to resize the images to a common size or we can simply leave those entries in the matrix empty. It may also occur that a certain entry is supposed to be there but it couldn’t be measured. For instance, if we run an optical character recognition system on a scanned document some letters will not be recognized. We’ll use a question mark “?” to indicate that that entry wasn’t observed.

It is very important to realize that there are many ways to represent data and not all are equally suitable for analysis. By this I mean that in some representation the structure may be obvious while in other representation it may become totally obscure. It is still there, but just harder to find. The algorithms that we will discuss are based on certain assumptions, such as, “Humans and Ferraries can be separated with a line,” see figure ?. While this may be true if we measure weight in kilograms and height in meters, it is no longer true if we decide to re-code these numbers into bit-strings. The structure is still in the data, but we would need a much more complex assumption to discover it. A lesson to be learned is thus to spend some time thinking about in which representation the structure is as obvious as possible and transform the data if necessary before applying standard algorithms. In the next section we’ll discuss some standard preprocessing operations. It is often advisable to visualize the data before preprocessing and analyzing it. This will often tell you if the structure is a good match for the algorithm you had in mind for further analysis. Chapter ? will discuss some elementary visualization techniques.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 17

Context: # 1.2. PREPROCESSING THE DATA

Attribute separately) and then added and divided by \( N \). You have perhaps noticed that variance does not have the same units as \( X \) itself. If \( X \) is measured in grams, then variance is measured in grams squared. To scale the data to have the same scale in every dimension, we divide by the square-root of the variance, which is usually called the sample standard deviation:

\[
X_{n}^{m} = \frac{X_{n}}{\sqrt{V[|X|_{i}]}} \quad \forall n
\]

Note again that sphering requires centering, implying that we always have to perform these operations in this order, first center, then sphere. Figure ??a,b,c illustrates this process.

You may now be asking, “well what if the data where elongated in a diagonal direction?”. Indeed, we can also deal with such a case by first centering, then rotating such that the elongated direction points in the direction of one of the axes, and then scaling. This requires quite a bit more math, and will postpone this issue until chapter ?? on “principal components analysis”. However, the question is in fact a very deep one, because one could argue that one could keep changing the data using more and more sophisticated transformations until all the structure was removed from the data and there would be nothing left to analyze! It is indeed true that the pre-processing steps can be viewed as part of the modeling process in that it identifies structure (and then removes it). By remembering the sequence of transformations you performed, you have implicitly built a model. Reversely, many algorithms can be easily adapted to model the mean and scale of the data. Now, the preprocessing is no longer necessary and becomes integrated into the model.

Just as preprocessing can be viewed as building a model, we can use a model to transform structured data into (more) unstructured data. The details of this process will be left for later chapters but a good example is provided by compression algorithms. Compression algorithms are based on models for the redundancy in data (e.g. text, images). The compression consists in removing this redundancy and transforming the original data into a less redundant (and hence more succinct) code. Models and structure reducing data transformations are in essence each others reverse: we often associate with a model an understanding of how the data was generated, starting from random noise. Reversely, pre-processing starts with the data and understands how we can get back to the unstructured random state of the data.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 18

Context: # CHAPTER 1. DATA AND INFORMATION

The origin. If data happens to be just positive, it doesn’t fit this assumption very well. Taking the following logarithm can help in that case:

$$
X'_{nm} = \log(\alpha + X_{nm}) \tag{1.5}
$$
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 19

Context: # Chapter 2

## Data Visualization

The process of data analysis does not just consist of picking an algorithm, fitting it to the data and reporting the results. We have seen that we need to choose a representation for the data necessitating data-preprocessing in many cases. Depending on the data representation and the task at hand we then have to choose an algorithm to continue our analysis. But even after we have run the algorithm and studied the results we are interested in, we may realize that our initial choice of algorithm or representation may not have been optimal. We may therefore decide to try another representation/algorithm, compare the results and perhaps combine them. This is an iterative process.

What may help us in deciding the representation and algorithm for further analysis? Consider the two datasets in Figure ??. In the left figure we see that the data naturally forms clusters, while in the right figure we observe that the data is approximately distributed on a line. The left figure suggests a clustering approach while the right figure suggests a dimensionality reduction approach. This illustrates the importance of looking at the data before you start your analysis instead of (literally) blindly picking an algorithm. After your first peek, you may decide to transform the data and then look again to see if the transformed data better suit the assumptions of the algorithm you have in mind.

“Looking at the data” sounds more easy than it really is. The reason is that we are not equipped to think in more than 3 dimensions, while most data lives in much higher dimensions. For instance image patches of size 10 × 10 live in a 100 pixel space. How are we going to visualize it? There are many answers to this problem, but most involve **projection**: we determine a number of, say, 2 or 3 dimensional subspaces onto which we project the data. The simplest choice of subspaces are the ones aligned with the features, e.g. we can plot \( x_1 \) vs \( x_2 \).
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 23

Context: # Chapter 3

## Learning

This chapter is without question the most important one of the book. It concerns the core, almost philosophical question of what learning really is (and what it is not). If you want to remember one thing from this book, you will find it here in this chapter.

Ok, let’s start with an example. Alice has a rather strange ailment. She is not able to recognize objects by their visual appearance. At home, she is doing just fine: her mother explained to Alice for every object in her house what it is and how you use it. When she is home, she recognizes these objects (if they have not been moved too much), but when she enters a new environment she is lost. For example, if she enters a new meeting room, she needs a long time to infer what the chairs and the table are in the room. She has been diagnosed with a severe case of "overfitting." 

What is the matter with Alice? Nothing is wrong with her memory because she remembers the objects once she has seen them. In fact, she has a fantastic memory. She remembers every detail of the objects she has seen. And every time she sees a new object, she reasons that the object in front of her is surely not a chair because it doesn’t have all the features she has seen in earlier chairs. The problem is that Alice cannot generalize the information she has observed from one instance of a visual object category to other, yet unobserved members of the same category. The fact that Alice’s disease is so rare is understandable; there must have been a strong selection pressure against this disease. 

Imagine our ancestors walking through the savanna one million years ago. A lion appears on the scene. Ancestral Alice has seen lions before, but not this particular one, and it does not induce a fear response. Of course, she has no time to infer the possibility that this animal may be dangerous logically. Alice’s contemporaries noticed that the animal was yellow-brown, had manes, etc., and immediately un-
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 25

Context: # Information Theory and Image Compression

The part of the information that does not carry over to the future, the unpredictable information, is called "noise." There is also the information that is predictable, the learnable part of the information stream. The task of any learning algorithm is to separate the predictable part from the unpredictable part.

Now imagine Bob wants to send an image to Alice. He has to pay 1 dollar cent for every bit that he sends. If the image were completely white, it would be really stupid of Bob to send the message: 

```
pixel 1: white, pixel 2: white, pixel 3: white, .... 
```

He could just have sent the message:

```
all pixels are white!
```

The blank image is completely predictable but carries very little information. Now imagine a image that consists of white noise (your television screen if the cable is not connected). To send the exact image Bob will have to send:

```
pixel 1: white, pixel 2: black, pixel 3: black, .... 
```

Bob cannot do better because there is no predictable information in that image, i.e., there is no structure to be modeled. You can imagine playing a game and revealing one pixel at a time to someone and paying him 15 for every next pixel he predicts correctly. For the white image, you can do perfect for the noisy picture you would be random guessing. Real pictures are in between: some pixels are very hard to predict, while others are easier. 

To compress the image, Bob can extract rules such as: always predict the same color as the majority of the pixels next to you, except when there is an edge. These rules constitute the model for the regularities of the image. Instead of sending the entire image pixel by pixel, Bob will now first send his rules and ask Alice to apply the rules. Every time the rule fails, Bob also sends a correction:

```
pixel 103: white, pixel 245: black.
```

A few rules and two corrections are obviously cheaper than 256 pixel values and no rules.

There is one fundamental tradeoff hidden in this game. Since Bob is sending only a single image, it does not pay to send an incredibly complicated model that would require more bits to explain than simply sending all pixel values. If he were sending 1 billion images it would pay off to first send the complicated model because he would be saving a fraction of all bits for every image. On the other hand, if Bob wants to send 2 pixels, there really is no need in sending a model whatsoever. 

Therefore, the size of Bob's model depends on the amount of data he wants to transmit. In this context, the boundary between what is model and what is noise depends on how much data we are dealing with! If we use a model that is too complex, we overfit to the data at hand, i.e., part of the model represents noise. On the other hand, if we use a too simple model we "underfit" (over-generalize) and valuable structure remains unmodeled. Both lead to sub-optimal compression of the image. 

The compression game can therefore be used to find the right size of model complexity for a given dataset.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 26

Context: # CHAPTER 3. LEARNING

Now let’s think for a moment what we really mean with “a model”. A model represents our prior knowledge of the world. It imposes structure that is not necessarily present in the data. We call this the **inductive bias**. Our inductive bias often comes in the form of a parameterized model. That is to say, we define a family of models but let the data determine which of these models is most appropriate. A strong inductive bias means that we don’t leave flexibility in the model for the data to work on. We are so convinced of ourselves that we basically ignore the data. The downside is that if we are creating a “bad bias” towards to wrong model. On the other hand, if we are correct, we can learn the remaining degrees of freedom in our model from very few data-cases. Conversely, we may leave the door open for a huge family of possible models. If we now let the data zoom in on the model that best explains the training data it will overfit to the peculiarities of that data. 

Now imagine you sampled 10 datasets of the same size \(N\) and train these very flexible models separately on each of these datasets (note that in reality you only have access to one such dataset but please play along in this thought experiment). Let’s say we want to determine the value of some parameter \(\theta\). Because the models are so flexible, we can actually model the idiosyncrasies of each dataset. The result is that the value for \(\theta\) is likely to be very different for each dataset. But because we didn’t impose much inductive bias the average of many of such estimates will be about right. We say that the bias is small, but the variance is high. In the case of very restrictive models the opposite happens: the bias is potentially large but the variance small. Note that not only is a large bias bad (for obvious reasons), a large variance is bad as well: because we only have one dataset of size \(N\), our estimate could be very far off simply we were unlucky with the dataset we were given. 

What we should therefore strive for is to inject all our prior knowledge into the learning problem (this makes learning easier) but avoid injecting the wrong prior knowledge. If we don’t trust our prior knowledge we should let the data speak. However, letting the data speak too much might lead to overfitting, so we need to find the boundary between too complex and too simple a model and get its complexity just right. Access to more data means that the data can speak more relative to prior knowledge. That, in a nutshell, is what machine learning is all about.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 27

Context: # 3.1 In a Nutshell

Learning is all about generalizing regularities in the training data to new, yet unobserved data. It is not about remembering the training data. Good generalization means that you need to balance prior knowledge with information from data. Depending on the dataset size, you can entertain more or less complex models. The correct size of the model can be determined by playing a compression game. 

Learning = generalization = abstraction = compression.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 30

Context: # CHAPTER 4. TYPES OF MACHINE LEARNING

Factors that can explain the data. Knowing these factors is like denoising the data where we first peel off the uninteresting bits and pieces of the signal and subsequently transform onto an often lower dimensional space which exposes the underlying factors.

There are two dominant classes of unsupervised learning algorithms: clustering-based algorithms assume that the data organizes into groups. Finding these groups is then the task of the ML algorithm and the identity of the group is the semantic factor. Another class of algorithms strives to project the data onto a lower dimensional space. This mapping can be nonlinear, but the underlying assumption is that the data is approximately distributed on some (possibly curved) lower dimensional manifold embedded in the input space. Unrolling that manifold is then the task of the learning algorithm. In this case, the dimensions should be interpreted as semantic factors.

There are many variations on the above themes. For instance, one is often confronted with a situation where you have access to many more unlabeled data (only \(X_i\)) and many fewer labeled instances (both \((X_i, Y_i)\)). Take the task of classifying news articles by topic (weather, sports, national news, international etc.). Some people may have labeled some news-articles by hand but there won’t be all that many of those. However, we do have a very large digital library of scanned newspapers available. Shouldn’t it be possible to use those scanned newspapers somehow to improve the classifier? Imagine that the data naturally clusters into well-separated groups (for instance because news articles reporting on different topics use very different words). This is depicted in Figure ??. Note that there are only very few cases which have labels attached to them. From this figure it becomes clear that the expected optimal decision boundary nicely separates these clusters. In other words, you do not expect that the decision boundary will cut through one of the clusters. Yet that is exactly what would happen if you would only be using the labeled data. Hence, by simply requiring that decision boundaries do not cut through regions of high probability we can improve our classifier. The subfield that studies how to improve classification algorithms using unlabeled data goes under the name “semi-supervised learning”.

A fourth major class of learning algorithms deals with problems where the supervised signal consists only of rewards (or costs) that are possibly delayed. Consider for example a mouse that needs to solve a labyrinth in order to obtain his food. While making his decisions he will not receive any feedback (apart from perhaps slowly getting more hungry). It’s only at the end when he reaches the cheese that he receives his positive feedback, and he will have use this to reinforce his perhaps random earlier decisions that lead him to the cheese. These problems are generally known as reinforcement learning.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 31

Context: fall under the name "reinforcement learning". It is a very general setup in which almost all known cases of machine learning can be cast, but this generally also means that these types of problems can be very difficult. The most general RL problems do not even assume that you know what the world looks like (i.e. the maze for the mouse), so you have to simultaneously learn a model of the world and solve your task in it. 

This dual task induces interesting trade-offs: should you invest time now to learn machine learning and reap the benefit later in terms of a high salary working for Yahoo!, or should you stop investing now and start exploiting what you have learned so far? This is clearly a function of age, or the time horizon that you still have to take advantage of these investments. The mouse is similarly confronted with the problem of whether he should try out this new alley in the maze that can cut down his time to reach the cheese considerably, or whether he should simply stay with what he has learned and take the route he already knows. This clearly depends on how often he thinks he will have to run through the same maze in the future. We call this the exploration versus exploitation trade-off. The reason that RL is a very exciting field of research is because of its biological relevance. Do we not also have to figure out how the world works and survive in it?

Let's go back to the news-articles. Assume we have control over what article we will label next. Which one would be pick? Surely the one that would be most informative in some suitably defined sense. Or the mouse in the maze. Given that it needs to explore, where does he explore? Surely he will try to seek out alleys that look promising, i.e. alleys that he expects to maximize his reward. We call the problem of finding the next best data-case to investigate "active learning".

One may also be faced with learning multiple tasks at the same time. These tasks are related but not identical. For instance, consider the problem if recommending movies to customers of Netflix. Each person is different and would require a separate model to make the recommendations. However, people also share commonalities, especially when people show evidence of being of the same "type" (for example a fan of a comedy fan). We can learn personalized models but share features between them. Especially for new customers, where we don’t have access to any movies that were rated by the customer, we need to "draw statistical strength" from customers who seem to be similar. From this example it is hopefully become clearer that we are trying to learn models for many different yet related problems and that we can build better models if we share some of the things learned for one task with the other ones. The trick is not to share too much nor too little and how much we should share depends on how much data and prior knowledge we have access to for each task. We call this subfield of machine learning: "multi-task learning".
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 41

Context: # 6.3. Class-Prediction for New Instances

where with \(v_i\) we mean the value for attribute \(i\) that we observe in the email under consideration, i.e. if the email contains no mention of the word “viagra” we set \(v_{\text{viagra}} = 0\).

The first term in Eqn. 6.7 adds all the log-probabilities under the spam model of observing the particular value of each attribute. Every time a word is observed that has high probability for the spam model, and hence has often been observed in the dataset, will boost this score. The last term adds an extra factor to the score that expresses our prior belief of receiving a spam email instead of a ham email.

We compute a similar score for ham, namely:

\[
S_{\text{spam}} = \sum_i \log P_{\text{ham}}(X_i = v_i) + \log P(\text{ham}) \tag{6.8}
\]

and compare the two scores. Clearly, a large score for spam relative to ham provides evidence that the email is indeed spam. If your goal is to minimize the total number of errors (whether they involve spam or ham) then the decision should be to choose the class which has the highest score.

In reality, one type of error could have more serious consequences than another. For instance, a spam email making it in my inbox is not too bad, but an important email that ends up in my spam box (which I never check) may have serious consequences. To account for this we introduce a general threshold \(\theta\) and use the following decision rule:

\[
Y = 1 \quad \text{if } S_1 > S_0 + \theta \tag{6.9}
\]

\[
Y = 0 \quad \text{if } S_1 < S_0 + \theta \tag{6.10}
\]

If these quantities are equal we flip a coin.

If \(\theta = -\infty\), we always decide in favor of label \(Y = 1\), while if we use \(\theta = +\infty\) we always decide in favor of \(Y = 0\). The actual value is a matter of taste. To evaluate a classifier we often draw an ROC curve. An ROC curve is obtained by sliding \(\theta\) between \(-\infty\) and \(+\infty\) and plotting the true positive rate (the number of examples with label \(Y = 1\) also classified as \(Y = 1\) divided by the total number of examples with \(Y = 1\)) versus the false positive rate (the number of examples with label \(Y = 0\) classified as \(Y = 1\) divided by the total number of examples with \(Y = 0\). For more details see chapter  ??.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 43

Context: # 6.5 Remarks

One of the main limitations of the NB classifier is that it assumes independence between attributes (This is presumably the reason why we call it the naive Bayesian classifier). This is reflected in the fact that each classifier has an independent vote in the final score. However, imagine that I measure the words, "home" and "mortgage". Observing "mortgage" certainly raises the probability of observing "home". We say that they are positively correlated. It would therefore be more fair if we attributed a smaller weight to "home" if we already observed mortgage because they convey the same thing: this email is about mortgages for your home. One way to obtain a more fair voting scheme is to model these dependencies explicitly. However, this comes at a computational cost (a longer time before you receive your email in your inbox) which may not always be worth the additional accuracy. One should also note that more parameters do not necessarily improve accuracy because too many parameters may lead to overfitting.

# 6.6 The Idea In a Nutshell

Consider Figure ??. We can classify data by building a model of how the data was generated. For NB we first decide whether we will generate a data-item from class \( Y = 0 \) or class \( Y = 1 \). Given that decision we generate the values for \( D \) attributes independently. Each class has a different model for generating attributes. Classification is achieved by computing which model was more likely to generate the new data-point, biasing the outcome towards the class that is expected to generate more data.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 47

Context: # 7.1. THE PERCEPTRON MODEL

We like to estimate these parameters from the data (which we will do in a minute), but it is important to notice that the number of parameters is fixed in advance. In some sense, we believe so much in our assumption that the data is linearly separable that we stick to it irrespective of how many data-cases we will encounter. This fixed capacity of the model is typical for parametric methods, but perhaps a little unrealistic for real data. A more reasonable assumption is that the decision boundary may become more complex as we see more data. Too few data-cases simply do not provide the resolution (evidence) necessary to see more complex structure in the decision boundary. Recall that non-parametric methods, such as the “nearest-neighbors” classifiers actually do have this desirable feature. Nevertheless, the linear separability assumption comes with some computation advantages as well, such as very fast class prediction on new test data. I believe that this computational convenience may be at the root for its popularity. By the way, when we take the limit of an infinite number of features, we will have happily returned the land of “non-parametrics” but we have exercise a little patience before we get there.

Now let’s write down a cost function that we wish to minimize in order for our linear decision boundary to become a good classifier. Clearly, we would like to control performance on future, yet unseen test data. However, this is a little hard (since we don’t have access to this data by definition). As a surrogate we will simply fit the line parameters on the training data. It can not be stressed enough that this is dangerous in principle due to the phenomenon of overfitting (see section ??). If we have introduced very many features and no form of regularization then we have many parameters to fit. When this capacity is too large relative to the number of data cases at our disposal, we will be fitting the idiosyncrasies of this particular dataset and these will not carry over to the future test data. So, one should split off a subset of the training data and reserve it for monitoring performance (one should not use this set in the training procedure). Cycling through multiple splits and averaging the result was the cross-validation procedure discussed in section ??. If we do not use too many features relative to the number of data-cases, the model class is very limited and overfitting is not an issue. (In fact, one may want to worry more about “underfitting” in this case.)

Ok, so now that we agree on writing down a cost on the training data, we need to choose an explicit expression. Consider now the following choice:

\[ C(w, \alpha) = \frac{1}{2n} \sum_{i=1}^{n} (y_i - w^T x_n + \alpha)^2 \quad (7.2) \]
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 51

Context: # Chapter 8

## Support Vector Machines

Our task is to predict whether a test sample belongs to one of two classes. We receive training examples of the form: \(\{(x_i, y_i), i = 1, \ldots, n\}\) and \(x_i \in \mathbb{R}^d\), \(y_i \in \{-1, +1\}\). We call \(\{x_i\}\) the co-variates or input vectors and \(\{y_i\}\) the response variables or labels.

We consider a very simple example where the data are in fact linearly separable: i.e., I can draw a straight line \(f(x) = w^T x - b\) such that all cases with \(y_i = -1\) fall on one side and have \(f(x_i) < 0\) and cases with \(y_i = +1\) fall on the other and have \(f(x_i) > 0\). Given that we have achieved that, we could classify new test cases according to the rule \(\hat{y}_{test} = \text{sign}(f(x_{test}))\).

However, typically there are infinitely many such hyper-planes obtained by small perturbations of a given solution. How do we choose between all these hyper-planes which solve the separation problem for our training data, but may have different performance on the newly arriving test cases? For instance, say \(y = -1\). Intuitively, when test cases arrive we will not make many mistakes on cases that should be classified with \(y = +1\), but we will make very easily mistakes on the cases with \(y = -1\) (for instance, imagine that a new batch of test cases arrives which are small perturbations of the training data). A sensible thing thus seems to choose the separation line as far away from both \(y = -1\) and \(y = +1\) training cases as we can, i.e., right in the middle.

Geometrically, the vector \(w\) is directed orthogonal to the line defined by \(w^T x = 0\). This can be understood as follows. First take \(b = 0\). Now it is clear that all vectors, \(x\), with vanishing inner product with \(w\) satisfy this equation, i.e., all vectors orthogonal to satisfy this equation. Now translate the hyperplane away from the origin over a vector \(a\). The equation for the plane now becomes: \((x - a)^T w = 0\).
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 60

Context: # CHAPTER 9. SUPPORT VECTOR REGRESSION

and from below,

minimize 
$$ -w^T\xi, \xi $$ 
$$ \frac{1}{2} ||w||^2 + \frac{C}{2} \sum_{i} (\xi_i^2 + \xi_i^2) $$ 

subject to 

$$ w^T\phi_b + b - y_i \leq \xi_i \quad \forall i $$  
$$ y_i - w^T\phi_b - b \leq e + \xi_i \quad \forall i $$ (9.1)

The primal Lagrangian becomes,

$$ L_P = \frac{1}{2} ||w||^2 + \frac{C}{2} \sum_{i} (\xi_i^2 + \xi_i^2) + \sum_{i} \alpha_i (w^T\phi_b + b - y_i - e - \xi_i) + \sum_{i} \alpha_i^*(y_i - w^T\phi_b - b - e - \xi_i) $$ (9.2)

**Remark 1:** We could have added the constraints that $\xi_i \geq 0$ and $\xi^*_i \geq 0$. However, it is not hard to see that the final solution will have that requirement automatically and there is no sense in constraining the optimization to the optimal solution as well. To see this, imagine some $\xi_i$ is negative, then, by setting $\xi_i = 0$ the cost is lower and none of the constraints is violated, so it is preferred. We also note due to the above reasoning we will always have at least one of the $\xi_i$ or $\xi^*_i$ inside the tube both are zero, outside the tube one of them is zero. This means that at the solution we have $\xi_i = 0$.

**Remark 2:** Note that we don’t scale $\epsilon = 1$ like in the SVM case. The reason is that $\{y_i\}$ now determines the scale of the problem, i.e. we have not over-parameterized the problem.

We now take the derivatives w.r.t. $w, b$, and $\xi_i$ to find the following KKT conditions (there are more of course),

$$ w = \sum_{i} (\alpha_i - \alpha_i^*)\phi_i $$ (9.3)

$$ \xi_i = \frac{\alpha_i}{C} \quad \xi_i^* = \frac{\alpha_i^*}{C} $$ (9.4)

Plugging this back in and using that now we also have $\alpha_i \alpha_i^* = 0$ we find the dual problem,

maximize 
$$ \alpha_i - \frac{1}{2} \sum_{i} (\alpha_i - \alpha_j)(\alpha_i - \alpha_j) (K_{ij} \frac{1}{C \delta_{ij}}) + \sum_{i} (\alpha_i - \alpha_i^*) (y_i - \sum_{j} (\alpha_j + \alpha_j^*)) $$

subject to 
$$ \sum_{i} (\alpha_i - \alpha_j) = 0 $$  
$$ \alpha_i \geq 0, \ \alpha_i^* \geq 0 \quad \forall i $$ (9.5)
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 79

Context: # 13.2 A CONSTRAINED CONVEX PROGRAMMING FORMULATION OF FDA

The eigenvalue equation. This scales as \( \mathcal{O}(n^3) \) which is certainly expensive for many datasets. More efficient optimization schemes solving a slightly different problem and based on efficient quadratic programs exist in the literature.

Projections of new test points into the solution space can be computed by:

\[
w^T\Phi(x) = \sum_{i} \alpha_i K(x_i, x) \tag{13.19}
\]

as usual. In order to classify the test point, we still need to divide the space into regions which belong to one class. The easiest possibility is to pick the cluster with the smallest Mahalanobis distance: 

\[
d(x, \mu^c) = \frac{(x - \mu^c)^2}{\sigma^2} \tag{13.20}
\]

where \( \mu^c \) and \( \sigma^c \) represent the class mean and standard deviation in the 1-d projected space respectively. Alternatively, one could train any classifier in the 1-d subspace.

One very important issue that we did not pay attention to is regularization. Clearly, as it stands the kernel machine will overfit. To regularize, we can add a term to the denominator:

\[
S_W \rightarrow S_W + \beta I \tag{13.21}
\]

By adding a diagonal term to this matrix, we ensure that very small eigenvalues are bounded away from zero, which improves numerical stability in computing the inverse. If we write the Lagrangian formulation where we maximize a constrained quadratic form in \( \alpha \), the extra term appears as a penalty proportional to \( ||\alpha||^2 \) which acts as a weight decay term, favoring smaller values of \( \alpha \) over larger ones. Fortunately, the optimization problem has exactly the same form in the regularized case.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 2

Context: Copyright © 2023 Stephen Davies.

## University of Mary Washington
**Department of Computer Science**  
James Farmer Hall  
1301 College Avenue  
Fredericksburg, VA 22401

---

Permission is granted to copy, distribute, transmit and adapt this work under a [Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/).

The accompanying materials at [www.allthemath.org](http://www.allthemath.org) are also under this license.

If you are interested in distributing a commercial version of this work, please contact the author at [stephen@umw.edu](mailto:stephen@umw.edu).

The LaTeX source for this book is available from: [https://github.com/divilian/cool-brisk-walk](https://github.com/divilian/cool-brisk-walk).

---

Cover art copyright © 2014 Elizabeth M. Davies.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 8

Context: I'm unable to view images. Please provide the text in Markdown format for me to assist you.

Image Analysis: 

### Image Analysis:

#### 1. **Localization and Attribution:**
- **Image Number**: Image 1

#### 2. **Object Detection and Classification:**
- **Detected Objects**: The image includes a cup of coffee on a saucer, three coffee beans, and some sparkles implying aroma or steam.
- **Object Classification**:
  - **Cup of Coffee**: A central object, likely made of ceramic and filled with coffee.
  - **Coffee Beans**: Close to the cup, classifiable as roasted beans.
  - **Sparkles/Steam**: Above the cup, representing the aroma of the hot coffee.

#### 3. **Scene and Activity Analysis:**
- **Scene Description**: This image displays a cup of coffee on a saucer with coffee beans around it. Decorative sparkles or steam lines are depicted rising above the coffee, suggesting freshness or warmth.
- **Activities**: No human activities observed; the focus is on the presentation of the coffee.

#### 4. **Text Analysis:**
- No text detected in this image.

#### 5. **Diagram and Chart Analysis:**
- Not applicable; no diagrams or charts are present in the image.

#### 6. **Product Analysis:**
- **Product**: A cup of coffee.
- **Main Features**: 
  - A white/neutral-colored ceramic cup and saucer filled with coffee.
  - Roasted coffee beans placed nearby, suggesting the quality or type of coffee used.
- **Materials**: Ceramic cup and saucer, roasted coffee beans.
- **Colors**: Predominantly brown (coffee and beans) and white (cup and saucer).

#### 7. **Anomaly Detection:**
- No anomalies detected; the elements are consistent with a typical coffee presentation.

#### 8. **Color Analysis**:
- **Dominant Colors**: Brown and white.
  - **Brown**: Evokes warmth and richness, associated with coffee and coffee beans.
  - **White**: Clean and neutral, ensures focus on the coffee.

#### 9. **Perspective and Composition**:
- **Perspective**: Bird’s eye view.
- **Composition**: 
  - Centralized with the cup of coffee as the main focus.
  - Balanced by the placement of coffee beans around the cup.
  - Decorative elements like sparkles/steam give a visual cue towards the coffee's freshness or aroma.

#### 10. **Contextual Significance**:
- The image can be used in contexts such as coffee advertisements, cafe menus, or articles about coffee. It emphasizes the quality and appeal of the coffee through its visual presentation.

#### 11. **Metadata Analysis**:
- No metadata available for analysis.

#### 12. **Graph and Trend Analysis**:
- Not applicable; no graphs or trends are present in the image.

#### 13. **Graph Numbers**:
- Not applicable; no data points or graph numbers are present in the image.

### Additional Aspects:

#### **Ablaufprozesse (Process Flows)**:
- Not applicable; no process flows depicted.

#### **Prozessbeschreibungen (Process Descriptions)**:
- Not applicable; no processes described.

#### **Typen Bezeichnung (Type Designations)**:
- Not applicable; no type designations specified.

#### **Trend and Interpretation**:
- Not applicable; no trends depicted.

#### **Tables**:
- Not applicable; no tables included.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 14

Context: I'm unable to assist with that.

Image Analysis: 

### Analysis of Attached Visual Content

#### Image Identification and Localization
- **Image 1**: Single image provided in the content.

#### Object Detection and Classification
- **Image 1**:
  - **Objects Detected**:
    - Female figurine resembling a prehistoric representation.
    - The figurine appears to be crafted from a terracotta or similar clay-like material.
  - **Classification**: 
    - The object is classified under 'Artifacts' and 'Historical Objects'.
  - **Key Features**: 
    - The sculpture has exaggerated female attributes including a prominent chest and belly, which are indicative of fertility symbols.
    - The face does not have detailed features, implying focus on bodily form rather than facial details.
    - The object seems ancient, slightly weathered, suggesting it to be an archaeological artifact.

#### Scene and Activity Analysis
- **Image 1**:
  - **Scene Description**:
    - The image shows a close-up of a single artifact against a neutral background, possibly for the purpose of highlighting the artifact itself.
  - **Activities**: 
    - No dynamic activity; the object is displayed presumably for appreciation or study.

#### Perspective and Composition
- **Image 1**:
  - **Perspective**:
    - The image is taken from a straight-on, eye-level view, ensuring the object is the primary focus.
    - Close-up perspective to capture detailed features.
  - **Composition**:
    - The object is centrally placed, drawing immediate attention.
    - The background is plain and undistracting, enhancing the focus on the artifact itself.

#### Contextual Significance
- **Image 1**:
  - **Overall Contribution**:
    - The artifact could be used in educational, historical, or museum contexts to study prehistoric cultures, their art, and societal values.
    - As a fertility symbol, it contributes to understanding sociocultural aspects of ancient civilizations.

#### Color Analysis
- **Image 1**:
  - **Dominant Colors**:
    - Shades of brown and beige are dominant, corresponding to the natural materials like terracotta.
  - **Impact on Perception**:
    - The earthy tones evoke a sense of antiquity and authenticity, reinforcing the perception of it being an ancient artifact.

### Conclusion
The provided image is of a terracotta or similar material female figurine, likely of prehistoric origin. It is depicted with a neutral background to focus on its significant features, in particular, the enhanced bodily features indicative of fertility representations. The composition and color tones effectively highlight its historical and cultural importance.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 76

Context: 68  
# CHAPTER 4. PROBABILITY  

does so on the most up-to-date information he/she/it has, always revising beliefs in various hypotheses when confirming or refuting evidence is encountered. Famous Bayesians include Pierre-Simon Laplace, considered "the French Isaac Newton" for his scientific brilliance, and 18th-century theologian Thomas Bayes, for whom the theory is named.

I won't try to conceal that my own thinking on this topic is pretty Bayesian. But I find this whole topic fascinating because it shows how brilliant people, who unanimously agree on the rules and equations, can have such radically different interpretations of what it all means.

## 4.4 Conditional probability  

I mentioned that Bayesians are especially concerned with the idea of revising estimates about probability based on new information that may come to light. This notion can be crystallized in the idea of **conditional probability**. When we talk about the conditional probability of an event A, we mean "what's the probability that A occurs, given that I know some other event K has also occurred?" Think of K as "background knowledge": it’s additional information which, when known, may influence how likely we think A is to have occurred. It can be mathematically computed as follows:

\[  
P(A|K) = \frac{P(A \cap K)}{P(K)}  
\]  

We pronounce \( P(A|K) \) as "the probability of A given K". It is the conditional probability of A, or "the probability of A conditioned on K." We'll sometimes call \( P(A) \) the **prior probability**, or the prior probability if we don't want to sound Latin. The prior is simply the original unadjusted probability, if we aren’t privy to the background information K.

Let's go back to **American Idol**. We know that the probability of an underage winner is only 4%, because \( E (Kelly, Fantasia) \) and we estimate that each of them has a 2% probability of winning. So it seems more likely than not that our winner will be over 21. But
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 201

Context: # 7.5 EXERCISES

6. If I told you that 98,243,917,215 mod 7 was equal to 1, would you call me a liar without even having to think too hard?  
   No, you shouldn’t. It turns out that the answer is 3, not 1, but how would you know that without working hard for it?  

7. If I told you that 273,111,999,214 mod 6 was equal to 6, would you call me a liar without even having to think too hard?  
   Yes, you should. Any number mod 6 will be in the range 0 through 5, never 6 or above. (Think in terms of repeatedly taking out groups of six from the big number. The mod is the number of stones you have left when there are no more whole groups of six to take. If towards the end of this process there are six stones left, that’s not a remainder because you can get another whole group!)

8. Are the numbers 18 and 25 equal?  
   Of course not. Don’t waste my time.

9. Are the numbers 18 and 25 congruent mod 7?  
   Yes. If we take groups of 7 out of 18 stones, we’ll get two such groups (a total of 14 stones) and have 4 left over. And then, if we do that same with 25 stones, we’ll get three such groups (a total of 21 stones) and again have 4 left over. So they’re not congruent mod 7.

10. Are the numbers 18 and 25 congruent mod 6?  
    No. If we take groups of 6 out of 18 stones, we’ll get three such groups with nothing left over. But if we start with 25 stones, we’ll take out 4 such groups (for a total of 24 stones) and have one left over. So they’re not congruent mod 6.

11. Are the numbers 617,418 and 617,424 equal?  
    Of course not. Don’t waste my time.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 204

Context: I'm unable to process images directly. However, if you provide the text you want to format in Markdown, I can certainly help you fix any formatting issues. Please paste the text here.

Image Analysis: 

### Image Analysis Report

**Localization and Attribution:**
- The image appears to be a single diagram located at the center of the page.
- It is referred to as **Image 1**.

**Object Detection and Classification:**
- **Objects Detected:**
  - Boxes/rectangles connected by arrows.
  - Icons inside some boxes.

**Scene and Activity Analysis:**
- **Scene Description:**
  - The image is a flowchart depicting a process. It contains several boxes connected by arrows, representing a step-by-step procedure.
- **Activities:**
  - The flowchart likely represents a sequence of tasks or decisions, as indicated by the directional arrows.

**Text Analysis:**
- **Detected Text:**
  - Box 1: "Start"
  - Box 2: "Input Data"
  - Box 3: "Process Data"
  - Box 4: "Output Data"
  - Box 5: "End"
- **Text Content Analysis:**
  - The text describes a straightforward process flow where data is input, processed, and then output before concluding.

**Diagram and Chart Analysis:**
- **Diagram Analysis:**
  - The flowchart is visually structured to guide the reader from the starting point of a process to its conclusion.
  - Arrows indicate the direction of the flow, ensuring clarity in the process order.
- **Key Insights:**
  - The process is linear and simple, involving data manipulation from input to processing and finally output.

**Perspective and Composition:**
- **Perspective:**
  - The image is presented from a direct, front-facing view to ensure readability and clarity.
- **Composition:**
  - The boxes are symmetrically aligned and uniformly spaced, with arrows guiding the flow from one step to the next.

**Contextual Significance:**
- **Overall Contribution:**
  - The image contributes as a simplified visual representation of a process or procedure, likely serving as a quick reference for understanding the sequence of steps involved.

**Tables:**
- **Tabular Data Analysis:**
  - No tables are included in the image.

### Summary

**Image 1** features a flowchart that describes a step-by-step process involving the input, processing, and output of data, concluding with an end step. The text provides clear labels for each step, and the arrows create a discernible flow from start to end. The perspective and composition facilitate easy understanding of the process being depicted. There are no additional anomalies or metadata information to analyze from the image.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 228

Context: 220  
# CHAPTER 8. LOGIC

21. What’s \( C \rightarrow R \)?  
   True. (The premise is false, so all bets are off and the sentence is true.)  

22. What’s \( C \leftrightarrow B \)?  
   False. (The truth values of the left and right sides are not the same.)  

23. What’s \( C \leftrightarrow B \)?  
   True. (The truth values of the left and right sides are the same.)  

24. I make this assertion:  
   \(\neg X \land \neg (Z \rightarrow Q)\).  
   And since I'm the professor, you can assume I'm correct about this. From this information alone, can you determine a unique set of values for the four variables? Or is there more than one possibility for them?  

25. What if I get rid of \( Q \) and replace it with \( X \), thus making my assertion:  
   \(\neg X \land \neg (Z \rightarrow X)\)?  
   Now what are/jare the solutions?  

26. At the time of this writing, all professors are human, and that’s what I’ll be assuming in these exercises.  

27. True or false: \(\forall X \, \text{Professor}(X)\).  
   False. This says “everyone and everything is a professor,” which is clearly not true. (Consider what you have in front of you as a counterexample.)  

28. True or false: \(\forall X \, \text{Human}(X)\).  
   False. This says “everyone and everything is human,” which is clearly not true. (Consider the book in front of you as a counterexample.)
##########

"""QUERY: what is your model?"""

Consider the chat history for relevant information. Use all information included. Use as much tokens as needed. Important: If you find information separated by a | in the context, it is a table formatted in Markdown.Important: Take a look at the QUERY and only the QUERY. If this is vague or unclear, please ignore everything and ask a follow-up question instead!


Final Files Sources: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 1, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 2, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 4, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 5, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 10, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 11, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 13, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 17, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 20, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 22, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 23, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 25, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 28, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 29, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 30, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 35, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 36, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 40, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 41, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 47, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 48, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 49, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 50, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 53, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 56, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 57, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 58, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 60, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 64, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 69, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 70, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 74, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 75, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 76, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 77, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 79, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 81, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 85, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 86, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 88, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 89, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 91, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 95, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 97, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 102, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 105, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 109, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 111, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 113, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 119, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 123, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 130, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 133, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 139, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 152, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 156, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 157, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 159, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 165, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 167, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 171, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 172, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 174, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 182, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 186, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 189, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 198, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 206, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 209, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 215, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 218, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 236, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 237, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 1, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 2, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 10, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 13, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 15, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 17, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 18, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 19, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 23, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 25, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 26, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 27, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 30, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 31, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 41, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 43, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 47, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 51, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 60, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 79, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 2, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 8, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 14, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 76, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 201, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 204, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 228
==================================================
**Elapsed Time: 0.00 seconds**
==================================================


FINAL ANSWER 
Answer: I am a large language model, trained by Google. 

==================================================
**Elapsed Time: 2.66 seconds**
==================================================