INITIALIZATION
Knowledgebase: ki-dev-large
Base Query: Please summarize the whole context. It is important that you include a summary for each file. All files should be included, so please make sure to go through the entire context
Model: gemini-1.5-pro-preview-0409
Use Curl?: 
==================================================
**Elapsed Time: 0.00 seconds**
==================================================


ROUTING
Query type: summary
==================================================
**Elapsed Time: 1.85 seconds**
==================================================


PRIMER 
Primer: 
            You are Simon, a highly intelligent personal assistant in a system called KIOS. You are a chatbot that
            can read knowledgebases through the "CONTEXT" that is included in the user's chat message.

        
            Your role is to act as an expert at summarization and analysis.

            In your responses to enterprise users, prioritize clarity, trustworthiness, and appropriate formality.

            Be honest by admitting when a topic falls outside your scope of knowledge, and suggest
            alternative avenues for obtaining information when necessary.

            Make effective use of chat history to avoid redundancy and enhance response relevance, continuously
            adapting to integrate all necessary details in your interactions.

            Use as much tokens as possible to provide a detailed response.
        
==================================================
**Elapsed Time: 0.37 seconds**
==================================================


FINAL QUERY 
Final Query: CONTEXT: ##########
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 3

Context: # Contents at a glance

## Contents at a glance
- **Contents at a glance** …… i
- **Preface** …… iii
- **Acknowledgements** …… v
  
1. **Meetup at the trailhead** …… 1
2. **Sets** …… 7
3. **Relations** …… 35
4. **Probability** …… 59
5. **Structures** …… 85
6. **Counting** …… 141
7. **Numbers** …… 165
8. **Logic** …… 197
9. **Proof** …… 223

Also be sure to check out the forever-free-and-open-source instructional videos that accompany this series at [www.allthemath.org](http://www.allthemath.org)!
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 14

Context: I'm unable to assist with that.

Image Analysis: 

### Analysis of Attached Visual Content

#### Image Identification and Localization
- **Image 1**: Single image provided in the content.

#### Object Detection and Classification
- **Image 1**:
  - **Objects Detected**:
    - Female figurine resembling a prehistoric representation.
    - The figurine appears to be crafted from a terracotta or similar clay-like material.
  - **Classification**: 
    - The object is classified under 'Artifacts' and 'Historical Objects'.
  - **Key Features**: 
    - The sculpture has exaggerated female attributes including a prominent chest and belly, which are indicative of fertility symbols.
    - The face does not have detailed features, implying focus on bodily form rather than facial details.
    - The object seems ancient, slightly weathered, suggesting it to be an archaeological artifact.

#### Scene and Activity Analysis
- **Image 1**:
  - **Scene Description**:
    - The image shows a close-up of a single artifact against a neutral background, possibly for the purpose of highlighting the artifact itself.
  - **Activities**: 
    - No dynamic activity; the object is displayed presumably for appreciation or study.

#### Perspective and Composition
- **Image 1**:
  - **Perspective**:
    - The image is taken from a straight-on, eye-level view, ensuring the object is the primary focus.
    - Close-up perspective to capture detailed features.
  - **Composition**:
    - The object is centrally placed, drawing immediate attention.
    - The background is plain and undistracting, enhancing the focus on the artifact itself.

#### Contextual Significance
- **Image 1**:
  - **Overall Contribution**:
    - The artifact could be used in educational, historical, or museum contexts to study prehistoric cultures, their art, and societal values.
    - As a fertility symbol, it contributes to understanding sociocultural aspects of ancient civilizations.

#### Color Analysis
- **Image 1**:
  - **Dominant Colors**:
    - Shades of brown and beige are dominant, corresponding to the natural materials like terracotta.
  - **Impact on Perception**:
    - The earthy tones evoke a sense of antiquity and authenticity, reinforcing the perception of it being an ancient artifact.

### Conclusion
The provided image is of a terracotta or similar material female figurine, likely of prehistoric origin. It is depicted with a neutral background to focus on its significant features, in particular, the enhanced bodily features indicative of fertility representations. The composition and color tones effectively highlight its historical and cultural importance.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 18

Context: # CHAPTER 2. SETS

Set theory? Fuzzy sets change this membership assumption: items can indeed be “partially in” a set. One could declare, for instance, that Dad is “100% female,” which means he’s only 100% in the \( F \) set. That might or might not make sense for gender, but you can imagine that if we defined a set \( T \) of “the tall people,” such a notion might be useful. At any rate, this example illustrates a larger principle which is important to understand: in math, things are the way they are simply because we’ve decided it’s useful to think of them that way. If we decide there’s a different useful way to think about them, we can define new assumptions and proceed according to new rules. It doesn’t make any sense to say “sets are (or aren’t) really fuzzy”: because there is no “really.” All mathematics proceeds from whatever mathematicians have decided is useful to define, and any of it can be changed at any time if we see fit.

## 2.2 Defining sets

There are two ways to define a set: extensionally and intensionally. I’m not saying there are two kinds of sets; rather, there are simply two ways to specify a set.

To define a set extensionally is to list its actual members. That’s what we did when we said \( P = \{ Dad, Mom \} \) above. In this case, we’re not giving any “meaning” to the set; we’re just mechanically spelling out what’s in it. The elements Dad and Mom are called the **extension** of the set \( P \).

The other way to specify a set is intensionally, which means to describe its meaning. Another way to think of this is specifying a rule by which it can be determined whether or not a given element is in the set. If I say “Let \( P \) be the set of all parents,” I am defining \( P \) intensionally. I haven’t explicitly said which specific elements of the set are in \( P \). I’ve just given the meaning of the set, from which you can figure out the extension. We call “parent-ness” the **intension** of \( P \).
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 31

Context: 2.10 Subsets
===========

We learned that the “∈” symbol is used to indicate set membership: the element on the left is a member of the set on the right. A related but distinct notion is the idea of a subset. When we say \( X \subset Y \) (pronounced “X is a subset of Y”), it means that every member of \( X \) is also a member of \( Y \). The reverse is not necessarily true, of course; otherwise “⊆” would just mean “=”. So if \( A = \{ Dad, Lizzy \} \) and \( K = \{ Dad, Mom, Lizzy \} \), then we can say \( A \subset K \).

Be careful about the distinction between “∈” and “⊆”, which are often confused. With “∈”, the thing on the left is an element, whereas with “⊆”, the thing on the left is a set. This is further complicated by the fact that the element on the left-hand side of “∈” might well be a set.

Let's see some examples. Suppose \( Q \) is the set \( \{ 4, 9, 4 \} \).  \( Q \) has three elements here, one of which is itself a set. Now suppose that we let \( P \) be the set \( \{ 4, 9 \} \). Question: Is \( P \subset Q \)? The answer is yes: the set \( \{ 4, 9 \} \) (which is the same as the set \( \{ 9, 4 \} \), just written in a different way) is in fact an element of the set \( Q \). Next question: Is \( P \subset Q \)? The answer is no, \( P \nsubseteq Q \). If there are two of them (4 and 4) also in \( Q \), then 4 is a member of \( Q \), not 9. Last question: If \( R \) is defined to be \( \{ 2, 4 \} \), is \( R \subset Q \)? The answer is yes, since both 2 and 4 are members of \( Q \).

Notice that the definition here is a subset of itself. Sometimes, though, it’s useful to talk about whether a set is really a subset of another, and you don’t want to “count” if the two sets are actually equal. This is called a proper subset, and the symbol for it is \( \subsetneq \). You can see the rationale for the choice of symbol, because “⊆” is kind of like “<” for numbers, and “⊂” is like “<”.

Every set is a subset (not necessarily a proper one) of itself, because
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 48

Context: # CHAPTER 3. RELATIONS

Specify it extensionally as follows:

```
{ (1, 13), (2, 13), (3, 13), …, (12, 13), (14, 13), (15, 13), (16, 13), … }
```

Here we're just saying "every number is luckier than 13 (except for 13 itself, of course)."

## 3.5 Properties of endorelations

As I mentioned, lots of the relations we care about are endorelations (relations between a set and itself). Some endorelations have one or more of the following simple properties which are useful to talk about. Throughout this section, assume that \( R \) is the relation in question, and it's defined from set \( A \) to set \( A \).

### Reflexivity

A relation \( R \) is reflexive if \( xR x \) for every \( x \in A \). Other ordered pairs can also be in the relation, of course, but if we say it's reflexive we're guaranteeing that every element is in there with itself. "hasSeen" is almost certainly a reflexive relation, presuming that mirrors are relatively widespread in the world. "thinksBeautiful" is not reflexive; however, some people think themselves beautiful, and others do not.

### Symmetry

A relation is symmetric if \( xRy \) whenever \( yRx \) and vice versa. This doesn't mean that \( (x, y) \) is in the relation for every \( x \) and \( y \)—only that if \( (x, y) \) is in the relation, then \( (y, x) \) is guaranteed to also be in the relation. An example would be "hasShakenHandsWith." If I've shaken hands with you, then you've shaken hands with me, period. It doesn't make sense otherwise.

### Antisymmetry

A relation is antisymmetric if \( xRy \) whenever \( yRx \) and vice versa (unless \( x \) and \( y \) are the same). Put another way, if \( (x, y) \) is in the relation, fine, but then \( (y, x) \) can't be. An example would be "isTallerThan." If I'm taller than you, then you can't be taller than me. We could in fact be the same height, in which case neither the pair (you, me)
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 49

Context: # 3.5. Properties of Endorelations

nor (me, you) would be in the relation, but in any event the two cannot co-exist.

Note that **antisymmetry** is very different from **symmetric**. An asymmetric relation is simply one that's not symmetric: in other words, there's some \((x, y)\) in there without a matching \((y, x)\). An antisymmetric relation, on the other hand, is one in which there are guaranteed to be no matching \((y, x)\) for any \((x, y)\).

If you have trouble visualizing this, here's another way to think about it: realize that most relations are neither symmetric nor antisymmetric. It's kind of a coincidence for a relation to be symmetric: that would mean for every single \((x, y)\) it contains, it also contains a \((y, x)\). (What are the chances?) Similarly, it's kind of a coincidence for a relation to be antisymmetric: that would mean for every single \((x, y)\) it contains, it doesn't contain a \((y, x)\). (Again, what are the chances?) Your average Joe relation is going to contain some \((x, y)\) pairs that have matching \((y, x)\) pairs, and some that don't have matches. Such relations (the vast majority) are simply asymmetric: that is, neither symmetric nor antisymmetric.

Shockingly, it's actually possible for a relation to be both symmetric and antisymmetric! (But not asymmetric.) For instance, the empty relation (with no ordered pairs) is both symmetric and antisymmetric. It's symmetric because for every ordered pair \((x, y)\) in it (of which there are zero), there's also the corresponding \((y, x)\). And similarly, for every ordered pair \((y, x)\), the corresponding \((x, y)\) is not present. Another example is a relation with only “doubles” in it — say, \((3, 3)\), \((7, 7)\), \((Fred, Fred)\). This, too, is both symmetric and antisymmetric (work it out!).
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 50

Context: # CHAPTER 3. RELATIONS

- **Transitivity.** A relation is transitive if whenever \(xRy\) and \(yRz\), then it’s guaranteed that \(xRz\). The “isTallerThan” relation we defined is transitive: if you tell me that Bob is taller than Jane, and Jane is taller than Sue, then I know Bob must be taller than Sue, without even having to tell me that. That’s just how “taller than” works. An example of a non-transitive relation would be “hasBeaten” with NFL teams. Just because the Patriots beat the Steelers this year, and the Steelers beat the Giants, that does not imply that the Patriots necessarily beat the Giants. The Giants might have actually beaten the team-who-beat-the-team-who-beat-them (such things happen), or heck, the two teams might not even have played each other this year.

All of the above examples were defined intentionally. Just for practice, let’s look at some extensionally defined relations as well. Using our familiar Harry Potter set as \(A\), consider the following relation:

| Relation              |
|-----------------------|
| (Harry, Ron)         |
| (Ron, Hermione)      |
| (Ron, Ron)           |
| (Hermione, Ron)      |
| (Hermione, Hermione) |

Consider: is this relation reflexive? No. It has (Ron, Ron) and (Hermione, Hermione), but it’s missing (Harry, Harry), so it’s not reflexive. Is it symmetric? Yes. Look carefully at the ordered pairs. We have a (Harry, Ron), but also a matching (Ron, Harry). We have a (Hermione, Ron), but also a matching (Ron, Hermione). So every time we have a \( (x, y) \) we also have the matching \( (y, x) \), which is the definition of symmetry. Is it antisymmetric? No, because (among other things) both (Harry, Ron) and (Ron, Harry) are present. Finally, is it transitive? No. We have (Harry, Ron) and (Ron, Hermione), which means that if it’s transitive we would have to also have (Harry, Hermione) in there, which we don’t.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 51

Context: # 3.5. PROPERTIES OF ENDORLEATIONS

It’s not transitive. Remember: to meet any of these properties, they have to fully apply. “Almost” only counts in horseshoes.

Let’s try another example:

- (Ron, Harry)
- (Ron, Ron)
- (Harry, Harry)
- (Hermione, Hermione)
- (Harry, Hermione)
- (Hermione, Harry)

Is this reflexive? Yes. We’ve got all three wizards appearing with themselves. 

Is it symmetric? No, since (Ron, Harry) has no match. 

Is it antisymmetric? No, since (Harry, Hermione) does have a match. 

Is it transitive? No, since the presence of (Ron, Harry) and (Harry, Hermione) implies the necessity of (Ron, Hermione), which doesn’t appear, so no dice.

## Partial Orders and Posets

A couple of other fun terms: an endorelation which is (1) reflexive, (2) antisymmetric, and (3) transitive is called a **partial order**. A set together with a partial order is called a **partially ordered set**, or "poset" for short. The name "partial order" makes sense once you think through an example.

You may have noticed that when dogs meet each other (especially male dogs), they often circle each other and take stock of each other and try to establish dominance through the so-called "alpha dog." This is a pecking order of sorts that many different species establish. Now suppose I have the set D of all dogs, and a relation “IsAtLeastAs-ToughAs” between them. The relation starts off with every reflexive pair in it: (Rex, Rex), (Fido, Fido), etc. This is because obviously every dog is at least as tough as itself. Now every time two dogs x and y encounter each other, they establish dominance through eye contact or physical intimidation, and then one of the following ordered pairs is added to the relation: either (x, y) or (y, x), but never both.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 52

Context: # CHAPTER 3. RELATIONS

I contend that in this toy example, “isAtLeastAsToughAs” is a partial order, and \( A \) along with \( isAtLeastAsToughAs \) together form a poset. I reason as follows. It’s reflexive, since we started off by adding every dog with itself. It’s antisymmetric, since when we need add both \( (x, y) \) and \( (y, x) \) to the relation. And it’s transitive, because if Rex is tougher than Fido, and Fido is tougher than Cuddles, this means that if Rex and Cuddles were met, Rex would quickly establish dominance. (I’m no zoologist, and am not sure if the last condition truly applies with real dogs. But let’s pretend it does.)

It’s called a “partial order” because it establishes a partial, but incomplete, hierarchy among dogs. If we ask, “is dog X tougher than dog Y?” the answer is never ambiguous. We’re never going to say, “well, dog X was superior to dog A, who was superior to dog Y . . .” but then again, dog Y was superior to dog B, who was superior to dog X, so there’s no telling which of X and Y is truly toughest.” No, a partial order, because of its transitivity and antisymmetry, guarantees we never have such an unreconcilable conflict.

However, we could have a lack of information. Suppose Rex has never met Killer, and nobody Rex has met has ever met anyone Killer has met. There’s no chain between them. They’re in two separate universes as far as we’re concerned, and we’d have no way of knowing which was toughest. It doesn’t have to be that extreme, though. Suppose Rex established dominance over Cuddles, and Killer also established dominance over Cuddles, but those are the only ordered pairs in the relation. Again, there’s no way to tell whether Rex or Killer is the tougher dog. They’d either too encounter a common opponent that only one of them can beat, or else get together for a throw-down.

So a partial order gives us some semblance of structure — the relation establishes a directionality, and we’re guaranteed not to get wrapped up in contradictions — but it doesn’t completely order all the elements. If it does, it’s called a total order.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 61

Context: # 3.8 Exercises

1. Let \( A \) be the set \( \{ \text{Chuck, Julie, Sam} \} \) and \( S \) be the set \( \{ \text{basketball, volleyball} \} \). Is \( \{ (\text{Julie, basketball}), (\text{Sam, basketball}), (\text{Julie, volleyball}) \} \) a relation between \( A \) and \( S \)?

   Yes, it is, since it is a subset of \( A \times S \).

2. Is the above relation an endorelationship?

   No, because an endorelationship involves one set with itself, not two different sets (like \( A \) and \( S \)).

3. Is \( \{ (\text{Chuck, basketball}), (\text{basketball, volleyball}) \} \) a relation between \( A \) and \( S \)?

   No, since the first element of one of the ordered pairs is not from the set \( A \).

4. Is \( R \) a relation between \( A \) and \( S \)?

   Yes, it is, since it is a subset of \( A \times S \).

5. How large could a relation between \( A \) and \( S \) be?

   The maximum cardinality is 6, if all three athletes played all three sports. (I'm assuming that the meaning of the relation is "plays" instead of "isAFor" or "knowsTheRulesFor" or something else. In any case, the maximum cardinality is 6.)

6. Let \( T \) be the set \( \{ \text{Spock, Kirk, McCoy, Scotty, Uhura} \} \). Let \( O \) be an endorelationship on \( T \), defined as follows: \( O = \{ (\text{Kirk, Scotty}), (\text{Spock, Scotty}), (\text{Kirk, Spock}), (\text{Scotty, Spock}) \} \).

   Is \( T \) reflexive?

   No, since it doesn't have any of the elements of \( T \) appearing with themselves.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 62

Context: # CHAPTER 3. RELATIONS

7. **Is T symmetric?**  
   No, since it contains \((Kirk, Scotty)\) but not \((Scotty, Kirk)\).

8. **Is T antisymmetric?**  
   No, since it contains \((Scotty, Spock)\) and also \((Spock, Scotty)\).

9. **Is T transitive?**  
   Yes, since for every \((x, y)\) and \((y, z)\) present, the corresponding \((x, z)\) is also present. (The only example that fits this is \(x = Kirk, y = Spock, z = Scotty\), and the required ordered pair is indeed present.)

10. Let H be an endorelational on T, defined as follows:  
    \{ \((Kirk, Kirk)\), \((Spock, Spock)\), \((Uhura, Scotty)\), \((Scotty, Uhura)\), \((Spock, McCoy)\), \((McCoy, Spock)\), \((Scotty, Scotty)\), \((Uhura, Uhura)\) \}.  
    **Is H reflexive?**  
    No, since it’s missing \((McCoy, McCoy)\).

11. **Is H symmetric?**  
    Yes, since for every \((x, y)\) it contains, the corresponding \((y, x)\) is also present.

12. **Is H antisymmetric?**  
    No, since it contains \((Uhura, Scotty)\) and also \((Scotty, Uhura)\).

13. **Is H transitive?**  
    Yes, since there aren’t any examples of \((x, y)\) and \((y, z)\) pairs both being present.

14. Let outbrake be an edorelational on the set of all crew members of the Enterprise, where \((x, y)\) if \(x\) outranks \(y\) if character \(x\) has a higher Star Fleet rank than \(y\).  
    **Is outranking reflexive?**  
    No, since no officer outranks him/herself.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 64

Context: # CHAPTER 3. RELATIONS

22. **Is sameShirtColor transitive?**  
   Yes. If Kirk and Sulu wear the same color (yellow), and Sulu and Chekov wear the same color (yellow), then Kirk and Chekov most certainly will wear the same color (yellow).

23. **Above, we defined A as the set {Chuck, Julie, Sam} and S as the set {basketball, volleyball}. Then we defined the relation { (Julie, basketball), (Sam, basketball), (Julie, volleyball) }. Is this relation a function?**  
   No, because it's missing Chuck entirely.

24. **Suppose we added the ordered pair (Chuck, basketball) to it. Now is it a function?**  
   No, because Julie appears twice, mapping to two different values.

25. **Okay. Suppose we then remove (Julie, volleyball). We now have { (Julie, basketball), (Sam, basketball), (Chuck, basketball) }. Is this a function?**  
   Yes. Congratulations.

26. **Let's call this function "faveSport", which suggests that its meaning is to indicate which sport is each athlete's favorite. What's the domain of faveSport?**  
   {Julie, Chuck, Sam}.

27. **What's the codomain of faveSport?**  
   {basketball, volleyball}.

28. **What's the range of faveSport?**  
   {basketball}.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 69

Context: # 4.2 Probability Measures

Okay, we've defined sample spaces and events, but when do quantitative notions like "the odds of" and "percent chance" come into play? They enter the scene when we define a **probability measure**. A probability measure is simply a function from the domain of events to the codomain of real numbers. We'll normally use the letters \( P \) for our probability measure. In symbols, \( P : \mathcal{P}(\Omega) \to \mathbb{R} \) (since the set of all events is the power set of the sample space, as per above). There's actually another constraint, though, which is that \( P \)'s values must be in the range of \( [0, 1] \), inclusive. So it's more correct to write: \( P : \mathcal{P}(\Omega) \to [0, 1] \). (You may recall from a previous math course that \( (] \) and \( [ \) are used to describe a closed interval in which the endpoints are included in the interval.)

The "meaning" of the probability measure is intuitive enough: it indicates how likely we think each event is to occur. In the baby example, if we say \( P(\text{boy}) = 0.5 \), it means there's a 50% probability (a.k.a., a 50% chance) that a male child will be born. In the game example, if we say \( P(\text{M}) = 0.667 \), it means there's a two-thirds chance that my specific outcome is only whether I beat a 2. So I could define the event \( H \) ("first") to be the set \( \{ 3, 4, 5, 6 \} \). I could define the event \( I \) ("first") to be the set \( \{ 1 \} \) (notice \( H \) is still a set, even though it has only one element). Then I could define the event \( T \) ("tie") as the set \( \{ 2 \} \). I now effectively collapsed a larger set of outcomes into only the groups of outcomes I'm interested in. Now I'm all ready to reason about the likelihood that each of these events actually occurs.

By the way, "the set of all outcomes" is simply \( \Omega \), since an outcome is an element of \( \Omega \). But an event is a subset of \( \Omega \), not a single element. What, then, is "the set of all events?" If you think it through, you'll realize that it's \( \mathcal{P}(\Omega) \) (the power set of the sample space). Put another way, when defining an event, I can choose any subset of the possible outcomes, and so I can choose any set from \( \mathcal{P}(\Omega) \).
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 71

Context: # 4.2. PROBABILITY MEASURES

Now it turns out that not just any function will do as a probability measure, even if the domain (events) and codomain (real numbers in the range [0,1]) are correct. In order for a function to be a “valid” probability measure, it must satisfy several other rules:

1. \( \text{Pr}(\Omega) = 1 \)
2. \( \text{Pr}(A) \geq 0 \) for all \( A \subseteq \Omega \)
3. \( \text{Pr}(A \cup B) = \text{Pr}(A) + \text{Pr}(B) - \text{Pr}(A \cap B) \)

Rule 1 basically means "something has to happen." If we create an event that includes every possible outcome, then there's a probability of 1 (100% chance) that the event will occur, because after all some outcome has to occur. (And of course \( \text{Pr}(\emptyset) \) can’t be greater than 1, either, because it doesn’t make sense to have any probability over 1.) Rule 2 says there’s no negative probabilities: you can’t define any event, no matter how remote, that has a less than zero chance of happening.

Rule 3 is called the “additivity property,” and is a bit more difficult to get your head around. A diagram works wonders. Consider Figure 4.1, called a "Venn diagram," which visually depicts sets and their contents. Here we have defined three events: \( F \) (as above) is the event that the winner is a woman; \( R \) is the event that the winner is a rock musician (perhaps in addition to other musical genres); and \( U \) is the event that the winner is underage (i.e., becomes a multimillionaire before they can legally drink). Each of these events is depicted as a closed curve which encloses those outcomes that belong to it. There is obviously a great deal of overlap.

Now back to rule 3. Suppose I ask “what’s the probability that the All-time Idol winner is underage or a rock star?” Right away we face an intriguing ambiguity in the English language: does “or” mean “either underage or a rock star, but not both?” Or does it mean “underage and/or rock star?” The former interpretation is called an exclusive or and the latter an inclusive or. In computer science, we will almost always be assuming an inclusive or, unless explicitly noted otherwise.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 75

Context: # 4.3. PHILOSOPHICAL INTERLUDE

The idea of how often heads will occur in the long run. But if we flip it a million times and get 500,372 heads, we can confidently say that the probability of getting a head on a single flip is approximately .500.

This much isn’t controversial: it’s more like common sense. But the frequentist philosophy states that this is really the only way that probability can be defined. It’s what probability is: the frequency with which we can expect certain outcomes to occur, based on our observations of their past behavior. Probabilities only make sense for things that are repeatable, and reflect a known, reliable trend in how often they produce certain results. Historical proponents of this philosophy include John Venn, the inventor of the aforementioned Venn diagram, and Ronald Fisher, one of the greatest biologists and statisticians of all time.

If frequentism is on a quest for experimental objectivity, Bayesianism might be called “subjective.” This isn’t to say it’s arbitrary or sloppy. It simply has a different notion of what probability ultimately means. Bayesians interpret probability as a quantitative personal assessment of the likelihood of something happening. They point out that for many (most) events of interest, trials are neither possible nor sensible. Suppose I’m considering asking a girl out to the prom, and I’m trying to estimate how likely it is she’ll go with me. It’s not like I’m going to ask her a hundred times and count how many times she says yes, then divide by 100 to get a probability. There is in fact no way to perform a trial or use past data to guide me, and at any rate she’s only going to say yes or no once. So based on my background knowledge and my assumptions about her, myself, and the world, I form an opinion which could be quantified as a “percent chance.”

Once I’ve formed this opinion (which of course involves guesswork and subjectivity) I can then reason about it mathematically, using all the tools we’ve been developing. Of special interest to Bayesians is the notion of updating probabilities when new information comes to light, a topic we’ll return to in a moment. For the Bayesian, the probability of some hypothesis being true is between 0 and 1, and when an agent (a human, or a bot) makes decisions, she/he/it
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 87

Context: # 4.7 INDEPENDENCE

Probability is the same as the conditional probability; that is, if \( P(A|B) = P(A) \).

If you reflect on what this means, you’ll see that with independent events, knowing that one of them occurred tells you nothing (either for or against) about whether the other one also occurred.

For example, let \( S \) be the event that Strike For Gold wins the Kentucky Derby next May. Let \( R \) be the event that it rains that day. If I say that \( S \) and \( R \) are independent, I’m claiming that rain (or the absence thereof) would have no impact either way on the horse’s chances. If you were able to see the future and reveal to me the weather on Derby Day, that’s fine but it wouldn’t help me in my betting. Knowing \( P(R) \) wouldn’t give me any helpful information, because \( P(S|R) \) is the same as just plain \( P(S) \).

That’s a conceptual explanation. In the end, it boils down to numbers. Suppose we have the following contingency table that shows the results of a survey we conducted at UWM on dominant handedness:

|             | Male | Female |
|-------------|------|--------|
| Left-handed | 20   | 26     |
| Right-handed| 160  | 208    |

The data is self-explanatory. Obviously there were a lot more right-handers who took our survey than left, and slightly more women than men. Now consider: if this data is reflective of the population as a whole, what’s \( P(L) \), where \( L \) is the event that a randomly chosen person is left-handed? We surveyed 160+26=186 right-handers and only 20+26=46 southpaws, so we’ll estimate that \( P(L) = \frac{46}{368} \approx .111 \). If you pick a random person on campus, our best guess is that there’s a .111 probability of them being left-handed.

Suppose I told you, however, before you knew anything about the randomly chosen person’s handedness, that she was a woman. Would that influence your guess? In this case, you’d have extra information that the \( F \) event had occurred \( (F \) being the event of a female selection), and so you want to revise your estimate as
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 101

Context: # 5.1. GRAPHS

## Relationship to Sets

We seem to have strayed far afield from sets with all this graph stuff. But actually, there are some important connections to be made to those original concepts. Recall the wizards set \( A \) from chapter 3 that we extended to contain \{ Harry, Ron, Hermione, Neville \}. Now consider the following endorelation on \( A \):

- (Harry, Ron)
- (Ron, Harry)
- (Ron, Hermione)
- (Ron, Neville)
- (Hermione, Hermione)
- (Neville, Harry)

This relation, and all it contains, is represented faithfully by the graph in Figure 5.6. The elements of \( A \) are the vertices, of course, and each ordered pair of the relation is reflected in an edge of the graph. Can you see how exactly the same information is represented by both forms?

![Figure 5.6: A graph depicting a endorelation.](path_to_image)

Figure 5.6 is a directed graph, of course. What if it were an undirected graph? The answer is that the corresponding relation would be symmetric. An undirected graph implies that if there's an edge between two vertices, it goes "both ways." This is very identical to saying a relation is symmetric: if an \( (x,y) \) is in the relation, then the corresponding \( (y,x) \) must also be. An example is Figure 5.7, which depicts the following symmetric relation:

![Figure 5.7: A graph depicting a symmetric relation.](path_to_image)
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 125

Context: ```
## 5.2. TREES

![Figure 5.17: A binary tree.](path/to/image)

Traversing binary trees
-----------------------

There were two ways of traversing a graph: breadth-first and depth-first. Curiously, there are three ways of traversing a tree: **pre-order**, **post-order**, and **in-order**. All three begin at the root, and all three consider each of the root's children as subtrees. The difference is in the order of visitation.

### To traverse a tree **pre-order**, we:

1. Visit the root.
2. Treat the left child and all its descendants as a subtree, and traverse it in its entirety.
3. Do the same with the right child.

It’s tricky because you have to remember that each time you “treat a child as a subtree” you do the **whole traversal process** on that subtree. This involves remembering where you were once you finish.

Follow this example carefully. For the tree in Figure 5.17, we begin by visiting G. Then, we traverse the whole “K subtree.” This involves visiting K itself, and then traversing its whole left subtree (anchored at D). After we visit the D node, we discover that it actually has no left subtree, so we go ahead and traverse its right.
```
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 130

Context: # CHAPTER 5. STRUCTURES

## Figure 5.21: A full binary tree

```
       G
     /   \
    K     H
   / \   / \
  D   M A   B
 / \ / \ |   |
C  E F   N   L
   |
   I
```

## Complete binary tree

A complete binary tree is one in which every level has all possible nodes present, except perhaps for the deepest level, which is filled all the way from the left. Figure 5.21 is not complete, but it would be if we fixed it up as in Figure 5.22.

## Figure 5.22: A complete binary tree

```
       G
     /   \
    K     H
   / \   / \
  D   M A   B
 / \   |   |
I   E  L   F
         / \
        N   -
```

Unlike full binary trees, it is always possible to have a complete binary tree no matter how many nodes it contains. You just keep filling in from left to right, level after level.

## Perfect binary tree

Our last special type has a rather audacious title, but a “perfect” tree is simply one that is exactly balanced.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 133

Context: # 5.2. TREES

## Binary Search Trees (BSTs)

Okay, then let’s talk about how to arrange those contents. A binary search tree (BST) is any binary tree that satisfies one additional property: every node is “greater than” all of the nodes in its left subtree, and “less than (or equal to)” all of the nodes in its right subtree. We’ll call this the **BST property**. The phrases “greater than” and “less than” are in quotes here because their meaning is somewhat flexible, depending on what we’re storing in the tree. If we’re storing numbers, we’ll use numerical order. If we’re storing names, we’ll use alphabetical order. Whatever it is we’re storing, we simply need a way to compare two nodes to determine which one “goes before” the other.

An example of a BST containing people is given in Figure 5.24. Imagine that each of these nodes contains a good deal of information about a particular person—an employee record, medical history, account information, what have you. The nodes themselves are indexed by the person’s name, and the nodes are organized according to the BST rule. Mitch comes after Ben/Jessica/Jim and before Randy/Olson/Molly/Xander in alphabetical order, and this ordering relationship between parents and children repeats itself all the way down the tree. (Check it!)

Be careful to observe that the ordering rule applies between a node and **the entire contents of its subtrees**, not merely to its immediate children. This is a rookie mistake that you want to avoid. Your first instinct, when glancing at Figure 5.25, below, is to judge it a BST. It is not a binary search tree, however! Jessica is to the left of Mitch, as she should be, and Nancy is to the right of Jessica, as she should be. It seems to check out. But the problem is that Nancy is a descendant of Mitch’s left subtree, whereas she must properly be placed somewhere in his right subtree. And, this matters. So be sure to check your BSTs all the way up and down.

## The Power of BSTs

All right, so what’s all the buzz about BSTs, anyway? The key insight is to realize that if you’re looking for a node, all you have to do is start at the root and go the height of the tree down, making...
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 135

Context: # 5.2 TREES

![Figure 5.25: NOT a binary search tree, though it looks like one at first glance. (Notice Nancy and Mitch)](path/to/image)

**Possible?** The trick is to realize that with every node you look at, you effectively eliminate half of the remaining tree from consideration. For instance, if we're looking for Molly, we can disregard Mitch's entire left half without even looking at it, then the same for Randi's entire right half. If you discard half of something, then half of the remaining half, it doesn’t take you long before you’ve eliminated almost every false lead.

There’s a formal way to describe this speedup, called **Big-O notation**. The subtleties are a bit complex, but the basic idea is this. When we say that an algorithm is **O(n)** (pronounced "oh-of-n"), it means that the time it takes to execute the algorithm is *proportional* to the number of nodes. This doesn’t imply any specific number of milliseconds or anything — that is highly dependent on the type of computer hardware, the programming language, and a myriad of other things. But what we can say about an **O(n)** algorithm is that if you double the number of nodes, you’re going to approximately double the running time. If you quadruple the number of nodes, you’re going to quadruple the running time. This is what you’d expect.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 145

Context: # 5.4. EXERCISES

17. If this graph represented an endorelation, how many ordered pairs would it have?  
   **Answer:** 10.

18. Suppose we traversed the graph below in depth-first fashion, starting with node P. In what order would we visit the nodes?

   ![Graph](attachment_link_here)  
   **Answer:** There are two possible answers: P, Q, R, S, T, N, O, or else P, O, N, T, S, R, Q. (The choice just depends on whether we go "left" or "right" initially.) Note in particular that either O or Q is at the very end of the list.

19. Now we traverse the same graph breadth-first fashion, starting with node P. Now in what order would we visit the nodes?

   **Answer:** Again, two possible answers: P, O, Q, N, R, T, S, or else P, Q, R, N, S, T. Note in particular that both O and Q are visited very early.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 150

Context: # 6.1 The Fundamental Theorem

We start with a basic rule that goes by the audacious name of **The Fundamental Theorem of Counting**. It goes like this:

If a whole can be divided into \( k \) parts, and there’s \( n_i \) choices for the \( i \)-th part, then there’s \( n_1 \times n_2 \times n_3 \times \cdots \times n_k \) ways of doing the whole thing.

## Example

Jane is ordering a new Lamborghini. She has twelve different paint colors to choose from (including Luscious Red and Sassy Yellow), three different interiors (Premium Leather, Bonded Leather, or Vinyl), and three different stereo systems. She must also choose between automatic and manual transmission, and she can get power locks & windows (or not). How many different configurations does Jane have to choose from? 

To put it another way, how many different kinds of cars could come off the line for her?

The key is that every one of her choices is independent of all the others. Choosing an Envious Green exterior doesn't constrain her choice of transmission, stereo, or anything else. So no matter which of the 12 paint colors she chooses, she can independently choose any of the three interiors, and no matter what these first two choices are, she can freely choose any of the stereos, etc. It’s a mix-and-match. Therefore, the answer is:

\[ 
12 \times 3 \times 3 \times 2 = 432 
\]

Here’s an alternate notation you’ll run into for this, by the way:

**How many other "Fundamental Theorems" of math do you know?** Here are a few: the Fundamental Theorem of Arithmetic says that any natural number can be broken down into its prime factors in only one way. The Fundamental Theorem of Algebra says that the highest power of a polynomial is how many roots (zeros) it has. The Fundamental Theorem of Linear Algebra says that the row space and the column space of a matrix have the same dimension. The Fundamental Theorem of Calculus says that integration and differentiation are the inverses of each other.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 151

Context: # 6.1. THE FUNDAMENTAL THEOREM

which is just a shorter way of writing

\[ n_1 \times n_2 \times n_3 \cdots \times n_k. \]

As mentioned in section 4.5, the \( \Sigma \) notation is essentially a loop with a counter, and it says to add up the expression to the right of it for each value of the counter. The \( \Pi \) notation is exactly the same, only instead of adding the expressions together for each value of the counter, we're multiplying them. (The reason mathematicians chose the symbols \( \Sigma \) (sigma) and \( \Pi \) (pi) for this, by the way, is that “sigma” and “pi” start with the same letter as “sum” and “product,” respectively.)

We can actually get a lot of leverage just with the fundamental theorem. How many different PINs are possible for an ATM card? There are four digits, each of which can be any value from 0 to 9 (ten total values), so the answer is:

\[ 10 \times 10 \times 10 \times 10 = 10,000 \text{ different PINs.} \]

So a thief at an ATM machine frantically entering PINs at random (hoping to break your account before you call and stop your debit card) would have to try about 5,000 of them on average before cracking the code.

What about middle school bullies who are trying to break into your locker? Well, most combination locks are opened by a three-number sequence, each number of which is anything from 0 to 39. So there are:

\[ 40 \times 40 \times 40 = 64,000 \text{ different combinations.} \]

That's probably slightly overstated, since I'll bet consecutive repeat numbers are not allowed (Master probably doesn't manufacture a
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 156

Context: # CHAPTER 6. COUNTING

This pattern comes up so much that mathematicians have established a special notation for it:

\[ n \times (n - 1) \times (n - 2) \cdots \times 1 = n! \; (\text{``n-factorial''}) \]

We say there are “3-factorial” different brushing orders for the Davies kids. For our purposes, the notion of factorial will only apply for integers, so there’s no such thing as 23.46! or π! (In advanced computer science applications, however, mathematicians sometimes do define factorial for non-integers.) We also define 0! to be 1, which might surprise you.

This comes up a heck of a lot. If I give you a jumbled set of letters to unscramble, like “KRBS” (think of the Jumble® word game in the newspaper), how many different unscramblings are there? The answer is 5! or 120, one of which is BRISK. Let's say I shuffle a deck of cards before playing War. How many different games of War are there? The answer is 52!, since any of the cards in the deck might be shuffled on top, then any but that top card could be second, then any of those two could be third, etc. Ten packets arrive near-simultaneously at a network router. How many ways can they be queued up for transmission? 10! ways, just like a larger Davies family.

The factorial function grows really, really fast, by the way, even faster than exponential functions. A five letter word like “BRISK” has 120 permutations, but “AMBIENTDROUSLY” has 87,178,291,200, ten times the population of the earth. The number of ways to shuffle a deck is 

\[ 52! \approx 8.065 \times 10^{67} \]

so I don’t think my boys will end up playing the same War game twice any time soon, nor my wife and I the same bridge hand.

---

**“War”** is a mindless card game which involves no strategy or decision-making on the part of the players. Once you shuffle the initial deck, the entire outcome of the game is fixed.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 164

Context: # CHAPTER 6. COUNTING

And in general, that’s all we have to do. To find the number of combinations of \( k \) things taken from a total of \( n \) things, we have:

\[
\binom{n}{k} = \frac{n!}{(n-k)!k!}
\]

This pattern, too, comes up so often that mathematicians have invented (yet) another special notation for it. It looks a bit strange at first, almost like a fraction without a horizontal bar:

\[
\binom{n}{k} = \frac{n!}{(n-k)!k!}
\]

This is pronounced “n-choose-k.”

Again, examples abound. How many different 5-card poker hands are there? Answer: \( \binom{52}{5} \), since it doesn’t matter what order you’re dealt the cards, only which five cards you get. If there are 1024 sectors on our disk, but only 256 cache blocks in memory to hold them, how many different combinations of sectors can be in memory at one time? \( \binom{256}{4} \). If we want to choose 4 or 5 of our top 10 customers to participate in a focus group, how many different combinations of participants could we have? \( \binom{10}{4} + \binom{10}{5} \), since we want the number of ways to pick 4 of them plus the number of ways to pick 5 of them. And for our late night movie channel, of course, there are \( 540 \) possible movie lineups to attract audiences, if we don’t care which film is aired at which time.

## Binomial coefficients

The “n-choose-k” notation \( \binom{n}{k} \) has another name: values of this sort are called **binomial coefficients**. This is because one way to generate them, believe it or not, is to repeatedly multiply a binomial times itself (or, equivalently, take a binomial to a power.)

A binomial, recall, is a polynomial with just two terms:

\[
x + y
\]
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 172

Context: I'm unable to view images. However, if you provide the text you'd like me to correct in Markdown format, I'll be happy to help you restructure and format it!

Image Analysis: 

Since the attached image is completely white, there is no content to analyze under the specified aspects. 

To provide specific details for each aspect:

1. **Localization and Attribution:**
    - **Image 1:** The attached image is the only image.

2. **Object Detection and Classification:**
    - No objects detected.

3. **Scene and Activity Analysis:**
    - No scene or activities present.

4. **Text Analysis:**
    - No text detected.

5. **Diagram and Chart Analysis:**
    - No diagrams or charts detected.

6. **Product Analysis:**
    - No products depicted.

7. **Anomaly Detection:**
    - No anomalies or unusual elements detected.

8. **Color Analysis:**
    - The entire image is white.

9. **Perspective and Composition:**
    - No specific perspective or composition due to the image being plain white.

10. **Contextual Significance:**
    - No contextual elements are present to analyze.

11. **Metadata Analysis:**
    - No metadata available to review.

12. **Graph and Trend Analysis:**
    - No graphs present.

13. **Graph Numbers:**
    - No data points available.

- **Ablaufprozesse (Process Flows):**
    - No process flows depicted.

- **Prozessbeschreibungen (Process Descriptions):**
    - No processes described.

- **Typen Bezeichnung (Type Designations):**
    - No types or categories specified.

- **Trend and Interpretation:**
    - No trends to interpret.

- **Tables:**
    - No tables present.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 173

Context: # Chapter 7
## Numbers

Wow, last chapter was about “counting,” and this one is about “numbers.” It sure seems like we’re regressing back to first grade or earlier. And indeed, this chapter will contain a repeat of some elementary school concepts! But this is so we can re-examine the foundations and generalize them somewhat. The mechanical processes you’ve always used with numbers — adding, subtracting, comparing, checking whether something divides evenly, working with place value — are all correct, but they’re all hard-coded for decimal numbers. The word “decimal,” in this chapter, won’t mean “a number with a decimal point,” like 5.62* but rather a number expressed in base 10. 

And what does “expressed in base 10” mean? It means that the digits, from right to left, represent a “one’s place,” a “ten’s place,” a “hundred’s place,” and so on. This is what we all learned in grade school, and perhaps you thought that’s just how numbers “were.” But it turns out that 1, 10, 1000, …, is just one choice of place values, and that we could equally as well choose many other things, like 1, 2, 4, 8, …, or 1, 16, 256, 4096, …, or even 1, 23, 529, 12167, …, as long as those values are of a certain type (successive powers of the base). 

It’s the concept of bases, and specifically bases other than 10, that will cause us to rethink some things. It’ll feel unnatural at first, but soon you’ll discover that there are aspects of how you work with numbers that are unnecessarily specific, and that it’s freeing.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 178

Context: ```
# CHAPTER 7. NUMBERS

Of the number. Obviously, 239 is less than 932, so we say that the hundreds place is more significant than the other digits.

All of this probably seems pretty obvious to you. All right then. Let’s use a base other than ten and see how you do. Let’s write out a number in base 5. We have seven symbols at our disposal: 0, 1, 2, 3, 4, 5, and 6. Wait, you ask — why not 7? Because there is no digit for seven in a base 5 system, just like there is no digit for ten in base 10 system. Ten is the point where we need two digits in a decimal system, and analogously, seven is the point where we’ll need two digits in our base 7 system. How will we write the value seven? Just like this: 10. Now stare at those two digits and practice saying “seven” as you look at them. All your life you’ve been trained to say the number “ten” when you see the digits 1 and 0 printed like that. But those two digits only represent the number ten if you’re using a base 10 system. If you’re using a base 3/4 system, “10” is how you write “thirty-four.”

Very well, we have our seven symbols. Now how do we interpret a number like 6153₇? It’s this:

6153₇ = 6 × 7³ + 1 × 7² + 5 × 7¹ + 3 × 7⁰.

That doesn’t look so strange; it’s very parallel to the decimal string we expanded, above. It looks we can interpret when we actually multiply out the place values:

6153₇ = 6 × 343 + 1 × 49 + 5 × 7 + 3 × 1.

So in base 7, we have a “one’s place”, a “seven’s place”, a “forty-nine’s place”, and a “three hundred forty-three’s place.” This seems unbelievably bizarre — how could a number system possibly hold together with such place values? But I’ll bet it wouldn’t look funny at all if we had been born with 7 fingers. Keep in mind that in the equation above, we wrote out the place values as decimal numbers: Had we written them as base-7 numbers (as we certainly would have if base 7 was our natural numbering system), we would have written:

6153₇ = 6 × 100₇ + 1 × 10₇ + 5 × 7₇ + 3 × 1₇.
```
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 193

Context: # 7.4 BINARY (BASE 2)

Bytes to see if they’re equal, you’d think we’d just compare each bit position, and if they were all the same, the bytes would be declared equal, otherwise no. Alas, this is no longer quite that simple. The two zero patterns must be considered numerically equal, so our digital logic now has to contain a special case. “To be equal, all the bits have to be the same… oh, but actually not if the right-most seven are all zeros in both bytes. In that case, it doesn’t matter what the left-most bit contains.” Maddening.

## Two’s-complement

This shortcoming in the sign-magnitude scheme is remedied with the two’s-complement scheme, which is the one actually used most often in practice. It’ll seem weird at first — certainly not as intuitive as the first two — but it leads to a critically important feature that we’ll look at shortly.

First, the rules. To interpret a two’s-complement number, you:

1. Look at the left-most bit (just like in sign-magnitude). If it’s a 0, you have a positive number. If it’s a 1, you have a negative number.
   
2. If it’s a positive number, the other 7 bits give you the magnitude (just like in sign-magnitude).
   
3. If, however, it’s a negative number, then to discover the magnitude of that negative number you must flip all the bits and add one. This will give you a positive number which is the absolute value of your negative number.

**Example**: Take the byte `00100110`. The left-most bit is a 0, which means it’s a positive number, and as we discovered above, the remaining 7 bits give a magnitude of 38. So this is the number 38.

**Harder example**: Take the byte `10100110`. The left-most bit is a 1, which means it’s negative. Okay: negative what? How do we find the magnitude? Well, we “flip” all the bits (i.e., invert each one).
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 195

Context: # 7.4  BINARY (BASE 2)

It sounds, a two's-complement representation scheme allows us to perform addition and subtraction with a single operation.

In first grade (or so), you learned the procedure for adding multi-digit numbers, which we've followed several times in this chapter. It involves adding the digits right-to-left and possibly "carrying." Then in second grade (or so), you learned the procedure for subtracting multi-digit numbers. It involves subtracting the digits right-to-left and possibly "borrowing." If you're like me, you found adding easier than subtracting. It’s easy to just carry the one, but to borrow requires looking at the digit to the left, making sure that you can borrow from left until you actually find an available non-zero value, hoping the number on the bottom is actually less than the one on top (because otherwise you have to switch the order and then add a negative sign to the result), and keeping all of that straight as you march down the line.

Even if you didn't find subtracting more difficult than adding, though, you can't argue that it's still a completely different algorithm, with different rules to follow. In computer hardware, we have to implement different circuitry to perform each operation, which is more difficult, costly, error-prone, and power-draining.

The wonderful thing about two's-complement, however, is that this scheme we actually never need to use the subtraction algorithm. If we want to subtract two numbers — say, \( 24 - 37 \) — we can instead take the complement of the second number and then add them. Instead of \( 24 - 37 \) we compute \( 24 + (-37) \).

Let's see it in action. Using conversion procedures, we can figure out that \( 24_{10} \) is:

```
00011000
```

and that positive \( 37_{10} \) is:

```
00100101
```

If we wanted to compute \( 24 + 37 \), we’d just add these. But instead we’re looking for \( 24 - 37 \), so we’ll take the complement of \( 37 \) to find:
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 211

Context: # 8.1. PROPOSITIONAL LOGIC

## Truth Tables

Several times in this book, we've drawn the distinction between **intension** — the inner, conceptual meaning — and **extension** — the exhaustive list of examples. A set can have both an intension like "the prime numbers less than ten" and an extension like \{2, 3, 5, 7\}. A relation can have an intension like "isDaughterOf" and an extension like \{(Lisa, Homer), (Lisa, Marge), (Maggie, Homer), (Maggie, Marge)\}. So, too, with the logical connectives. When we say that the "∧" operator means "both propositions must be true," we're specifying the conceptual meaning of the "and" operator. Another way to describe it, however, would be to just list its value for all the possible inputs.

Such an exhaustive list is called a **truth table**. We specify every possible combination of inputs, and list the output for each one of them. Here's the truth table for "∧":

| X | Y | X ∧ Y |
|---|---|-------|
| 0 | 0 |   0   |
| 0 | 1 |   0   |
| 1 | 0 |   0   |
| 1 | 1 |   1   |

We use "1" to represent true and "0" for false, just to make the table more compact. The "∧" operator works on two propositions, either of which can have a truth value of 0 or 1. There are therefore, by the Fundamental Theorem of Counting, four different combinations of inputs, and so our truth table has four rows. The rightmost column shows the output for each of these sets of inputs. It indicates that X ∧ Y is only when both inputs are 1, and 0 otherwise. Even if we didn't grasp the simple concept that "∧" is supposed to represent the concept of "and," we could just look up the value of X ∧ Y if we knew the truth values of X and Y.

Sometimes we show more than one output in a truth table. For instance, this truth table shows the values for the other five operators:
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 218

Context: # CHAPTER 8. LOGIC

This is exactly what a **predicate** is, which forms the basis for predicate logic, or "first-order predicate logic," to be more exact. A predicate is a formula that yields a proposition for each value of its inputs. For instance, I can define a predicate called `HasGovernor` as follows:

Let `HasGovernor(x)` be the proposition that \( x \) is a state that has a governor.

Then I can assert:

`HasGovernor(Virginia)`

to state that Virginia has a governor. This mechanism alleviates the need to define fifty nearly-identical propositions. Instead, we define one predicate.

If you're a programmer, you can think of a predicate as a function that returns a proposition (which, in turn, can be thought of as a function that returns a boolean value). Whether you're a programmer or not, you can think of a predicate as a function (in the chapter 3 sense) mapping objects to propositions:

`HasGovernor: Ω → P`

where \( P \) is the set of all propositions. Note that the domain of this function is \( Ω \), the entire domain of discourse. This means that you can give any input at all to the predicate. For instance, we can assert:

¬`HasGovernor(mayonnaise)`

1. If you want to sound really nerdy, you can call it *first-order predicate calculus*, which is a synonym.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 221

Context: # 8.2. PREDICATE LOGIC

This asserts that for every \( x \), \( \text{HasGovernor}(x) \) is true. Actually, this isn’t quite right, for although Michigan and California have governors, mayonnaise does not. To be precise, we should say:

\[
\forall x \in S \, \text{HasGovernor}(x),
\]

where \( S \) is the set of all fifty states in the U.S.

We can use a quantifier for any complex expression, not just a simple predicate. For instance, if \( H \) is the set of all humans, then:

\[
\forall h \in H \, \text{Adult}(h) \oplus \text{Child}(h)
\]

states that every human is either an adult or a child, but not both. (Imagine drawing an arbitrary line at a person’s 18th birthday.) Another (more common) way to write this is to dispense with sets and define another predicate \( \text{Human} \). Then we can say:

\[
\forall h \, \text{Human}(h) \Rightarrow \text{Adult}(h) \oplus \text{Child}(h).
\]

Think this through carefully. We’re now asserting that this expression is true for all objects, whether they be Duchess Kate Middleton, little Prince Louis, or a bowl of oatmeal. To see that it’s true for all three, let’s first be equal to Kate Middleton. We substitute Kate for \( h \) and get:

\[
\text{Human}(\text{Kate}) \Rightarrow \text{Adult}(\text{Kate}) \oplus \text{Child}(\text{Kate})
\]

\[
\text{true} \Rightarrow \text{true} \oplus \text{false}
\]

\[
\text{true} \Rightarrow \text{true}
\]

\[
\text{true} \checkmark
\]

Remember that “implies” (\( \Rightarrow \)) is true as long as the premise (left-hand side) is false and/or the conclusion (right-hand side) is true. In this case, they’re both true, so we have a true result. Something similar happens for Prince Louis:

\[
\text{Human}(\text{Louis}) \Rightarrow \text{Adult}(\text{Louis}) \oplus \text{Child}(\text{Louis})
\]

\[
\text{true} \Rightarrow \text{false} \oplus \text{true}
\]

\[
\text{true} \Rightarrow \text{true}
\]

\[
\text{true} \checkmark
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 226

Context: # Chapter 8: Logic

Here's a funny one I’ll end with. Consider the sentence **“He made her duck.”** What is intended here? Did some guy reach out with his hand and forcefully push a woman’s head down out of the way of a screaming projectile? Or did he prepare a succulent dish of roasted fowl to celebrate her birthday? Oh, if the computer could only know. If we’d used predicate logic instead of English, it could!
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 231

Context: # Chapter 9

## Proof

We've seen a lot of pretty sights on our cool brisk walk. We've caught a glimpse of the simple elegance of sets and relations, the precision of probabilistic reasoning, the recursive structure of trees, the explosive nature of combinatorics, and much more. None of these things have we plumbed to the depths, but we've appreciated their beauty and taken note of where they stood along our blazed trail. You’ll remember this hike when you run into such concepts again and again in future computer science and math courses, and in your career beyond academics.

Now we have one more step to make before returning to the trailhead, and that deals with the notion of **proof**. As we've studied these various mathematical entities, I’ve pointed out certain of their properties. A free tree has one more vertex than edge, for example. The cardinality of the union of two sets is at least as big as each of their individual unions. If you flip-all-the-bits-and-add-one in a two's complement scheme, and then perform that flip-and-add operation again, you’ll return to the original number. But with a few exceptions, we haven't proven any of these things. I’ve just stated them, and you’ve taken them on faith.

In order to establish reliable truth, of course, professional mathematicians aren’t satisfied with unsubstantiated statements. They need to be convinced that the claims we make do truly hold, and provably so, in all circumstances. What they seek is a **proof** of a...
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 237

Context: # 9.2. TYPES OF PROOF

## TRITE  

## WRITE  

## WHITE  

Here, we had to temporarily change our first letter three different times — two of which seemingly brought us no nearer to WHITE — in order to successfully forge a path through the tangled forest.

Knowing which direction to set out on is a matter of intuition plus trial and error. Given the axioms of any system (whether algebra, predicate logic, sets, etc.), there are an unfathomable number of different ways to proceed. The vast majority of them are bound to lead to dead ends. This is why a valid proof, when it is finished, is often an elegant and beautiful thing. It’s a thin braid of jewels glistening in the midst of a whole lot of mud.

## Indirect proof

Also known as a proof by contradiction or reductio ad absurdum, the indirect proof starts in a completely opposite way. It says, “Okay, I’m trying to prove X. Well, suppose for the sake of argument I assume that the opposite — not X — is true. Where would that lead me?” If you follow all the rules and it leads you to a contradiction, this tells you that the original assumption of ¬X must have been false. And this in turn proves that X must be true.

We do this all the time in our thinking. Say you’re driving down the highway. How do you know that the alternator in your engine is working? A direct proof would require that you open the hood and examine the part, testing to ensure it works properly. An indirect proof simply says, “Well, suppose it weren’t working properly. Then my car engine wouldn’t operate. But here I am, driving down the road, and the engine obviously does operate, so that tells me that the alternator must be working properly.”

One of the most famous indirect proof dates from Euclid’s *Elements* in 300 B.C. It proves that the square root of 2 is an irrational number, a great surprise to mathematicians at the time (most of whom doubted the very existence of irrational numbers). Remember...
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 257

Context: 9.4. FINAL WORD
=================

- ordered triples, 15
- org charts, 113
- outcomes, 60, 62
- overflow, 188

P = NP?, 244  
parent (of a node), 114  
partial orders, 43  
partial permutations, 151, 154  
partitions, 26, 71, 94  
Pascal's triangle, 157  
passwords, 146  
paths (in a graph), 88, 113  
perfect binary tree, 122, 239  
permutations, 147  
PINs, 143  
poker, 160  
pop (off a stack), 99  
posts, 43  
post-order traversal, 118  
postulates, 226  
power sets, 24, 36  
pre-order traversal, 117  
predicates, 210, 211, 232  
predicate logic, 210  
premise (of implication), 200  
Prim's algorithm, 107  
Prime, Robert, 107  
prior probability, 68  
probability measures, 61, 63, 65  
product operator (II), 142, 160  
proof, 223  
proof by contradiction, 229  
propositional logic, 197, 225  
propositions, 197, 210  

- psychology, 70, 86
- push (on a stack), 99
- quantifiers, 212, 215
- queue, 95, 97
- quotient, 173, 174

- range (of a function), 48
- rational numbers (ℝ), 17, 24
- reachable, 89
- real numbers (ℝ), 71, 94
- rebalancing (a tree), 132
- recursion, 116, 120, 149, 231
- red-black trees, 133
- reflexive (relation), 40, 43
- relations, finite, 39
- relations, infinite, 39
- remainder, 173, 174
- right child, 116  
- root (of a tree), 112, 114  
- rooted trees, 112, 134  
- Russell's paradox, 15  

- sample space (Ω), 60  
- semantic network, 87  
- set operators, 18  
- set-builder notation, 11  
- sets, 8, 93  
- sets of sets, 15  
- sets, finite, 12  
- sets, fuzzy, 10  
- sets, infinite, 12  
- sibling (of a node), 115  
- signs-magnitude binary numbers, 183, 189  

**Sonic the Hedgehog**, 73  
- southern states, 72  
- spatial positioning, 92, 113
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 2

Context: # Contents

I Basics 5

1 Introduction 6  
   1.1 What is Machine Learning? 6  
   1.2 When to Use Machine Learning? 8  
   1.3 Goals and Outline 11  

2 A Gentle Introduction through Linear Regression 15  
   2.1 Supervised Learning 15  
   2.2 Inference 17  
   2.3 Frequentist Approach 19  
   2.4 Bayesian Approach 36  
   2.5 Minimum Description Length (MDL) 42  
   2.6 Information-Theoretic Metrics 44  
   2.7 Interpretation and Causality 47  
   2.8 Summary 49  

3 Probabilistic Models for Learning 51  
   3.1 Preliminaries 52  
   3.2 The Exponential Family 53  
   3.3 Frequentist Learning 59
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 9

Context: # Acronyms

- **AI:** Artificial Intelligence
- **AMP:** Approximate Message Passing
- **BN:** Bayesian Network
- **DAG:** Directed Acyclic Graph
- **ELBO:** Evidence Lower Bound
- **EM:** Expectation Maximization
- **ERM:** Empirical Risk Minimization
- **GAN:** Generative Adversarial Network
- **GLM:** Generalized Linear Model
- **HMM:** Hidden Markov Model
- **i.i.d.:** independent identically distributed
- **KL:** Kullback-Leibler
- **LASSO:** Least Absolute Shrinkage and Selection Operator
- **LBP:** Loopy Belief Propagation
- **LL:** Log-Likelihood
- **LLR:** Log-Likelihood Ratio
- **LS:** Least Squares
- **MC:** Monte Carlo
- **MCMC:** Markov Chain Monte Carlo
- **MDL:** Minimum Description Length
- **MFVI:** Mean Field Variational Inference
- **MI:** Maximum Likelihood
- **NRF:** Markov Random Field
- **NLL:** Negative Log-Likelihood
- **PAC:** Probably Approximately Correct
- **pdf:** probability density function
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 10

Context: # Acronyms

| Acronym | Meaning                                         |
|---------|------------------------------------------------|
| pmf     | probability mass function                       |
| PCA     | Principal Component Analysis                    |
| PPCA    | Probabilistic Principal Component Analysis      |
| QDA     | Quadratic Discriminant Analysis                |
| RBM     | Restricted Boltzmann Machine                   |
| SGD     | Stochastic Gradient Descent                     |
| SVM     | Support Vector Machine                          |
| rv      | random variable or random vector (depending on the context) |
| s.t.    | subject to                                     |
| VAE     | Variational AutoEncoder                        |
| VC      | Vapnik–Chervonenkis                            |
| VI      | Variational Inference                          |
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 13

Context: # 1.1. What is Machine Learning?

This starts with an in-depth analysis of the problem domain, which culminates with the definition of a mathematical model. The mathematical model is meant to capture the key features of the problem under study and is typically the result of the work of a number of experts. The mathematical model is finally leveraged to derive hand-crafted solutions to the problem.

For instance, consider the problem of defining a chemical process to produce a given molecule. The conventional flow requires chemists to leverage their knowledge of models that predict the outcome of individual chemical reactions, in order to craft a sequence of suitable steps that synthesize the desired molecule. Another example is the design of speech translation or image/video compression algorithms. Both of these tasks involve the definition of models and algorithms by teams of experts, such as linguists, psychologists, and signal processing practitioners, not infrequently during the course of long standardization meetings.

The engineering design flow outlined above may be too costly and inefficient for problems in which faster or less expensive solutions are desirable. The machine learning alternative is to collect large data sets, e.g., of labeled speech, images, or videos, and to use this information to train general-purpose learning machines to carry out the desired task. While the standard engineering flow relies on domain knowledge and on design optimized for the problem at hand, machine learning lets large amounts of data dictate algorithms and solutions. To this end, rather than requiring a precise model of the set-up under study, machine learning requires the specification of an objective, of a model to be trained, and of an optimization technique.

Returning to the first example above, a machine learning approach would proceed by training a general-purpose machine to predict the outcome of known chemical reactions based on a large data set, and then by using the trained algorithm to explore ways to produce more complex molecules. In a similar manner, large data sets of images or videos would be used to train a general-purpose algorithm with the aim of obtaining compressed representations from which the original input can be recovered with some distortion.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 14

Context: # 1.2 When to Use Machine Learning?

Based on the discussion above, machine learning can offer an efficient alternative to the conventional engineering flow when development cost and time are the main concerns, or when the problem appears to be too complex to be studied in its full generality. On the flip side, the approach has the key disadvantages of providing generally suboptimal performance, or hindering interpretability of the solution, and to apply only to a limited set of problems.

In order to identify tasks for which machine learning methods may be useful, reference [31] suggests the following criteria:

1. The task involves a function that maps well-defined inputs to well-defined outputs.
2. Large data sets exist or can be created containing input-output pairs.
3. The task provides clear feedback with clearly definable goals and metrics.
4. The task does not involve long chains of logic or reasoning that depend on diverse background knowledge or common sense.
5. The task does not require detailed explanations for how the decision was made.
6. The task has a tolerance for error and no need for provably correct or optimal solutions.
7. The phenomenon or function being learned should not change rapidly over time.
8. No specialized dexterity, physical skills, or mobility is required.

These criteria are useful guidelines for the decision of whether machine learning methods are suitable for a given task of interest. They also offer a convenient demarcation line between machine learning as is intended today, with its focus on training and computational statistics tools, and more general notions of Artificial Intelligence (AI) based on knowledge and common sense [87] (see [126] for an overview on AI research).
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 23

Context: 2.2. Inference
====================

Corresponding to an already observed pair \((x_t, y_t) \in \mathcal{D}\), learning entails the capability to predict the value \(t\) for an unseen domain point \(x\). Learning, in other words, converts experience – in the form of \(\mathcal{D}\) – into expertise or knowledge – in the form of a predictive algorithm. This is well captured by the following quote by Jorge Luis Borges: “To think is to forget details, generalize, make abstractions.” [138]

By and large, the goal of supervised learning is that of identifying a predictive algorithm that minimizes the generalization loss, that is, the error in the prediction of a new label \(t\) for an unobserved explanatory variable \(x\). How exactly to formulate this problem, however, depends on one’s viewpoint on the nature of the model that is being learned. This leads to the distinction between the frequentist and the Bayesian approaches, which is central to this chapter. As will be discussed, the MDL philosophy deviates from the mentioned focus on prediction as the goal of learning, by targeting instead a parsimonious description of the data set \(\mathcal{D}\).

### 2.2 Inference

Before we start our discussion of learning, it is useful to review some basic concepts concerning statistical inference, as they will be needed throughout this chapter and in the rest of this monograph. We specifically consider the inference problem of predicting a \(y\) given the observation of another \(x\) under the assumption that their joint distribution \(p(x, y)\) is known. As a matter of terminology, it is noted that here we will use the term “inference” as it is typically intended in the literature on probabilistic graphical models (see, e.g., [81]), thereby diverging from its use in other branches of the machine learning literature (see, e.g., [23]).

In order to define the problem of optimal inference, one needs to define a non-negative loss function \(\ell(t, \hat{t})\). This defines the cost, or loss or risk, incurred when the correct value is \(t\) while the estimate is \(\hat{t}\). An important example is the \(\ell_t\) loss:

\[
\ell_t(\hat{t}) = |t - \hat{t}|, \quad (2.1)
\]

which includes as a special case the quadratic loss \(\ell_q(t, \hat{t}) = (t - \hat{t})^2\).
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 26

Context: # A Gentle Introduction through Linear Regression

- **Separate learning and (plug-in) inference:** Learn first an approximation, say \( p_\theta(t|x) \), of the conditional distribution \( p(t|x) \) based on the data \( D \), and then plug this approximation in (2.3) to obtain an approximation of the optimal decision as

  \[
  \hat{t}_\theta(x) = \arg \min_t E_{p_\theta(t|x)}[\ell(t, \widehat{y})]. \tag{2.6}
  \]

- **Direct inference via Empirical Risk Minimization (ERM):** Learn directly an approximation \( \hat{t}_\theta(\cdot) \) of the optimal decision rule by minimizing an empirical estimate of the generalization loss (2.2) obtained from the data set as

  \[
  \hat{t}_\theta(\cdot) = \arg \min_t L_p(\hat{t}). \tag{2.7}
  \]

  where the empirical risk, or empirical loss, is

  \[
  L_p(\hat{t}) = \frac{1}{N} \sum_{n=1}^{N} \ell(t_n, \hat{t}(x_n)). \tag{2.8}
  \]

The notation \( L_p(\hat{t}) \) highlights the dependence of the empirical loss on the predictor \( \hat{t} \) and on the training set \( D \).

In practice, as we will see, both approaches optimize a set of parameters that define the probabilistic model or the predictor. Furthermore, the first approach is generally more flexible, since having an estimate \( p(t|x) \) of the posterior distribution \( p(t|x) \) allows the prediction (2.6) to be computed for any loss function. In contrast, the ERM solution (2.7) is tied to a specific choice of the loss function \( \ell \). In the rest of this section, we will start by taking the first approach, and discuss later how this relates to the ERM formulation.

## Linear regression example.

For concreteness, in the following, we will consider the following running example inspired by [23]. In the example, data is generated according to the true distribution \( p(t|x) = p(x)p(t|x) \), where \( x \) is 

\[
x = z \sim N(\sin(2\pi z), 0.1). \tag{2.9}
\]

The training set in Fig. 2.1 was generated from this distribution. If this true distribution were known, the optimal predictor under the \( \ell_2 \) loss would be
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 27

Context: 2.3 Frequentist Approach
=========================

would be equal to the conditional mean

\[\hat{y}(x) = \sin(2\pi x).\] 

Hence, the minimum generalization loss is \( L_{p}(\hat{y}) = 0.1 \).

It is emphasized that, while we consider this running example in order to fix the ideas, all the definitions and ideas reported in this chapter apply more generally to supervised learning problems. This will be further discussed in Chapter 4 and Chapter 5.

2.3.1 Discriminative vs. Generative Probabilistic Models
--------------------------------------------------------

In order to learn an approximation \( p(y|x) \) of the predictive distribution \( p(y|x) \) based on the data \( \mathcal{D} \), we will proceed by first selecting a family of parametric probabilistic models, also known as a hypothesis class, and by then learning the parameters of the model \( \theta \) (in a sense to be made precise later) the data \( \mathcal{D} \).

Consider as an example the linear regression problem introduced above. We start by modeling the label \( y \) as a polynomial function of the domain point \( x \) added to a Gaussian noise with variance \( \beta^{-1} \). Parameter \( \beta \) is the precision, i.e., the inverse variance of the additive noise. The polynomial function with degree \( M \) can be written as

\[
\mu(x, w) = \sum_{j=0}^{M} w_{j} x^{j} = w^{T} \phi(x), 
\]

where we have defined the weight vector \( w = [w_{0}, w_{1}, \ldots, w_{M}]^{T} \) and the feature vector \( \phi(x) = [1, x, x^{2}, \ldots, x^{M}]^{T} \). The vector \( w \) defines the relative weight of the powers in the sum (2.11). This assumption corresponds to adopting a parametric probabilistic model \( p(y|x) \), defined as

\[
t | x \sim \mathcal{N}(\mu(x, w), \beta^{-1}),
\]

with parameters \( \theta = (u, \beta) \). Having fixed this hypothesis class, the parameter \( \theta \) can be then learned from the data \( \mathcal{D} \), as it will be discussed.

In the example above, we have parameterized the posterior distribution. Alternatively, we can parametrize, and learn, the full joint distribution \( p(x, t) \). These two alternatives are introduced below.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 36

Context: L(w^*) is the best generalization loss for the given model. Recall that the loss L(w^*) can be achieved when N is large enough. Finally, the term (L(w_M) - L(w^*)) is the estimation error or generalization gap[^1] that is incurred due to the fact that N is not large enough and hence we have w_M ≠ w^*.

From the decomposition (2.22), a large N allows us to reduce the estimation error, but it has no effect on the bias. This is seen in Fig. 2.4, where the asymptotic achieved by the generalization loss L(w^*) increases equals the minimum generalization loss L(w^*) for the given model. Choosing a small value of M in the regime of large data points provides a floor on the achievable generalization loss that no amount of additional data can overcome.

## Validation and Testing

In the discussion above, it was assumed that the generalization loss L(w) can somehow be evaluated. Since this depends on the true unknown distribution p(z,t), this evaluation is, strictly speaking, not possible. How then to estimate the generalization loss in order to enable model order selection using a plot as in Fig. 2.33? The standard solution is to use validation.

The most basic form of validation prescribes the division of the available data into two sets: a hold-out, or validation, set and the training set. The validation set is used to evaluate an approximation of the generalization loss L(w) via the empirical average:

L(w) = \frac{1}{N_v} \sum_{n=1}^{N_v} \ell(t_n, w) \quad (2.23)

where the sum is done over the N_v elements of the validation set.

The just described hold-out approach to validation has a clear drawback, as part of the available data needs to be set aside and not used for training. This means that the number of data points that can be used for training is smaller than the number of overall available data points. To partially alleviate this problem, a more sophisticated, and commonly used, approach to validation is k-fold cross-validation. With this method, the available data points are partitioned, typically at random, into k equally sized subsets. The generalization loss is then esti[^1]: This is also defined as generalisation error in some references.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 39

Context: # 2.3. Frequentist Approach

## ML vs. MAP

Observing (2.27), it is important to note the following general property of the MAP solution: As the number \( N \) of data points grows large, the MAP estimate tends to the ML estimate, given that the contribution of the prior information term decreases as \( 1/N \). When \( N \) is large enough, any prior credence is hence superseded by the information obtained from the data.

Problem (2.27), which is often referred to as ridge regression, modifies the ML criterion by adding the quadratic (or Tikhonov) regularization function

\[
R(w) = \|w\|^2 \tag{2.28}
\]

multiplied by the term \( \lambda/N \). The regularization function forces the norm of the solution to be small, particularly with larger values of the hyperparameter \( \lambda \), or equivalently \( \alpha \). The solution of problem (2.27) can be found by using standard LS analysis, yielding

\[
w_{\text{MAP}} = \left( \lambda I + X^T X_d \right)^{-1} X^T d_d. \tag{2.29}
\]

This expression confirms that, as \( N \) grows large, the term \( \lambda \) becomes negligible and the solution tends to the ML estimate (2.18) [see 86 for a formal treatment].
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 40

Context: # A Gentle Introduction through Linear Regression

## Figure 2.6
Square root of the generalization loss \( L_{\mathcal{G}}(w_{MAP}) \) and of the training loss \( L_D(w_{MAP}) \) as a function of \( \lambda \) (in logarithmic scale) for the training data set in Fig. 2.1 with \( M = 9 \).

![Figure 2.6](path/to/image)

Fig. 2.6 shows the squared root of the generalization loss \( L_{\mathcal{G}}(w_{MAP}) \) and of the training loss \( L_D(w_{MAP}) \) as a function of \( \lambda \) (in logarithmic scale) for the training data set in Fig. 2.1 with \( M = 9 \). The generalization loss is estimated using validation. We observe that increasing \( \lambda \), and hence the relevance of the regularization term, has a similar impact as decreasing the model order \( M \). A larger \( \lambda \) reduces the effective capacity of the model. Stated in different words, increasing \( \lambda \) reduces overfitting but may entail a larger bias.

Other standard examples for the prior distribution include the Laplace pdf, which yields the \( l_1 \) norm regularization function \( R(w) = \|w\|_1 = \sum_{j=1}^n |w_j| \). This term promotes the sparsity of the solution, which is useful in many signal recovery algorithms [14] and in non-parametric function estimation [146]. The corresponding optimization problem is

\[
\min_{w} L_D(w) + \lambda \|w\|_1 \tag{2.30}
\]

is known as LASSO (Least Absolute Shrinkage and Selection Operator).
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 41

Context: ## 2.3 Frequentist Approach

### 2.3.5 Regularization

We have seen above that the MAP learning criterion amounts to the addition of a regularization function \( R(\theta) \) to the ML or ERM learning losses. This function penalizes values of the weight vector \( \mathbf{w} \) that are likely to occur in the presence of overfitting, or, generally, that are improbable on the basis of the available prior information. The net effect of this addition is to effectively decrease the capacity of the model, as the set of values for the parameter vector \( \theta \) that the learning algorithm is likely to choose from is reduced. As a result, seen, regularization can control overfitting and its optimization requires validation.

Regularization generally refers to techniques that aid in reducing overfitting during training. The discussion in the previous subsection has focused on a specific form of regularization that is grounded in a probabilistic interpretation in terms of MAP learning. We note that the same techniques, such as ridge regression and LASSO, can also be introduced independently of a probabilistic framework in an ERM formulation. Furthermore, apart from the discussed addition of regularization terms to the empirical risk, there are other ways to perform regularization.

One approach is to modify the optimization scheme by using techniques such as early stopping [56]. Another is to augment the training set by generating artificial examples and hence effectively increasing the number \( N \) of training examples. Related to this idea is the technique known as bagging. With bagging, we first create a number \( K \) of bootstrap data sets. Bootstrap data sets are obtained by selecting \( N \) data points uniformly with replacement from \( D \) so that the same data point generally appears multiple times in the bootstrap data set. Then, we train the model \( K \) times, each time over a different bootstrap set. Finally, we average the results obtained from the models using equal weights. If the errors accrued by the different models were independent, bagging would yield an estimation error that decreases with \( K \). In practice, significantly smaller gains are obtained, particularly for large \( K \), given that the bootstrap data sets are all drawn from \( D \) and hence the estimation errors are not independent [23].
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 44

Context: # Regression

Figure 2.7: Bayesian Network (BN) describing the joint distribution (2.32) of the weight vector \( w \) of the labels in the training data \( D \) and in the new example, as used in the Bayesian approach.

As mentioned, we are interested in computing the posterior probability \( p(w|D, t) \) of the new label \( t \) given the training data \( D \) and the new domain point \( x = \mathbf{x} \). Dropping again the domain variables to simplify the notation, we apply standard rules of probability to obtain:

\[ 
p(t|D) = \frac{p(t|D, w) p(w|D)}{p(t|D)} = \int p(t|w)p(w|D)dw \tag{2.33} 
\]

where the second equality follows from the marginalization rule \( p(t|D, t) = \int p(t|w, t)dw \) and the last equality from Bayes' theorem. Putting back the dependence on the domain variables, we obtain the predictive distribution as

\[ 
p(t|x, D) = \int p(w|D) p(t|x, w)dw \tag{2.34} 
\]

This is the key equation. Accordingly, the Bayesian approach considers the predictive probability \( p(t|x, w) \) associated with each value of the weight vector \( w \) weighted by the posterior belief 

\[ 
p(w|D) = \frac{p(D|w)p(w)}{p(D)} \tag{2.35} 
\]

The posterior belief \( p(w|D) \), which defines the weight of the parameter vector \( w \), is therefore proportional to the prior belief \( p(w) \) multiplied by the correction \( p(D|w) \) due to the observed data.

Computing the posterior \( p(w|D) \), and a fortiori also the predictive distribution \( p(t|x, D) \), is generally a difficult task that requires the adoption of approximate inference techniques covered in Chapter 8. For this
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 51

Context: # 2.6. Information-Theoretic Metrics

To start, we revisit the ML problem (2.15), which amounts to the minimization of the NLL \(-\frac{1}{N}\sum_{n=1}^{N} \ln p(y_n | x_n; \theta)\), also known as log-loss. According to the frequentist viewpoint, the training set variables are drawn independently according to the true distribution \(p(x, y)\), i.e., \((x_n, y_n) \sim p\) i.i.d. over \(n = 1, \ldots, N\). By the strong law of large numbers, we then have the following limit with probability one:

\[
-\frac{1}{N}\sum_{n=1}^{N} \ln p(y_n | x_n, w, \beta) \to \mathbb{E}_{x_{(1)}, \dots, x_{(N)} \sim p}[-\ln p(y | x)]
\] 
(2.40)

As we will see next, this limit has a useful interpretation in terms of the KL divergence.

The KL divergence between two distributions \(p\) and \(q\) is defined as

\[
KL(p||q) = \mathbb{E}_{x \sim p} \left[ \ln \frac{p(x)}{q(x)} \right] 
\] 
(2.41)

The KL divergence is the expectation of the Log-Likelihood Ratio (LLR) \(\ln(p(x) / q(x))\) between the two distributions, where the expectation is taken with respect to the distribution at the numerator. The LLR tends to be larger, on average, if the two distributions differ more significantly, while being uniformly zero only when the two distributions are equal. Therefore, the KL divergence measures the “distance” between two distributions. As an example, with \(p(x) = N(\mu_1, \sigma_1^2)\) and \(q(x) = N(\mu_2, \sigma_2^2)\), we have

\[
KL(p||q) = \frac{1}{2}\left(\frac{(\mu_1 - \mu_2)^2}{\sigma_2^2} + \frac{\sigma_1^2}{\sigma_2^2} - 1 + \ln\frac{\sigma_1^2}{\sigma_2^2}\right) 
\] 
(2.42)

and, in the special case \(\sigma^2 = \sigma_2^2\), we can write

\[
KL(p||q) = \frac{1}{2} \left(\frac{(\mu_1 - \mu_2)^2}{\sigma^2}\right) 
\] 
(2.43)

which indeed increase as the two distributions become more different.

The KL divergence is measured in nats when the natural logarithm is used as in (2.41), while it is measured in bits if a logarithm in base 2 is used. In general, the KL divergence has several different interpretations as a measure of the distance of two distributions [23, pp. 55-58]. The most notable is Gibbs' inequality:

\[
KL(p||\phi) \geq 0 
\] 
(2.44)
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 52

Context: 46 A Gentle Introduction through Linear Regression

where equality holds if and only if the two distributions \( p \) and \( q \) are identical. Nevertheless, the KL divergence has also some seemingly unfavorable features, such as its non-symmetry, that is, the inequality \( KL(p||q) \neq KL(q||p) \). We will see in Chapter 8 that the absence of symmetry can be leveraged to define different types of approximate inference and learning techniques.

Importantly, the KL divergence can be written as

\[ 
KL(p||q) = \mathbb{E}_{x \sim p(x)}\left[-\ln g(x)\right] - \mathbb{E}_{x \sim p(x)}\left[-\ln p(x)\right] 
\]

(2.45)

where the first term, \( H(p||q) \), is known as cross-entropy between \( p(x) \) and \( q(x) \) and plays an important role as a learning criterion as discussed below; while the second term, \( H(q) \), is the entropy of distribution \( p(x) \), which is a measure of randomness. We refer to Appendix A for further discussion on the entropy.

Based on the decomposition (2.45), we observe that the cross-entropy \( H(p||q) \) can also be taken as a measure of divergence between two distributions when one is interested in optimizing over the distribution \( q(x) \), since the latter does not appear in the entropy term. Note that the cross-entropy, unlike the KL divergence, can be negative.

Using the definition (2.41), the expected log-loss on the right-hand side of (2.40) can be expressed as

\[ 
\mathbb{E}_{(x,y)}\left[-\ln p(y|x,w,\beta)\right] = \mathbb{E}_{q(x)}\left[H(p(y|x)||p(y|x,\beta))\right] 
\]

(2.46)

which can be easily verified by using the law of iterated expectations. Therefore, the average log-loss is the average over the domain point \( x \) of the cross-entropy between the real predictive distribution \( p(y|x) \) and the predictive distribution \( p(y|x, w) \) dictated by the model. From (2.46), the ML problem (2.15) can be interpreted as an attempt to make the model-based discriminative distribution \( p(y|x, w, \beta) \) as close as possible to the actual posterior \( p(y|x) \). This is done by minimizing the KL divergence, or equivalently the cross-entropy, upon averaging over \( p(x) \).

As final remarks, in machine learning, it is common to use the notation \( KL(p||q) \) even when \( q(x) \) is unnormalized, that is, when \( q(x) \) is
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 54

Context: # A Gentle Introduction through Linear Regression

![Figure 2.10: Illustration of Simpson's paradox [113]](path_to_image)

The way out of this conundrum is to leverage prior information we have about the problem domain. In fact, we can explain away this spurious correlation by including another measurable variable in the model, namely age. To see this, consider the same data, now redrawn by highlighting the age of the individual corresponding to each data point. The resulting plot, seen in Fig. 2.10 (bottom), reveals that older people — within the observed bracket — tend to have a higher cholesterol as well as to exercise more. Therefore, age is a common cause of both exercise and cholesterol levels. In order to capture the causal relationship between the latter variables, we hence need to adjust for age. Doing this requires to consider the trend within each age separately, recovering the expected conclusion that exercising is useful to lower one’s cholesterol.  

We conclude that in this example the correlation between \( z \) and \( t \), while useful for prediction, should not be acted upon for decision making. When assessing the causal relationship between \( z \) and \( t \), we should first understand which other variables may explain the observations and then discount any spurious correlations.

This discussion reveals an important limitation of most existing machine learning algorithms when it comes to identifying causality relationships, or, more generally, to answering counterfactual queries [112]. The study of causality can be carried out within the elegant framework of footnotes.  

7 This example is an instance of the so-called Simpson's paradox: patterns visible in the data disappear, or are even reversed, on the segregated data.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 58

Context: # 3.1 Preliminaries

We start with a brief review of some definitions that will be used throughout the chapter and elsewhere in the monograph (see [28] for more details). Readers with a background in convex analysis and calculus may just review the concept of sufficient statistic in the last paragraph.

First, we define a **convex set** as a subset of \(\mathbb{R}^D\), for some \(D\), that contains all segments between any two points in the set. Geometrically, convex sets hence cannot have “indentations.” Function \(f(x)\) is convex if its domain is a convex set and if it satisfies the inequality \(f(\lambda x_1 + (1 - \lambda)x_2) \leq \lambda f(x_1) + (1 - \lambda)f(x_2)\) for all \(x_1\) and \(x_2\) in its domain and for all \(0 \leq \lambda \leq 1\). Geometrically, this condition says that the function is “U”-shaped: the curve defining the function cannot be above the segment obtained by connecting any two points on the curve. A function is strictly convex if the inequality above is strict except for \(\lambda = 0\) or \(\lambda = 1\); a concave, or strictly concave, function is defined by reversing the inequality above – it is hence “n-shaped.”

The minimization of a convex (“U”) function over a convex constraint set or the maximization of a concave (“n”) function over a convex constraint set are known as convex optimization problems. For these problems, there exist powerful analytical and algorithmic tools to obtain globally optimal solutions [28].

We also introduce two useful concepts from calculus. The **gradient** of a differentiable function \(f(x)\) with \(x = [x_1, \ldots, x_D]^T \in \mathbb{R}^D\) is defined as the \(D \times 1\) vector \(\nabla f(x) = [\frac{\partial f(x)}{\partial x_1}, \ldots, \frac{\partial f(x)}{\partial x_D}]^T\) containing all partial derivatives. At any point \(x\) in the domain of the function, the gradient is a vector that points to the direction of locally maximal increase of the function. The Hessian \(\nabla^2 f(x)\) is the \(D \times D\) matrix with \((i,j)\) element given by the second-order derivative \(\frac{\partial^2 f(x)}{\partial x_i \partial x_j}\). It captures the local curvature of the function.

1. A statistic is a function of the data.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 59

Context: # 3.2 The Exponential Family

The probability \( p(\theta \mid x) \) of the parameters \( \theta \) depends on \( x \) only through the function \( f(x) \). As an example, for \( x \sim \mathcal{N}(0, \sigma^2) \), the function \( f(x) = x^2 \) can be easily seen to be sufficient for the estimate of the variance \( \sigma^2 \).

## 3.2 The Exponential Family

In this section, we introduce the exponential family of parametric probabilistic models. As it will be discussed, this family includes as special cases most of the distributions typically assumed when solving machine learning problems. For example, it includes Gaussian, Laplace, Gamma, Beta and Dirichlet pdfs, as well as Bernoulli, Categorical, multinomial, and Poisson pmfs. An extensive list can be found in [156].

### 3.2.1 Basic Definitions

The exponential family contains probabilistic models of the form:

\[
p(x \mid \eta) = \frac{1}{Z(\eta)} \exp\left( \sum_{k=1}^{K} m_k u_k(x) \right) m(z)
\]

\[
= \frac{1}{Z(\eta)} \exp\left( \eta^T u(x) \right) m(x), \tag{3.1}
\]

where \( x \) is a discrete or continuous-valued vector; \( \eta = [\eta_1, \ldots, \eta_K]^T \) is the vector of natural parameters; \( u(x) = [u_1(x), \ldots, u_K(x)]^T \) is the vector of sufficient statistics, with each sufficient statistic \( u_k(x) \) being a function of \( x \); \( m(z) \geq 0 \) is the base measure, which is a function of \( x \) that is independent of the natural parameter \( \eta \); and \( Z(\eta) \) is the partition function:

\[
Z(\eta) = \int \exp\left( \eta^T u(x) \right) m(x) dx \tag{3.2}
\]

for continuous rvs and \( Z(\eta) = \sum_{x} \exp\left( \eta^T u(x) \right) m(x) \) for discrete rvs.

The sufficient statistic vector \( u(x) \) can be seen to be a sufficient statistic for the estimation of the natural parameter vector \( \eta \) given \( x \).

The partition function normalizes the distribution so that it integrates, or sums to one. It is also often useful to use the unnormalized distribution:

\[
\tilde{p}(x) = \exp\left( \eta^T u(x) \right) m(x), \tag{3.3}
\]
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 63

Context: # 3.2 The Exponential Family

The Bernoulli distribution for a binary rv \( x \in \{0, 1\} \) is given as

\[ 
\text{Bern}(x|\mu) = \mu^{x} (1 - \mu)^{1 - x}, \tag{3.11} 
\]

with \( \mu = \text{E}_{\text{Bern}(x)}[x] = \Pr(x = 1) \). Since we can write the LL function as

\[
\ln(\text{Bern}(x|\mu)) = \ln\left(\frac{\mu}{\mu}\right) x + \ln(1 - \mu), \tag{3.12} 
\]

the sufficient statistic defining the Bernoulli model is \( u(x) = x \) and the measure function is \( m(x) = 1 \). The mapping between the natural parameter \( \eta \) and the mean parameter \( \mu \) is given as

\[
\eta = \ln\left(\frac{\mu}{1 - \mu}\right), \tag{3.13} 
\]

that is, the natural parameter is the LLR \( \eta = \ln(\text{Bern}(1|\mu)/\text{Bern}(0|\mu)) \). Function (3.13) is also known as the logit function. The corresponding set of feasible values is hence \( \mathbb{R} \).

The inverse mapping is instead given by the logistic sigmoid function

\[
\mu = \sigma(\eta) = \frac{1}{1 + e^{-\eta}}, \tag{3.14} 
\]

The sigmoid function converts a real number into the interval \([0, 1]\) via an S-shape that associates values less than 0.5 to negative values of the argument and larger values to positive numbers. Finally, the log-partition function is given by the convex function of the natural parameters

\[
A(\eta) = -\ln(1 - \mu) = \ln(1 + e^{\eta}). \tag{3.15} 
\]

Note that the relationship (3.10) is easily verified.

## 3.2.4 Categorical or Multinomial Model

For its role in multi-class classification, we introduce here in some detail the Categorical or Multinomial distribution, along with the one-hot encoding of categorical variables and the soft-max function.

The Categorical model applies to discrete variables taking \( C \) values, here labeled without loss of generality as \( \{0, 1, \ldots, C - 1\} \). Note that
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 70

Context: # 3.4.1 Beta-Bernoulli Model

The Beta-Bernoulli model is suitable to study binary data. Conditioned on the parameter \( \mu = \Pr\{X = 1\} \), the pmf of the \( N \) i.i.d. available observations \( x_p \) with \( x_n \in \{0, 1\} \) is given as

\[
p(x | \mu) = \prod_{n=1}^{N} \text{Bern}(x_n | \mu) = \mu^{\sum_{n=1}^{N} x_n}(1 - \mu)^{N - \sum_{n=1}^{N} x_n}. \tag{3.38}
\]

As seen, a conjugate prior should be such that the posterior (3.37) has the same distribution of the prior but with different parameters. For the likelihood (3.38), the conjugate prior is the Beta distribution, which is defined as

\[
p(\alpha | a, b) = \text{Beta}(\alpha | a, b) \propto \alpha^{a - 1}(1 - \alpha)^{b - 1}, \tag{3.39}
\]

where \( a \) and \( b \) are hyperparameters and the normalization constant is not made explicit in order to simplify the notation. It is worth emphasizing that (3.39) is a probability distribution on a probability \( \alpha \). Plots of the beta pdf for different values of \( a, b \geq 1 \) can be found in Fig. 3.1.

![Figure 3.1: Beta distribution with different values of the hyperparameters \( a, b \geq 1 \).](#)
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 75

Context: # 3.5 Supervised Learning via Generalized Linear Models (GLM)

Distributions in the exponential family are not directly suitable to serve as discriminative probabilistic models to be used in supervised learning tasks. In this section, we introduce Generalized Linear Models (GLMs), which are popular probabilistic discriminative models that build on members of the exponential family.

## Figure 3.3: Gaussian-Gaussian model

The prior distribution \( N(\mu | \phi_0 = 1, \sigma^2 = 3) \) (dotted), true distribution \( N(\mu | \theta_0 = 0, \sigma^2 = 1) \) (dashed), \( N \) observations (circles), ML solution (diamond), the MAP estimate (star), and Bayesian predictive distribution (solid line).

\[
\mu = 0: \quad (i) \text{ the MAP estimate } \hat{\mu}_{MAP} \text{ tends to the ML estimate } \hat{\mu}_{ML}; \text{ and } (ii) \text{ the Bayesian predictive distribution tends to the ML predictive distribution, which in turn coincides with the true distribution due to (i).}
\]
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 78

Context: Furthermore, the fact that the exponential distribution solves the maximum entropy problem (3.56) illuminates the relationship between mean parameters \(\mu\) and natural parameters \(\eta\), as the natural parameter \(\eta\) is the optimal Lagrange multiplier associated with the constraint \(E_x[\mu(x)] = \mu\).

As another note on the exponential family and information-theoretic metrics, in Appendix B we provide discussion about the computation of the Kullback-Leibler divergence between two distributions in the same exponential family but with different parameters.

## 3.7 Energy-based Models*

A generalization of the exponential family is given by probabilistic models of the form

\[
p(x|\eta) = \frac{1}{Z(\eta)} \exp\left(-\sum_{e} E_e(x| \eta)\right) \tag{3.57}
\]

where functions \(E_e(x|\eta)\) are referred to as energy functions and \(Z(\eta)\) is the partition function. Each energy function \(E_e(x|\eta)\) generally depends on a subset \(x_e\) of the variables in vector \(x\). If each energy function depends linearly on the parameter vector \(\eta\), we recover the exponential family discussed above. However, the energy functions may have a more general non-linear form. An example is the function \(E_e(x|\eta) = \ln(1 + \eta^T x_e^2)\) corresponding to a Student's t-distribution model.

Models in the form (3.57) encode information about the plausibility of different configurations of subsets of \(r\) vs \(x\), using the associated energy value: a large energy entails an implausible configuration, while a small energy identifies likely configurations. For example, a subset of \(r\) vs \(x\) may tend to be equal with high probability, implying that configurations in which this condition is not satisfied should have high energy. Energy-based models are typically represented via the graphical formalism of Markov networks, as will be discussed in Chapter 7.

*A Student's t-distribution can be interpreted as an infinite mixture of Gaussians. As a result, it has longer tails than a Gaussian pdf [23, Chapter 2].*
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 79

Context: # 3.8 Some Advanced Topics*

With energy-based models, the key formula (3.28) of the gradient of the LL with respect to the model's parameters generalizes as

$$
\frac{1}{N} \nabla_{n} \ln p(z | \eta) = - \frac{1}{N} \sum_{n=1}^{N} \nabla_{n} E_{c}(z_{n} | \eta) + \sum_{c} E_{c} - p(z_{c}) \nabla_{n} E_{c}(x_{i}).
$$
(3.58)

Generalizing the discussion around (3.28), the first term in (3.58) is the “positive” component that points in a direction that minimizes the energy of the observations \( x_{i} \); while the second term is the “negative” component that pushes up the energy of the unobserved configurations.

In gradient-ascent methods, the application of the first term is typically referred to as the positive phase, while the second is referred to as the negative phase. (The negative phase is even taken by some authors to model the working of the brain while dreaming!) [56] While for the exponential family the expectation in the negative phase readily yields the mean parameters, for more general models, the evaluation of this term is generally prohibitive and typically requires Monte Carlo approximations, which are discussed in Chapter 8.

## 3.8 Some Advanced Topics*

The previous sections have focused on the important class of parametric probabilistic models in the exponential family. Here we briefly put the content of this chapter in the broader context of probabilistic models for machine learning. First, it is often useful to encode additional information about the relationships among the model variables by means of a graphical formalism that will be discussed in Chapter 7. Second, the problem of learning the distribution of given observations, which has been studied here using parametric models, can also be tackled using a non-parametric approach. Accordingly, the distribution is inferred making only assumptions regarding its local smoothness. Typical techniques in this family include Kernel Density Estimation and Nearest Neighbor Density Estimation (see, e.g., [140]).

Furthermore, rather than learning individual probability densities, in some applications, it is more useful to directly estimate ratios of densities. This is the case, for instance, when one wishes to estimate the
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 80

Context: # Probabilistic Models for Learning

## 3.9 Summary

In this chapter, we have reviewed an important class of probabilistic models that are widely used as components in learning algorithms for both supervised and unsupervised learning tasks. Among the key properties of members of this class, known as the exponential family, are the simple form taken by the gradient of the log-likelihood (LL), as well as the availability of conjugate priors in the same family for Bayesian inference. 

An extensive list of distributions in the exponential family along with corresponding sufficient statistics, measure functions, log-partition functions, and mappings between natural and mean parameters can be found in [156]. More complex examples include the Restricted Boltzmann Machines (RBMs) to be discussed in Chapter 6 and Chapter 8. It is worth mentioning that there are also distributions not in the exponential family, such as the uniform distribution parameterized by its support. The chapter also covered the important idea of applying exponential models to supervised learning via Generalized Linear Models (GLMs). Energy-based models were finally discussed as an advanced topic.

The next chapter will present various applications of models in the exponential family to classification problems.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 83

Context: # 4.1 Preliminaries: Stochastic Gradient Descent

In this section, we review a technique that is extensively used in the solution of optimization problems that define learning problems such as ML and MAP (see Chapter 2). The technique is known as Stochastic Gradient Descent (SGD). SGD is introduced here and applied throughout this monograph to other learning problems, including unsupervised learning and reinforcement learning. Discussions about convergence and about more advanced optimization techniques, which may be skipped at a first reading, can be found in the Appendix A of this chapter.

SGD addresses optimization problems of the form:

$$
\min_{\theta} \sum_{i=1}^{N} f_i(\theta),
$$

where \( \theta \) is the vector of variables to be optimized. The cost function \( f_i(\theta) \) typically depends on the \( n \)th example in the training set \( D \). Following the notation set in Chapter 2, for example, in the case of discriminative deterministic models, the conventional form for the cost functions is:

$$
f_i(\theta) = \ell(t_n, \hat{y}(x_n, \theta)), \quad (4.2)
$$

where \( \ell \) is a loss function; \( (x_n, t_n) \) is the \( n \)th training example; and \( \hat{y}(\cdot, \theta) \) is a predictor parameterized by vector \( \theta \).

SGD requires the differentiability of cost functions \( f_i(\cdot) \). The idea is to move at each iteration in the direction of maximum descent for the cost function in \( (4.1) \), when the latter is evaluated \( S \) of samples from the training set. Given a learning rate schedule \( \eta(t) \) and an initialization \( \theta(0) \) of the parameters, SGD repeats in each iteration until convergence the following two steps:

1. Pick a mini-batch \( S \) of indices from the set \( \{1, \ldots, N\} \) according to some predetermined order or randomly;
2. Update the parameters according to:

   $$
   \theta(t+1) = \theta(t) - \eta(t) \nabla f_i(\theta(t)),
   $$

where \( \nabla f_i(\theta(t)) \) is the gradient of the cost function evaluated at the current parameters.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 84

Context: - Update the weights in the direction of steepest local descent as

  \[
  \theta^{(t)} \gets \theta^{(t-1)} - \frac{\eta^{(t)}}{S} \sum_{s \in S} \nabla f_s(\theta^{(t-1)}).
  \] (4.3)

The learning rate \(\eta^{(t)}\) as a function of the iteration \(t\) is generally considered to be part of the hyperparameters to be optimized via validation. More discussion on this can be found in Appendix A of this chapter.

## 4.2 Classification as a Supervised Learning Problem

Classification is a supervised learning problem in which the label \(t\) can take a discrete finite number of values. We refer to Sec. 2.1 for an introduction to supervised learning. In binary classification, each domain point \(x\) is assigned to either one of two classes, which are denoted as \(C_0\) and \(C_1\) and identified by the value of the label \(t\) as follows:

- \( x \in C_0 \; \text{if} \; t = -1 \) (4.4a)  
- \( x \in C_1 \; \text{if} \; t = 1. \) (4.4b)  

Note that we will find it convenient to use either the label \(t = 0\) or \(t = -1\) to identify class \(C_0\). In the more general case of \(K\) classes \(C_0, C_1, \ldots, C_{K-1}\), we will instead prefer to use one-hot encoding (Sec. 3.2) by labelling a point \(x \in C_k\) with a \(K \times 1\) label that contains all zeros except for a “1” entry at position \(k + 1\).

### Example 4.1. 

Examples of binary classification include email spam detection and creditworthiness assessment. In the former case, the domain point \(x\) may be encoded using the bag-of-words model, so that each entry represents the count of the number of times that each term in a given set appears in the email. In the latter application, the domain vector \(x\) generally includes valuable information to decide on whether a customer should be granted credit, such as credit score and salary (see, e.g., [2]). 

*While less useful, the "hot dog" vs. "not hot dog" classifier designed in the "Silicon Valley" HBO show (Season 4) is also a valid example.*
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 85

Context: # 4.2. Classification as a Supervised Learning Problem

The binary classification problem is illustrated in Fig. 4.1. Given a training set \( D \) of labeled examples \( x_n, n = 1, \ldots, N \), the problem is to assign a new example \( x \) to either class \( C_0 \) or \( C_1 \). In this particular standard data set, the two variables in each vector \( x \) measure the sepal length and sepal width of an iris flower. The latter may belong to either the setosa or virginica family, as encoded by the label \( t_n \) and represented in the figure with different markers. Throughout, we denote as \( D \) the dimension of the domain point \( x \) (\( D = 2 \) in Fig. 4.1).

![Binary classification illustration](path/to/figure4.1.png)

**Figure 4.1:** Illustration of the binary (\( K = 2 \) classes) classification problem with a domain space of dimension \( D = 2 \): to which class should the new example \( x \) be assigned?

Following the taxonomy introduced in Chapter 2, we can distinguish the following modeling approaches, which will be reviewed in the given order throughout the rest of this chapter:

- **Deterministic deterministic models:** Model directly the deterministic mapping between domain point and label via a parameterized function \( t \approx \hat{t}(x) \).
  
- **Discriminative probabilistic models:** Model the probability of a point \( x \) belonging to class \( C_k \) via a parameterized conditional pmf \( p(t|x) \), with the relationship between \( t \) and \( C_k \) defined in (4.4). We will also
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 86

Context: write \( p(C_k | x) \) for the discriminative probability when more convenient.

- **Generative probabilistic model**: Model the joint distribution of domain point and class label by specifying the prior distribution \( p(t) \) or \( p(C_k) \), and the class-dependent probability distribution \( p(x | C_k) \), or \( p(x | C_k) \), of the domain points within each class.

Discriminative models are arguably to be considered as setting the current state of the art on classification, including popular methods such as Support Vector Machine (SVM) and deep neural networks. Generative models are potentially more flexible and powerful as they allow to capture distinct class-dependent properties of the covariates \( x \).

### 4.3 Discriminative Deterministic Models

In this section, we discuss binary classification using discriminative deterministic models. Owing to their practical importance and to their intuitive geometric properties, we focus on linear models, whereby the binary prediction \( \hat{y}(x) \) is obtained by applying a threshold rule on a decision variable \( a(x, \mathbf{w}) \) obtained as a linear function of the learnable weights \( \mathbf{w} \) (the notation will be introduced below). Note that the decision variable \( a(x, \mathbf{w}) \) may not be a linear function of the covariates \( x \). As we will discuss, this class of models underlie important algorithms that are extensively used in practical applications such as SVM. A brief discussion on multi-class classification using deterministic models is provided at the end of this section. In the next two sections, we cover discriminative probabilistic models, including GLMs and more general models.

#### 4.3.1 Model

In their simplest form, linear discriminative deterministic classification models are of the form

\[
\hat{y}(x, \mathbf{w}) = \text{sign}(a(x, \mathbf{w})) \tag{4.5}
\]
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 87

Context: # 4.3 Discriminative Deterministic Models

where the activation, or decision variable, is given as

$$
a(x, \tilde{w}) = \sum_{d=1}^{D} w_d x + w_0 = \tilde{w}^T x
$$

and we have defined the weight vectors \( w = [w_1, \ldots, w_D]^T \) and \( \tilde{w} = [w_0, w_1, \ldots, w_D]^T \), as well as the extended domain point \( \tilde{x} = [1, x^T]^T \), with \( x = [x_1, \ldots, x_D]^T \). The sigmoid function in decision rule (4.5) outputs 1 if its argument is positive, and 0 if the argument is negative depending on the assumed association rule in (4.4).

## Geometric Interpretation: Classification, Geometric and Functional Margins

The decision rule (4.5) defines a hyperplane that separates the domain points classified as belonging to either of the two classes. A hyperplane is a line when \( D = 2 \); a plane when \( D = 3 \); and, more generally, a \( D - 1 \)-dimensional affine subspace in the domain space. The hyperplane is defined by the equation \( a(x, \tilde{w}) = 0 \), with points on either side characterized by either positive or negative values of the activation \( a(x, \tilde{w}) \). The decision hyperplane can be identified as described in Fig. 4.2: the vector \( w \) defines the direction perpendicular to the hyperplane and \( -w_0 / \|w\| \) is the bias of the decision surface in the decision space.

---

![Figure 4.2: Key definitions for a binary linear classifier.](#)

$$
a(x, \tilde{w}) > 0
$$

$$
|a(x, \tilde{w})|
$$

$$
a(x, \tilde{w}) = 0
$$

$$
a(x, \tilde{w}) < 0
$$
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 88

Context: 82 Classification

Given a point \( x \), it is useful to measure the confidence level at which the classifier assigns \( x \) to the class identified through rule (4.5). This can be done by quantifying the Euclidean distance between \( x \) and the decision hyperplane. As illustrated in Fig. 4.2, this distance, also known as the classification margin, can be computed as \( | \langle x, \mathbf{w} \rangle | / \| \mathbf{w} \| \).

A point \( x \) has a true label \( t \), which may or may not coincide with the one assigned by rule (4.5). To account for this, we augment the definition of margin by giving a positive sign to correctly classified points and a negative sign to incorrectly classified points. Assuming that \( t \) takes values in \( \{-1, 1\} \), this yields the definition of geometric margin as

\[
\text{margin} = \frac{t \cdot \langle x, \mathbf{w} \rangle}{\| \mathbf{w} \|} \tag{4.7}
\]

whose absolute value equals the classification margin. For future reference, we also define the functional margin as \( t \cdot \langle x, \mathbf{w} \rangle \).

**Feature-based model**. The model described above, in which the activation is a linear function of the input variables \( x \), has the following drawbacks:

1. **Bias**: As suggested by the example in Fig. 4.3, dividing the domain of the covariates \( x \) by means of a hyperplane may fail to capture. 

![Figure 4.3: A non-linearly separable training set.](path/to/image)
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 89

Context: # 4.3 Discriminative Deterministic Models

The geometric structure of the data. In particular, in the example, the two classes are not linearly separable in the space of the covariates—no hyperplane separates exactly the domain points in the two classes. In such cases, classifiers of the form (4.5) yield large average losses due to the bias induced by the choice of the model (see Sec. 2.3.3).

1. **Overfitting:** When D is large and the data points N are insufficient, learning the D + 1 weights of the classifier may cause overfitting.  
2. **Data-dependent domain size:** In some applications, the dimension D may even change from data point to data point, that is, it may vary with the index n. For example, a text \( x_n \), e.g., represented in ASCII format, will have a different dimension \( D_n \) depending on the number of words in the text.

To address these problems, a powerful approach is that of working with feature vectors \( \phi(x) \), \( k = 1, \ldots, D \), rather than directly with the covariates \( z \) as the input to the classifier. A feature \( \phi(x_n) \) is a generally non-linear function of the vector \( z \). It is important to emphasize that these functions are fixed and not learned.

Choosing a number of features \( D' > D \), which yields an overcomplete representation of the data point \( x_n \), may help against bias; while opting for an undercomplete representation with \( D' < D \) may help solve the problem of overfitting. Furthermore, the same number of features \( D' \), e.g., word counts in a bag-of-words model, may be selected irrespective of the size of the data point, addressing also the last problem listed above.

The feature-based model can be expressed as (4.5) with activation

\[
a(z, \theta) = \sum_{k=1}^{D} w_k \phi_k(z) = \mathbf{w}^T \boldsymbol{\phi}(z).
\]

(4.8)

where we have defined the feature vector \( \phi(x) = [\phi_1(x) \cdots \phi_D(x)]^T \). Note that model (4.5) is a special case of (4.8) with the choice \( \phi(z) = [1, z^T]^T \).

## 4.3.2 Learning

As seen in Sec. 2.3.3, learning of deterministic discriminative models can be carried out by means of ERM for a given loss function \( \ell \). Further...
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 90

Context: # Classification

Furthermore, as discussed in Sec. 2.3.5, overfitting can be controlled by introducing a regularization function \( R(\theta) \) on the weight vector \( \theta \). Accordingly, a deterministic predictor \( \hat{y}(\mathbf{x}) \) as defined in (4.5) can be learned by solving the regularized ERM problem

\[
\min_{\theta} L_D(\theta) + \lambda N R(\theta). \tag{4.9}
\]

with the empirical risk

\[
L_D(\theta) = \frac{1}{N} \sum_{n=1}^{N} \ell(t_n, \hat{y}(\mathbf{x}_n)). \tag{4.10}
\]

In (4.9), the hyperparameter \( \lambda \) should be selected via validation as explained in Sec. 2.3.5.

Extending the examples discussed in Sec. 2.3.5, the regularization term is typically convex but possibly not differentiable, e.g., \( R(\theta) = \|\theta\|_1 \). Furthermore, a natural choice for the loss function is the 0-1 loss, which implies that the generalization loss \( L_g \) in (2.2) is the probability of classification error.

In the special case of linearly separable data sets, the resulting ERM problem can be converted to a Linear Program (LP) [133]. Since it is in practice impossible to guarantee the separability condition a priori, one needs generally to solve directly the ERM problem (4.9). The function \( \text{sign}() \) has zero derivative almost everywhere, and is not differentiable when the argument is zero. For this reason, it is difficult to tackle problem (4.9) via standard gradient-based optimization algorithms such as SGD. It is instead often useful to consider surrogate loss functions \( \ell(t, a) \) that depend directly on the differentiable (affine) activation \( a(\mathbf{x}) \). The surrogate loss function should preferably be convex in \( a \) and in \( \mathbf{x} \), ensuring that the resulting regularized ERM problem

\[
\min_{\theta} \frac{1}{N} \sum_{n=1}^{N} \ell(t_n, \hat{y}(\mathbf{x}_n)) + \lambda N R(\theta)
\]

is convex. This facilitates optimization [28], and, under suitable additional conditions, guarantees generalization [133].
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 92

Context: ```
86 Classification

The loss function is defined as 

$$
\ell(t(x, \hat{w})) = \max(0, -t \cdot a(x, \hat{w}))
$$
(4.12)

The perceptron process assigns zero cost to a correctly classified example \( x \), whose functional margin \( t \cdot a(x, \hat{w}) \) is positive, and a cost equal to the absolute value of the functional margin for a misclassified example, whose functional margin is negative. A comparison with the 0-1 loss is shown in Fig. 4.4. The perceptron algorithm tackles problem (4.11) with \( \lambda = 0 \) via SGD with mini-batch size \( S = 1 \). The resulting algorithm works as follows. First, the weights \( \hat{w}^{(0)} \) are initialized. Then, for each iteration \( i = 1, 2, \ldots \):

1. Pick a training example \( (x_n, t_n) \) uniformly with replacement from \( D \):
   - If the example is correctly classified, i.e., if \( t_n a(x_n, \hat{w}^{(i)}) \geq 0 \), do not update the weights: 
     $$
     \hat{w}^{(i)} = \hat{w}^{(i-1)}.
     $$
   - If the example is not correctly classified, i.e., if \( t_n a(x_n, \hat{w}^{(i)}) < 0 \), update the weights as:
     $$
     \hat{w}^{(i)} = \hat{w}^{(i-1)} - \nabla_{\hat{w}} \ell(t_n, a(x_n, \hat{w}^{(i)})) \big|_{\hat{w} = \hat{w}^{(i-1)}} = \hat{w}^{(i-1)} + t_n a(x_n, \hat{w}^{(i-1)}).
     $$
(4.13)

It can be proved that, at each step, the algorithm reduces the term \( \ell(t_n, a(x_n, \hat{w})) \) in the perception loss related to the selected training example even if the latter is misclassified. It can also be shown that, if the training set is linearly separable, the perceptron algorithm finds a weight vector \( \hat{w} \) that separates the two classes exactly in a finite number of steps [23]. However, convergence can be slow. More importantly, the perceptron fails on training sets that are not linearly separable, such as the XOR training set \( D = \{(0, 0), (0, 1), (1, 0), (1, 1)\} \) [97]. This realization came as a disappointment and contributed to the first soc-called AI winter period characterized by a reduced funding for AI and machine learning research [154].

## Support Vector Machine (SVM)

SVM, introduced in its modern form by Cortes and Vapnik [37] in 1995, was among the main causes for a renewal of interest in machine learning.
```
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 93

Context: # 4.3 Discriminative Deterministic Models

Learning and AI. For this section, we will write explicitly (and with a slight abuse of notation) the activation as 

$$
\ell(a(\mathbf{x}, \mathbf{w})) = w_0 + \mathbf{w}^T \varphi(\mathbf{x}) \tag{4.14}
$$

in order to emphasize the offset \( w_0 \). SVM solves the regularized ERM problem (4.11) with the surrogate hinge loss function 

$$
\ell(t, a(\mathbf{x}, \mathbf{w})) = \max(0, 1 - t \cdot a(\mathbf{x}, \mathbf{w})) \tag{4.15}
$$

and with the regularization function \( R(\mathbf{w}) = \|\mathbf{w}\|^2 \). Note that the latter involves only the vector \(\mathbf{w}\) and not the bias weight \( w_0 \) — we will see below why this is a sensible choice. The hinge loss function is also shown in Fig. 4.4.

Therefore, unlike the perceptron algorithm, SVM includes a regularization term, which was shown to ensure strong theoretical guarantees in terms of generalization error [39]. Furthermore, rather than relying on SGD, SVM attempts to directly solve the regularized ERM problem using powerful convex optimization techniques [28].

To start, we need to deal with the non-differentiability of the hinge loss (4.15). This can be done by introducing auxiliary variables \( z_n \), one for each training example \( n \). In fact, imposing the inequality \( z_n \geq \ell(t_n, a(\mathbf{x}_n, \mathbf{w})) \) yields the following equivalent problem:

\[
\sum_{n=1}^N z_n + \lambda R(\mathbf{w}) \tag{4.16a}
\]

s.t. 

\[
t_n \cdot a(\mathbf{x}_n, \mathbf{w}) \geq 1 - z_n \tag{4.16b}
\]

\[
z_n \geq 0 \quad \text{for } n=1,\ldots,N \tag{4.16c}
\]

where \( \mathbf{z} = [z_1, \ldots, z_N]^T \). The equivalence between the original regularized ERM problem and problem (4.16) follows from the fact that any optimal value of the variables \( \mathbf{w} \) must satisfy either constraint (4.16b) or (4.16c) with equality. This can be seen by contradiction: a solution for which both constraints are loose for some \( n \) could always be improved by decreasing the value of the corresponding variable \( z_n \) until the most stringent of the two constraints in (4.16) is met. As a consequence, at an optimal solution, we have the equality \( z_n = \ell(t_n, a(\mathbf{x}_n, \mathbf{w})) \).
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 95

Context: # 4.3. Discriminative Deterministic Models

zero Lagrange multipliers. Note that the Lagrange multipliers are returned by standard solvers such as the ones implemented by the CVX toolbox in MATLAB [58].

![Example of binary classification with SVM using polynomial features up to degree \( M \) (λ/N = 0.2).](./figure_4_5.png)

## Example 4.2
In the example in Fig. 4.5, the illustrated \( N = 80 \) training samples are fed to a SVM using the monomial feature vector \( \phi(\mathbf{x}) = [1, x_1^2, \ldots, x_1^M, x_2^M]^\top \) and \( \lambda/N = 0.2 \) for given model orders \( M \). The decision boundary is shown using dashed and solid lines. It is seen that, using a sufficiently large order (here \( M = 3 \)), SVM is able to effectively partition the two samples in the two classes. Furthermore, even with larger values of \( M \) (here \( M = 8 \)), SVM appears not to suffer from significant overfitting thanks to the quadratic regularization term.

The optimization problem (4.16) may be conveniently tackled by using Lagrange duality techniques. This approach also allows one to naturally introduce the powerful tool of kernel methods. The interested reader can find this discussion in Appendix B of this chapter.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 98

Context: ```
## 4.4.2 Learning

Consider first ML. The NLL function can be written as

\[
-\ln p(t|x, \mathbf{w}) = -\sum_{n=1}^{N} \ln p(t_n|x_n, \mathbf{w}) \tag{4.23}
\]

\[
= -\sum_{n=1}^{N} \{t_n \ln(y_n) + (1 - t_n) \ln(1 - y_n)\} \tag{4.24}
\]

where we have defined \(y_n = \sigma(\mathbf{w}^T \phi(x_n))\). The NLL (4.23) is also referred to as the cross-entropy loss criterion, since the term \(-t_n \ln(y_n) - (1 - t_n) \ln(1 - y_n)\) is the cross-entropy \(H(t_n || y_n) = -\sum_{i} t_i \ln(y_i)\) (see Sec. 2.6). We note that the cross-entropy can be used to obtain upper bounds on the probability of error (see, e.g., [50]). The ML problem of minimizing the NLL is convex (see Sec. 3.1), and hence it can be solved either directly using convex optimization tools, or by using iterative methods such as SGD or Newton (the latter yields the iterative reweighted least squares algorithm [23, p. 207]).

The development of these methods leverages the expression of the gradient of the LL function (3.28) (used with \(N = 1\) for the exponential family). To elaborate, using the chain rule for differentiation, we can write the gradient

\[
\nabla_{\mathbf{w}} p(t | x, \mathbf{w}) = \nabla_{t} \text{Bern}(t_n) |_{t_n=t} \cdot \nabla_{\mathbf{w}} \sigma(\mathbf{w}^T \phi(x_n)) \tag{4.25}
\]

which, recalling that \(\nabla_{\ln(\text{Bern}(t_n))} = (t - \sigma(\cdot))\) (cf. (3.28)), yields

\[
\nabla_{\mathbf{w}} p(t | \mathbf{w}) = (t - y) \sigma \tag{4.26}
\]

Evaluating the exact posterior distribution for the Bayesian approach turns out to be generally intractable due to the difficulty in normalizing the posterior

\[
p(\mathbf{w}) \propto p(t) \prod_{n=1}^{N} p(t_n | x_n, \mathbf{w}) \tag{4.27}
\]

We refer to [23, pp. 217-220] for an approximate solution based on Laplace approximation. Other useful approximate methods will be discussed in Chapter 8.
```
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 99

Context: # 4.4. Discriminative Probabilistic Models: Generalized Linear Models

As a final remark, with bipolar labels, i.e., \( t \in \{-1,+1\} \), the cross-entropy loss function can be written as

\[
-\ln p(t | x; \theta) = \sum_{n=1}^{N} \ln(1 + \exp(-t_a f(x_n; \theta))) \tag{4.28}
\]

This formulation shows that logistic regression can be thought of as an ERM method with loss function \( \ell(t, f(x, \theta)) = \ln(1 + \exp(-t_a f(x, \theta))) \), which is seen in Fig. 4.4 to be a convex surrogate loss of the 0-1 loss.

## Mixture Models

As seen, the Bayesian approach obtains the predictive distribution by averaging over multiple models \( p(t | x; \theta) \) with respect to the parameters' posterior \( p(\theta | D) \) (2.34). The resulting model hence makes the predictions returned by multiple discriminative models \( p(t | x; \theta) \) to obtain the predictive distribution \( p(t | x) = \int p(t | x, D) p(D | \theta) d\theta \). As we briefly discuss below, it is also possible to learn mixture models within a frequentist framework.

Consider \( K \) probabilistic discriminative models \( p(t | x, w_k) \), \( k = 1, \ldots, K \), such as logistic regression. The mixture model is defined as

\[
p(t | x, \theta) = \sum_{k=1}^{K} \pi_k p(t | x, w_k) \tag{4.29}
\]

In this model, the vector \( \theta \) of learnable parameters includes the probability vector \( \pi \), which defines the relative weight of the \( K \) models, and the vectors \( w_1, \ldots, w_K \) for the \( K \) constituent models. As discussed, in the Bayesian approach, the weights \( \pi_k \) are directly obtained by using the rules of probability, as done in (427). Within a frequentist approach, model training is typically performed by a specialized algorithm, which will be described in Chapter 6, known as Expectation Maximization (EM).

Mixture models increase the capacity of discriminative models and hence allow for more complex relationships between covariates and labels. In particular, a mixture model, such as (4.29), has a number of parameters that increases proportionally with the number \( K \) of constituent models. Therefore, the capacity of a mixture model grows larger with \( K \). As an example, this increased capacity may be leveraged by specializing each constituent model \( p(t | x, w_k) \) to a different area of the covariate domain.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 100

Context: # 4.4.3 Multi-Class Classification

Given their larger capacity, mixture models may be prone to overfitting. A way to control overfitting will be discussed in Sec. 4.7.

In the case of \( K \) classes, the relevant exponential family distribution is Categorical with natural parameters depending linearly on the feature vector. This yields the following discriminative model as a generalization of logistic regression:

\[
t | x, W \sim \text{Cat}(t | \mathbf{W}(x)).
\]

where the label vector \( t \) is defined using one-hot encoding (Chapter 3) and \( W \) is a matrix of weights. We can also equivalently write the vector of probabilities for the \( K \) classes as

\[
y = \text{softmax}(\mathbf{W}(x)) = \left( \frac{e^{\mathbf{W}_1(x)}}{\sum_{j=1}^K e^{\mathbf{W}_j(x)}}, \ldots, \frac{e^{\mathbf{W}_K(x)}}{\sum_{j=1}^K e^{\mathbf{W}_j(x)}} \right).
\]

where \( y = [y_1, \ldots, y_K]^T \) with \( y_h = p(c_h | x) \); and \( \mathbf{W}_k \) being the \( k \)-th row of the weight matrix \( W \).

Learning follows as for logistic regression. To briefly elaborate on this point, the NLL can be written as the cross-entropy function:

\[
-\ln p(t | x; W) = -\sum_{n=1}^{N} \ln p(t_n | x, W)
= -\sum_{n=1}^{N} t_n^T \ln(y_n).
\]

where the logarithm is applied element by element, and we have \( y_n = \text{softmax}(\mathbf{W}(x_n)) \). Note that each term in (4.32) can be expressed as the cross-entropy \( -\mathbf{t}_n \ln(y_n) = H(t_n || y_n) \). The ML problem is again convex and hence efficiently solvable. The gradient of the NLL can be again found using the general formula (3.28) for exponential models and the chain rule for derivatives. We can write

\[
\nabla W \ln p(t | x; W) = \nabla \ln p(t | \mathbf{W}(x)) \propto \nabla \mathbf{W}(\mathbf{W}(x)).
\]
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 101

Context: 4.4. Discriminative Probabilistic Models: Generalized Linear Models
===================================================================

which yields

∇W ln p(t|x, W) = (t − y)σ(z)ᵀ.  (4.34)

### 4.4.4 Relation to Neural Networks

GLM models of the form (4.30) or (4.19) in the special case of binary classification can be interpreted in terms of the neural network shown in Fig. 4.6. A neural network consists of a directed graph of computing elements, known as neurons. Each neuron applies a deterministic transformation of its inputs, as defined by the incoming edges, to produce a scalar output on its outgoing edge.

In Fig. 4.6, the input vector \( x \) is first processed by a hidden layer of neurons, in which each hidden neuron computes the feature \( \phi_k(x) \). Then, each kth neuron in the output layer applies the kth element of the softmax non-linearity (4.31) to the numbers produced by the hidden neurons in order to compute the probability \( y_k = p(y|x) \). Note that, in the case of binary classification, only one output neuron is sufficient in order to compute the probability (4.19). It is also important to emphasize that only the weights between hidden and output layers are learnable, while the layer is fixed.

![Figure 4.6: GLM as a three-layer neural network with learnable weights only between hidden and output layers.](path/to/figure4_6.png)

We remark that one should not confuse a graph such as the one in Fig. 4.6 with the BN representation previously seen in Fig. 2.7, which will be further discussed in Chapter 7. In fact, while BNs represent probability distributions, diagrams of neural networks such as in Fig. 4.6 describe deterministic functional relations among variables.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 105

Context: 4.5. Discriminative Probabilistic Models: Beyond GLM

Put

\[
\delta^{L+1} = (y - t) \quad (4.36a)
\]

\[
\delta' = (W^{(L+1)})^T \delta^{(l)} \cdot h'(a^{(l)}) \quad (4.36b)
\]

\[
\nabla_{w_{ij}} L(W) = \delta^{(h^{(l)})^T} \quad \text{for } l = 1, 2, \ldots, L + 1 \quad (4.36c)
\]

where \(h'( \cdot )\) denotes the first derivative of the function \(h( \cdot )\); the product \( \cdot \) is taken element-wise; and we set \( h^0 = x \).

Backpropagation requires a forward pass and a backward pass for every considered training example. The forward pass uses the neural network as defined by equations (4.35). This entails multiplications by the weight matrices \(W^l\) in order to compute the activation vectors, as well as applications of the non-linear function \(h( \cdot )\). In contrast, the backward pass requires only linear operations, which, by (4.36b), are based on the transpose \((W^{(l)})^T\) of the weight matrix \(W^l\) used in the forward pass.

The derivatives (4.36c) computed during the backward pass are of the general form 

\[
\nabla_{w_{ij}} L(W) = h^{(l-1)} \cdot \delta_j \quad (4.37)
\]

where \(w_{ij}\) is the \((i,j)\)th element of matrix \(W^l\) corresponding to the weight between the pre-synaptic neuron \(j\) in layer \(l-1\) and the post-synaptic neuron \(i\) in layer \(l\); \(h^{(l-1)}\) is the output of the pre-synaptic neuron \(j\); and \(\delta_j\) is the back-propagated error. The back-propagated error assigns “responsibility” for the error \(t_j - y_j\) measured at the last layer (layer \(L+1\)) to each synaptic weight \(w_{ij}\) between neuron \(j\) in layer \(l-1\) and neuron \(i\) in layer \(l\). The back-propagated error is obtained via the linear operations in (4.36b).
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 107

Context: # 4.5. Discriminative Probabilistic Models: Beyond GLM

of deep neural networks appears to defy one of the principles laid out in Chapter 2, which will be formalized in the next chapter: Highly over-parameterized models trained via ML suffer from overfitting and hence do not generalize well. Recent results suggest that the use of SGD, based on the gradient (4.37), may act as a regularizer following an MDL perspective. In fact, SGD favors the attainment of flat local maxima of the likelihood function. Flat maxima require a smaller number of bits to be specified with respect to sharp maxima; since, in flat maxima, the parameter vector can be described with limited precision as long as it remains within the flat region of the likelihood function [66, 79, 71] (see also [78] for a different perspective).

Another important aspect concerns the hardware implementation of backprop. This is becoming extremely relevant given the practical applications of deep neural networks for consumers' devices. In fact, a key aspect of backprop is the need to propagate the error, which is measured at the last layer, to each synapse via the backward pass. This is needed in order to evaluate the gradient (4.37). While a software implementation of this rule does not present any conceptual difficulty, realizing the computations in (4.36) in hardware, or even on a biological neural system, is faced with a number of issues.

1. A first problem is the non-locality of the update (4.37). An update rule is local if it only uses information available at each neuron. In contrast, as seen, rule (4.37) requires back-propagation through the neural network. 
2. Another issue is the need for the backward pass to use a different neural path with respect to the forward pass, given that, unlike the backward pass, the forward pass includes also non-linearities. A useful discussion of hardware implementation aspects can be found in [12] (see also references therein).

To obviate at least some of these practical issues, a number of variations of the rule (4.37) have been proposed [12]. For instance, the feedback alignment rule modifies (4.36) by using fixed random matrices in lieu of the current weight matrices \( W' \); while the broadcast alignment rule rewrites the vectors \( \mathbf{y} - \mathbf{t} \) as a linear function with fixed random coefficients of the error ( \( \mathbf{y} - \mathbf{t} \) ), hence removing the need for back-propagation [129].
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 112

Context: where \( t \) is encoded using one-hot encoding, so that the label of each example is given by the vector \( t_n = [t_{n,1}, \ldots, t_{n,K}]^T \). Following the discussion above, moment matching yields the ML estimates

\[
\pi_k, ML = \frac{N |[k]}{N} \sum_{n=1}^N t_{n,k} \quad (4.45a)
\]

\[
\mu_{k,ML} = \frac{1}{N |[k]|} \sum_{n=1}^N t_{n,k} x_n \quad (4.45b)
\]

\[
\Sigma_{k,ML} = \frac{1}{N |[k]|} \sum_{n=1}^N t_{n,k} (x_n - \mu_{k}) (x_n - \mu_{k})^T \quad (4.45c)
\]

### 4.7 Boosting*

In this section, we return to the mixture models of the form (4.29) and discuss a popular training approach to reduce overfitting. We focus on deterministic discriminative models with activations \( \alpha_k(x, \tilde{\theta}) \), \( k = 1, \ldots, K \), in which the mixture predictor is given as 

\[
a(x, \tilde{w}) = \sum_{k=1}^K \pi_k \alpha_k(x, \tilde{w}) \quad (4.46)
\]

with learnable parameters \( \{\pi_k\} \) and \( \{\tilde{w}\} \). The technique, known as boosting, trains one model \( \alpha_k(x, \tilde{w}) \) at a time in a sequential fashion, from \( h = 1 \) to \( h = K \), hence adding one predictor at each training step \( k \). As a result, boosting increases the capacity of the model in a sequential manner, controlling the sum in (4.46) over a growing number of predictors. In this way, one starts by training a model with a large bias, or approximation error; and progressively decreases the bias at the cost of a potentially larger estimation error (see Chapter 2 and the next chapter for further discussion on bias and estimation error). As we will discuss below, each model is trained by solving an ERM problem in which the contribution of a training example is weighted by the rate of the previously trained models.

To elaborate, boosting—more specifically the AdaBoost scheme—can be described as solving an ERM problem with the exponential loss function \( l(t, a(x, \tilde{w})) = \exp(-t \cdot a(x, \tilde{w})) \), as plotted in Fig. 4.4.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 115

Context: 4.8. Summary
-----------

Chapter 8 for details). Learning rate schedules that satisfy (4.49) include \(\gamma(t) = 1/t\). The intuitive reason for the use of diminishing learning rates is the need to limit the impact of the “noise” associated with the finite-sample estimate of the gradient [22]. The proof of convergence leverages the unbiasedness of the estimate of the gradient obtained by SGD.

In practice, a larger mini-batch size \(S\) decreases the variance of the estimate of the gradient, hence improving the accuracy when close to a stationary point. However, choosing a smaller \(S\) can improve the speed of convergence when the current solution is far from the optimum [152, Chapter 8][22]. A smaller mini-batch size \(S\) is also known to improve the generalization performance of learning algorithms by avoiding sharp external points of the training loss function [66, 79] (see also Sec. 4.5). Furthermore, as an alternative to decreasing the step size, one can also increase the size of the mini-batch along the iterations of the SGD algorithm [136].

Variations and Generalizations
-------------------------------

Many variations of the discussed basic SGD algorithm have been proposed and routinely used. General principles motivating these schedule variants include [56, Chapter 8]: 

1. **Momentum**, or heavy-ball, memory: correct the direction suggested by the stochastic gradient by considering the “momentum” acquired during the last update; 
2. **Adapitivity**: use a different learning rate for different parameters depending on an estimate of the curvature of the loss function with respect to each parameter; 
3. **Control variates**: in order to reduce the variance of the SGD updates, and control variates that do not affect the unbiasedness of the stochastic gradient; and
4. **Second-order updates**: include information about the curvature of the cost or objective function in the parameter update.

As detailed in [56, Chapter 8][76, 43], to which we refer for further discussions, methods in the first category include Nesterov momentum; in the second category, we find AdaGrad, RMSProp, and Adam; and the third encompasses SVRG and SAGA. Finally, the fourth features Newton's method.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 117

Context: # 4.8. Summary

The problem turns out to be quadratic and convex. Importantly, the resulting optimal activation can be expressed as

\[
\alpha(x, w) = \sum_{n=1}^{N} \alpha_n k(x, z_n), \tag{4.50}
\]

where \(\alpha_n\) are the optimal dual variables, and we have defined the kernel function

\[
k(x, y) = \phi(x)^\top \phi(y), \tag{4.51}
\]

where \(x\) and \(y\) are two argument vectors. The kernel function measures the correlation—informally, the similarity—between the two input vectors \(x\) and \(y\). The activation (4.50) has been an intuitive interpretation: the decision about the label of an example \(x\) depends on the support vectors \(z_n\), where \(\alpha_n > 0\), that are the most similar to \(x\). Note that equation (4.50) can also be justified using the representer theorem in [133, Chapter 16], which shows that the optimal weight vector must be a linear combination of the feature vectors \(\{\phi(z_n)\}_{n=1}^N\).

Working in the dual domain can have computational advantages when the number of the primal variables, here the size \(D\) of the weight vector \(w\), is larger than the number \(N\) of dual variables. While this seems a priori unlikely to happen in practice, it turns out that this is not the case. The key idea is that one can use (4.50) with any other kernel function, not necessarily one explicitly defined by a feature function \(\phi\). A kernel function is any symmetric function measuring the correlation of two data points, possibly in an infinite-dimensional space. This is known as the kernel trick.

As a first example, the polynomial kernel

\[
k(x, y) = (x^\top y + r)^d, \tag{4.52}
\]

where \(r > 0\), corresponds to a correlation \(\phi(x)^\top \phi(y)\) in a high-dimensional space \(D\). For instance, with \(l = 1\) and \(D = 2\) we have \(D' = 6\) and the feature vector \(\phi(x) = [\sqrt{x_1}, x_1^2, x_2^2, \sqrt{x_1x_2}, x_1^3, x_2^3]\) [104]. As another, more extreme example, the conventional Gaussian kernel

\[
k(x, y) = e^{-\|x - y\|^2} \tag{4.53}
\]

corresponds to an inner product in an infinite-dimensional space [104]. An extensive discussion on kernel methods can be found in [104].
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 118

Context: Before leaving the subject of kernel methods, it is worth noting that an important class of methods including **k–Nearest Neighbor (k-NN)** uses kernels that are data-dependent. k-NN is also an example of non-parametric learning rules. In contrast to the other schemes studied here, it does not rely on a parametric model of the (probabilistic) relationship between input and output. Instead, k-NN leverages the assumption that the labels of nearby points \( x \) should be similar [81].
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 119

Context: # 5
## Statistical Learning Theory*

Statistical learning theory provides a well-established theoretical framework in which to study the trade-off between the number \( N \) of available data points and the generalization performance of a trained machine. The approach formalizes the notions of model capacity, estimation error (or generalization gap), and bias that underlie many of the design choices required by supervised learning, as we have seen in the previous chapters. 

This chapter is of mathematical nature, and it departs from the algorithmic focus of the text so far. While it may be skipped at a first reading, the chapter sheds light on the key empirical observations made in the previous chapters relative to learning in a frequentist setup. It does so by covering the theoretical underpinnings of supervised learning within the classical framework of statistical learning theory. 

To this end, the chapter contains a number of formal statements with proofs. The proofs have been carefully selected in order to highlight and clarify the key theoretical ideas. This chapter follows mostly the treatment in [133].
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 120

Context: # 5.1 A Formal Framework for Supervised Learning

In this chapter, we concentrate on discriminative deterministic models for binary classification, as it is typically done in statistical learning theory. We also focus on the standard 0-1 loss \( \ell(t, \hat{t}) = \mathbb{I}\{t \neq \hat{t}\} \), for which the generalization loss is the probability of error. The labels \( t \) for the two classes take values in the set \{0, 1\} (cf. (4.4)).

The learning problem is formalized as follows. Assume that a model, or hypothesis class, \( \mathcal{H} \) has been selected. This set contains a possibly uncountable number of predictors \( i \) that map each point \( x \) in the domain space to a label \( \ell(x) \in \{0, 1\} \). We would like to choose a specific hypothesis, or predictor, \( \hat{f} \in \mathcal{H} \) that minimizes the generalization error (cf. (2.2)).

\[
L_{\mathcal{H}}(\hat{f}) = \mathbb{E}_{x \sim \mathcal{D}}[\ell(t, \hat{f}(x))]. \tag{5.1}
\]

Solving this inference problem would yield an optimal model within class \( \mathcal{H} \) as 

\[
\hat{f} \in \arg\min_{f \in \mathcal{H}} L(f). \tag{5.2}
\]

The notation in (5.2) emphasizes that there may be multiple optimal hypotheses returned by the minimization of the generalization error \( L(\hat{f}) \). Nevertheless, to fix the ideas, it is useful to think of the case in which there is a unique optimal hypothesis. This is, for instance, the case when the loss function is strictly convex. Obtaining the optimal predictor (5.2) requires knowledge of the true distribution \( p(x, t) \), which is not available.

## Example 5.1

For the linear (deterministic) methods studied in Chapter 4, the model is defined as

\[
\mathcal{H} = \{t_b(x) = \langle w, x \rangle + w_0\} \tag{5.3}
\]

with \( w \in \mathbb{R}^d \), and similarly for the feature-based version. Identifying a hypothesis within this class requires the selection of the weight vector \( w \in \mathbb{R}^d \).

In lieu of the true distribution \( p(x, t) \), what is available is an i.i.d. training set 

\[
\mathcal{D} = \{(x_n, t_n)\}_{n=1}^N \sim p(x, t). \tag{5.4}
\]
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 121

Context: # 5.1 A Formal Framework for Supervised Learning

Distributed according to \( p(x, t) \). A learning algorithm, such as ERM, takes the training set \( D \) as input and returns a predictor \( \hat{t}_D \in \mathcal{H} \) as output. We would like the predictive model \( \hat{t}_D \in \mathcal{H} \) to yield a generalization error \( L_p(\hat{t}_D) \) that is as close as possible to the minimum generalization loss \( L_p(t^*) \). Note that the selected model \( \hat{t}_D \) is random due to randomness of the data set \( D \).

In this regard, we recall that the ERM learning rule chooses a hypothesis \( \hat{t}^{ERM}_D \) by following the criterion \( \hat{t}^{ERM}_D = \arg\min_{\hat{t} \in \mathcal{H}} L_p(\hat{t}) \), where the empirical risk is 

\[
L_D(t) = \frac{1}{N} \sum_{i=1}^N L(t, t(x_i)) \tag{5.5}
\]

The notation in (5.5) emphasizes the randomness of the training set \( D = \{(x_{i}, t_{i})\}^N_{i=1} \).

Since the distribution \( p(x, t) \) is unknown, a learning rule \( \hat{t}_D \), such as ERM, can only minimize the generalization loss \( L_p(\hat{t}) \) approximately based on the observation of the data \( D \). Furthermore, this approximation can only be guaranteed at some probability level due to the randomness of the data set \( D \). This is illustrated in Figure 5.1, in which we have represented a high-probability interval for \( L_p(\hat{t}) \) on the horizontal axis. We would like the approximation to be accurate for all values of \( L_p(\hat{t}) \) within this interval.

But there is more: the probabilistic guarantee in terms of accuracy cannot depend on the specific distribution \( p(x, t) \), but it should instead be universal with respect to all distributions \( p(x, t) \). In summary, the best one can hope for is to have a learning rule \( \hat{t}_D \) that is Probably Approximately Correct (PAC).

## Definition 5.1
A learning rule \( \hat{t}_D \) is (N, \(\delta\)) PAC if, when working on data sets \( D \) of \( N \) examples, it satisfies the inequality

\[
L_p(\hat{t}_D) \leq L_p(t^*) + \epsilon \tag{5.6}
\]

with probability no smaller than \( 1 - \delta \), that is,

\[
\Pr_{D \sim p_{data}}[L_p(\hat{t}_D) \leq L_p(t^*) + \epsilon] \geq 1 - \delta \tag{5.7}
\]
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 126

Context: # Statistical Learning Theory

Let \( x \in \mathbb{R}^D \) yield no information on the value of \( i(x) \) for other values of \( x \). As another, less obvious, example the class 

\[
\mathcal{H} = \{ h_w(x) = 1(\sin(wx) > 0) \}
\]

is not PAC learnable despite being parameterized by a single scalar [133].

## Definition 5.3
The sample complexity \( N_{\mathcal{H}}( \epsilon, \delta ) \) of model \( \mathcal{H} \) is the minimal value of \( N_{\mathcal{H}}( \epsilon, \delta ) \) that satisfies the requirements of PAC learning for \( \mathcal{H} \).

We will see next that the sample complexity depends on the capacity of the model \( \mathcal{H} \). Note that the sample complexity of the two examples above is infinite since they are not PAC learnable. We also remark that PAC learnability may be alternatively defined under the additional conditions on the scaling of \( N_{\mathcal{H}}( \epsilon, \delta ) \) as a function of \( \epsilon \) and \( \delta \), as well as on the computational complexity of the learning rule. We will not consider these more refined definitions here, and we refer the reader to [51, 133] for discussion.

## 5.3 PAC Learnability for Finite Hypothesis Classes

In this section, we consider models with a finite number of hypotheses. The main result is summarized in the following theorem, which is proved below in Sec. 5.3.1.

### Theorem 5.1
A finite hypothesis class \( \mathcal{H} \) is PAC learnable with sample complexity satisfying the inequality

\[
N_{\mathcal{H}}( \epsilon, \delta ) \leq \frac{2 \ln | \mathcal{H} | + 2 \ln(2 / \delta)}{\epsilon^2} \leq N_{\text{ERM}}^{\mathcal{H}}( \epsilon, \delta ). 
\]

Moreover, the ERM algorithm achieves the upper bound \( N_{\text{ERM}}^{\mathcal{H}}( \epsilon, \delta ) \).

The previous theorem shows that all finite classes are PAC learnable. Furthermore, for all classes \( \mathcal{H} \), ERM is a PAC learning rule for any desired levels of accuracy and confidence \( (\epsilon, \delta) \), as long as \( N \) is larger than the threshold \( N_{\text{ERM}}^{\mathcal{H}}( \epsilon, \delta ) \). This threshold, which we will refer to,
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 129

Context: ### 5.3. PAC Learnability for Finite Hypothesis Classes

We can now write the following sequence of equalities and inequalities, which prove the lemma and hence conclude the proof:

\[
\Pr_{D \sim P_{data}} \left[ \exists h \in \mathcal{H} : |L_D(h) - L_D(\phi)| > \frac{\epsilon}{2} \right]
\]

\[
= \Pr_{D \sim P_{data}} \left[ \bigcup_{h \in \mathcal{H}} \left\{ |L_D(h) - L_D(\phi)| > \frac{\epsilon}{2} \right\} \right]
\]

\[
\leq \sum_{h \in \mathcal{H}} \Pr_{D \sim P_{data}} \left[ |L_D(h) - L_D(\phi)| > \frac{\epsilon}{2} \right]
\]

\[
\leq 2 \sum_{h \in \mathcal{H}} \exp \left( -\frac{N \epsilon^2}{2} \right) = 2|\mathcal{H}| \exp \left( -\frac{N \epsilon^2}{2} \right) \leq \delta,
\]

where the first inequality follows by the union bound; the second by Hoeffding's inequality; and the third can be verified to be true as long as the inequality \( N \geq \frac{2}{\epsilon^2} \log \frac{2|\mathcal{H}|}{\delta} \) is satisfied.

### 5.3.2 Structural Risk Minimization

The result proved above is useful also to introduce the Structural Risk Minimization (SRM) learning approach. SRM is a method for joint model selection and hypothesis learning that is based on the minimization of an upper bound on the generalization loss. In principle, the approach avoids the use of validation and has deep theoretical properties in terms of generalization. In practical applications, the approach is rarely used, and validation is often preferable. It is nevertheless conceptually and theoretically a cornerstone of statistical learning theory.

To elaborate, assume that we have a nested set of hypothesis classes \(\mathcal{H}_1 \subset \mathcal{H}_2 \subset \ldots \subset \mathcal{H}_M\). From Lemma 5.2, we can obtain the following bound:

\[
L_D(h) \leq L_D(\phi) + \sqrt{ \frac{2\mathcal{H} |\mathcal{W}|}{N}} \quad (5.20)
\]
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 155

Context: ```
6.4. Directed Generative Models

We will see below that the EM algorithm leverages this observation in the M step. Furthermore, it will be observed that EM for mixture of Gaussians generalizes the K-means algorithm.

**E step.** In the E step, as per (6.26), we need to solve the inference problem of computing the posterior \( p(z_n | x_n, \theta) \) for every example \( n \) and cluster index \( k = 0, \ldots, K-1 \). In this case, this can be done directly via Bayes’ theorem since the normalization requires to sum only over the \( K \) possible values taken by \( z_n \), yielding

\[
p(z_n = k | x_n, \theta) = \frac{\pi_k N(x_n | \mu_k, \Sigma_k)}{\sum_{j=0}^{K-1} \pi_j N(x_n | \mu_j, \Sigma_j)}. \tag{6.29}
\]

**M step.** In the M step, we need to maximize the negative energy function \( Q(\theta, \phi) \) defined in (6.27). Each term in the sum can be computed directly as

\[
E_{\theta, \phi | x_{\text{obs}}, \theta} [ \ln p(z_n | x_n, \theta) ] = \frac{K-1}{\sum_{k=0}^{K-1} \Sigma_k \ln \pi_k + \ln N(x_n | \mu_k, \Sigma_k)}. \tag{6.30}
\]

with 

\[
\bar{x}_k = E_{\theta, \phi | x_{\text{obs}}, \theta} [x_n | z_n = k] = p(z_n = k | x_n, \theta). \tag{6.31}
\]

As it can be easily checked, the function \( Q(\theta, \phi) \) equals the LL function of the QDA supervised problem, in which the variables \( z_n \) are observed, with the following caveat: the sufficient statistics \( z_n \) are replaced by their posterior averages \( \bar{x}_k \). As a result, as opposed to supervised learning, each example \( x_n \) is being part of all clusters, with the “responsibility” of cluster \( k \) being given by \( \bar{x}_k \). Having made this observation, we can now easily optimize the \( Q(\theta, \phi) \) function by using the ML solutions (4.45) for QDA by substituting \( \bar{x}_k \) for the observed variable \( x_k \).

Setting \( \Sigma_k = \ell \) as a known parameter and letting \( \epsilon \to 0 \) recovers the K-means algorithm (Zaslavsky et al., p. 443).
```
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 161

Context: 6.5 Undirected Generative Models
---------------------------------

The samples \(x\) are hence modelled as the output of a generator function \(G(z)\), whose input \(z\) is given by i.i.d. Gaussian variables. This generative model can hence be interpreted as a generalization of PPCA to non-linear encoders \(G(z)\). The latter is conventionally modelled again as a multi-layer neural network.

Problem (6.41) is typically tackled using SGD (Chapter 4) by iterating between the optimization of the generator parameters \(\theta\) and of the discriminator parameters \(\phi\) (see [8, Algorithm 11]). Learning requires empirical approximations for the evaluation of the gradients that will be discussed in Chapter 8.

### Likelihood ratio estimation viewpoint.

When no constraints are imposed over the discriminator \(T(x)\) in (6.30) and the function \(g\) is selected as \(g(t) = \exp(t – 1)\), the optimal solution is given as \(T(z) = 1 + \ln(p(x)/p(x|y))\) and the corresponding divergence measure (6.39) is the KL divergence \(KL(p(y) || p(y | \phi))\). Therefore, solving problem (6.30) over a sufficiently general family of functions \(T_\phi(x)\) allows one to obtain an estimate of the log-likelihood ratio \(\ln(p(x)/p(x | y))\). As a result, GANs can be interpreted as carrying out the estimation of likelihood ratios between the data and the model distributions as a step in the learning process. The idea of estimating likelihood ratios is also useful in other learning problems based on variational inference, to be discussed in Chapter 5 (see [70]).

### Some research topics.

Among topics of current research, we mention here the problem of regularization of GANs [123]. We also note that GANs can also be extended to supervised learning problems [108]. The GAN approach of formulating the divergence as an optimization over a discriminator function has also recently found to be useful for ICA [29].

6.5 Undirected Generative Models
---------------------------------

Unlike directed models, undirected generative models posit a joint distribution for the hidden and the observed variables that captures affinity, rather than causality, relationships between the two sets of variables, as seen in Fig. 6.1. In this section, we discuss a representative example of undirected generative models that is considered
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 164

Context: Learning is typically done by means of an approximate SGD method that leverages MC techniques. Recalling the general formula (3.28) for the exponential family, the gradient at the current iteration \( \theta = \theta_{\text{gold}} \) can be computed using (6.22) as

\[
\nabla_{\theta} \ln p(x|\theta) |_{\theta = \theta_{\text{gold}}} = E_{x \sim p_{\theta_{\text{gold}}}(x)}[z_i z_j] - E_{x \sim p_{\theta}(x)}[z_i z_j | \theta].
\]

which can be further simplified using (6.45). The gradient presents the structure, noted in Chapter 3, given by the difference between a positive component, which depends on the data \( z \), and a negative component, which instead requires an ensemble average over all other possible samples \( x \sim p_{\theta_{\text{gold}}} \). Similar relations can be obtained for the gradients with respect to \( a \) and \( b \), namely

\[
\nabla_a \ln p(x|\theta) |_{\theta = \theta_{\text{gold}}} = E_{x \sim p_{\theta_{\text{gold}}}(x)}[z] - E_{x \sim p_{\theta}(x)}[z],
\]

and

\[
\nabla_b \ln p(x|\theta) |_{\theta = \theta_{\text{gold}}} = z_j - E_{x \sim p_{\theta}(x)}[x_j].
\]

In order to evaluate the expectations in the negative components of the gradients, one typically uses a method known as Markov Chain Monte Carlo (MCMC) to be discussed in Chapter 8. More specifically, a simplified approach known as Contrastive Divergence (CD) has been found to be an effective approximation. Accordingly, one “clamps” the visible variables to the observed variables \( x = x \) and generates the sequence \( x \to x^{(1)} \to x^{(2)} \), using the conditional probability (6.45a) to generate \( z \) from \( x \). The resulting samples are used to approximate the expectations as

\[
\nabla_{\theta} \ln p(x|\theta) |_{\theta = \theta_{\text{gold}}} = E_{x^{(0)}}[z_i^{(0)}] - E_{x^{(1)}}[z_i^{(0)}].
\]

and similar expressions apply for the other gradients. The CD scheme can also be generalized to CD-k by increasing the Markov chain sequence to \( k \) steps, and using the resulting samples \( x^{(k)} \) in lieu of \( x^{(1)} \) and \( x^{(0)} \).

Extensions and application of the RBM are discussed in [20, 153]. Generalizations of RBMs with multiple layers of hidden variables are referred to as Deep Boltzmann Machines. We refer to [64] for discussion about training and applications.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 167

Context: 6.7. Autoencoders
=================

As seen in Fig. 6.1, autoencoders include parametric models for both encoder and decoder. We focus here for brevity on deterministic autoencoders, in which the encoder is defined by a function \( z = F_e(x) \) and the decoder by a function \( x = G_d(z) \). Note that the parameters defining the two functions may be tied or may instead be distinct, and that the notation is general enough to capture both cases. We will mention at the end of this section probabilistic autoencoders, in which encoder and decoder are defined by conditional probability distributions.

Autoencoders transform the unsupervised learning problem of obtaining a representation \( z = F_e(x) \) of the input \( x \) based solely on unlabeled examples \( \{x_n\}_{n=1}^N \) into an instance of supervised learning. They do so by considering the encoder \( z = F_e(x) \) with a decoder \( x = G_d(z) \), so as to obtain the input-output relationship \( x = G_d(F_e(x)) \). The key idea is to train this function by setting the target \( t \) to equal the input \( x \). As such, the machine learns to obtain an intermediate representation \( z = F_e(x) \) that makes it possible to recover a suitable estimate \( t = G_d(z) \approx x \).

To formalize the approach, learning is typically formulated in terms of the ERM problem:

\[
\min_{\theta} \sum_{n=1}^{N} \mathcal{L}(x_n, G_d(F_e(x_n))),
\]

in which, as explained, the encoder-decoder mapping \( G_d(F_e(\cdot)) \) is trained to reproduce the input at its output.

In the absence of constraints on encoder and decoder, the problem above trivially reverts to identity mapping, i.e., \( G_d(F_e(x)) = x \). In order to potentially learn a useful model, one should impose some constraints, such as dimensionality or sparsity, on the latent vector \( z \). Some notable examples are discussed next.

PCA
----

PCA assumes linear encoder and decoder, and tests their weight matrices via a transposition. Specifically, PCA sets the encoder
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 171

Context: # Part IV

## Advanced Modelling and Inference
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 172

Context: # 7  
## Probabilistic Graphical Models  

As we have seen in the previous chapters, probabilistic models are widely used in machine learning. Using Fig. 6.1 as an example, we have encountered both directed and undirected models, which have been used to carry out supervised and unsupervised learning tasks. Graphical models encode structural information about the random variables of interest, both observed and latent. They hence provide a principled way to define parametric probabilistic models with desired features.  

The selection of a probabilistic graphical model hence follows the same general rules that have been discussed so far: A more specialized, or structured, model may help reduce overfitting, and hence the generalization gap. This is done, as we will see, by reducing the number of parameters to be learned. On the flip side, specialization may come at the cost of an irrecoverable bias.  

In this chapter, we provide an introduction to the vast field of probabilistic graphical models, which is a powerful framework that allows us to represent and learn structured probabilistic models. The goal here is to introduce the main concepts and tools, while referring to the extensive treatments in [81, 15, 104, 151] for additional information.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 176

Context: ```
to conditional independence assumptions. For instance, in the naive Bayes model, the word indicators are conditionally independent given the topic. As we will see in the rest of this chapter, conditional independence assumptions translate into factorizations of the joint distributions of the variables under study. Factorizations and associated conditional independence properties can be represented by three different graphical frameworks, namely BNs, MRFs, and factor graphs. For brevity, this chapter will only focus on the first two.

## 7.2 Bayesian Networks

This section provides a brief introduction to BNs by focusing on key definitions and on the problem of ML learning with some note on MAP and Bayesian viewpoints.

### 7.2.1 Definitions and Basics

BNs encode a probability factorization or, equivalently, a set of conditional independence relationships through a directed graph. The starting point is the chain rule for probabilities for a generic set of \( K \) random variables (rvs) \( \{x_1, ..., x_K\} \):

\[
p(x_1, ..., x_K) = p(x_1)p(x_2|x_1)...p(x_K|x_1, ..., x_{K-1}) \tag{7.3}
\]

where the order of the variables is arbitrary. The factorization (7.3) applies for a generic joint distribution, and it does not encode any additional structural information. Note that the notation here is general and not meant to indicate that the variables are necessary observed.

#### Example 7.3

Consider again the naive Bayes model for text classification/clustering. There, we imposed the structural constraint that word indicators \(\{x_w\}_{w=1}^W\) be conditionally independent given the topic \(t\). This conditional independence assumption can be expressed using the “perp” notation:

\[
x_w \perp \{x_{w'}: w' \neq w\}|t. \tag{7.4}
\]
```
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 189

Context: # 7.4. Bayesian Inference in Probabilistic Graphical Models

The EM algorithm requires the evaluation of the posterior \( p(x_{i} | x_{-i}, \theta) \) of the latent (extensive) variables \( \{z_{i}\} \) given the current iterate \( \theta \).

As discussed, when performing Bayesian inference, we can distinguish between observed variables \( x \), and latent variables \( z \). In general, only a subset of latent variables may be of interest, say \( z_{i} \), with the rest of the random variables denoted as \( z_{-i} \). The quantity to be computed is the posterior distribution:

\[
p(z_{i} | x) = \frac{p(x, z_{i})}{p(x)}, \tag{7.22}
\]

where 

\[
p(x, z_{-i}) = \sum_{z_{i}} p(x, z), \tag{7.23}
\]

and 

\[
p(x) = \sum_{z} p(x, z_{i}). \tag{7.24}
\]

with the sum being replaced by an integral for continuous variables. The key complication in evaluating these expressions is the need to sum over potentially large sets, namely the domains of variables \( z_{-i} \) and \( z_{i} \). Note that the sum in (7.23), which appears at the numerator of (7.22), is over all hidden variables that are of no interest. In contrast, the sum in (7.21), which is at the denominator of (7.22), is over the variables whose posterior probability (7.22) is the final objective of the calculation.

## Example 7.10

Consider an HMM, whose BN is shown in Fig. 7.3. Having learned the probabilistic model, a typical problem is that of inferring a given hidden variable \( z \) given the observed variables \( x = \{x_{1}, \ldots, x_{p}\} \). Computing the posterior \( p(z | x) \) requires the evaluation of the sums in (7.23) and (7.24). When the hidden variables \( z_{1}, \ldots, z_{p} \) are discrete with alphabet size \( 2 \), the complexity of step (7.23) is of the order \( |Z|^{D-1} \), since one needs to sum over the \( |Z|^{D-1} \) possible values of the hidden variables.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 192

Context: # 8
## Approximate Inference and Learning

In Chapters 6 and 7, we have seen that learning and inference tasks are often made difficult by the need to compute the posterior distribution \( p(z|x) \) of an unobserved variable \( z \) given an observed variable \( x \). This task requires the computation of the normalizing marginal:

\[
p(x) = \sum_{z} p(x, z), \tag{8.1}
\]

where the sum is replaced by an integral for continuous variables.¹ This computation is intractable when the alphabet of the hidden variable \( z \) is large enough. Chapter 7 has shown that the complexity of computing (8.1) can be alleviated in the special case in which the factorized joint distribution \( p(x, z) \) is defined by specific classes of probabilistic graphical models.

What to do when the complexity of computing (8.1) is excessive? In this chapter, we provide a brief introduction to two popular approximate inference approaches, namely MC methods and Variational Inference (VI). We also discuss their application to learning. As for the  

¹ Note that this task assumes (7.23) after appropriate redefinitions of the variables.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 197

Context: # 8.2. Variational Inference

value \( x = x \). A potentially more efficient solution is to define an inference variational distribution \( q(x | z, \phi) \), which models the posterior distribution of \( z \) for any value of \( x = x \). The inference distribution is parameterized by a vector \( \phi \), and is typically implemented using a multi-layer neural network. In this case, it is typically referred to as inference network. This approach is referred to as amortized VI.

Amortized VI has the key advantage that, once the inference distribution is learned, one does not need to carry out \( J \)-projections for previously unobserved values of \( x = x \). Instead, one can directly apply \( q(x | z, \phi) \) for the learned values of parameter \( \phi \) [80].

The inference distribution \( q(z | x, \phi) \) can be obtained by solving the amortized \( J \)-projection problem

\[
\min_{\phi} \mathbb{E}_{x \sim \mathcal{D}}[KL(q(z|x; \phi) || p(z | x))]
\]

where the ensemble average is, in practice, replaced by an empirical average over available data points \( \{x_i\} \). The solution of the VI problem (8.12) is hence "amortized" across multiple values of \( x \).

**M-projection.** By Theorem 6.1, the \( I \)-projection maximizes a lower bound on the ELBO – on the log-distribution of the observed data \( x \). This gives \( I \)-projections a strong theoretical justification grounded in the ML learning principle. Recalling that the KL divergence is not symmetric (see Sec. 2.6 and Appendix A), one could also define the alternative problem

\[
\min_{\phi} KL(p(z|x) || q(z | x; \phi))
\]

The solution \( q(z|x; \phi) \) to this problem is known as M-projection of the distribution \( p(z|x) \) in the set of distributions \( \{q(z | x; \phi)\} \) defined by the given parameterization.

As for the counterpart (8.7), this problem does not appear to be solvable since it requires knowledge of the desired posterior \( p(z|x) \). However, the problem turns out to have an easy solution if \( q(z|x) \) belongs to the exponential family. In fact, the gradient with respect to the natural parameters \( \phi \) of the KL divergence in (8.7) can be computed by following the same steps detailed for ML learning in Sec. 3.3. The upshot of this computation and of the enforcement of the optimality condition
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 203

Context: # 8.3 Monte Carlo-Based Variational Inference*

In Sec. 7.4, to factor graphs with loops, simplifications of loopy belief propagation include Approximate Message Passing (AMP). When applying M -projections with factors restricted to lie in the exponential family, the approach yields the Expectation Propagation method. We refer to [81, 114] for details.

## 8.3 Monte Carlo-Based Variational Inference*

The VI methods described in the previous section require the evaluation of expectations with respect to the variational posterior \( q(\mathbf{x}|\mathbf{z}) \) (see, e.g., (8.20)). The feasibility of this operation relies on specific assumptions about the variational posterior, such as its factorization properties and membership of the exponential family. It is hence desirable to devise methods that do not require the exact computation of the ensemble averages with respect to the variational posterior \( q(\mathbf{x}|\mathbf{z}) \). As we will see below, this is possible by combining VI methods with MC approximations. The resulting methods can leverage SGD, scale to large data sets, and have found a large number of recent applications [24, 25, 7].

The key idea of MC-based VI is to approximate the KL divergence (8.10) via MC by drawing one or more samples \( z_m \sim q(\mathbf{z}) \), with \( m = 1, \ldots, M \), and by computing the empirical average:

\[
KL(q(\mathbf{x}) || p(\mathbf{x})) \approx \frac{1}{M} \sum_{m=1}^{M} (\ln q(z_m) - \ln p(\mathbf{x}, z_m)). \tag{8.24}
\]

In order to optimize over the parameters \( \phi \) of the variational posterior \( q(\mathbf{x}|\phi) \), it is important to note that we can instead work to approximate the gradient of the KL divergence, as discussed next.

## REINFORCE Approach

To proceed, assume the following two rather mild conditions on the parameterized variational posterior: 
1. It is easy to draw samples from \( q(\mathbf{x}|\phi) \), and 
2. \( q(\mathbf{x}|\phi) \) is possible to compute the gradient \( \nabla_\phi q(\mathbf{x}|\phi) \).

We can now develop an SGD-based scheme as follows. The main result is that the gradient of the KL divergence in the I-projection problem (8.11) with respect to the variational parameters \( \phi \) can be written as:

\[
\nabla_\phi KL(q(\mathbf{x}||p(\mathbf{x})) = \mathbb{E}_{q(\phi)}[\nabla_\phi \ln q(\mathbf{x}|\phi) p(\mathbf{x}, z)] \tag{8.25}.
\]
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 209

Context: # 9
## Concluding Remarks

This monograph has provided a brief introduction to machine learning by focusing on parametric probabilistic models for supervised and unsupervised learning problems. It has endeavored to describe fundamental concepts within a unified treatment starting from first principles. Throughout the text, we have also provided pointers to advanced topics that we have only been able to mention or shortly touch upon. Here, we offer a brief list of additional important aspects and open problems that have not been covered in the preceding chapters.

- **Privacy:** In many applications, data sets used to train machine learning algorithms contain sensitive private information, such as personal preferences for recommendation systems. It is hence important to ensure that the learned model does not reveal any information about the individual entries of the training set. This constraint can be formulated using the concept of differential privacy. Typical techniques to guarantee privacy of individual data points include adding noise to gradients when training via SGD and relying on mixtures of experts trained with different subsets of the data [1].

- **Robustness:** It has been reported that various machine learning models, including neural networks, are sensitive to small variations in input data.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 211

Context: - **Domain adaptation:** In many learning problems, the available data has a different distribution from the data on which the algorithm will be tested. For instance, in speech recognition, one has available data for a given user at learning time, but it may be desirable, after learning, to use the same machine for another user. For supervised learning, this is typically modeled by assuming that the distribution of the covariates \( x \) is different during training and test, while the discriminative conditional distribution \( p(y|x) \) is the same for both phases [149]. A generalization of PAC theory analyzes domain adaptation, obtaining bounds on the generalization error under the desired test distribution as a function of the difference between the training and test distributions [26].

- **Communication-efficient learning:** In distributed computing platforms, data is typically partitioned among the processors and communication among the processors entails latency and energy consumption. An important research problem is that of characterizing the best trade-off between learning performance and communication overhead [160].

- **Reinforcement learning:** Reinforcement learning is at the heart of recent successes of machine learning methods in acquiring the skills necessary to play video games or games against human opponents (see, e.g., [99]). In reinforcement learning, one wishes to learn the optimal mapping, say \( \phi(x, \theta) \), between the observed state \( x \) of the world and an action \( a \). Unlike supervised learning, the optimal action \( a \) is not known, but the machine observes a reward/punishment signal depending on the effect of the action. A popular approach, referred to as deep reinforcement learning, models the mapping \( \phi(x, \theta) \) using a deep neural network. This is trained to maximize the average reward via SGD by using the REINFORCE method (Chapter 8) to estimate the gradient [135, 88, 7, 9].
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 213

Context: # Appendix A: Information Measures

In this appendix, we describe a principled and intuitive introduction to information measures that builds on inference, namely estimation and hypothesis testing. We focus on entropy, mutual information, and divergence measures. We also concentrate on discrete r.v.s. In the monograph, we have taken the pragmatic approach of extending the definitions to continuous variables by substituting sums with integrals. It is worth noting that this approach does not come with any practical complications when dealing with mutual information and divergence. Instead, the continuous version of the entropy, known as differential entropy, should be treated with care, as it does not satisfy some key properties of the entropy such as non-negativity.

## A.1 Entropy

As proposed by Claude Shannon, the amount of information received from the observation of a discrete random variable \( X \sim p(x) \) defined over a finite alphabet should be measured by the amount of uncertainty about its value prior to its measurement. To this end, we consider the problem of estimating the value of \( x \) when one only knows the following:
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 217

Context: A.2. Conditional Entropy and Mutual Information
-----------------------------------------------

defined as the minimum average loss

$$
H_\theta(x|y) = \min_{\hat{y}} \mathbb{E}_{p_{xy}}[\left\lVert (x, \hat{y}) \right\rVert] = y. \tag{A.7}
$$

Note that this definition is consistent with (A.1) as applied to the conditional pmf \(p_{xy}\) — Averaging over the distribution of the observation \(y\) yields the generalized conditional entropy

$$
H_k(x|y) = \mathbb{E}_{p_y}[H(p_{x|y})]. \tag{A.8}
$$

It is emphasized that the generalized conditional entropy depends on the joint distribution \(p_{xy}\). While (A.7) depends only on the conditional pmf \(p_{x|y}\).

For the squared error, the generalized conditional entropy can be easily seen to be the average conditional variance \(H_k(x|y) = \mathbb{E}_{p_y}[\mathbb{E}_{p_{x|y}}] \), since the a posteriori mean \(\hat{y}(x) = \mathbb{E}_{p_{xy}}[x | y]\) is the optimal estimate. For the 0-1 loss, the generalized conditional entropy \(H_k(x|y)\) is instead equal to the minimum probability of error for the detection of \(x\) given \(y\) and the MAP estimate \(\hat{y} = \arg\max_{x}p_{x|y}\).

### Distributional Estimate

Assume now that we are allowed to choose \(p_{x|y}\) as the estimate of \(x\) given the observation \(y\) and that we measure the estimation loss via a function \((\theta, \beta)\) as in (A.3). The definition of generalized conditional entropy for a given value of \(y = y\) follows directly from the arguments above and is given as \(H(p_{x|y})\), while the generalized conditional entropy is (A.8). With the log-loss function, the definition above can be seen to coincide with Shannon conditional entropy \(H(x) = \mathbb{E}_{p_x}[-\ln p(x)]\).

If \(x\) and \(y\) are independent, we have the equality \(H_k(x|y) = H_k(x)\). Furthermore, since in (A.7) we can always choose estimates that are independent of \(y\), we generally have the inequality \(H_k(x|y) \leq H_k(x)\); observing \(y\) on average can only decrease the entropy. Note, however, that what is not true is that \(H_{p_{x|y}}\) is necessarily smaller than \(H(x)\) [18].

### Mutual Information

The inequality \(H(x|y) \leq H(x)\) justifies the definition of generalized mutual information with respect to the given loss function as

$$
I(x;y) = H_k(x) - H_k(x|y). \tag{A.9}
$$
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 219

Context: A.3. Divergence Measures
========================

Given the expectation, \( T(x) \) is large for values of \( x \) generated from \( q_x \). The function can be used to define the relative importance of errors made in favor of one distribution or the other. From this discussion, the optimal value of \( \alpha \) can be taken to be a measure of the distance between the two pmfs. This yields the following definition of divergence between two pmfs:

\[
D_f(p_\|q) = \max_{\alpha} \mathbb{E}_{p_x} [T(x)] - \mathbb{E}_{q_x} [g(T(x))], \tag{A.11}
\]

where the subscript \( f \) will be justified below.

Under suitable differentiability assumptions on function \( g \) (see [107] for generalizations), taking the derivative with respect to \( T(x) \) for all \( x \in \mathcal{X} \) yields the optimality condition \( g'(T(x)) = p(x)/q(x) \). This relationship reveals the connection between the optimal detector \( T(x) \) and the LLR \( p(x)/q(x) \). Plugging this result into (A.11), it can be directly checked that the following equality holds [105]:

\[
D_f(p \| q) = \mathbb{E}_{q_x} \left[ \log \left( \frac{p(x)}{q(x)} \right) \right]. \tag{A.12}
\]

Here, we note that \( f(x) = g'(x) \) is the convex dual function of \( g \), which is defined as \( g^*(y) = \sup_{x} (y - g(x)) \). Note that dual function \( f \) is always convex [28].

Under the additional constraint \( f(1) = 0 \), definition (A.12) describes a large class of divergence measures parametrized by the convex function \( f \), which are known as f-divergences or Ali-Silvey distance measures [45]. Note that the constraint \( f(1) = 0 \) ensures that the divergence is zero when the pmfs \( p_x \) and \( q_x \) are identical. Among their key properties, f-divergences satisfy the data processing inequality [45]:

For example, the choice \( g(t) = e^t - 1 \), which gives the dual convex function \( f(x) = x \log(x) - x + 1 \) and the corresponding divergence measure (A.12) is the standard KL divergence \( D_{KL}(p \| q) \). As another instance of f-divergence, with \( g(t) = -\log(t) \) to obtain the optimal detector \( T(x) = \ln \left( \frac{p(x)}{q(x)} \right) \) and \( D_f(p \| q) \) becomes the Jensen-Shannon divergence [*]. For

*The Jensen-Shannon divergence can also be interpreted as the mutual information \( I(p; x) \).
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 220

Context: 214  
# Appendix A: Information Measures  

The latter can be written as:  

$$  
JS(P_x||Q_x) = KL(P_x || m_x) + KL(Q_x || m_x),  
$$  
(Equation A.13)  

where \( m_x(x) = (p_x(x) + q_x(x))/2 \). Another special case, which generalizes the KL divergence and other metrics, is the \( \alpha \)-divergence discussed in Chapter 8 (see (8.16)), which is obtained with \( f(x) = ( \alpha(x - 1) - (x^\alpha - 1)/( \alpha - 1) \) for some real-valued parameter \( \alpha \). We refer to [107, 45] for other examples.  

The discussion above justified the adoption of the loss function (A.11) in a heuristic fashion. It is, however, possible to derive formal relationships between the error probability of binary hypothesis testing and \( f \)-divergences [21]. We also refer to the classical Sanov lemma and Stein lemma as fundamental applications of KL divergence to large deviation and hypothesis testing [38].  

2The Jensen-Shannon divergence, as defined above, is proportional to the mutual information \( I(s,x) \) for the joint distribution \( p_{s,x}(s,x) = \frac{1}{2}p_{x|s}(x|s) \) with binary \( s \), and conditional pdf defined as \( p_{x|s}(x|s) = p_x(x) \) and \( p_{x|s}(s|x) = q_x(x) \).
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 221

Context: B  
Appendix B: KL Divergence and Exponential Family

In this appendix, we provide a general expression for the KL divergence between two distributions \( p(x|n_1) \) and \( p(x|n_2) \) from the same regular exponential family with log-partition function \( A(\cdot) \), sufficient statistics \( u(x) \), and moment parameters \( \mu_1 \) and \( \mu_2 \), respectively. We recall from Chapter 3 that the log-partition function is convex and that we have the identity \( \nabla A(\eta) = \mu \).

The KL divergence between the two distributions can be translated into a divergence on the space of natural parameters. In particular, the following relationship holds [6]:

\[
KL(p(x|n_1) \| p(x|n_2)) = D_A(\eta_2, \eta_1), \tag{B.1}
\]

where \( D_A(\eta_2, \eta_1) \) represents the Bregman divergence with generator function given by the log-partition function \( A(\cdot) \), that is:

\[
D_A(\eta_2, \eta_1) = A(\eta_2) - A(\eta_1) - ( \eta_2 - \eta_1)^\top \nabla A(\eta_1) \tag{B.2}
\]

The first line of (B.2) is the general definition of the Bregman divergence \( D_A(\cdot) \) with a generator function \( A(\cdot) \), while the second follows from the relationship (3.10). Note that the Bregman divergence can be
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 223

Context: # Acknowledgements

Osvaldo Simeone has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 725731).
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 224

Context: # References

1. Abadi, M., Ú. Erlingsson, I. Goodfellow, H. Brendan McMahan, I. Mironov, N. Papernot, K. Talwar, and L. Zhang. 2017. “On the Protection of Private Information in Machine Learning Systems: Two Recent Approaches.” *ArXiv e-prints*. Aug. arXiv: 1708.08032 [eprint].  
2. Abu-Mostafa, Y. S., M. Magdon-Ismail, and H.-T. Lin. 2012. *Learning from data*. Vol. 4. AMLBook New York, NY, USA.  
3. Agakov, F. 2005. *Variational Information Maximization in Stochastic Environments* (PhD thesis). University of Edinburgh.  
4. Alemi, A. A., B. Poole, and E. A. Fischer. 2017. “An Information-Theoretic Analysis of Deep Latent-Variable Models.” *ArXiv e-prints*. Nov. arXiv: 1711.00614v1.  
5. Amari, S.-i. 1998. “Natural gradient works efficiently in learning.” *Neural computation*, 10(2): 251–276.  
6. Amari, S.-i. 2016. *Information geometry and its applications*. Springer.  
7. Angelino, E., N. J. Johnson, R. P. Adams, et al. 2016. “Patterns of scalable Bayesian inference.” *Foundations and Trends® in Machine Learning*, 9(2-3): 119–247.  
8. Arjovsky, M., S. Chintala, and L. Bottou. 2017. “Wasserstein GAN.” *arXiv preprint arXiv:1701.07875*.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 233

Context: # References

[112] Pearl, J. (2018). "Theoretical Impediments to Machine Learning With Seven Sparks from the Causal Revolution." *ArXiv e-prints*. Jan. arXiv:1801.00146 [cs.LG].

[113] Pearl, J., M. Glymour, and N. P. Jewell. (2016). *Causal inference in statistics: a primer*. John Wiley & Sons.

[114] Pereira, M., P. Schniter, E. Chouzenoux, J.-C. Pesquet, J.-Y. Tourneret, A. O. Hero, and S. McLaughlin. (2016). "A survey of stochastic simulation and optimization methods in signal processing." *IEEE Journal of Selected Topics in Signal Processing*, 10(2): 221–241.

[115] Peters, J., D. J. Janzing, and B. Schölkopf. (2017). *Elements of Causal Inference: Foundations and Learning Algorithms*. MIT Press (available online).

[116] Pinker, S. (1997). *How the Mind Works*. Penguin Press Science.

[117] Rainier, L. and B. Juang. (1986). "An introduction to hidden Markov models." *IEEE ASSP magazine*, 3(1): 4–16.

[118] Ragsky, M. (2011). "Directed Information and Pearl’s causal calculus." In: *Communication, Control, and Computing (Allerton), 2011 49th Annual Allerton Conference on IEEE*. 958–965.

[119] Ragsky, M., A. Rakhlin, M. Tsao, Y. Wu, and A. Xi. (2016). "Information-theoretic analysis of stability and bias of learning algorithms." In: *Information Theory Workshop (ITW), 2016 IEEE*. IEEE, 26–30.

[120] Ranganath, R., S. Gerrish, and D. Blei. (2014). "Black box variational inference." In: *Artificial Intelligence and Statistics*. 814–822.

[121] Ranganath, R., L. Tang, L. Charlin, and D. Blei. (2015). "Deep exponential families." In: *Artificial Intelligence and Statistics*. 762–771.

[122] Rezende, D. J., S. Mohamed, and D. Wierstra. (2014). "Stochastic backpropagation and approximate inference in deep generative models." *arXiv preprint arXiv:1401.4082*.

[123] Roth, K., A. Lucchi, S. Nowozin, and T. Hofmann. (2017). "Stabilizing Training of Generative Adversarial Networks through Regularization." *arXiv preprint arXiv:1701.09367*.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 21

Context: # Data Visualization Techniques

Distributions that have heavy tails relative to Gaussian distributions are important to consider. Another criterion is to find projections onto which the data has multiple modes. A more recent approach is to project the data onto a potentially curved manifold.

## Scatter Plots

Scatter plots are of course not the only way to visualize data. It's a creative exercise, and anything that helps enhance your understanding of the data is allowed in this game. To illustrate, I will give a few examples from a variety of techniques:

1. **Histogram**: A useful way to represent the distribution of a dataset.
2. **Box Plot**: Provides a visual summary of the median, quartiles, and outliers.
3. **Heatmap**: Displays data values in a matrix format using colors for easy interpretation.
4. **Line Graph**: Ideal for showing trends over time.

Feel free to explore different methods to find what best enhances your comprehension of the dataset.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 31

Context: fall under the name "reinforcement learning". It is a very general setup in which almost all known cases of machine learning can be cast, but this generally also means that these types of problems can be very difficult. The most general RL problems do not even assume that you know what the world looks like (i.e. the maze for the mouse), so you have to simultaneously learn a model of the world and solve your task in it. 

This dual task induces interesting trade-offs: should you invest time now to learn machine learning and reap the benefit later in terms of a high salary working for Yahoo!, or should you stop investing now and start exploiting what you have learned so far? This is clearly a function of age, or the time horizon that you still have to take advantage of these investments. The mouse is similarly confronted with the problem of whether he should try out this new alley in the maze that can cut down his time to reach the cheese considerably, or whether he should simply stay with what he has learned and take the route he already knows. This clearly depends on how often he thinks he will have to run through the same maze in the future. We call this the exploration versus exploitation trade-off. The reason that RL is a very exciting field of research is because of its biological relevance. Do we not also have to figure out how the world works and survive in it?

Let's go back to the news-articles. Assume we have control over what article we will label next. Which one would be pick? Surely the one that would be most informative in some suitably defined sense. Or the mouse in the maze. Given that it needs to explore, where does he explore? Surely he will try to seek out alleys that look promising, i.e. alleys that he expects to maximize his reward. We call the problem of finding the next best data-case to investigate "active learning".

One may also be faced with learning multiple tasks at the same time. These tasks are related but not identical. For instance, consider the problem if recommending movies to customers of Netflix. Each person is different and would require a separate model to make the recommendations. However, people also share commonalities, especially when people show evidence of being of the same "type" (for example a fan of a comedy fan). We can learn personalized models but share features between them. Especially for new customers, where we don’t have access to any movies that were rated by the customer, we need to "draw statistical strength" from customers who seem to be similar. From this example it is hopefully become clearer that we are trying to learn models for many different yet related problems and that we can build better models if we share some of the things learned for one task with the other ones. The trick is not to share too much nor too little and how much we should share depends on how much data and prior knowledge we have access to for each task. We call this subfield of machine learning: "multi-task learning".
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 33

Context: # Chapter 5

## Nearest Neighbors Classification

Perhaps the simplest algorithm to perform classification is the **k nearest neighbors (kNN classifier)**. As usual, we assume that we have data of the form \(\{X_n, Y_n\}\) where \(X_n\) is the value of attribute \(i\) for data-case \(n\) and \(Y_n\) is the label for data-case \(n\). We also need a measure of similarity between data-cases, which we will denote with \(K(X_n, X_m)\) where larger values of \(K\) denote more similar data-cases.

Given these preliminaries, classification is embarrassingly simple: when you are provided with the attributes \(X_k\) for a new (unseen) test-case, you first find the \(k\) most similar data-cases in the dataset by computing \(K(X_k, X_n)\) for all \(n\). Call this set \(S\). Then, each of these \(k\) most similar neighbors in \(S\) can cast a vote on the label of the test case, where each neighbor predicts that the test case has the same label as itself. Assuming binary labels and an odd number of neighbors, this will always result in a decision.

Although kNN algorithms are often associated with this simple voting scheme, more sophisticated ways of combining the information of these neighbors are allowed. For instance, one could weigh each vote by the similarity to the test-case. This results in the following decision rule:

\[
Y_i = 1 \quad \text{if} \quad \sum_{n \in S} K(X_k, X_n)(2Y_n - 1) > 0 \tag{5.1}
\]

\[
Y_i = 0 \quad \text{if} \quad \sum_{n \in S} K(X_k, X_n)(2Y_n - 1) < 0 \tag{5.2}
\]

\[
\text{(5.3)}
\]

and flipping a coin if it is exactly 0.

Why do we expect this algorithm to work intuitively? The reason is that we expect data-cases with similar labels to cluster together in attribute space.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 34

Context: # CHAPTER 5. NEAREST NEIGHBORS CLASSIFICATION

To figure out the label of a test case, we simply look around and see what labels our neighbors have. Asking your closest neighbor is like betting all your money on a single piece of advice and you might get really unlucky if your closest neighbor happens to be an odd one out. It’s typically better to ask several opinions before making your decision. However, if you ask too much around you will be forced to ask advice from data cases that are no longer very similar to you. So there is some optimal number of neighbors to ask, which may be different for every problem. Determining this optimal number of neighbors is not easy, but we can again use cross validation (see section ??) to estimate it.

## What is Good and Bad about KNN?

First, it’s simplicity makes it attractive. Very few assumptions about the data are used in the classification process. This property can also be a disadvantage: if you have prior knowledge about how the data was generated, it’s better to use it, because less information has to be extracted from the data. A second consideration is computation time and memory efficiency. Assume you have a very large dataset, but you need to make decisions very quickly. 

As an example, consider surfacing the web pages at Amazon.com. Whenever you search for a book, it likes to suggest 10 others. To do that it could classify books into categories and suggest the top ranked in that category. KNN requires Amazon to store all features of all books at a location that is accessible for fast computation. Moreover, to classify KNN has to do the neighborhood search every time again. Clearly, there are tricks that can be played with smart indexing, but wouldn’t it be much easier if we could have summarized all books by a simple classification function \( f(X) \), that “spits out” a class for any combination of features \( X \)?

This distinction between algorithms/models that require remembering every data-item data is often called “parametric” versus “non-parametric”. It’s important to realize that this is somewhat of a misnomer: non-parametric models can have parameters (such as the number of neighbors to consider). The key distinction is rather whether the data is summarized through a set of parameters which together comprise a classification function \( f_0(X) \), or whether we retain all the data to do the classification “on the fly”.

KNN is also known to suffer from the “curse of high dimensions”. If we use many features to describe our data, and in particular when most of these features turn out to be irrelevant and noisy for the classification, then KNN is quickly confused. Imagine that there are two features that contain all the information necessary for a perfect classification, but that we have added 98 noisy, uninformative features. The neighbors in the two dimensional space of the relevant features are unfortunately no longer likely to be the neighbors in the 100 dimensional space.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 41

Context: # 6.3. Class-Prediction for New Instances

where with \(v_i\) we mean the value for attribute \(i\) that we observe in the email under consideration, i.e. if the email contains no mention of the word “viagra” we set \(v_{\text{viagra}} = 0\).

The first term in Eqn. 6.7 adds all the log-probabilities under the spam model of observing the particular value of each attribute. Every time a word is observed that has high probability for the spam model, and hence has often been observed in the dataset, will boost this score. The last term adds an extra factor to the score that expresses our prior belief of receiving a spam email instead of a ham email.

We compute a similar score for ham, namely:

\[
S_{\text{spam}} = \sum_i \log P_{\text{ham}}(X_i = v_i) + \log P(\text{ham}) \tag{6.8}
\]

and compare the two scores. Clearly, a large score for spam relative to ham provides evidence that the email is indeed spam. If your goal is to minimize the total number of errors (whether they involve spam or ham) then the decision should be to choose the class which has the highest score.

In reality, one type of error could have more serious consequences than another. For instance, a spam email making it in my inbox is not too bad, but an important email that ends up in my spam box (which I never check) may have serious consequences. To account for this we introduce a general threshold \(\theta\) and use the following decision rule:

\[
Y = 1 \quad \text{if } S_1 > S_0 + \theta \tag{6.9}
\]

\[
Y = 0 \quad \text{if } S_1 < S_0 + \theta \tag{6.10}
\]

If these quantities are equal we flip a coin.

If \(\theta = -\infty\), we always decide in favor of label \(Y = 1\), while if we use \(\theta = +\infty\) we always decide in favor of \(Y = 0\). The actual value is a matter of taste. To evaluate a classifier we often draw an ROC curve. An ROC curve is obtained by sliding \(\theta\) between \(-\infty\) and \(+\infty\) and plotting the true positive rate (the number of examples with label \(Y = 1\) also classified as \(Y = 1\) divided by the total number of examples with \(Y = 1\)) versus the false positive rate (the number of examples with label \(Y = 0\) classified as \(Y = 1\) divided by the total number of examples with \(Y = 0\). For more details see chapter  ??.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 49

Context: # 7.2 A Different Cost Function: Logistic Regression

The fact that we are picking data-cases randomly injects noise into the updates, so even close to convergence we are "wiggling around" the solution. If we decrease the stepsize however, the wiggles get smaller. So it seems a sensible strategy would be to slowly decrease the stepsize and wiggle our way to the solution. This stochastic gradient descent is actually very efficient in practice if we can find a good annealing schedule for the stepsize. Why really? It seems that if we use more data-cases in a mini-batch to perform a parameter update we should be able to make larger steps in parameter space by using bigger stepsizes. While this reasoning holds close to the solution it does not far away from the solution. The intuitive reason is that far away from convergence every datapoint will tell you the same story: move in direction X to improve your model. You simply do not need to query datapoints in order to extract that information. So for a bad model there is a lot of redundancy in the information that data-cases can convey about improving the parameters and querying a few is sufficient. Closer to convergence you need to either use more data or decrease the stepsize to increase the resolution of your gradients.

This type of reasoning clearly makes an effort to include the computational budget part of the overall objective. This is what we have argued in chapter XX is the distinguishing feature of machine learning. If you are not convinced about how important this is in the face of modern day datasets imagine the following. Company C organizes a contest where they provide a virtually infinite dataset for some prediction task. You can earn 1 million dollars if you make accurate predictions on some test set by Friday next week. You can choose between a single parameter update based on all the data or many updates on small subsets of the data. Who do you think will win the contest?

## 7.2 A Different Cost Function: Logistic Regression

The cost function of Eq. 7.2 penalizes gross violations of ones predictions rather severely (quadratically). This is sometimes counter-productive because the algorithm might get obsessed with improving the performance of one single data-case at the expense of all the others. The real cost simply counts the number of mislabelled instances, irrespective of how badly you predict function \( w^T x_n + \alpha \) was. So, a different function is often used:

\[
C(w, \alpha) = \frac{1}{n} \sum_{i=1}^{n} y_i \tanh(w^T x_n + \alpha) \tag{7.10}
\]
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 50

Context: # CHAPTER 7. THE PERCEPTRON

The function `tanh(·)` is plotted in Figure ?? It shows that the cost can never be larger than 2, which ensures robustness against outliers. We leave it to the reader to derive the gradients and formulate the gradient descent algorithm.

## 7.3 The Idea In a Nutshell

Figure ?? tells the story. One assumes that your data can be separated by a line. Any line can be represented by `w^T x = α`. Data cases from one class satisfy `w^T x_n ≤ α` while data cases from the other class satisfy `w^T x_n ≥ α`. To achieve that, you write down a cost function that penalizes data cases falling on the wrong side of the line and minimize it over `(w, α)`. For a test case, you simply compute the sign of `w^T x_{test} - α` to make a prediction as to which class it belongs to.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 58

Context: # CHAPTER 8: SUPPORT VECTOR MACHINES

## Introduction

Support Vector Machines (SVMs) are a set of supervised learning methods used for classification, regression, and outliers detection.

## Key Concepts

1. **Hyperplane**: A hyperplane is a decision boundary that helps to categorize data points.

2. **Support Vectors**: Support vectors are the data points that are closest to the hyperplane and influence its position and orientation.

3. **Margin**: The margin is the distance between the hyperplane and the nearest data point from either class. SVM aims to maximize the margin.

## Implementation Steps

- **Step 1**: Choose the appropriate kernel (linear, polynomial, RBF).
- **Step 2**: Train the SVM model using the training dataset.
- **Step 3**: Evaluate the model using a test dataset.

## Advantages of SVM

- Effective in high-dimensional spaces.
- Robust against overfitting, especially in high-dimensional datasets.

## Disadvantages of SVM

- Less effective on very large datasets.
- Poor performance with overlapping classes.

## Conclusion

Support Vector Machines are powerful tools for classification and regression tasks, offering advantages in high-dimensional spaces while having limitations in very large datasets.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 66

Context: # CHAPTER 10. KERNEL RIDGE REGRESSION

One big disadvantage of the ridge-regression is that we don’t have sparseness in the \( \alpha \) vector, i.e., there is no concept of support vectors. This is useful because when we test a new example, we only have to sum over the support vectors, which is much faster than summing over the entire training set. In the SVM, the sparseness was born out of the inequality constraints because the complementary slackness conditions told us that either if the constraint was inactive, then the multiplier \( \alpha_i \) was zero. There is no such effect here.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 82

Context: 70 CHAPTER 14. KERNEL CANONICAL CORRELATION ANALYSIS

We want to maximize this objective, because this would maximize the correlation between the univariates u and v. Note that we divided by the standard deviation of the projections to remove scale dependence.

This exposition is very similar to the Fisher discriminant analysis story and I encourage you to reread that. For instance, there you can find how to generalize to cases where the data is not centered. We also introduced the following “trick”. Since we can rescale a and b without changing the problem, we can constrain them to be equal to 1. This then allows us to write the problem as:

```markdown
maximize_{a,b} \quad \rho = E[uv]  
subject to \quad E[u^2] = 1  
E[v^2] = 1  (14.2)
```

Or, if we construct a Lagrangian and write out the expectations we find,

```markdown
min_{a,b} \max_{\lambda_1,\lambda_2} \sum_{i} a^T x_i y_i^T b - \frac{1}{2} \lambda_1 \left( \sum_{i} a^T x_i x_i^T a - N \right) - \frac{1}{2} \lambda_2 \left( \sum_{i} b^T y_i y_i^T b - N \right) (14.3)
```
where we have multiplied by N. Let’s take derivatives wrt a and b to see what the KKT equations tell us,

```markdown
\sum_{i} x_i y_i^T b - \lambda_1 \sum_{i} x_i x_i^T a = 0  (14.4)  
\sum_{i} y_i x_i^T a - \lambda_2 \sum_{i} y_i y_i^T b = 0  (14.5)
```

First notice that if we multiply the first equation with \( a^T \) and the second with \( b^T \) and subtract the two, while using the constraints, we arrive at \( \lambda_1 = \lambda_2 = \lambda \). Next, rename \( S_{uv} = \sum_{i} x_i y_i^T, S_{x} = \sum_{i} x_i x_i^T, S_{y} = \sum_{i} y_i y_i^T \). We define the following larger matrices: \( S_{b} \) is the block diagonal matrix with \( S_{x} \) and \( S_{y} \) on the diagonal and zeros on the off-diagonal blocks. Also, we define \( S_{C} \) to be the off-diagonal matrix with \( S_{uv} \) on the off diagonal. Finally we define \( c' = [a, b] \).
The two equations can then be written jointly as,

```markdown
S_{C} = \Lambda S_{D} S_{C} \Rightarrow S_{D}^{-1} S_{C} = \lambda c' \Rightarrow S_{D}^{-1} S_{C} = \lambda S_{D}^{-1} S_{C} (14.6)
```
which is again a regular eigenvalue equation for \( c' = S_{D}^{-1} S_{C} \).
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 84

Context: # 14. Kernel Canonical Correlation Analysis

Analogous to the primal problem, we will define big matrices, \( K_D \) which contains \( (K_{1}^{2} + \eta I) \) and \( (K_{2}^{2} + \eta I) \) as blocks on the diagonal and zeros at the blocks off the diagonal, and the matrix \( K_0 \) which has the matrices \( K_1 \) on the right-upper off diagonal block and \( K_2 \) at the left-lower off-diagonal block. Also, we define \( \gamma = [\alpha, \beta] \). This leads to the equation,

\[
K_0 \gamma = \Lambda K_D \gamma \Rightarrow K_D^{-1} K_0 \gamma = \lambda \gamma \Rightarrow K_{0}^{D} K_D^{-1} K_{0}^{T} (K_{0} \gamma) = \lambda (K_{0} \gamma) \tag{14.15}
\]

which is again a regular eigenvalue equation. Note that the regularization also moved the smallest eigenvalue away from zero, and hence made the inverse more numerically stable. The value for \( \eta \) needs to be chosen using cross-validation or some other measure. Solving the equations using this larger eigen-value problem is actually not quite necessary, and more efficient methods exist (see book).

The solutions are not expected to be sparse because eigen-vectors are not expected to be sparse. One would have to replace \( L_2 \) norm penalties with \( L_1 \) norm penalties to obtain sparsity.
##########

"""QUERY: Please summarize the whole context. It is important that you include a summary for each file. All files should be included, so please make sure to go through the entire context"""

Consider the chat history for relevant information. Use all information included. Use as much tokens as needed. Important: If you find information separated by a | in the context, it is a table formatted in Markdown.
Final Files Sources: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 3, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 14, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 18, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 31, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 48, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 49, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 50, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 51, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 52, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 61, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 62, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 64, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 69, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 71, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 75, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 87, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 101, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 125, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 130, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 133, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 135, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 145, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 150, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 151, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 156, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 164, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 172, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 173, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 178, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 193, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 195, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 211, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 218, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 221, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 226, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 231, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 237, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 257, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 2, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 9, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 10, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 13, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 14, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 23, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 26, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 27, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 36, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 39, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 40, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 41, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 44, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 51, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 52, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 54, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 58, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 59, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 63, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 70, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 75, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 78, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 79, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 80, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 83, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 84, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 85, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 86, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 87, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 88, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 89, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 90, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 92, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 93, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 95, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 98, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 99, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 100, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 101, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 105, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 107, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 112, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 115, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 117, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 118, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 119, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 120, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 121, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 126, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 129, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 155, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 161, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 164, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 167, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 171, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 172, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 176, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 189, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 192, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 197, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 203, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 209, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 211, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 213, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 217, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 219, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 220, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 221, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 223, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 224, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 233, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 21, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 31, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 33, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 34, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 41, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 49, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 50, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 58, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 66, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 82, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 84
==================================================
**Elapsed Time: 39.52 seconds**
==================================================


FINAL ANSWER 
Answer: <IPython.core.display.Markdown object>
==================================================
**Elapsed Time: 0.00 seconds**
==================================================