INITIALIZATION
Knowledgebase: ki-dev-large
Base Query: can you give me a summary of the topics discussed in the context. display it in a table format with the topic and a short description as headers. Limit it to 5 lists only
Model: gpt-4o
Use Curl?: None
==================================================
**Elapsed Time: 0.00 seconds**
==================================================


ROUTING
Query type: summary
==================================================
**Elapsed Time: 1.87 seconds**
==================================================


RAG PARAMETERS
Max Context To Include: 100
Lowest Score to Consider: 0.1
==================================================
**Elapsed Time: 0.14 seconds**
==================================================


VECTOR SEARCH ALGORITHM TO USE 
Use MMR search?: True
Use Similarity search?: False
==================================================
**Elapsed Time: 0.10 seconds**
==================================================


VECTOR SEARCH DONE 
==================================================
**Elapsed Time: 6.24 seconds**
==================================================


PRIMER 
Primer: 
            You are Simon, a highly intelligent personal assistant in a system called KIOS. You are a chatbot that
            can read knowledgebases through the "CONTEXT" that is included in the user's chat message.

        
            Your role is to act as an expert at summarization and analysis.

            In your responses to enterprise users, prioritize clarity, trustworthiness, and appropriate formality.

            Be honest by admitting when a topic falls outside your scope of knowledge, and suggest
            alternative avenues for obtaining information when necessary.

            Make effective use of chat history to avoid redundancy and enhance response relevance, continuously
            adapting to integrate all necessary details in your interactions.

            Use as much tokens as possible to provide a detailed response.
        
==================================================
**Elapsed Time: 0.18 seconds**
==================================================


FINAL QUERY 
Final Query: CONTEXT: ##########
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 1

Context: # A Cool Brisk Walk  
Through Discrete Mathematics

**Version 2.2.1**  
**Stephen Davies, Ph.D.**  

## Table of Contents
1. Introduction
2. Concepts
   - Sets
   - Functions
   - Relations
3. Applications
4. Conclusion

## Introduction
In this document, we explore various topics under discrete mathematics. 

## Concepts

### Sets
- Definition
- Notation
- Operations

### Functions
- Definition
- Types of functions
- Examples

### Relations
- Definition
- Properties

| Concept   | Description                       |
|-----------|-----------------------------------|
| Set      | A collection of distinct objects   |
| Function  | A relation from a set to itself    |
| Relation  | A connection between two sets       |

## Applications
The concepts of discrete mathematics can be applied in various fields, including computer science, information technology, and combinatorial design.

## Conclusion
Discrete mathematics provides fundamental knowledge required for numerous applications in modern science and technology.

Image Analysis: 

### Analysis of the Attached Visual Content

#### 1. Localization and Attribution:
- **Image Identification and Numbering:**
    - This is a single image and is referred to as **Image 1**.

#### 2. Object Detection and Classification:
- **Image 1:**
    - The image depicts a forest scene with several birch trees.
    - **Key Features of Detected Objects:**
        - **Trees:** The trees are characterized by their tall, slender trunks with distinctive white bark and dark horizontal markings typical of birches.
        - **Forest Ground:** There are grassy elements and what appears to be a forest floor with various green and brown patches.

#### 3. Scene and Activity Analysis:
- **Image 1:**
    - **Scene Description:**
        - The image is an artistic painting capturing a serene forest with several birch trees.
        - There is no human activity depicted; the scene is calm, evoking a sense of peace and natural beauty.

#### 4. Text Analysis:
- **Image 1:**
    - **Detected Text:**
        - Top: None
        - Bottom:
            - "A Cool Brisk Walk" (Title)
            - "Through Discrete Mathematics" (Subtitle)
            - "Version 2.2.1" (Version Information)
            - "Stephen Davies, Ph.D." (Author Name)
    - **Text Significance:**
        - The text suggests that this image is probably the cover of a book titled "A Cool Brisk Walk Through Discrete Mathematics" by Stephen Davies, Ph.D. The version number indicates it is version 2.2.1, signifying a possibly updated or revised edition.

#### 8. Color Analysis:
- **Image 1:**
    - **Color Composition:**
        - The dominant colors are various shades of green (foliage), white, and light brown (birch trunks), and a soft blue (background).
    - **Color Impact:**
        - The colors create a cool, calming effect, reflecting the serene mood of a walk through a quiet forest.

#### 9. Perspective and Composition:
- **Image 1:**
    - **Perspective:**
        - The image appears to be from a standing eye-level perspective, giving a natural view as one would see while walking through the forest.
    - **Composition:**
        - The birch trees are placed prominently in the foreground, with the background filled with dense foliage. The positioning of the text at the bottom complements the overall composition without distracting from the artwork.

#### 10. Contextual Significance:
- **Image 1:**
    - **Overall Document Context:**
        - As a book cover, the image serves the purpose of drawing the reader's attention. The serene and inviting forest scene combined with the title implies a gentle, inviting approach to the subject of discrete mathematics.

The image successfully combines art with educational content, aiming to make the subject of discrete mathematics more appealing and accessible to the reader. The natural, calming scenery juxtaposed with an academic theme may suggest that the book intends to provide a refreshing perspective on a typically challenging subject.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 3

Context: # Contents at a glance

## Contents at a glance
- **Contents at a glance** …… i
- **Preface** …… iii
- **Acknowledgements** …… v
  
1. **Meetup at the trailhead** …… 1
2. **Sets** …… 7
3. **Relations** …… 35
4. **Probability** …… 59
5. **Structures** …… 85
6. **Counting** …… 141
7. **Numbers** …… 165
8. **Logic** …… 197
9. **Proof** …… 223

Also be sure to check out the forever-free-and-open-source instructional videos that accompany this series at [www.allthemath.org](http://www.allthemath.org)!
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 5

Context: # Preface

Discrete math is a popular book topic — start Googling around and you’ll find a zillion different textbooks about it. Take a closer look, and you’ll discover that most of these are pretty thick, dense volumes packed with lots of equations and proofs. They’re principled approaches, written by mathematicians and (seemingly) to mathematicians. I speak with complete frankness when I say I’m comforted to know that the human race is well covered in this area. We need smart people who can derive complex expressions and prove theorems from scratch, and I’m glad we have them.

Your average computer science practitioner, however, might be better served by a different approach. There are elements to the discrete math mindset that a budding software developer needs experience with. This is why discrete math is (properly, I believe) part of the mandatory curriculum for most computer science undergraduate programs. But for future programmers and engineers, the emphasis should be different than it is for mathematicians and researchers in computing theory. A practical computer scientist mostly needs to be able to use these tools, not to derive them. She needs familiarity, and practice, with the fundamental concepts and the thought processes they involve. The number of times the average software developer will need to construct a proof in graph theory is probably near zero. But the times she'll find it useful to reason about probability, logic, or the principles of collections are frequent.

I believe the majority of computer science students benefit most from simply gaining an appreciation for the richness and rigor of
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 8

Context: I'm unable to view images. Please provide the text in Markdown format for me to assist you.

Image Analysis: 

### Image Analysis:

#### 1. **Localization and Attribution:**
- **Image Number**: Image 1

#### 2. **Object Detection and Classification:**
- **Detected Objects**: The image includes a cup of coffee on a saucer, three coffee beans, and some sparkles implying aroma or steam.
- **Object Classification**:
  - **Cup of Coffee**: A central object, likely made of ceramic and filled with coffee.
  - **Coffee Beans**: Close to the cup, classifiable as roasted beans.
  - **Sparkles/Steam**: Above the cup, representing the aroma of the hot coffee.

#### 3. **Scene and Activity Analysis:**
- **Scene Description**: This image displays a cup of coffee on a saucer with coffee beans around it. Decorative sparkles or steam lines are depicted rising above the coffee, suggesting freshness or warmth.
- **Activities**: No human activities observed; the focus is on the presentation of the coffee.

#### 4. **Text Analysis:**
- No text detected in this image.

#### 5. **Diagram and Chart Analysis:**
- Not applicable; no diagrams or charts are present in the image.

#### 6. **Product Analysis:**
- **Product**: A cup of coffee.
- **Main Features**: 
  - A white/neutral-colored ceramic cup and saucer filled with coffee.
  - Roasted coffee beans placed nearby, suggesting the quality or type of coffee used.
- **Materials**: Ceramic cup and saucer, roasted coffee beans.
- **Colors**: Predominantly brown (coffee and beans) and white (cup and saucer).

#### 7. **Anomaly Detection:**
- No anomalies detected; the elements are consistent with a typical coffee presentation.

#### 8. **Color Analysis**:
- **Dominant Colors**: Brown and white.
  - **Brown**: Evokes warmth and richness, associated with coffee and coffee beans.
  - **White**: Clean and neutral, ensures focus on the coffee.

#### 9. **Perspective and Composition**:
- **Perspective**: Bird’s eye view.
- **Composition**: 
  - Centralized with the cup of coffee as the main focus.
  - Balanced by the placement of coffee beans around the cup.
  - Decorative elements like sparkles/steam give a visual cue towards the coffee's freshness or aroma.

#### 10. **Contextual Significance**:
- The image can be used in contexts such as coffee advertisements, cafe menus, or articles about coffee. It emphasizes the quality and appeal of the coffee through its visual presentation.

#### 11. **Metadata Analysis**:
- No metadata available for analysis.

#### 12. **Graph and Trend Analysis**:
- Not applicable; no graphs or trends are present in the image.

#### 13. **Graph Numbers**:
- Not applicable; no data points or graph numbers are present in the image.

### Additional Aspects:

#### **Ablaufprozesse (Process Flows)**:
- Not applicable; no process flows depicted.

#### **Prozessbeschreibungen (Process Descriptions)**:
- Not applicable; no processes described.

#### **Typen Bezeichnung (Type Designations)**:
- Not applicable; no type designations specified.

#### **Trend and Interpretation**:
- Not applicable; no trends depicted.

#### **Tables**:
- Not applicable; no tables included.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 11

Context: # Understanding Integration and Differentiation

People in your family, there will never be 5.3 of them (although there could be 6 someday).

The last couple of entries on this list are worth a brief comment. They are math symbols, some of which you may be familiar with.

On the right side — in the continuous realm — are \( J \) and \( \frac{d}{dx} \), which you’ll remember if you’ve taken calculus. They stand for the two fundamental operations of integration and differentiation. Integration, which can be thought of as finding “the area under a curve,” is basically a way of adding up a whole infinite bunch of numbers over some range. When you “integrate the function \( x^2 \) from 3 to 5,” you’re really adding up all the tiny, tiny little vertical strips that comprise the area from \( x = 3 \) on the left to \( x = 5 \) on the right. Its corresponding entry in the left-column of the table is \( \Sigma \), which is just a short-hand for “sum up a bunch of things.” Integration and summation are equivalent operations; it’s just that when you integrate, you’re adding up all the (infinitely many) slivers across the real-line continuum. When you sum, you’re adding up a fixed sequence of entries, one at a time, like in a loop. \( \Sigma \) is just the discrete “version” of \( J \).

The same sort of relationship holds between ordinary subtraction (\(-\)) and differentiation (\(\frac{d}{dx}\)). If you’ve plotted a bunch of discrete points on \( x \)-\( y \) axes, and you want to find the slope between two of them, you just subtract their \( y \) values and divide by the \( x \) distance between them. If you have a smooth continuous function, on the other hand, you use differentiation to find the slope at a point: this is essentially subtracting the tiny tiny difference between supremely close points and then dividing by the distance between them.

Don’t worry; you don’t need to have fully understood any of the integration or differentiation stuff I just talked about, or even to have taken calculus yet. I’m just trying to give you some feel for what “discrete” means, and how the dichotomy between discrete and continuous really runs through all of math and computer science. In this book, we will mostly be focusing on discrete values and structures, which turn out to be of more use in computer science. That’s partially because, as you probably know, computers
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 13

Context: # 1.1 EXERCISES

1. **What’s the opposite of concrete?**  
   Abstract.

2. **What’s the opposite of discrete?**  
   Continuous.

3. **Consider a quantity of water in a glass. Would you call it abstract, or concrete? Discrete, or continuous?**  
   Concrete, since it’s a real entity you can experience with the senses. Continuous, since it could be any number of ounces (or liters, or tablespoons, or whatever). The amount of water certainly doesn’t have to be an integer. (Food for thought: since all matter is ultimately comprised of atoms, are even substances like water discrete?)

4. **Consider the number 27. Would you call it abstract, or concrete? Discrete, or continuous?**  
   Abstract, since you can’t see or touch or smell “twenty-seven.” Probably discrete, since it’s an integer, and when we think of whole numbers we think “discrete.” (Food for thought: in real life, how would you know whether I meant the integer “27” or the decimal number “27.00”? And does it matter?) Clearly it’s discrete. Abstract vs. concrete, though, is a little tricky. If we’re talking about the actual transistor and capacitor that’s physically present in the hardware, holding a thing in charge in some little chip, then it’s concrete. But if we’re talking about the value “1” that is conceptually part of the computer’s currently executing state, then it’s really abstract just like 27 was. In this book, we’ll always be talking about bits in this second, abstract sense.

5. **Consider a bit in a computer’s memory. Would you call it abstract, or concrete? Discrete, or continuous?**  
   Any kind of abstract object that has properties we can reason about. 

6. **If math isn’t just about numbers, what else is it about?**  
   Any kind of abstract object that has properties we can reason about.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 21

Context: 2.4 Sets are not arrays
=======================

If you’ve done some computer programming, you might see a resemblance between sets and the collections of items often used in a program: arrays, perhaps, or linked lists. To be sure, there are some similarities. But there are also some very important differences, which must not be overlooked:

- **No order.** As previously mentioned, there is no order to the members of a set. “{Mom, Dad}” is the same set as “{Dad, Mom}”. In a computer program, of course, most arrays or lists have first, second, and last elements, and an index number assigned to each.

- **No duplicates.** Suppose \( M \) is the set of all males. What would it possibly mean to say \( M = \{T.J., T.J., Johnny\} \)? Would that mean that “T.J. is twice the man that Johnny is”? This is obviously nonsensical. The set \( M \) is based on a property: maleness. Each element of \( M \) is either male, or it isn't. It can't be “male three times.” Again, in an array or linked list, you could certainly have more than one copy of the same item in different positions.

- **Infinite sets.** 'Nuff said. I’ve never seen an array with infinitely many elements, and neither will you.

- **Untyped.** Most of the time, an array or other collection in a computer program contains elements of only a single type; it’s an array of integers, or a linked list of Customer objects, for example. This is important because the program often needs to treat all elements in the collection the same way. Perhaps it needs to loop over the array to add up all the numbers, or iterate through a linked list and search for customers who have not placed an order in the last six months.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 23

Context: # 2.6 Sets of Sets

Sets are heterogeneous — a single set can contain four universities, seven integers, and an image — and so it might occur to you that they can contain other sets as well. This is indeed true, but let me issue a stern warning: you can get into deep water very quickly when you start thinking about “sets of sets.” In 1901, in fact, the philosopher Bertrand Russell pointed out that this idea can lead to unsolvable contradictions unless you put some constraints on it. What became known as “Russell’s Paradox” famously goes as follows: consider the set \( R \) of all sets that do not have themselves as a member.

For this reason, we’ll be very careful to use curly braces to denote sets, and parentheses to denote ordered pairs.

By the way, although the word “coordinate” is often used to describe the elements of an ordered pair, that’s really a geometry-centric word that implies a visual plot of some kind. Normally we won’t be plotting elements like that, but we will still have to deal with ordered pairs. I’ll just call the constituent parts “elements” to make it more general.

Three-dimensional points need ordered triples \( (x, y, z) \), and it doesn’t take a rocket scientist to deduce that we could extend this to any number of elements. The question is what to call them, and you do sort of sound like a rocket scientist (or other generic nerd) when you say tuple. Some people rhyme this word with “Drupal,” and others with “couple,” by the way, and there seems to be no consensus. If you have an ordered pair-type thing with 5 elements, therefore, it’s a 5-tuple (or a quintuple). If it has 117 elements, it’s a 117-tuple, and there’s really nothing else to call it. The general term (if we don’t know or want to specify how many elements) is n-tuple. In any case, it’s an ordered sequence of elements that may contain duplicates, so it’s very different than a set.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 26

Context: # 2.9 Combining Sets

Okay, so we have sets. Now what can we do with them? When you first learn about numbers back before kindergarten, the next thing you learn is how to combine numbers using various operations to produce other numbers. These include \( -, \times, +, \div \), exponents, roots, etc. Sets, too, have operations that are useful for combining to make other sets. These include:

- **Union (\(\cup\))**: The union of two sets is a set that includes the elements that either (or both) of them have as members. For instance, if \( A = \{ \text{Dad}, \text{Lizzy} \} \) and \( B = \{ \text{Lizzy}, \text{T.J.}, \text{Johnny} \} \), then \( A \cup B = \{ \text{Dad}, \text{Lizzy}, \text{T.J.}, \text{Johnny} \} \). Note that any element is in the union if it is in \( A \) or \( B \). For this reason, there is a strong relationship between the union operator of sets and the “or” (\( \vee \)) operator of boolean logic that we’ll see later.

- **Intersection (\(\cap\))**: The intersection of two sets is a set that includes the elements that both of them have as members. In the above example, \( A \cap B = \{ \text{Lizzy} \} \). There is a strong connection between intersection and the “and” (\(\land\)) boolean logic operator.

- **(Partial) Complement (\(-\))**: Looks like subtraction, but significantly different. \( A - B \) contains the elements from \( A \) that are also not in \( B \). So you start with \( A \) and then “subtract off” the contents of \( B \) if they occur. In the above example, \( A - B = \{ \text{Dad} \} \). (Note that T.J. and Johnny didn’t really enter into the calculation.) Unlike \( \cup \) and \( \cap \), it is not commutative. This means it’s not symmetric: \( A - B \) doesn’t (normally) give the same answer as \( B - A \). In this example, \( B - A = \{ \text{T.J.}, \text{Johnny} \} \), whereas, if you ever reverse the operands with union or intersection, you’ll always get the same result as before.

- **(Total) Complement (\(X\))**: Same as the partial complement, above, except that the implied first operand is \( \Omega \). In other words, \( A - B \) is “all the things in \( A \) that aren’t in \( B \),” whereas...
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 31

Context: 2.10 Subsets
===========

We learned that the “∈” symbol is used to indicate set membership: the element on the left is a member of the set on the right. A related but distinct notion is the idea of a subset. When we say \( X \subset Y \) (pronounced “X is a subset of Y”), it means that every member of \( X \) is also a member of \( Y \). The reverse is not necessarily true, of course; otherwise “⊆” would just mean “=”. So if \( A = \{ Dad, Lizzy \} \) and \( K = \{ Dad, Mom, Lizzy \} \), then we can say \( A \subset K \).

Be careful about the distinction between “∈” and “⊆”, which are often confused. With “∈”, the thing on the left is an element, whereas with “⊆”, the thing on the left is a set. This is further complicated by the fact that the element on the left-hand side of “∈” might well be a set.

Let's see some examples. Suppose \( Q \) is the set \( \{ 4, 9, 4 \} \).  \( Q \) has three elements here, one of which is itself a set. Now suppose that we let \( P \) be the set \( \{ 4, 9 \} \). Question: Is \( P \subset Q \)? The answer is yes: the set \( \{ 4, 9 \} \) (which is the same as the set \( \{ 9, 4 \} \), just written in a different way) is in fact an element of the set \( Q \). Next question: Is \( P \subset Q \)? The answer is no, \( P \nsubseteq Q \). If there are two of them (4 and 4) also in \( Q \), then 4 is a member of \( Q \), not 9. Last question: If \( R \) is defined to be \( \{ 2, 4 \} \), is \( R \subset Q \)? The answer is yes, since both 2 and 4 are members of \( Q \).

Notice that the definition here is a subset of itself. Sometimes, though, it’s useful to talk about whether a set is really a subset of another, and you don’t want to “count” if the two sets are actually equal. This is called a proper subset, and the symbol for it is \( \subsetneq \). You can see the rationale for the choice of symbol, because “⊆” is kind of like “<” for numbers, and “⊂” is like “<”.

Every set is a subset (not necessarily a proper one) of itself, because
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 33

Context: ```markdown
## 2.11. POWER SETS

Don't lose sight. There are four elements to the power set of \( A \), each of which is one of the possible subsets. It might seem strange to talk about "all of the possible subsets"—when I first learned this stuff, I remember thinking at first that there would be no limit to the number of subsets you could make from a set. But of course there is. To create a subset, you can either include, or exclude, each one of the original set's members. In \( A \)’s case, you can either (1) include both Dad and Lizzy, or (2) include Dad but not Lizzy, or (3) include Lizzy but not Dad, or (4) exclude both, in which case your subset is empty. Therefore, \( P(A) \) includes all four of those subsets.

Now what's the cardinality of \( P(X) \) for some set \( X \)? That's an interesting question, and one well worth pondering. The answer ripples through the heart of a lot of combinatorics and the binary number system, topics we'll cover later. And the answer is right at our fingertips if we just extrapolate from the previous example. To form a subset of \( X \), we have a choice to either include, or else exclude, each of its elements. So there's two choices for the first element, and then whether we choose to include or exclude that first element, there are two choices for the second, etc. 

So if \( |X| = 2 \) (recall that this notation means \( X \) has two elements) or \( X \) has a cardinality of \( 2^n \), then its power set has \( 2 \times 2 \) members. If \( |X| = 3 \), then its power set has \( 2 \times 2 \times 2 \) members. In general:

\[
|P(X)| = 2^{|X|}.
\]

As a limiting case (and a brain-bender), notice that if \( X \) is the empty set, then \( P(X) \) has one (not zero) members, because there is in fact one subset of the empty set: namely, the empty set itself. So \( |X| = 0 \), and \( |P(X)| = 1 \). And that jives with the above formula.
```
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 34

Context: # 2.12 Partitions

Finally, there’s a special variation on the subset concept called a **partition**. A partition is a group of subsets of another set that together are both **collectively exhaustive** and **mutually exclusive**. This means that every element of the original set is in **one and only one** of the sets in the partition. Formally, a partition of \( X \) is a group of sets \( X_1, X_2, \ldots, X_n \) such that:

\[ 
X_1 \cup X_2 \cup \cdots \cup X_n = X, 
\]

and 

\[ 
X_i \cap X_j = \emptyset \quad \text{for all } i, j. 
\]

So let’s say we’ve got a group of subsets that are supposedly a partition of \( X \). The first line above says that if we combine the contents of all of them, we get everything that’s in \( X \) (and nothing more). This is called being collectively exhaustive. The second line says that no two of the sets have anything in common; they are mutually exclusive.

As usual, an example is worth a thousand words. Suppose the set \( D \) is \(\{ \text{Dad, Mom, Lizzy, T.J., Johnny} \}\). A partition is any way of dividing \( D \) into subsets that meet the above conditions. One such partition is:

- \(\{ \text{Lizzy, T.J.} \}\),
- \(\{ \text{Mom, Dad} \}\), 
- \(\{ \text{Johnny} \}\).

Another one is:

- \(\{ \text{Lizzy} \}\),
- \(\{ \text{T.J.} \}\), 
- \(\{ \text{Mom} \}\),
- \(\{ \text{Johnny} \}\).

Yet another is:

- \(\emptyset\),
- \(\{ \text{Lizzy, T.J., Johnny, Mom, Dad} \}\), and 
- \(\emptyset\).
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 37

Context: 2.13 Exercises
================

Use an index card or a piece of paper folded lengthwise, and cover up the right-hand column of the exercises below. Read each exercise in the left-hand column, answer it in your mind, then slide the index card down to reveal the answer and see if you're right! For every exercise you missed, figure out why you missed it before moving on.

1. Is the set \( \{ Will, Smith \} \) the same as the set \( \{ Smith, Will \} \)?

   Yes indeed.

2. Is the ordered pair \( (Will, Smith) \) the same as \( (Smith, Will) \)?

   No. Order matters with ordered pairs (hence the name), and with any size tuple for that matter.

3. Is the set \( \{ Luke, Leia \}, Han \) the same as the set \( \{ Luke, Leia, Han \} \)?

   No. For instance, the first set has Han as a member but the second does not. (Instead, it has another set as a member, and that inner set happens to include Han.)

4. What's the first element of the set \( \{ Cowboys, Redskins, Steelers \} \)?

   The question doesn't make sense. There is no "first element" of a set. All three teams are equally members of the set and could be listed in any order.

5. Let \( G = \{ Matthew, Mark, Luke, John \} \) be \( \{ Luke, Obi-wan, Yoda \} \). \( S \) be the set of all Star Wars characters, and \( F \) be the four gospels from the New Testament. Now then. Is \( J \subseteq G \)?

   No.

6. Is \( J \subseteq S \)?

   Yes.

7. Is Yoda \( \in J \)?

   Yes.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 39

Context: # 2.13. EXERCISES

21. What’s \( A \cap B \)?  
   \{ Macbeth, Hamlet \}.

22. What’s \( B \cap T \)?  
   ∅.

23. What’s \( B \cap \overline{T} \)?  
   B. (which is { Scrabble, Monopoly, Othello. })

24. What’s \( A \cup (B \cap T) \)?  
   \{ Hamlet, Othello, Macbeth \}.

25. What’s \( (A \cup B) \cap T \)?  
   \{ Macbeth, Hamlet \}.

26. What’s \( A \setminus B \)?  
   Simply \( T \), since these two sets have nothing in common.

27. What’s \( T' \setminus B \)?  
   \{ (Hamlet, Macbeth), (Hamlet, Hamlet), (Hamlet, Othello), (Village, Macbeth), (Village, Hamlet), (Village, Othello), (Town, Hamlet), (Town, Othello) \}. The order of the ordered pairs within the set is not important; the order of the elements within each ordered pair is important.

28. What’s \( T \cap X \)?  
   0.

29. What’s \( (B \cap B) \times (A \cap T) \)?  
   \{ (Scrabble, Hamlet), (Monopoly, Hamlet), (Othello, Hamlet) \}.

30. What’s \( I \cup B \cup T \)?  
   7.

31. What’s \( I \cap (A \cap T) \)?  
   21. (The first parenthesized expression gives rise to a set with 7 elements, and the second to a set with three elements (B itself). Each element from the first set gets paired with an element from the second, so there are 21 such pairings.)
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 52

Context: # CHAPTER 3. RELATIONS

I contend that in this toy example, “isAtLeastAsToughAs” is a partial order, and \( A \) along with \( isAtLeastAsToughAs \) together form a poset. I reason as follows. It’s reflexive, since we started off by adding every dog with itself. It’s antisymmetric, since when we need add both \( (x, y) \) and \( (y, x) \) to the relation. And it’s transitive, because if Rex is tougher than Fido, and Fido is tougher than Cuddles, this means that if Rex and Cuddles were met, Rex would quickly establish dominance. (I’m no zoologist, and am not sure if the last condition truly applies with real dogs. But let’s pretend it does.)

It’s called a “partial order” because it establishes a partial, but incomplete, hierarchy among dogs. If we ask, “is dog X tougher than dog Y?” the answer is never ambiguous. We’re never going to say, “well, dog X was superior to dog A, who was superior to dog Y . . .” but then again, dog Y was superior to dog B, who was superior to dog X, so there’s no telling which of X and Y is truly toughest.” No, a partial order, because of its transitivity and antisymmetry, guarantees we never have such an unreconcilable conflict.

However, we could have a lack of information. Suppose Rex has never met Killer, and nobody Rex has met has ever met anyone Killer has met. There’s no chain between them. They’re in two separate universes as far as we’re concerned, and we’d have no way of knowing which was toughest. It doesn’t have to be that extreme, though. Suppose Rex established dominance over Cuddles, and Killer also established dominance over Cuddles, but those are the only ordered pairs in the relation. Again, there’s no way to tell whether Rex or Killer is the tougher dog. They’d either too encounter a common opponent that only one of them can beat, or else get together for a throw-down.

So a partial order gives us some semblance of structure — the relation establishes a directionality, and we’re guaranteed not to get wrapped up in contradictions — but it doesn’t completely order all the elements. If it does, it’s called a total order.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 68

Context: # 4.1 Outcomes and events

Since life is uncertain, we don’t know for sure what is going to happen. But let’s start by assuming we know what things might happen. Something that might happen is called an outcome. You can think of this as the result of an experiment if you want to, although normally we won’t be talking about outcomes that we have explicitly manipulated and measured via scientific means. It’s more like we’re just curious how some particular happening is going to turn out, and we’ve identified the different ways it can turn out and called them outcomes.

Now we’ve been using the symbol \( \Omega \) to refer to “the domain of discourse” or “the universal set” or “all the stuff we’re talking about.” We’re going to give it yet another name now: the sample space. \( \Omega \), the sample space, is simply the *set of all possible outcomes*. Any particular outcome — call it \( O \) — is an element of this set, just like in Chapter 1 every conceivable element was a member of the domain of discourse.

If a woman is about to have a baby, we might define \( \Omega \) as \{ boy, girl \}. Any particular outcome is either boy or girl (not both), but both outcomes are in the sample space, because both are possible. If we roll a die, we’d define \( \Omega \) as \{ 1, 2, 3, 4, 5, 6 \}. If we’re interested in motor vehicle safety, we might define \( \Omega \) for a particular road trip as \{ safe, accident \}. The outcomes don’t have to be equally likely; an important point we’ll return to soon.

In probability, we define an event as a *subset of the sample space*. In other words, an event is a group of related outcomes (though an event might contain just one outcome, or even zero). I always thought this was a funny definition for the word “event”: it’s not the first thing that word brings to mind. But it turns out to be a useful concept, because sometimes we’re not interested in any particular outcome necessarily, but rather in whether the outcome — whatever it is — has a certain property. For instance, suppose at the start of some game, my opponent and I each roll the die, agreeing that the highest roller gets to go first. Suppose he rolls a 2. Now it’s my turn. The \( \Omega \) for my die roll is of course \{ 1, 2, 3 \}.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 94

Context: # CHAPTER 5. STRUCTURES

## 5.1 Graphs

In many ways, the most elegant, simple, and powerful way of representing knowledge is by means of a graph. A graph is composed of a bunch of little bits of data, each of which may (or may not) be attached to each of the others. An example is in Figure 5.1. Each of the labeled ovals is called a **vertex** (plural: **vertices**), and the lines between them are called **edges**. Each vertex does, or does not, contain an edge connecting it to each other vertex. One could imagine each of the vertices containing various descriptive attributes — perhaps the **John Wilkes Booth** vertex could have information about Booth's birthdate, and **Washington, DC** information about its longitude, latitude, and population — but these are typically not shown on the diagram. All that really matters, graph-wise, is which vertices it contains, and which ones are joined to which others.

```
                President
                  |
          Abraham Lincoln
                  |
          Washington, DC
                  |
           John Wilkes Booth
                 /  \
      Ford's Theatre   Civil War
                     |
                  Actor
                    |
                 Gettysburg
```

Figure 5.1: A graph (undirected).

Cognitive psychologists, who study the internal mental processes of the mind, have long identified this sort of structure as the principal way that people mentally store and work with information. After all, if you step back a moment and ask “what is the ‘stuff’ that’s in my memory?” a reasonable answer is “well I know about a bunch of things, and the properties of those things, and the relationships between those things.” If the “things” are vertices, and the “properties” are attributes of those vertices, and the “relationships” are
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 99

Context: # 5.1. GRAPHS

A graph of dependencies like this must be both **directed** and **acyclic**, or it wouldn’t make sense. Directed, of course, means that task X can require task Y to be completed before it, without the reverse also being true. If they both depended on each other, we’d have an infinite loop, and no brownies could ever get baked! Acyclic means that no kind of cycle can exist in the graph, even one that goes through multiple vertices. Such a cycle would again result in an infinite loop, making the project hopeless. Imagine if there were an arrow from **bake for 30 mins** back to **grease pan** in Figure 5.4. Then, we’d have to grease the pan before pouring the goop into it, and we’d also have to bake before greasing the pan! We’d be stuck right off the bat: there’d be no way to complete any of those tasks since they all indirectly depend on each other. A graph that is both directed and acyclic (and therefore free of these problems) is sometimes called a **DAG** for short.

```
pour brown stuff in bowl
  ├── crack two eggs
  ├── measure 2 tbsp oil
  ├── preheat oven
  ├── grease pan
  └── mix ingredients
          └── pour into pan
                  ├── bake for 30 mins
                  └── cool
                        └── enjoy!
```

*Figure 5.4: A DAG.*
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 101

Context: # 5.1. GRAPHS

## Relationship to Sets

We seem to have strayed far afield from sets with all this graph stuff. But actually, there are some important connections to be made to those original concepts. Recall the wizards set \( A \) from chapter 3 that we extended to contain \{ Harry, Ron, Hermione, Neville \}. Now consider the following endorelation on \( A \):

- (Harry, Ron)
- (Ron, Harry)
- (Ron, Hermione)
- (Ron, Neville)
- (Hermione, Hermione)
- (Neville, Harry)

This relation, and all it contains, is represented faithfully by the graph in Figure 5.6. The elements of \( A \) are the vertices, of course, and each ordered pair of the relation is reflected in an edge of the graph. Can you see how exactly the same information is represented by both forms?

![Figure 5.6: A graph depicting a endorelation.](path_to_image)

Figure 5.6 is a directed graph, of course. What if it were an undirected graph? The answer is that the corresponding relation would be symmetric. An undirected graph implies that if there's an edge between two vertices, it goes "both ways." This is very identical to saying a relation is symmetric: if an \( (x,y) \) is in the relation, then the corresponding \( (y,x) \) must also be. An example is Figure 5.7, which depicts the following symmetric relation:

![Figure 5.7: A graph depicting a symmetric relation.](path_to_image)
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 103

Context: # 5.1 GRAPHS

{ Harry } and { Ron, Hermione, Neville }.

In other words, the “connectedness” of a graph can be represented precisely as a partition of the set of vertices. Each connected subset is in its own group, and every vertex is in one and only one group; therefore, these isolated groups are mutually exclusive and collectively exhaustive. Cool.

## Graph Traversal

If you had a long list — perhaps of phone numbers, names, or purchase orders — and you needed to go through and do something to each element of the list — dial all the numbers, scan the list for a certain name, and add up all the orders — it’d be pretty obvious how to do it. You just start at the top and work your way down. It might be tedious, but it’s not confusing.

Iterating through the elements like this is called **traversing** the data structure. You want to make sure you encounter each element once (and only once) so you can do whatever needs to be done with it. It’s clear how to traverse a list. But how to traverse a graph? There is no obvious “first” or “last” node, and each one is linked to potentially many others. And as we’ve seen, the vertices might not even be fully connected, so a traversal path through all the nodes might not even exist.

There are two different ways of traversing a graph: **breadth-first** and **depth-first**. They provide different ways of exploring the nodes, and as a side effect, each is able to discover whether the graph is connected or not. Let’s look at each in turn.

### Breadth-First Traversal

With breadth-first traversal, we begin at a starting vertex (it doesn’t matter which one) and explore the graph cautiously and delicately. We probe equally deep in all directions, making sure we’ve looked a little ways down each possible path before exploring each of those paths a little further.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 110

Context: # CHAPTER 5. STRUCTURES

## Figure 5.11: The stages of depth-first traversal 

Marked nodes are grey, and visited nodes are black. The order of visitation is: G, C, A, J, B, K, F, E, I.

|     |     |     |
|-----|-----|-----|
| 1   | 2   | 3   |
| Stack: G     | Stack: Cl   | Stack: Alb   |
| Stack: Lif   | Stack: Ken   | Stack: Bus   |
| 4   | 5   | 6   |
| Stack: Lif   | Stack: Eli   | Stack: Ber   |
| Stack: Den   | Stack: Eli   | Stack: Bin   |
| 7   | 8   | 9   |
| Stack: Fal   | Stack: Ter   | Stack: Iz   |
| Stack: Bec   | Stack: End   | Stack: Ray   |
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 111

Context: # 5.1. GRAPHS

and is widely used today by, among other things, network routing protocols.

Consider Figure 5.12, a simplified map of France circa November 1944. Fresh U.S. troops are arriving by ship at the port town of Bordeaux, and need to reach Strasbourg as quickly as possible to assist the Allies in pushing Nazi squadrons back into Germany. The vertices of this graph are French cities, and the edge weights represent marching distances in kilometers. Although D-Day was successful, the outcome of the War may depend on how quickly these reinforcements can reach the front.

```
         350
     Lille ---- 500
      |           |
  300 |           | 400
      |           |
   Paris ----- Strasbourg
      |           |
  150 |           |
      |           |
   Nantes ---- 650
      |           
  200 |           
      |           
   Bordeaux             
      |           
  300 |           
      |           
 Toulouse ---- 400
      |           
  450 |           
      |           
   Lyon ----- 150
      |          
  250 |          
      |          
 Marseille ---- Nice
```

**Figure 5.12:** A weighted graph, through which we desire to find the shortest path from Bordeaux to Strasbourg.

The question, obviously, is which path the troops should take so as to reach Strasbourg the soonest. With a graph this small, you might be able to eyeball it. (Try it!) But Dijkstra's algorithm systematically considers every possible path and is guaranteed to find the one with the shortest total distance.

The way it works is to assign each node a tentative lowest distance, along with a tentative path from the start node to it. Then, if the algorithm encounters a different path to the same node as it progresses, it updates this tentative distance with the new, lower value.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 118

Context: ```markdown
# Chapter 5: Structures

## Figure 5.14
The stages of Prim's minimal connecting edge set algorithm. Heavy lines indicate edges that have been (irrevocably) added to the set.

| Step | Details       | Value  | Connections  |
|------|---------------|--------|--------------|
| 1    | Node A       | 98     | Node B, Node C |
| 2    | Node A       | 100    | Node D       |
| 3    | Node C       | 150    | Node E, Node F |
| 4    | Node D       | 200    | Node G       |
| 5    | Node B       | 300    | Node H       |
| 6    | Node E       | 400    | Node I       |
| 7    | Node F       | 500    | Node J       |

...
```
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 121

Context: ## 5.2. TREES

Instead, we would have a different rooted tree, depicted in the right half of the figure. Both of these rooted trees have all the same edges as the free tree did: B is connected to both A and C, F is connected only to A, etc. The only difference is which node is designated the root.

```
    A
   / \
  F   B
     / \
    C   D
         \
          E
```

```
    C
   / \
  B   D
 /     \
A       E
 \
  F
```

**Figure 5.16:** Two different rooted trees with the same vertices and edges.

Up to now we’ve said that the spatial positioning on graphs is irrelevant. But this changes a bit with rooted trees. Vertical positioning is our only way of showing which nodes are “above” others, and the word “above” does indeed have meaning here: it means closer to the root. The altitude of a node shows how many steps it is away from the root. In the right rooted tree, nodes B, D, and E are all one step away from the root (C), while node F is three steps away.

The key aspect to rooted trees — which is both their greatest advantage and greatest limitation — is that every node has one and only one path to the root. This behavior is inherited from free trees: as we noted, every node has only one path to every other.

Trees have a myriad of applications. Think of the files and folders on your hard drive: at the top is the root of the filesystem (perhaps “/” on Linux/Mac or “C:\” on Windows) and underneath that are named folders. Each folder can contain files as well as other named folders, and so on down the hierarchy. The result is that each file has one, and only one, distinct path to it from the top of the filesystem. The file can be stored, and later retrieved, in exactly one way.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 122

Context: # Chapter 5. Structures

An "org chart" is like this: the CEO is at the top, then underneath her are the VPs, the Directors, the Managers, and finally the rank-and-file employees. So is a military organization: the Commander in Chief directs generals, who command colonels, who command majors, who command captains, who command lieutenants, who command sergeants, who command privates.

The human body is even a rooted tree of sorts: it contains skeletal, cardiovascular, digestive, and other systems, each of which is comprised of organs, then tissues, then cells, molecules, and atoms. In fact, anything that has this sort of part-whole containment hierarchy is just asking to be represented as a tree.

In computer programming, the applications are too numerous to name. Compilers scan code and build a "parse tree" of its underlying meaning. HTML is a way of structuring plain text into a tree-like hierarchy of displayable elements. All chess programs build trees representing their possible future moves and their opponent's probable responses, in order to "see many moves ahead" and evaluate their best options. Object-oriented designs involve "inheritance hierarchies" of classes, each one specialized from a specific other. Etc. Other than a simple sequence (like an array), trees are probably the most common data structure in all of computer science.

## Rooted tree terminology

Rooted trees carry with them a number of terms. I'll use the tree on the left side of Figure 5.16 as an illustration of each:

1. **Root**: The top of the tree, which is an example. Note that unlike trees in the real world, computer science trees have their root at the top and grow down. Every tree has a root except the empty tree, which is the "tree" that has no nodes at all in it. (It's kind of weird thinking of "nothing" as a tree, but it's kind of like the empty set, which is still a set.)

2. **Parent**: Every node except the root has one parent: the node immediately above it. D's parent is C, C's parent is B, F's parent is A.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 123

Context: # 5.2 TREES

parent is A, and A has no parent.

## child

Some nodes have children, which are nodes connected directly below it. A’s children are F and B, C’s are D and E, B’s only child is C, and E has no children.

## sibling

A node with the same parent. E’s sibling is D, B’s is F, and none of the other nodes have siblings.

## ancestor

Your parent, grandparent, great-grandparent, etc., all the way back to the root. B’s only ancestor is A, while E’s ancestors are C, B, and A. Note that F is not C’s ancestor, even though it’s above it on the diagram; there’s no connection from C to F, except back through the root (which doesn’t count).

## descendant

Your children, grandchildren, great-grandchildren, etc., all the way to the leaves. B’s descendants are C, D, and E, while A’s are F, B, C, D, and E.

## leaf

A node with no children. F, D, and E are leaves. Note that in a (very) small tree, the root could itself be a leaf.

## internal node

Any node that’s not a leaf. A, B, and C are the internal nodes in our example.

## depth (of a node)

A node’s depth is the distance (in number of nodes) from it to the root. The root itself has depth zero. In our example, B is of depth 1, E is of depth 3, and A is of depth 0.

## height (of a tree)

A rooted tree’s height is the maximum depth of any of its nodes; i.e., the maximum distance from the root to any node. Our example has a height of 3, since the “deepest” nodes are D and E, each with a depth of 3. A tree with just one node is considered to have a height of 0. Bizarrely, but to be consistent, we’ll say that the empty tree has height -1! Strange, but what else could it be? To say it has height 0 seems inconsistent with a one-node tree also having height 0. At any rate, this won’t come up much.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 127

Context: # 5.2 TREES

It’s the same as pre-order, except that we visit the root after the children instead of before. Still, despite its similarity, this has always been the trickiest one for me. Everything seems postponed, and you have to remember what order to do it in later.

For our sample tree, the first node visited turns out to be L. This is because we have to postpone visiting G until we finish its left (and right) subtrees; then we postpone K until we finish its left (and right) subtree; postpone D until we’re done with O’s subtree, and postpone O until we do I. Then finally, the thing begins to unwind... all the way back up to K. But we can’t actually visit K itself yet, because we have to do its right subtree. This results in C, E, and M, in that order. Then we can do K, but we still can’t do G because we have its whole right subtree’s world to contend with. The entire order ends up being: I, O, D, C, E, M, K, A, F, L, N, B, H, and finally G. (See Figure 5.19 for a visual.)

Note that this is not remotely the reverse of the pre-order visitation, as you might expect. G is last instead of first, but the rest is all jumbled up.

![Figure 5.19: The order of node visitation in post-order traversal.](#)

1. G
2. H
3. D
4. M
5. A
6. B
7. C
8. E
9. F
10. N
11. I
12. J
13. K
14. L
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 131

Context: # 5.2. TREES

anced: every level is completely filled. Figure 5.22 is not perfect, but it would be if we either added nodes to fill out level 4, or deleted the unfinished part of level 3 (as in Figure 5.23).

```
        G
      /   \
     K     H
    / \   / \
   D   M A   B
  / \   / \
 O   C E   I
        \
         F
          \
           N
```
Figure 5.23: A “perfect” binary tree.

Perfect binary trees obviously have the strictest size restrictions. It’s only possible, in fact, to have perfect binary trees with \( 2^{h} - 1 \) nodes, if \( h \) is the height of the tree. So there are perfect binary trees with 1, 3, 7, 15, 31, ... nodes, but none in between. In each such tree, \( 2^{h} \) of the nodes (almost exactly half) are leaves.

Now as we’ll see, binary trees can possess some pretty amazing powers if the nodes within them are organized in certain ways. Specifically, a binary search tree and a heap are two special kinds of binary trees that conform to specific constraints. In both cases, what makes them so powerful is the rate at which a tree grows as nodes are added to it.

Suppose we have a perfect binary tree. To make it concrete, let’s say it has height 3, which would give it \( 1+2+4+8=15 \) nodes, 8 of which are leaves. Now what happens if you increase the height of this tree to 4? If it’s still a “perfect” tree, you will have added 16 more nodes (all leaves). Thus, you have doubled the number of leaves by simply adding one more level. This cascades the more levels you add. A tree of height 5 doubles the number of leaves again (to 32), and height 6 doubles it again (to 64).
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 133

Context: # 5.2. TREES

## Binary Search Trees (BSTs)

Okay, then let’s talk about how to arrange those contents. A binary search tree (BST) is any binary tree that satisfies one additional property: every node is “greater than” all of the nodes in its left subtree, and “less than (or equal to)” all of the nodes in its right subtree. We’ll call this the **BST property**. The phrases “greater than” and “less than” are in quotes here because their meaning is somewhat flexible, depending on what we’re storing in the tree. If we’re storing numbers, we’ll use numerical order. If we’re storing names, we’ll use alphabetical order. Whatever it is we’re storing, we simply need a way to compare two nodes to determine which one “goes before” the other.

An example of a BST containing people is given in Figure 5.24. Imagine that each of these nodes contains a good deal of information about a particular person—an employee record, medical history, account information, what have you. The nodes themselves are indexed by the person’s name, and the nodes are organized according to the BST rule. Mitch comes after Ben/Jessica/Jim and before Randy/Olson/Molly/Xander in alphabetical order, and this ordering relationship between parents and children repeats itself all the way down the tree. (Check it!)

Be careful to observe that the ordering rule applies between a node and **the entire contents of its subtrees**, not merely to its immediate children. This is a rookie mistake that you want to avoid. Your first instinct, when glancing at Figure 5.25, below, is to judge it a BST. It is not a binary search tree, however! Jessica is to the left of Mitch, as she should be, and Nancy is to the right of Jessica, as she should be. It seems to check out. But the problem is that Nancy is a descendant of Mitch’s left subtree, whereas she must properly be placed somewhere in his right subtree. And, this matters. So be sure to check your BSTs all the way up and down.

## The Power of BSTs

All right, so what’s all the buzz about BSTs, anyway? The key insight is to realize that if you’re looking for a node, all you have to do is start at the root and go the height of the tree down, making...
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 135

Context: # 5.2 TREES

![Figure 5.25: NOT a binary search tree, though it looks like one at first glance. (Notice Nancy and Mitch)](path/to/image)

**Possible?** The trick is to realize that with every node you look at, you effectively eliminate half of the remaining tree from consideration. For instance, if we're looking for Molly, we can disregard Mitch's entire left half without even looking at it, then the same for Randi's entire right half. If you discard half of something, then half of the remaining half, it doesn’t take you long before you’ve eliminated almost every false lead.

There’s a formal way to describe this speedup, called **Big-O notation**. The subtleties are a bit complex, but the basic idea is this. When we say that an algorithm is **O(n)** (pronounced "oh-of-n"), it means that the time it takes to execute the algorithm is *proportional* to the number of nodes. This doesn’t imply any specific number of milliseconds or anything — that is highly dependent on the type of computer hardware, the programming language, and a myriad of other things. But what we can say about an **O(n)** algorithm is that if you double the number of nodes, you’re going to approximately double the running time. If you quadruple the number of nodes, you’re going to quadruple the running time. This is what you’d expect.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 137

Context: 5.2. TREES
==========

![Figure 5.26: The BST after adding Jennifer.](image_link_here)

This allows us to look at the tree and reconstruct some of what came before. For instance, we know that Mitch must have been the first node originally inserted, and that Randi was inserted before Owen, Xander, or Molly. As an exercise, add your own name to this tree (and a few of your friends' names) to make sure you get the hang of it. When you're done, the tree must of course obey the BST property.

## Removing nodes from a BST

Removing nodes is a bit trickier than adding them. How do we delete an entry without messing up the structure of the tree? It’s easy to see how to delete Molly: since she’s just a leaf, just remove her and be done with it. But how to delete Jessica? Or for that matter, Mitch?

Your first inclination might be to eliminate the node and promote one of its children to go in its place. For instance, if we delete Jessica, you might think we could elevate Ben up to where Jessica was, and then move Jennifer up under Ben as well. This doesn’t work, though. The result would look like Figure 5.27, with Jennifer in the wrong place. The next time we look for Jennifer in the tree, we’ll search to the right of Ben (as we should), completely.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 142

Context: # CHAPTER 5. STRUCTURES

Thirdly, there are specialized data structures you may learn about in future courses, such as AVL trees and red-black trees, which are binary search trees that add extra rules to prevent imbalancing. Basically, the idea is that when a node is inserted (or removed), certain metrics are checked to make sure that the change didn’t cause too great an imbalance. If it did, the tree is adjusted to minimize the imbalance. This comes at a slight cost every time the tree is changed, but prevents any possibility of a lopsided tree that would cause slow lookups in the long run.

## 5.3 Final Word

Whew, that was a lot of information about structures. Before we continue our walk in the next chapter with a completely different topic, I’ll leave you with this summary thought. Let \( BST \) be the set of Binary Search Trees, and \( BT \) be the set of Binary Trees. Let \( RT \) be the set of rooted trees, and \( T \) be the set of trees (free or rooted). Finally, let \( CG \) be the set of connected graphs, and \( G \) the set of all graphs. Then we have:

\[ 
BST \subseteq BT \subseteq RT \subseteq CG \subseteq G 
\]

It’s a beautiful thing.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 143

Context: # 5.4 Exercises

1. How many vertices are there in the graph below?  
   ![Graph](link_to_graph_image)

2. How many edges are there?  
   7.

3. What’s the degree of vertex B?  
   3.

4. Is this graph directed?  
   No. (No arrowheads on the lines.)

5. Is this graph connected?  
   No – there is no path from A, B, E, or F to either C or D.

6. Is this graph weighted?  
   No. (No numbers annotating the edges.)

7. Is it a tree?  
   No. (A tree must be connected and must also have no cycles, which this graph clearly does: e.g., B–to–A–to–E–to–B.)

8. Is it a DAG?  
   Not remotely; it is neither directed nor acyclic.

9. If this graph represented an undirected relation, how many ordered pairs would it have?  
   14. (If you said 7, remember that since there are no arrowheads on the lines, this is an undirected graph, which corresponds to a symmetric relation, and hence both (A, E) and (E, A) will be present.)
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 147

Context: # 5.4. EXERCISES

25. **Is it a binary search tree?**  
   No. Although nearly every node does satisfy the BST property (all the nodes in its left subtree come before it alphabetically, and all the nodes in its right subtree come after it), there is a single exception: Zoe is in Wash's left subtree, whereas she should be to his right.  
   Many ways; one would be to swap Zoe's and Wash's positions. If we do that, the fixed tree would be:

   ```
       Mal
      /   \
    Jayne  Zoe
    /  \   /
  Inara Kayle
           \
           River
             \
             Simon
               \
               Wash
   ```

26. **How could we fix it?**  
   Take a moment and convince yourself that every node of this new tree does in fact satisfy the BST property. It's not too bad, but it does have one too many levels in it (it has a height of 4, whereas all its nodes would fit in a tree of height 3).

27. **Is the tree balanced?**  
   Many ways; one would be to rotate the River-Simon-Wash subtree so that Simon becomes Zoe's left child. Simon would then be the parent of River (on his left) and Wash (on his right).

28. **How could we make it more balanced?**  
   Suggested approaches include reviewing the structure to ensure the depths of all leaves are as equal as possible or implementing rotations to balance out the heights of subtrees.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 149

Context: # Chapter 6

## Counting

If the title of this chapter seems less than inspiring, it’s only because the kind of counting we learned as children was mostly of a straightforward kind. In this chapter, we’re going to learn to answer some more difficult questions like “how many different semester schedules could a college student possibly have?” and “how many different passwords can a customer choose for this e-commerce website?” and “how likely is this network buffer to overflow, given that its packets are addressed to three different destinations?”

The more impressive-sounding name for this topic is **combinatorics**. In combinatorics, we focus on two tasks: counting things (to find out how many there are), and enumerating things (to systematically list them as individuals). Some things turn out to be hard to count but easy to enumerate, and vice versa.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 158

Context: # CHAPTER 6. COUNTING

and so on. Once we have the **RISK** permutations, we can generate the **BRISK** permutations in the same way:

| B | R | I | S | K |
|---|---|---|---|---|
| R | B | I | S | K |
| R | I | B | S | K |
| R | I | S | B | K |
| B | I | R | S | K |
| T | B | R | I | S | K |
| I | R | B | S | K |
| I | R | S | K |
| I | S | R | K |
| B | S | K |
| . . . |

Another algorithm to achieve the same goal (though in a different order) is as follows:

## Algorithm #2 for enumerating permutations

1. Begin with a set of **n** objects.

   a. If **n = 1**, there is only one permutation; namely, the object itself.
   
   b. Otherwise, remove each of the objects in turn, and prepend that object to the permutations of all the others, creating another permutation each time.

I find this one a little easier to get my head around, but in the end it’s personal preference. The permutations of **BRISK** are: “B followed by all the permutations of **RISK**, plus **I** followed by all the permutations of **BISK**, plus **R** followed by all the permutations of **BRSK**, etc.” So the first few permutations of a 4-letter word are:

- R I S K
- R I K S
- R S I K
- R S K I
- R K I S
- R K S I
- I R S K
- I R K S
- I S R K
- I S K R
- I K R S
- I K S R
- S R I K
- S R K I
- S I R K
- S I K R
- S K R I
- S K I R
- K R I S
- K R S I
- K I R S
- K I S R
- K S R I
- K S I R
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 167

Context: # 6.4 Summary

Most of the time, counting problems all boil down to a variation of one of the following three basic situations:

- \( n^k \) — this is when we have \( k \) different things, each of which is free to take on one of \( n \) completely independent choices.

- \( \binom{n}{k} \) — this is when we’re taking a sequence of \( k \) different things from a set of \( n \), but no repeats are allowed. (A special case of this is \( n! \), when \( k = n \).)

- \( \binom{n}{k} \) — this is when we’re taking \( k \) different things from a set of \( n \), but the order doesn’t matter.

Sometimes it’s tricky to deduce exactly which of these three situations apply. You have to think carefully about the problem and ask yourself whether repeated values would be allowed and whether it matters what order the values appear in. This is often subtle.

As an example, suppose my friend and I work out at the same gym. This gym has 18 different weight machines to choose from, each of which exercises a different muscle group. Each morning,
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 171

Context: # 6.5. EXERCISES

1. To encourage rivalry and gluttony, we're going to give a special certificate to the child who collects the most candy at the end of the night. And while we're at it, we'll give 2nd-place and 3rd-place certificates as well. How many different ways could our 1st-2nd-3rd contest turn out?  
   **Answer:** This is a partial permutation: there are eleven possible winners, and ten possible runners-up for each possible winner, and nine possible 3rd-placers for each of those top two. The answer is therefore 114, or 900. Wow! I wouldn’t have guessed that high.

2. Finally, what if we want every kid to get a certificate with their name and place-of-finish on it? How many possibilities? (Assume no ties.)  
   **Answer:** This is now a full-blown permutation: 111. It comes to 30,916,800 different orders-of-finish, believe it or not. I told you: this counting stuff can explode fast.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 203

Context: # 7.5 EXERCISES

24. What’s the binary number `1011001110110100` in hexadecimal? Or is that too hard a question to eyeball?  
   **Answer:** Naw, it’s easy. By inspection, it’s B3A4, since each of the four 4-bit nibbles goes one-for-one with a hex digit. (You can look up nibble values on p. 181 if you want, but again it’s definitely worth memorizing.)

25. What’s the binary number `1011001110110100` in decimal? Or is that too hard a question to eyeball?  
   **Answer:** Ugh. Ain’t nobody got time for that.

26. What’s the hex number `F4C3` in decimal? Or is that too hard a question to eyeball?  
   **Answer:** Too hard.

27. What’s the hex number `F4C3` in binary? Or is that too hard a question to eyeball?  
   **Answer:** Simple: `1111010011000011`. Read it right off the chart (p. 181).

28. If I told you that the bit pattern `1010` was meant to represent an unsigned number, what value would it represent?  
   **Answer:** Ten. (8 + 2 = 10).

29. If I told you that the bit pattern `1010` was meant to represent a sign-magnitude number, what value would it represent?  
   **Answer:** Negative two. The left-most bit is 1, so it’s negative; and the remaining bits are `010`, which when interpreted in binary are the number 2.

30. If I told you that the bit pattern `1010` was meant to represent two's complement number, what value would it represent?  
   **Answer:** Negative six. The left-most bit is 1, so it’s negative. This means in order to figure out the value, we have to flip all the bits and add one. Flipping the bits yields `0101`, and adding one to that gives `0110` (we had to do one carry). Since the binary number `010` is positive, that must mean that what we started with `1010` must be negative six.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 204

Context: I'm unable to process images directly. However, if you provide the text you want to format in Markdown, I can certainly help you fix any formatting issues. Please paste the text here.

Image Analysis: 

### Image Analysis Report

**Localization and Attribution:**
- The image appears to be a single diagram located at the center of the page.
- It is referred to as **Image 1**.

**Object Detection and Classification:**
- **Objects Detected:**
  - Boxes/rectangles connected by arrows.
  - Icons inside some boxes.

**Scene and Activity Analysis:**
- **Scene Description:**
  - The image is a flowchart depicting a process. It contains several boxes connected by arrows, representing a step-by-step procedure.
- **Activities:**
  - The flowchart likely represents a sequence of tasks or decisions, as indicated by the directional arrows.

**Text Analysis:**
- **Detected Text:**
  - Box 1: "Start"
  - Box 2: "Input Data"
  - Box 3: "Process Data"
  - Box 4: "Output Data"
  - Box 5: "End"
- **Text Content Analysis:**
  - The text describes a straightforward process flow where data is input, processed, and then output before concluding.

**Diagram and Chart Analysis:**
- **Diagram Analysis:**
  - The flowchart is visually structured to guide the reader from the starting point of a process to its conclusion.
  - Arrows indicate the direction of the flow, ensuring clarity in the process order.
- **Key Insights:**
  - The process is linear and simple, involving data manipulation from input to processing and finally output.

**Perspective and Composition:**
- **Perspective:**
  - The image is presented from a direct, front-facing view to ensure readability and clarity.
- **Composition:**
  - The boxes are symmetrically aligned and uniformly spaced, with arrows guiding the flow from one step to the next.

**Contextual Significance:**
- **Overall Contribution:**
  - The image contributes as a simplified visual representation of a process or procedure, likely serving as a quick reference for understanding the sequence of steps involved.

**Tables:**
- **Tabular Data Analysis:**
  - No tables are included in the image.

### Summary

**Image 1** features a flowchart that describes a step-by-step process involving the input, processing, and output of data, concluding with an end step. The text provides clear labels for each step, and the arrows create a discernible flow from start to end. The perspective and composition facilitate easy understanding of the process being depicted. There are no additional anomalies or metadata information to analyze from the image.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 212

Context: ```
# CHAPTER 8. LOGIC

| X | Y | X∨Y | X•Y | ¬X | X=Y | X≠Y |
|---|---|-----|-----|----|-----|-----|
| 0 | 0 |  0  |  0  |  1 |  1  |  1  |
| 0 | 1 |  1  |  0  |  1 |  0  |  1  |
| 1 | 0 |  1  |  0  |  0 |  0  |  1  |
| 1 | 1 |  1  |  1  |  0 |  1  |  0  |

Take a moment and look carefully through the entries in that table, and make sure you agree that this correctly represents the outputs for the five operators. (Note that ¬, being a unary operator, only has X as an input, which means that the value of Y is effectively ignored for that column.)

Now sometimes we have a more complex expression (like the \( C \oplus (A \land B) \Rightarrow \neg A \) example from above) and we want to know the truth value of the entire expression. Under what circumstances — i.e., for what truth values of A, B, and C — is that expression true? We can use truth tables to calculate this piece by piece.

Let’s work through that example in its entirety. First, we set up the inputs for our truth table:

| A | B | C |
|---|---|---|
| 0 | 0 | 0 |
| 0 | 0 | 1 |
| 0 | 1 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 1 | 1 |

In this case, there are three inputs to the expression (A, B, and C) and so we have \( 2^3 \), or eight, rows in the truth table.

Now we work our way through the expression inside out, writing down the values of intermediate parts of the expression. We need to know the value of ¬B to figure some other things out, so let’s start with that one:
```
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 254

Context: ```
# CHAPTER 9. PROOF

- Cantor, Georg, 7, 12, 17
- capacity (of a byte), 182
- cardinality (of sets), 16, 25, 28, 66
- Carroll, Lewis, 227
- carry-in, 189
- carry-out, 189
- Cartesian product (of sets), 
  - 19, 35
- chess, 114
- child (of a node), 115
- closed interval, 61
- codomain (of a function), 45
- collectively exhaustive, 26
- combinations, 154
- combinators, 25, 141
- commutative, 18, 20, 71
- compilers, 114
- complement laws (of sets), 
  - 21
- complement, partial (of sets), 
  - 18
- complement, total (of sets), 
  - 18, 65, 146, 162
- complete binary tree, 121
- conclusion (of implication), 200
- conditional probability, 
  - 68, 72, 74, 78
- congruent, 173
- conjunction, 199, 208
- connected (vertices/graphs), 
  - 89, 95
- coordinates, 15
- curly brace notation, 11
- cycles, 90

## DAGs (directed acyclic graphs), 90

- data structures, 85
- Davies family, 8, 9, 26, 147, 154
- De Morgan's laws, 21, 22, 207, 208
- decimal numbers, 165, 169, 173, 178
- degree (of a vertex), 90
- depth (of a node), 115
- dequeuing, 95
- descendant (of a node), 115
- DFT (depth-first traversal), 99, 101

## Dijkstra's algorithm, 101, 104
- Dijkstra, Edsger, 101
- direct proof, 227
- directed graphs, 88, 91
- disjunction, 199, 208, 226
- disjunctive syllogism, 226
- disk sectors, 156
- distributive, 20, 208, 227
- domain (of a function), 45
- domain of discourse (?), 9, 
  - 19, 21, 27, 60, 210
- domination laws (of sets), 21
- dominors, 234
- drinking age, 232, 234
- duplicates (in sets), 13
- edges, 86, 87
- elements (of sets), 8, 15, 23
- ellipses, 12
- empty graph, 87
- empty set, 9, 16, 21, 24, 25, 
  - 36, 114
- endorelations, 38, 93
```
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 257

Context: 9.4. FINAL WORD
=================

- ordered triples, 15
- org charts, 113
- outcomes, 60, 62
- overflow, 188

P = NP?, 244  
parent (of a node), 114  
partial orders, 43  
partial permutations, 151, 154  
partitions, 26, 71, 94  
Pascal's triangle, 157  
passwords, 146  
paths (in a graph), 88, 113  
perfect binary tree, 122, 239  
permutations, 147  
PINs, 143  
poker, 160  
pop (off a stack), 99  
posts, 43  
post-order traversal, 118  
postulates, 226  
power sets, 24, 36  
pre-order traversal, 117  
predicates, 210, 211, 232  
predicate logic, 210  
premise (of implication), 200  
Prim's algorithm, 107  
Prime, Robert, 107  
prior probability, 68  
probability measures, 61, 63, 65  
product operator (II), 142, 160  
proof, 223  
proof by contradiction, 229  
propositional logic, 197, 225  
propositions, 197, 210  

- psychology, 70, 86
- push (on a stack), 99
- quantifiers, 212, 215
- queue, 95, 97
- quotient, 173, 174

- range (of a function), 48
- rational numbers (ℝ), 17, 24
- reachable, 89
- real numbers (ℝ), 71, 94
- rebalancing (a tree), 132
- recursion, 116, 120, 149, 231
- red-black trees, 133
- reflexive (relation), 40, 43
- relations, finite, 39
- relations, infinite, 39
- remainder, 173, 174
- right child, 116  
- root (of a tree), 112, 114  
- rooted trees, 112, 134  
- Russell's paradox, 15  

- sample space (Ω), 60  
- semantic network, 87  
- set operators, 18  
- set-builder notation, 11  
- sets, 8, 93  
- sets of sets, 15  
- sets, finite, 12  
- sets, fuzzy, 10  
- sets, infinite, 12  
- sibling (of a node), 115  
- signs-magnitude binary numbers, 183, 189  

**Sonic the Hedgehog**, 73  
- southern states, 72  
- spatial positioning, 92, 113
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 4

Context: ```
6.6 Discriminative Models........................................159  
6.7 Autoencoders...............................................163  
6.8 Ranking*...................................................164  
6.9 Summary....................................................164  

IV Advanced Modelling and Inference..............................165  

7 Probabilistic Graphical Models................................166  
7.1 Introduction...............................................167  
7.2 Bayesian Networks..........................................170  
7.3 Markov Random Fields.......................................178  
7.4 Bayesian Inference in Probabilistic Graphical Models......182  
7.5 Summary....................................................185  

8 Approximate Inference and Learning............................186  
8.1 Monte Carlo Methods........................................187  
8.2 Variational Inference.......................................189  
8.3 Monte Carlo-Based Variational Learning*....................197  
8.4 Approximate Learning*......................................199  
8.5 Summary....................................................201  

V Conclusions...................................................202  

9 Concluding Remarks............................................203  

Appendices.......................................................206  

A Appendix A: Information Measures..............................207  
A.1 Entropy....................................................207  
A.2 Conditional Entropy and Mutual Information................210  
A.3 Divergence Measures.......................................212  

B Appendix B: KL Divergence and Exponential Family..........215  

Acknowledgements...............................................217  
```
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 7

Context: # Notation

- Random variables or random vectors – both abbreviated as rvs – are represented using roman typeface, while their values and realizations are indicated by the corresponding standard font. For instance, the equality \( x = x \) indicates that \( r \) takes value \( x \).
  
- Matrices are indicated using uppercase fonts, with roman typeface used for random matrices.
  
- Vectors will be taken to be in column form.
  
- \( X^T \) and \( X' \) are the transpose and the pseudoinverse of matrix \( X \), respectively.

- The distribution of a rv \( x \), either probability mass function (pmf) for discrete rv or probability density function (pdf) for continuous rvs, is denoted as \( p_x(x) \) or \( p(x) \).

- The notation \( x \sim p_x \) indicates that \( r \) is distributed according to \( p_x \).

- For jointly distributed rvs \( (x, y) \sim p_{x,y} \), the conditional distribution of \( x \) given the observation \( y = y \) is indicated as \( p_{x|y} \) or \( p(x|y) \).

- The notation \( x|y \sim p_{x|y} \) indicates that \( r \) is drawn according to the conditional distribution \( p_{x|y} \).

- The notation \( E_{p_{x|y}}[t] \) indicates the expectation of the argument with respect to the distribution of the rv \( x \sim p_x \). Accordingly, we will also write \( E_{x \sim p_y}[t] \) for the conditional expectation with respect to the distribution \( p_{y|x} \). When clear from the context, the distribution over which the expectation is computed may be omitted.

- The notation \( P_{x|y}(t) \) indicates the probability of the argument event with respect to the distribution of the rv \( x \sim p_x \). When clear from the context, we may omit the specification of the distribution.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 10

Context: # Acronyms

| Acronym | Meaning                                         |
|---------|------------------------------------------------|
| pmf     | probability mass function                       |
| PCA     | Principal Component Analysis                    |
| PPCA    | Probabilistic Principal Component Analysis      |
| QDA     | Quadratic Discriminant Analysis                |
| RBM     | Restricted Boltzmann Machine                   |
| SGD     | Stochastic Gradient Descent                     |
| SVM     | Support Vector Machine                          |
| rv      | random variable or random vector (depending on the context) |
| s.t.    | subject to                                     |
| VAE     | Variational AutoEncoder                        |
| VC      | Vapnik–Chervonenkis                            |
| VI      | Variational Inference                          |
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 11

Context: # Part I
## Basics
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 12

Context: # 1 Introduction

Having taught courses on machine learning, I am often asked by colleagues and students with a background in engineering to suggest “the best place to start” to get into this subject. I typically respond with a list of books – for a general, but slightly outdated introduction, read this book; for a detailed survey of methods based on probabilistic models, check this other reference; to learn about statistical learning, I found this text useful; and so on. This answer strikes me, and most likely also my interlocutors, as quite unsatisfactory. This is especially so since the size of many of these books may be discouraging for busy professionals and students working on other projects. This monograph is an attempt to offer a basic and compact reference that describes key ideas and principles in simple terms and within a unified treatment, encompassing also more recent developments and pointers to the literature for further study.

## 1.1 What is Machine Learning?

A useful way to introduce the machine learning methodology is by means of a comparison with the conventional engineering design flow.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 14

Context: # 1.2 When to Use Machine Learning?

Based on the discussion above, machine learning can offer an efficient alternative to the conventional engineering flow when development cost and time are the main concerns, or when the problem appears to be too complex to be studied in its full generality. On the flip side, the approach has the key disadvantages of providing generally suboptimal performance, or hindering interpretability of the solution, and to apply only to a limited set of problems.

In order to identify tasks for which machine learning methods may be useful, reference [31] suggests the following criteria:

1. The task involves a function that maps well-defined inputs to well-defined outputs.
2. Large data sets exist or can be created containing input-output pairs.
3. The task provides clear feedback with clearly definable goals and metrics.
4. The task does not involve long chains of logic or reasoning that depend on diverse background knowledge or common sense.
5. The task does not require detailed explanations for how the decision was made.
6. The task has a tolerance for error and no need for provably correct or optimal solutions.
7. The phenomenon or function being learned should not change rapidly over time.
8. No specialized dexterity, physical skills, or mobility is required.

These criteria are useful guidelines for the decision of whether machine learning methods are suitable for a given task of interest. They also offer a convenient demarcation line between machine learning as is intended today, with its focus on training and computational statistics tools, and more general notions of Artificial Intelligence (AI) based on knowledge and common sense [87] (see [126] for an overview on AI research).
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 16

Context: # Introduction

representing the data in a smaller or more convenient space; and generative modelling, which is the problem of learning a generating mechanism to produce artificial examples that are similar to available data in the data set \( D \).

As a generalization of both supervised and unsupervised learning, **semi-supervised learning** refers to scenarios in which not all examples are labelled, with the unlabelled examples providing information about the distribution of the covariates \( x \).

## 3. Reinforcement learning

Reinforcement learning refers to the problem of inferring optimal sequential decisions based on rewards or punishments received as a result of previous actions. Under supervised learning, the “label” refers to an action to be taken when the learner is in an informational state about the environment given by a variable \( x \). Upon taking an action \( a \) in a state \( x \), the learner is provided with feedback on the immediate reward accrued via this decision, and the environment moves on to a different state. As an example, an agent can be trained to navigate a given environment in the presence of obstacles by penalizing decisions that result in collisions.

Reinforcement learning is hence neither supervised, since the learner is not provided with the optimal actions \( t \) to select in a given state \( x \); nor is it fully unsupervised, given the availability of feedback on the quality of the chosen action. Reinforcement learning is also distinguished from supervised and unsupervised learning due to the influence of previous actions on future states and rewards.

This monograph focuses on supervised and unsupervised learning. These general tasks can be further classified along the following dimensions:

- **Passive vs. active learning**: A passive learner is given the training examples, while an active learner can affect the choice of training examples on the basis of prior observations.
  
- **Offline vs. online learning**: Offline learning operates over a batch of training samples, while online learning processes samples in a streaming fashion. Note that reinforcement learning operates inherently in an online manner, while supervised and unsupervised learning can be carried out by following either offline or online formulations.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 17

Context: # 1.3 Goals and Outline

This monograph considers only passive and offline learning.

## 1.3 Goals and Outline

This monograph aims at providing an introduction to key concepts, algorithms, and theoretical results in machine learning. The treatment concentrates on probabilistic models for supervised and unsupervised learning problems. It introduces fundamental concepts and algorithms by building on first principles, while also exposing the reader to more advanced topics with extensive pointers to the literature, within a unified notation and mathematical framework. Unlike other texts that are focused on one particular aspect of the field, an effort has been made here to provide a broad but concise overview in which the main ideas and techniques are systematically presented. Specifically, the material is organized according to clearly defined categories, such as discriminative and generative models, frequentist and Bayesian approaches, exact and approximate inference, as well as directed and undirected models. This monograph is meant as an entry point for researchers with a background in probability and linear algebra. A prior exposure to information theory is useful but not required.

Detailed discussions are provided on basic concepts and ideas, including overfitting and generalization, Maximum Likelihood and regularization, and Bayesian inference. The text also endeavors to provide intuitive explanations and pointers to advanced topics and research directions. Sections and subsections containing more advanced material that may be regarded at a first reading are marked with a star (+).

The reader will find here further discussions on computing platform or programming frameworks, such as map-reduce, nor details on specific applications involving large data sets. These can be easily found in a vast and growing body of work. Furthermore, rather than providing exhaustive details on the existing myriad solutions in each specific category, techniques have been selected that are useful to illustrate the most salient aspects. Historical notes have also been provided only for a few selected milestone events.

Finally, the monograph attempts to strike a balance between the algorithmic and theoretical viewpoints. In particular, all learning algorithms are presented in a manner that emphasizes their theoretical foundations while also providing practical insights.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 22

Context: ```markdown
variables \( t_n \) are assumed to be dependent on \( x_n \), and are referred to as dependent variables, labels, or responses. An example is illustrated in Fig. 2.1. We use the notation \( \mathbf{x}_D = (x_1, \ldots, x_N)^T \) for the covariates and \( \mathbf{t}_D = (t_1, \ldots, t_N)^T \) for the labels in the training set \( D \). Based on this data, the goal of supervised learning is to identify an algorithm to predict the label \( t \) for a new, that is, as of yet unobserved, domain point \( x \).

![Figure 2.1: Example of a training set \( D \) with \( N = 10 \) points \( (x_n, t_n) \), \( n = 1, \ldots, N \).](path_to_image)

The outlined learning task is clearly impossible in the absence of additional information on the mechanism relating variables \( z \) and \( t \). With reference to Fig. 2.1, unless we assume, say, that \( z \) and \( t \) are related by a function \( t = f(z) \) with some properties, such as smoothness, we have no way of predicting the label \( t \) for an unobserved domain point \( z \). This observation is formalized by the no free lunch theorem to be reviewed in Chapter 5: one cannot learn rules that generalize to unseen examples without making assumptions about the mechanism generating the data. The set of all assumptions made by the learning algorithm is known as the inductive bias.

This discussion points to a key difference between memorizing and learning. While the former amounts to mere retrieval of a value \( t_n \), 
```
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 50

Context: ```
44
# A Gentle Introduction through Linear Regression

Produces a description of approximately 
$$ - \sum_{n=1}^{N} \log{p(y_n | x_n, w_{ML}, \beta_{ML})} $$
bits. This description is, however, not sufficient, since the decoder of the description should also be informed about the parameters $(w_{ML}, \beta_{ML})$.

Using quantization, the parameters can be described by means of a number $C(M)$ of bits that is proportional to the number of parameters, here $M + 2$. Concatenating these bits with the description produced by the ML model yields the overall description length

$$ - \sum_{n=1}^{N} \log{p[y_n | x_n, w_{ML}, \beta_{ML}]} + C(M). \quad (2.39) $$

MDL – in the simplified form discussed here – selects the model order $M$ that minimizes the description length (2.39). Accordingly, the term $C(M)$ acts as a regularizer. The optimal value of $M$ for the MDL criterion is hence the result of the trade-off between the minimization of the overhead $C(M)$, which calls for a small value of $M$, and the minimization of the NLL, which decreases with $M$.

Under some technical assumptions, the overhead term can be often evaluated in the form 
$$ \left( \frac{K}{2} \right) N \log N + c, $$ 
where $K$ is the number of parameters in the model and $c$ is a constant. This expression is not quite useful in practice, but it provides intuition about the mechanism used by MDL to combat overfitting.

## 2.6 Information-Theoretic Metrics

We now provide a brief introduction to information-theoretic metrics by leveraging the example studied in this chapter. As we will see in the following chapters, information-theoretic metrics are used extensively in the definition of learning algorithms. Appendix A provides a detailed introduction to information-theoretic measures in terms of inferential tasks. Here we introduce the key metrics of Kullback-Leibler (KL) divergence and entropy by examining the asymptotic behavior of ML in the regime of large $N$. The case with finite $N$ is covered in Chapter 6 (see Sec. 6.4.3).
```
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 58

Context: # 3.1 Preliminaries

We start with a brief review of some definitions that will be used throughout the chapter and elsewhere in the monograph (see [28] for more details). Readers with a background in convex analysis and calculus may just review the concept of sufficient statistic in the last paragraph.

First, we define a **convex set** as a subset of \(\mathbb{R}^D\), for some \(D\), that contains all segments between any two points in the set. Geometrically, convex sets hence cannot have “indentations.” Function \(f(x)\) is convex if its domain is a convex set and if it satisfies the inequality \(f(\lambda x_1 + (1 - \lambda)x_2) \leq \lambda f(x_1) + (1 - \lambda)f(x_2)\) for all \(x_1\) and \(x_2\) in its domain and for all \(0 \leq \lambda \leq 1\). Geometrically, this condition says that the function is “U”-shaped: the curve defining the function cannot be above the segment obtained by connecting any two points on the curve. A function is strictly convex if the inequality above is strict except for \(\lambda = 0\) or \(\lambda = 1\); a concave, or strictly concave, function is defined by reversing the inequality above – it is hence “n-shaped.”

The minimization of a convex (“U”) function over a convex constraint set or the maximization of a concave (“n”) function over a convex constraint set are known as convex optimization problems. For these problems, there exist powerful analytical and algorithmic tools to obtain globally optimal solutions [28].

We also introduce two useful concepts from calculus. The **gradient** of a differentiable function \(f(x)\) with \(x = [x_1, \ldots, x_D]^T \in \mathbb{R}^D\) is defined as the \(D \times 1\) vector \(\nabla f(x) = [\frac{\partial f(x)}{\partial x_1}, \ldots, \frac{\partial f(x)}{\partial x_D}]^T\) containing all partial derivatives. At any point \(x\) in the domain of the function, the gradient is a vector that points to the direction of locally maximal increase of the function. The Hessian \(\nabla^2 f(x)\) is the \(D \times D\) matrix with \((i,j)\) element given by the second-order derivative \(\frac{\partial^2 f(x)}{\partial x_i \partial x_j}\). It captures the local curvature of the function.

1. A statistic is a function of the data.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 80

Context: # Probabilistic Models for Learning

## 3.9 Summary

In this chapter, we have reviewed an important class of probabilistic models that are widely used as components in learning algorithms for both supervised and unsupervised learning tasks. Among the key properties of members of this class, known as the exponential family, are the simple form taken by the gradient of the log-likelihood (LL), as well as the availability of conjugate priors in the same family for Bayesian inference. 

An extensive list of distributions in the exponential family along with corresponding sufficient statistics, measure functions, log-partition functions, and mappings between natural and mean parameters can be found in [156]. More complex examples include the Restricted Boltzmann Machines (RBMs) to be discussed in Chapter 6 and Chapter 8. It is worth mentioning that there are also distributions not in the exponential family, such as the uniform distribution parameterized by its support. The chapter also covered the important idea of applying exponential models to supervised learning via Generalized Linear Models (GLMs). Energy-based models were finally discussed as an advanced topic.

The next chapter will present various applications of models in the exponential family to classification problems.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 82

Context: # 4
## Classification

The previous chapters have covered important background material on learning and probabilistic models. In this chapter, we use the principles and ideas covered so far to study the supervised learning problem of classification. Classification is arguably the quintessential machine learning problem, with the most advanced state of the art and the most extensive application to problems as varied as email spam detection and medical diagnosis. 

Due to space limitations, this chapter cannot provide an exhaustive review of all existing techniques and latest developments, particularly in the active field of neural network research. For instance, we do not cover decision trees here (see, e.g., [155]). Rather, we will provide a principled taxonomy of approaches, and offer a few representative techniques for each category within a unified framework. We will specifically proceed by first introducing as preliminary material the Stochastic Gradient Descent optimization method. Then, we will discuss deterministic and probabilistic discriminative models, and finally we will cover probabilistic generative models.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 106

Context: # Classification

![Figure 4.8](path/to/your/image)  
**Figure 4.8:** Probability that the class label is the same as for the examples marked with circles according to the output of a feed-forward multi-layer network with one hidden layer (L = 1) and six hidden neurons with sigmoid non-linearity. The probability is represented by the color map illustrated by the bar on the right of the figure. For reference, the solid line represents the decision line for logistic regression.

## Example 4.3
In the example in Fig. 4.8, the illustrated \( N = 300 \) training examples are used to train a logistic regression, i.e., GLM, model and a feed-forward multi-layer network with one hidden layer (L = 1) and six hidden neurons with sigmoid non-linearity \( h(z) = \sigma(x) \). The logistic model uses linear features \( \sigma(x) = [1]^T \). Both networks are trained using SGD. For logistic regression, the decision line is illustrated as a solid line, while for the multi-layer network we plot the probability that the class labels the same as for the examples marked with circles as color map. The GLM model with linear features is seen to be unable to capture the structure of the data, while the multi-layer network can learn suitable features that improve the effectiveness of classification.

## 4.5.3 Some Advanced Topics
We conclude this section by noting a few important aspects of the ongoing research on deep neural networks.

A first issue concerns the theoretical understanding of the generalization properties of deep neural networks. On the face of it, the success

Image Analysis: 

### Comprehensive Examination of the Visual Content

#### 1. Localization and Attribution
- **Image Position:**
  - **Image 1:** Located centrally on the page.

#### 2. Object Detection and Classification
- **Image 1:**
  - **Objects Detected:**
    - Data points represented by blue circles and red crosses.
    - Color map background in grayscale.
    - Decision boundary indicated by a solid line.

#### 3. Scene and Activity Analysis
- **Image 1:**
  - **Scene Description:**
    - The image depicts a scatter plot overlaid with a color map and a decision boundary.
  - **Activities:**
    - The scatter plot shows data points classified into two categories (blue circles and red crosses).
    - The color map indicates the probability of classification.
    - A solid line separates the regions, representing the decision boundary from logistic regression.

#### 4. Text Analysis
- **Extracted Text:**
    - **Caption:**
      - **Figure 4.8:** Probability that the class label is the same as for the examples marked with circles according to the output of a feed-forward multi-layer network with one hidden layer (\(L=1\)) and six hidden neurons (with sigmoid non-linearity). The probability is represented by the color map illustrated by the bar on the right of the figure. For reference, the solid line represents the decision line for logistic regression.
    - **Example 4.3:**
      - In the example in Fig. 4.8, the illustrated \(N=300\) training examples are used to train a logistic regression, i.e., GLM, model and a feed-forward multi-layer network with one hidden layer (\(L=1\)) and six hidden neurons with sigmoid non-linearity \(h(z)=\sigma(z)\). The logistic model uses linear features \(\sigma(z)=[1 \, z^T]\). Both networks are trained using SGD. For logistic regression, the decision line is illustrated as a solid line, while for the multi-layer network we plot the probability that the class label is the same as for the examples marked with circles as color map. The GLM model with linear features is seen to be unable to capture the structure of the data, while the multi-layer network can learn suitable features that improve the effectiveness of classification.

#### 5. Diagram and Chart Analysis
- **Image 1:**
  - **Axes and Scales:**
    - X-axis (\(Z_1\)): Ranges approximately from -4 to 4.
    - Y-axis (\(Z_2\)): Ranges approximately from -4 to 4.
  - **Legends:**
    - Red crosses (\( \times \)): Data points of class 1.
    - Blue circles (\( \circ \)): Data points of class 2.
    - Color map: Represents the probability, with a bar on the right showing the gradient from 0 to 1.
  - **Key Insights:**
    - The decision boundary (solid line) shows the separation made by logistic regression.
    - The color map illustrates that the feed-forward multi-layer network better captures the structure of the data compared to the linear model.

#### 8. Color Analysis
- **Image 1:**
  - **Dominant Colors:**
    - Red and blue are used to differentiate classes.
    - Black, white, and shades of gray indicate the probability regions.
  - **Impact on Perception:**
    - The color scheme helps to distinguish between different classes and the corresponding decision areas effectively.

#### 9. Perspective and Composition
- **Image 1:**
  - **Perspective:**
    - Direct, front-facing view typical of scatter plots.
  - **Composition:**
    - Center-aligned with data points spread across the plot.
    - Decision boundary centrally positioned, highlighting the separation between classes.
    - Color bar on the right side provides probability information.

#### 10. Contextual Significance
- **Image 1:**
  - **Overall Document Contribution:**
    - The plot supports the explanation of Example 4.3 in comparing the effectiveness of logistic regression vs. multi-layer networks.
    - It visually demonstrates how different models handle training data and classification.

By considering these aspects, we can see that this page offers a detailed explanation of machine learning classification techniques with a practical example illustrated by the scatter plot and color map. The image effectively highlights the differences in performance and decision-making capabilities of logistic regression and a multi-layer network.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 113

Context: # 4.8 Summary

When training the \(k\)th model, the outputs \(a_1(x, \mathbf{w}_1), \ldots, a_k(x, \mathbf{w}_{k-1})\) of the previously trained models, as well as their weights \(\pi_1, \ldots, \pi_{k-1}\), are kept fixed. Excluding the models \(k + 1, \ldots, K\), the training loss can be written as

\[
L_k = \sum_{n=1}^{N} \alpha_{n}^{(k)} \exp(-\pi_k t_n - \sum_{j=1}^{k-1} \pi_j a_j(x_n, \mathbf{w}_j)). \tag{4.47}
\]

with the weights

\[
\alpha_{n}^{(k)} = \exp\left(-t_n - \sum_{j=1}^{k-1} \pi_j a_j(x_n, \mathbf{w}_j)\right). \tag{4.48}
\]

An important point is that the weights (4.48) are larger for training samples with smaller functional margin under the mixture model \(\sum_{j=1}^{k} \pi_j a_j(x_n, \mathbf{w}_j)\). Therefore, when training the \(k\)th model, we give more importance to examples that fare worse in terms of classification margins under the current mixture model. Note that, at each training step \(k\), one trains a simple model, which has the added advantage of reducing the computational complexity as compared to the direct learning of the full training model. We refer to [23, Ch. 14][33, Ch. 10] for further details.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 115

Context: 4.8. Summary
-----------

Chapter 8 for details). Learning rate schedules that satisfy (4.49) include \(\gamma(t) = 1/t\). The intuitive reason for the use of diminishing learning rates is the need to limit the impact of the “noise” associated with the finite-sample estimate of the gradient [22]. The proof of convergence leverages the unbiasedness of the estimate of the gradient obtained by SGD.

In practice, a larger mini-batch size \(S\) decreases the variance of the estimate of the gradient, hence improving the accuracy when close to a stationary point. However, choosing a smaller \(S\) can improve the speed of convergence when the current solution is far from the optimum [152, Chapter 8][22]. A smaller mini-batch size \(S\) is also known to improve the generalization performance of learning algorithms by avoiding sharp external points of the training loss function [66, 79] (see also Sec. 4.5). Furthermore, as an alternative to decreasing the step size, one can also increase the size of the mini-batch along the iterations of the SGD algorithm [136].

Variations and Generalizations
-------------------------------

Many variations of the discussed basic SGD algorithm have been proposed and routinely used. General principles motivating these schedule variants include [56, Chapter 8]: 

1. **Momentum**, or heavy-ball, memory: correct the direction suggested by the stochastic gradient by considering the “momentum” acquired during the last update; 
2. **Adapitivity**: use a different learning rate for different parameters depending on an estimate of the curvature of the loss function with respect to each parameter; 
3. **Control variates**: in order to reduce the variance of the SGD updates, and control variates that do not affect the unbiasedness of the stochastic gradient; and
4. **Second-order updates**: include information about the curvature of the cost or objective function in the parameter update.

As detailed in [56, Chapter 8][76, 43], to which we refer for further discussions, methods in the first category include Nesterov momentum; in the second category, we find AdaGrad, RMSProp, and Adam; and the third encompasses SVRG and SAGA. Finally, the fourth features Newton's method.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 117

Context: # 4.8. Summary

The problem turns out to be quadratic and convex. Importantly, the resulting optimal activation can be expressed as

\[
\alpha(x, w) = \sum_{n=1}^{N} \alpha_n k(x, z_n), \tag{4.50}
\]

where \(\alpha_n\) are the optimal dual variables, and we have defined the kernel function

\[
k(x, y) = \phi(x)^\top \phi(y), \tag{4.51}
\]

where \(x\) and \(y\) are two argument vectors. The kernel function measures the correlation—informally, the similarity—between the two input vectors \(x\) and \(y\). The activation (4.50) has been an intuitive interpretation: the decision about the label of an example \(x\) depends on the support vectors \(z_n\), where \(\alpha_n > 0\), that are the most similar to \(x\). Note that equation (4.50) can also be justified using the representer theorem in [133, Chapter 16], which shows that the optimal weight vector must be a linear combination of the feature vectors \(\{\phi(z_n)\}_{n=1}^N\).

Working in the dual domain can have computational advantages when the number of the primal variables, here the size \(D\) of the weight vector \(w\), is larger than the number \(N\) of dual variables. While this seems a priori unlikely to happen in practice, it turns out that this is not the case. The key idea is that one can use (4.50) with any other kernel function, not necessarily one explicitly defined by a feature function \(\phi\). A kernel function is any symmetric function measuring the correlation of two data points, possibly in an infinite-dimensional space. This is known as the kernel trick.

As a first example, the polynomial kernel

\[
k(x, y) = (x^\top y + r)^d, \tag{4.52}
\]

where \(r > 0\), corresponds to a correlation \(\phi(x)^\top \phi(y)\) in a high-dimensional space \(D\). For instance, with \(l = 1\) and \(D = 2\) we have \(D' = 6\) and the feature vector \(\phi(x) = [\sqrt{x_1}, x_1^2, x_2^2, \sqrt{x_1x_2}, x_1^3, x_2^3]\) [104]. As another, more extreme example, the conventional Gaussian kernel

\[
k(x, y) = e^{-\|x - y\|^2} \tag{4.53}
\]

corresponds to an inner product in an infinite-dimensional space [104]. An extensive discussion on kernel methods can be found in [104].
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 119

Context: # 5
## Statistical Learning Theory*

Statistical learning theory provides a well-established theoretical framework in which to study the trade-off between the number \( N \) of available data points and the generalization performance of a trained machine. The approach formalizes the notions of model capacity, estimation error (or generalization gap), and bias that underlie many of the design choices required by supervised learning, as we have seen in the previous chapters. 

This chapter is of mathematical nature, and it departs from the algorithmic focus of the text so far. While it may be skipped at a first reading, the chapter sheds light on the key empirical observations made in the previous chapters relative to learning in a frequentist setup. It does so by covering the theoretical underpinnings of supervised learning within the classical framework of statistical learning theory. 

To this end, the chapter contains a number of formal statements with proofs. The proofs have been carefully selected in order to highlight and clarify the key theoretical ideas. This chapter follows mostly the treatment in [133].
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 121

Context: # 5.1 A Formal Framework for Supervised Learning

Distributed according to \( p(x, t) \). A learning algorithm, such as ERM, takes the training set \( D \) as input and returns a predictor \( \hat{t}_D \in \mathcal{H} \) as output. We would like the predictive model \( \hat{t}_D \in \mathcal{H} \) to yield a generalization error \( L_p(\hat{t}_D) \) that is as close as possible to the minimum generalization loss \( L_p(t^*) \). Note that the selected model \( \hat{t}_D \) is random due to randomness of the data set \( D \).

In this regard, we recall that the ERM learning rule chooses a hypothesis \( \hat{t}^{ERM}_D \) by following the criterion \( \hat{t}^{ERM}_D = \arg\min_{\hat{t} \in \mathcal{H}} L_p(\hat{t}) \), where the empirical risk is 

\[
L_D(t) = \frac{1}{N} \sum_{i=1}^N L(t, t(x_i)) \tag{5.5}
\]

The notation in (5.5) emphasizes the randomness of the training set \( D = \{(x_{i}, t_{i})\}^N_{i=1} \).

Since the distribution \( p(x, t) \) is unknown, a learning rule \( \hat{t}_D \), such as ERM, can only minimize the generalization loss \( L_p(\hat{t}) \) approximately based on the observation of the data \( D \). Furthermore, this approximation can only be guaranteed at some probability level due to the randomness of the data set \( D \). This is illustrated in Figure 5.1, in which we have represented a high-probability interval for \( L_p(\hat{t}) \) on the horizontal axis. We would like the approximation to be accurate for all values of \( L_p(\hat{t}) \) within this interval.

But there is more: the probabilistic guarantee in terms of accuracy cannot depend on the specific distribution \( p(x, t) \), but it should instead be universal with respect to all distributions \( p(x, t) \). In summary, the best one can hope for is to have a learning rule \( \hat{t}_D \) that is Probably Approximately Correct (PAC).

## Definition 5.1
A learning rule \( \hat{t}_D \) is (N, \(\delta\)) PAC if, when working on data sets \( D \) of \( N \) examples, it satisfies the inequality

\[
L_p(\hat{t}_D) \leq L_p(t^*) + \epsilon \tag{5.6}
\]

with probability no smaller than \( 1 - \delta \), that is,

\[
\Pr_{D \sim p_{data}}[L_p(\hat{t}_D) \leq L_p(t^*) + \epsilon] \geq 1 - \delta \tag{5.7}
\]
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 130

Context: # 5.4 VC Dimension and Fundamental Theorem of PAC Learning

We have seen that finite classes are PAC learnable with sample complexity proportional to the model capacity in \(|H|\) by using ERM. In this section, we address the following questions: Is \(N_{\text{SRM}}(N,\delta)\) the smallest sample complexity? How can we define the capacity of infinite hypothesis classes? We have discussed at the end of Section 5.3 that the answer to the latter question cannot be found by extrapolating from results obtained when considering finite hypothesis classes. In contrast, we will see here that the answers to both of these questions rely on the concept of VC dimension, which serves as a more fundamental definition of capacity of a model. The VC dimension is defined next.

## Definition 5.4

A hypothesis class \(\mathcal{H}\) is said to shatter a set of domain points \(X = \{x_1,\ldots,x_n\}\) if, no matter how the corresponding labels \(\{y_i \in \{0,1\}\}\) are selected, there exists a hypothesis \(h \in \mathcal{H}\) that ensures \(\hat{y}_{n} = y_n\) for all \(n = 1,\ldots,V\).

## Definition 5.5

The VC dimension \(\text{VCdim}(\mathcal{H})\) of the model \(\mathcal{H}\) is the size of the largest set \(X\) that is shattered by \(\mathcal{H}\).

Based on the definitions above, to prove that a model has VCdim(\(\mathcal{H}\)) = V, we need to carry out the following two steps:

1. **Step 1:** Demonstrate the existence of a set \(X\) with \(|X| = V\) that is shattered by \(\mathcal{H}\);
2. **Step 2:** Prove that no set \(X'\) of dimensions \(V + 1\) exists that is shattered by \(\mathcal{H}\).

The second step is typically seen to be more difficult, as illustrated by the following examples.

## Example 5.3

The threshold function model (5.8) has VCdim(\(\mathcal{H}) = 1\), since there is clearly a set \(X\) of one sample (\(V = 1\)) that can be shattered.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 133

Context: # 5.5. Summary

To elaborate, consider a probabilistic model \( \mathcal{H} \) defined as the set of all pmfs \( p(x|\theta) \) parameterized by \( \theta \) in a given set. With some abuse of notation, we take \( \mathcal{H} \) to be also the domain of parameter \( \theta \). To fix the ideas, assume that \( \theta \) takes values over a finite alphabet. We know from Section 2.5 that a distribution \( q(x) \) is associated with a lossless compression scheme that requires around \( -\log q(x) \) bits to describe a value. Furthermore, if we were informed about the true parameter \( \theta \), the minimum average coding length would be the entropy \( H(p(\cdot|\theta)) \), which requires setting \( q(x) = p(x|\theta) \) (see Appendix A).

Assume now that we only know that the parameter \( \theta \) lies in set \( \mathcal{H} \), and hence the true distribution \( p(x|\theta) \) is not known. In this case, we cannot select the true parameter distribution, and we need instead to choose a generally different distribution \( q(x) \) to define a compression scheme. With a given distribution \( q(x) \), the average coding length is given by 

\[
\bar{L}(q(x), \theta) = -\sum_{x} p(x|\theta) \log q(x). 
\]
Therefore, the choice of a generally different distribution \( q(x) \) entails a redundancy of 

\[
\Delta R(q(x), \theta) = -\sum_{x} p(x|\theta) \log \frac{q(x)}{p(x|\theta)} \geq 0 \tag{5.22}
\]
bits.

The redundancy \( \Delta R(q(x), \theta) \) in (5.22) depends on the true value of \( \theta \). Since the latter is not known, this quantity cannot be computed. We can instead obtain a computable metric by maximizing over all values of \( \theta \in \mathcal{H} \), which yields the worst-case redundancy

\[
\Delta R(q(x), \mathcal{H}) = \max_{\theta \in \mathcal{H}} \Delta R(q(x), \theta). \tag{5.23}
\]

This quantity can be minimized over \( q(x) \) yielding the so-called min-max redundancy:

\[
\Delta R(q(x)) = \min_{q} \Delta R(q(x), \mathcal{H}) \tag{5.24}
\]
\[
= \min_{q} \max_{\theta} \sum_{x} p(x|\theta) \log \frac{p(x|\theta)}{q(x)} \tag{5.25}
\]
\[
= \min_{q} \max_{\theta} \sum_{x} p(x|\theta) \log \frac{p(x|\theta)}{q(x)}. \tag{5.26}
\]

The minimum redundancy can be taken as a measure of the capacity of model \( \mathcal{H} \), since a richer model tends to yield a larger \( \Delta R(\mathcal{H}) \). In fact,
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 138

Context: # Unsupervised Learning

may wish to cluster a set \( D \) of text documents according to their topics, by modeling the latter as an unobserved label \( z_n \). Broadly speaking, this requires to group together documents that are similar according to some metric. It is important at this point to emphasize the distinction between classification and clustering: While the former assumes the availability of a labelled set of training examples and evaluates its (generalization) performance on a separate set of unlabelled examples, the latter works with a single, unlabelled, set of examples. The different notation used for the labels \( z_n \) in lieu of \( t_n \) is meant to provide a reminder of this key difference.

## Dimensions reduction and representation:
Given the set \( D \), we would like to represent the data points \( x_n \in D \) in a space of lower dimensionality. This makes it possible to highlight independent explanatory factors, and/or to ease visualization and interpretation [93], e.g., for text analysis via vector embedding (see, e.g., [124]).

## Feature extraction:
Feature extraction is the task of deriving functions of the data points \( x_n \) that provide useful lower-dimensional inputs for tasks such as supervised learning. The extracted features are unobserved, and hence latent, variables. As an example, the hidden layer of a deep neural network extracts features from the data for use by the output layer (see Sec. 4.5).

## Generation of new samples:
The goal here is to train a machine that is able to produce samples that are approximately distributed according to the true distribution \( p(x) \). For example, in computer graphics for filmmaking or gaming, one may want to train a software that is able to produce artistic scenes based on a given description.

The variety of tasks and the difficulty in providing formal definitions, e.g., on the realism of an artificially generated image, may hinder unsupervised learning, at least in its current state, a less formal field than supervised learning. Often, loss criteria in unsupervised learning measure the divergence between the learned model and the empirical data distribution, but there are important exceptions, as we will see.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 169

Context: # 6.8 Ranking

## Denoising autoencoders
An alternative approach to facilitate the learning of useful features is that taken by denoising autoencoders [150]. Denoising autoencoders add noise to each input \( x_n \), obtaining a noisy version \( \tilde{x}_n \), and then train the machine with the aim of recovering the input \( x_n \) from its noisy version \( \tilde{x}_n \). Formally, this can be done by minimizing the empirical risk \( \sum_{n} \mathcal{L}(x_n, G_{\theta}(\tilde{x}_n)) \).

## Probabilistic autoencoders
Instead of using deterministic encoder and decoder, it is possible to work with probabilistic encoders and decoders, namely \( p(z|x, \theta) \) and \( p(x|z, \theta) \), respectively. Treating the decoder as a variational distribution, learning can then be done by a variant of the EM algorithm. The resulting algorithm, known as Variational AutoEncoder (VAE), will be mentioned in Chapter 8.

## 6.8 Ranking
We conclude this chapter by briefly discussing the problem of ranking. When one has available ranked examples, ranking can be formulated as a supervised learning problem [133]. Here, we focus instead on the problem of ranking a set of webpages based only on the knowledge of their underlying graph of mutual hyperlinks. This set-up may be considered as a special instance of unsupervised learning. We specifically describe a representative, popular, scheme known as PageRank [110], which uses solely the web of hyperlinks as a form of supervision signal to guide the ranking.

To elaborate, we define the connectivity graph by including a vertex for each webpage and writing the adjacency matrix as:

\[
L_{ij} =
\begin{cases}
1 & \text{if page } j \text{ links to page } i \\
0 & \text{otherwise}
\end{cases}
\]
  
The outgoing degree of a webpage is given as:

\[
C_j = \sum_{i} L_{ij}.
\]

PageRank computes the rank \( p_i \) of a webpage \( i \) as:

\[
p_i = (1 - d) + d \sum_{j} \frac{L_{ji}}{C_j} p_j,
\] 

where \( d \) is a damping factor.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 173

Context: # 7.1 Introduction

In this section, we start by discussing two examples that illustrate the type of structural information that can be encoded by means of probabilistic graphical models. We then provide an overview of this chapter.

As illustrated by the two examples below, structured probabilistic models can be used to set up parametric models for both supervised and unsupervised learning. In the former case, all variables are observed in the training set, with some revs being inputs, i.e., covariates (\(x\)), and others being considered as outputs, or targets (\(y\)). In contrast, in the latter case, some variables are unobserved and play the role of latent variables (\(z\)) that help explain or generate the observed variables (\(y\)).[^1]

## Example 7.1

Consider the tasks of text classification via supervised learning or topic clustering via unsupervised learning. In the supervised learning case, the problem is to classify documents depending on their topic, e.g., sport, politics or entertainment, based on a set of labeled documents. With unsupervised learning, the problem is to cluster documents according to the similarity of their contents based on the sole observation of the documents themselves.

A minimal model for this problem should include a variable \(t\) representing the topic and a variable \(x\) for the document. The topic can be represented by a categorical variable taking \(T\) values, i.e., \(t \in \{1, ... , T\}\), which is observed for supervised learning and latent for unsupervised learning. As for the document, with “bag-of-words” encoding, a set of \(W\) words of interest is selected, and a document is encoded as \(W \times 1\) binary vector \(z = [z_1, ..., z_W]^T\), where \(z_w = 1\) if word \(w\) is contained in the document.[^2]

To start, we could try to use an unstructured directed model defined as

\[
t \sim \text{Cat}(π) \tag{7.1a}
\]

\[
x | t \sim \text{Cat}(t) \tag{7.1b}
\]

[^1]: Strictly speaking, this distinction applies to the frequentist approach, since in the Bayesian approach the model parameters are always treated as unobserved random variables.
[^2]: Note that \(W\) here does not represent a matrix of weights!
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 176

Context: ```
to conditional independence assumptions. For instance, in the naive Bayes model, the word indicators are conditionally independent given the topic. As we will see in the rest of this chapter, conditional independence assumptions translate into factorizations of the joint distributions of the variables under study. Factorizations and associated conditional independence properties can be represented by three different graphical frameworks, namely BNs, MRFs, and factor graphs. For brevity, this chapter will only focus on the first two.

## 7.2 Bayesian Networks

This section provides a brief introduction to BNs by focusing on key definitions and on the problem of ML learning with some note on MAP and Bayesian viewpoints.

### 7.2.1 Definitions and Basics

BNs encode a probability factorization or, equivalently, a set of conditional independence relationships through a directed graph. The starting point is the chain rule for probabilities for a generic set of \( K \) random variables (rvs) \( \{x_1, ..., x_K\} \):

\[
p(x_1, ..., x_K) = p(x_1)p(x_2|x_1)...p(x_K|x_1, ..., x_{K-1}) \tag{7.3}
\]

where the order of the variables is arbitrary. The factorization (7.3) applies for a generic joint distribution, and it does not encode any additional structural information. Note that the notation here is general and not meant to indicate that the variables are necessary observed.

#### Example 7.3

Consider again the naive Bayes model for text classification/clustering. There, we imposed the structural constraint that word indicators \(\{x_w\}_{w=1}^W\) be conditionally independent given the topic \(t\). This conditional independence assumption can be expressed using the “perp” notation:

\[
x_w \perp \{x_{w'}: w' \neq w\}|t. \tag{7.4}
\]
```
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 192

Context: # 8
## Approximate Inference and Learning

In Chapters 6 and 7, we have seen that learning and inference tasks are often made difficult by the need to compute the posterior distribution \( p(z|x) \) of an unobserved variable \( z \) given an observed variable \( x \). This task requires the computation of the normalizing marginal:

\[
p(x) = \sum_{z} p(x, z), \tag{8.1}
\]

where the sum is replaced by an integral for continuous variables.¹ This computation is intractable when the alphabet of the hidden variable \( z \) is large enough. Chapter 7 has shown that the complexity of computing (8.1) can be alleviated in the special case in which the factorized joint distribution \( p(x, z) \) is defined by specific classes of probabilistic graphical models.

What to do when the complexity of computing (8.1) is excessive? In this chapter, we provide a brief introduction to two popular approximate inference approaches, namely MC methods and Variational Inference (VI). We also discuss their application to learning. As for the  

¹ Note that this task assumes (7.23) after appropriate redefinitions of the variables.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 208

Context: # Part V

## Conclusions

- This section summarizes the key findings of the study.
- Conclusions drawn from the data.
- Implications of the findings for future research.

1. **First conclusion point.**
2. **Second conclusion point.**
3. **Third conclusion point.**

### Recommendations

- **Recommendation 1:** Description of recommendation.
- **Recommendation 2:** Description of recommendation.
- **Recommendation 3:** Description of recommendation.

### References

1. Author, A. (Year). *Title of the work*. Publisher.
2. Author, B. (Year). *Title of the work*. Publisher. 

### Acknowledgements

- Acknowledgement of contributions from individuals or organizations.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 210

Context: # Concluding Remarks

The data, producing the wrong response upon minor, properly chosen, changes in the explanatory variables. Note that such adversarially chosen examples, which cause a specific machine to fail, are conceptually different from the randomly selected examples that are assumed when defining the generalization properties of a network. There is evidence that finding such examples is possible even without knowing the internal structure of a machine, but solely based on black-box observations [11]. Modifying the training procedure in order to ensure robustness to adversarial examples is an active area of research with important practical implications [55].

## Computing Platforms and Programming Frameworks

In order to scale up machine learning applications, it is necessary to leverage distributed computing architectures and related standard programming frameworks [17, 7]. As a complementary and more futuristic approach, recent work has been proposed to leverage the capabilities of an annealing-based quantum computers as samplers [82] or discrete optimizers [103].

## Transfer Learning

Machines trained for a certain task currently need to be re-trained in order to be re-purposed for a different task. For instance, a machine that learned how to drive a car would need to be retrained in order to learn how to drive a truck. The field of transfer learning covers scenarios in which one wishes to transfer the expertise acquired from some tasks to others. Transfer learning includes different related paradigms, such as multitask learning, lifelong learning, zero-shot learning, and domain adaptation [149]. 

In multitask learning, several tasks are learned simultaneously. Typical solutions for multitask learning based on neural networks preserve the presence of common representations among neural networks trained for different tasks [19]. Lifelong learning utilizes a machine trained on a number of tasks to carry out a new task by leveraging the knowledge accumulated during the previous training phases [143]. Zero-shot learning refers to models capable of recognizing unseen classes with training examples available only for related, but different, classes. This often entails the task of learning representation of classes, such as prototype vectors, that generate data in the class through a fixed probabilistic mechanism [52]. Domain adaptation will be discussed separately in the next point.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 212

Context: # Appendices

## Appendix A: Title

Content of Appendix A.

## Appendix B: Title

Content of Appendix B.

## Appendix C: Title

Content of Appendix C.

## Appendix D: Title

Content of Appendix D.

## References

- Reference 1
- Reference 2
- Reference 3

## Tables

| Header 1 | Header 2 | Header 3 |
|----------|----------|----------|
| Row 1    | Data 1   | Data 2   |
| Row 2    | Data 3   | Data 4   |
| Row 3    | Data 5   | Data 6   |
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 213

Context: # Appendix A: Information Measures

In this appendix, we describe a principled and intuitive introduction to information measures that builds on inference, namely estimation and hypothesis testing. We focus on entropy, mutual information, and divergence measures. We also concentrate on discrete r.v.s. In the monograph, we have taken the pragmatic approach of extending the definitions to continuous variables by substituting sums with integrals. It is worth noting that this approach does not come with any practical complications when dealing with mutual information and divergence. Instead, the continuous version of the entropy, known as differential entropy, should be treated with care, as it does not satisfy some key properties of the entropy such as non-negativity.

## A.1 Entropy

As proposed by Claude Shannon, the amount of information received from the observation of a discrete random variable \( X \sim p(x) \) defined over a finite alphabet should be measured by the amount of uncertainty about its value prior to its measurement. To this end, we consider the problem of estimating the value of \( x \) when one only knows the following:
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 224

Context: # References

1. Abadi, M., Ú. Erlingsson, I. Goodfellow, H. Brendan McMahan, I. Mironov, N. Papernot, K. Talwar, and L. Zhang. 2017. “On the Protection of Private Information in Machine Learning Systems: Two Recent Approaches.” *ArXiv e-prints*. Aug. arXiv: 1708.08032 [eprint].  
2. Abu-Mostafa, Y. S., M. Magdon-Ismail, and H.-T. Lin. 2012. *Learning from data*. Vol. 4. AMLBook New York, NY, USA.  
3. Agakov, F. 2005. *Variational Information Maximization in Stochastic Environments* (PhD thesis). University of Edinburgh.  
4. Alemi, A. A., B. Poole, and E. A. Fischer. 2017. “An Information-Theoretic Analysis of Deep Latent-Variable Models.” *ArXiv e-prints*. Nov. arXiv: 1711.00614v1.  
5. Amari, S.-i. 1998. “Natural gradient works efficiently in learning.” *Neural computation*, 10(2): 251–276.  
6. Amari, S.-i. 2016. *Information geometry and its applications*. Springer.  
7. Angelino, E., N. J. Johnson, R. P. Adams, et al. 2016. “Patterns of scalable Bayesian inference.” *Foundations and Trends® in Machine Learning*, 9(2-3): 119–247.  
8. Arjovsky, M., S. Chintala, and L. Bottou. 2017. “Wasserstein GAN.” *arXiv preprint arXiv:1701.07875*.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 3

Context: # Contents

Preface iii  
Learning and Intuition vii  

1. Data and Information  
   1.1 Data Representation ...................................................... 2  
   1.2 Preprocessing the Data .................................................. 4  

2. Data Visualization .................................................................. 7  

3. Learning  
   3.1 In a Nutshell ............................................................... 15  

4. Types of Machine Learning  
   4.1 In a Nutshell ............................................................... 20  

5. Nearest Neighbors Classification  
   5.1 The Idea In a Nutshell .................................................. 23  

6. The Naive Bayesian Classifier  
   6.1 The Naive Bayes Model .................................................. 25  
   6.2 Learning a Naive Bayes Classifier ............................... 27  
   6.3 Class-Prediction for New Instances ............................. 28  
   6.4 Regularization ............................................................... 30  
   6.5 Remarks .................................................................... 31  
   6.6 The Idea In a Nutshell .................................................. 32  

7. The Perceptron  
   7.1 The Perceptron Model .................................................. 34
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 4

Context: ```
# CONTENTS

7.2 A Different Cost function: Logistic Regression .......................... 37  
7.3 The Idea In a Nutshell .................................................. 38  

# 8 Support Vector Machines ................................................ 39  
8.1 The Non-Separable case ................................................. 43  

# 9 Support Vector Regression .............................................. 47  

# 10 Kernel ridge Regression ............................................... 51  
10.1 Kernel Ridge Regression ............................................... 52  
10.2 An alternative derivation ............................................. 53  

# 11 Kernel K-means and Spectral Clustering ............................... 55  

# 12 Kernel Principal Components Analysis ................................ 59  
12.1 Centering Data in Feature Space ....................................... 61  

# 13 Fisher Linear Discriminant Analysis .................................. 63  
13.1 Kernel Fisher LDA .................................................... 66  
13.2 A Constrained Convex Programming Formulation of FDA ................. 68  

# 14 Kernel Canonical Correlation Analysis ................................ 69  
14.1 Kernel CCA ............................................................ 71  

# A Essentials of Convex Optimization ..................................... 73  
A.1 Lagrangians and all that ............................................... 73  

# B Kernel Design .......................................................... 77  
B.1 Polynomials Kernels .................................................... 77  
B.2 All Subsets Kernel ...................................................... 78  
B.3 The Gaussian Kernel ..................................................... 79  
```
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 6

Context: # PREFACE

about 60% correct on 100 categories, the fact that we pull it off seemingly effortlessly serves as a “proof of concept” that it can be done. But there is no doubt in my mind that building truly intelligent machines will involve learning from data.

The first reason for the recent successes of machine learning and the growth of the field as a whole is rooted in its multidisciplinary character. Machine learning emerged from AI but quickly incorporated ideas from fields as diverse as statistics, probability, computer science, information theory, convex optimization, control theory, cognitive science, theoretical neuroscience, physics, and more. To give an example, the main conference in this field is called: **advances in neural information processing systems**, referring to information theory and theoretical neuroscience and cognitive science.

The second, perhaps more important reason for the growth of machine learning is the exponential growth of both available data and computer power. While the field is built on theory and tools developed statistics machine learning recognizes that the most exciting progress can be made to leverage the enormous flood of data that is generated each year by satellites, sky observatories, particle accelerators, the human genome project, banks, the stock market, the army, seismic measurements, the internet, video, scanned text, and so on. It is difficult to appreciate the exponential growth of data that our society is generating. To give an example, a modern satellite generates roughly the same amount of data all previous satellites produced together. This insight has shifted the attention from highly sophisticated modeling techniques on small datasets to more basic analysis on such larger data-sets (the latter sometimes called **data-mining**). Hence the emphasis shifted to algorithmic efficiency and as a result many machine learning faculty (like myself) can typically be found in computer science departments. To give some examples of recent successes of this approach one would only have to turn on one computer and perform an internet search. Modern search engines do not run terribly sophisticated algorithms, but they manage to store and sift through almost the entire content of the internet to return sensible search results. There has also been much success in the field of machine translation, not because a new model was invented but because many more translated documents became available.

The field of machine learning is multifaceted and expanding fast. To sample a few sub-disciplines: statistical learning, kernel methods, graphical models, artificial neural networks, fuzzy logic, Bayesian methods, and so on. The field also covers many types of learning problems, such as supervised learning, supervised learning, semi-supervised learning, active learning, reinforcement learning, etc. I will only cover the most basic approaches in this book from a highly per
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 7

Context: # MEANT FOR INDUSTRY AS WELL AS BACKGROUND READING

This book was written during my sabbatical at the Radboud University in Nijmegen (Netherlands). I would like to thank Hans for discussion on intuition. I also thank Prof. Bert Kappen, who leads an excellent group of postdocs and students for his hospitality. Marga, kids, UCI,...

---

There are a few main aspects I want to cover from a personal perspective. Instead of trying to address every aspect of the entire field, I have chosen to present a few popular and perhaps useful tools and approaches. What will (hopefully) be significantly different than most other scientific books is the manner in which I will present these methods. I have always been frustrated by the lack of proper explanation of equations. Many times, I have been staring at a formula without the slightest clue where it came from or how it was derived. Many books also excel in stating facts in an almost encyclopedic style, without providing the proper intuition of the method. This is my primary mission: to write a book that conveys intuition. The first chapter will be devoted to why I think this is important.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 10

Context: # LEARNING AND INTUITION

Baroque features or a more “dull” representation, whatever works. Some scientists have been asked to describe how they represent abstract ideas and they invariably seem to entertain some type of visual representation. A beautiful account of this in the case of mathematicians can be found in a marvelous book “XXX” (Hardamard).

By building accurate visual representations of abstract ideas, we create a database of knowledge in the unconscious. This collection of ideas forms the basis for what we call intuition. I often find myself listening to a talk and feeling uneasy about what is presented. The reason seems to be that the abstract idea I am trying to capture from the talk clashed with a similar idea that is already stored. This in turn can be a sign that I either misunderstood the idea before and need to update it, or that there is actually something wrong with what is being presented. In a similar way, I can easily detect that some idea is a small perturbation of what I already knew (I feel happily bored), or something entirely new (I feel intrigued and slightly frustrated). While the novice is continuously challenged and often feels overwhelmed, the more experienced researcher feels at ease 90% of the time because the “new” idea was already in his/her database which therefore needs no and very little updating.

Somehow our unconscious mind can also manipulate existing abstract ideas into new ones. This is what we usually think of as creative thinking. One can stimulate this by seeding the mind with a problem. This is a conscious effort and is usually a combination of detailed mathematical derivations and building an intuitive picture or metaphor for the thing one is trying to understand. If you focus enough time and energy on this process and walk home for lunch, you’ll find that you’ll still be thinking about it in a much more vague fashion: you review and create visual representations of the problem. Then you get your mind off the problem altogether and when you walk back to work, suddenly parts of the solution surface into consciousness. Somehow, your unconscious took over and kept working on your problem. The essence is that you created visual representations as the building blocks for the unconscious mind to work with.

In any case, whatever the details of this process are (and I am no psychologist), I suspect that any good explanation should include both an intuitive part, including examples, metaphors and visualizations, and a precise mathematical part where every equation and derivation is properly explained. This then is the challenge I have set to myself. It will be your task to insist on understanding the abstract idea that is being conveyed and build your own personalized visual representations. I will try to assist in this process but it is ultimately you who will have to do the hard work.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 11

Context: Many people may find this somewhat experimental way to introduce students to new topics counter-productive. Undoubtedly for many it will be. If you feel under-challenged and become bored, I recommend moving on to the more advanced textbooks, of which there are many excellent samples on the market (for a list see [books](#)). But I hope that for most beginning students, this intuitive style of writing may help to gain a deeper understanding of the ideas that I will present in the following. Above all, have fun!
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 13

Context: # Chapter 1

## Data and Information

Data is everywhere in abundant amounts. Surveillance cameras continuously capture video; every time you make a phone call your name and location get recorded. Often, your clicking pattern is recorded when surfing the web. Most financial transactions are recorded; satellites and observatories generate terabytes of data every year; the FBI maintains a DNA database of most convicted criminals. Soon, all written text from our libraries is digitized. Need I go on?

But data in itself is useless. Hidden inside the data is valuable information. The objective of machine learning is to pull the relevant information from the data and make it available to the user. What do we mean by "relevant information"? When analyzing data, we typically have a specific question in mind, such as: 

- “How many types of car can be discerned in this video?” 
- “What will be the weather next week?”

The answer can take the form of a single number (there are 5 cars), or a sequence of numbers (or the temperature next week) or a complicated pattern (the cloud configuration next week). If the answer to our query is complex, we like to visualize it using graphs, bar plots, or even little movies. One should keep in mind that the particular analysis depends on the task one has in mind.

Let me spell out a few tasks that are typically considered in machine learning:

### Prediction

Here we ask ourselves whether we can extrapolate the information in the data to new use cases. For instance, if I have a database of attributes of Hummers such as weight, color, number of people it can hold, etc., and another database of attributes for Ferraris, then we can try to predict the type of car (Hummer or Ferrari) from a new set of attributes. Another example is predicting the weather given all the recorded weather patterns in the past: can we predict the weather next week, or the stock prices?
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 15

Context: # 1.1. DATA REPRESENTATION

Most datasets can be represented as a matrix, \( X = [X_{n,k}] \), with rows indexed by "attribute-index" \( i \) and columns indexed by "data-index" \( n \). The value \( X_{n,k} \) for attribute \( i \) and data-case \( n \) can be binary, real, discrete, etc., depending on what we measure. For instance, if we measure weight and color of 100 cars, the matrix \( X \) is \( 2 \times 100 \) dimensional and \( X_{1,20} = 20,684.57 \) is the weight of car nr. 20 in some units (a real value) while \( X_{2,20} = 2 \) is the color of car nr. 20 (say one of 6 predefined colors).

Most datasets can be cast in this form (but not all). For documents, we can give each distinct word of a prespecified vocabulary a number and simply count how often a word was present. Say the word "book" is defined to have nr. 10, 568 in the vocabulary then \( X_{1,1005} = 4 \) would mean: the word book appeared 4 times in document 5076. Sometimes the different data-cases do not have the same number of attributes. Consider searching the internet for images about cats. You’ll retrieve a large variety of images most with a different number of pixels. We can either try to resize the images to a common size or we can simply leave those entries in the matrix empty. It may also occur that a certain entry is supposed to be there but it couldn’t be measured. For instance, if we run an optical character recognition system on a scanned document some letters will not be recognized. We’ll use a question mark “?” to indicate that that entry wasn’t observed.

It is very important to realize that there are many ways to represent data and not all are equally suitable for analysis. By this I mean that in some representation the structure may be obvious while in other representation it may become totally obscure. It is still there, but just harder to find. The algorithms that we will discuss are based on certain assumptions, such as, “Humans and Ferraries can be separated with a line,” see figure ?. While this may be true if we measure weight in kilograms and height in meters, it is no longer true if we decide to re-code these numbers into bit-strings. The structure is still in the data, but we would need a much more complex assumption to discover it. A lesson to be learned is thus to spend some time thinking about in which representation the structure is as obvious as possible and transform the data if necessary before applying standard algorithms. In the next section we’ll discuss some standard preprocessing operations. It is often advisable to visualize the data before preprocessing and analyzing it. This will often tell you if the structure is a good match for the algorithm you had in mind for further analysis. Chapter ? will discuss some elementary visualization techniques.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 16

Context: # CHAPTER 1. DATA AND INFORMATION

## 1.2 Preprocessing the Data

As mentioned in the previous section, algorithms are based on assumptions and can become more effective if we transform the data first. Consider the following example, depicted in Figure 1. The algorithm we consist of estimating the area that the data occupy. It grows a circle starting at the origin and at the point it contains all the data we record the area of a circle. In the figure why this will be a bad estimate: the data-cloud is not centered. If we had first centered it we would have obtained reasonable estimate. Although this example is somewhat simple-minded, there are many, much more interesting algorithms that assume centered data. To center data we will introduce the sample mean of the data, given by,

\[ 
E[X_i] = \frac{1}{N} \sum_{n=1}^{N} X_{in} \tag{1.1} 
\]

Hence, for every attribute \(i\) separately, we simple add all the attribute value across data-cases and divide by the total number of data-cases. To transform the data so that their sample mean is zero, we set,

\[ 
X'_{in} = X_{in} - E[X_i], \quad \forall n 
\tag{1.2} 
\]

It is now easy to check that the sample mean of \(X'\) indeed vanishes. An illustration of the global shift is given in Figure 2. We also see in this figure that the algorithm described above now works much better!

In a similar spirit as centering, we may also wish to scale the data along the coordinate axis in order to make it more “spherical.” Consider Figure 3. In this case the data was first centered, but the elongated shape still prevented us from using the simplistic algorithm to estimate the area covered by the data. The solution is to scale the axes so that the spread is the same in every dimension. To define this operation we first introduce the notion of sample variance,

\[ 
V[X_i] = \frac{1}{N} \sum_{n=1}^{N} X_{in}^2 \tag{1.3} 
\]

where we have assumed that the data was first centered. Note that this is similar to the sample mean, but now we have used the square. It is important that we have removed the sign of the data-cases (by taking the square) because otherwise positive and negative signs might cancel each other out. By first taking the square, all data-cases first get mapped to positive half of the axes (for each dimension or
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 20

Context: # CHAPTER 2. DATA VISUALIZATION

An example of such a scatter plot is given in Figure ??.

Note that we have a total of \( d(d - 1) / 2 \) possible two-dimensional projections which amounts to 4950 projections for 100-dimensional data. This is usually too many to manually inspect. How do we cut down on the number of dimensions? Perhaps random projections may work? Unfortunately, that turns out to be not a great idea in many cases. The reason is that data projected on a random subspace often looks distributed according to what is known as a Gaussian distribution (see Figure ??). The deeper reason behind this phenomenon is the **central limit theorem** which states that the sum of a large number of independent random variables is (under certain conditions) distributed as a Gaussian distribution. Hence, if we denote with \( \mathbf{w} \) a vector in \( \mathbb{R}^d \) and by \( x \) the d-dimensional random variable, then \( y = \mathbf{w}^T \mathbf{x} \) is the value of the projection. This is clearly a weighted sum of the random variables \( x_i, \; i = 1, \ldots, d \). If we assume that \( x_i \) are approximately independent, then we can see that their sum will be governed by its central limit theorem. Analogously, a dataset \( \{X_n\} \) can thus be visualized in one dimension by "histogramming" the values of \( y = \mathbf{w}^T \mathbf{x} \), see Figure ??. In this figure we clearly recognize the characteristic "Bell-shape" of the Gaussian distribution of projected and histogrammed data.

In one sense the central limit theorem is a rather helpful quirk of nature. Many variables follow Gaussian distributions and the Gaussian distribution is one of the few distributions which have very nice analytical properties. Unfortunately, the Gaussian distribution is also the most uninformative distribution. This notion of "uninformative" can actually be made very precise using information theory and states: 

> Given a fixed mean and variance, the Gaussian density represents the least amount of information among all densities with the same mean and variance.

This is rather unfortunate for our purposes because Gaussian projections are the least revealing dimensions to look at. So in general we have to work a bit harder to see interesting structure.

A large number of algorithms has been devised to search for informative projections. The simplest being "principal component analysis" or PCA for short ??. Here, interesting means dimensions of high variance. However, the fact that high variance is not always a good measure of interestingness and one should rather search for dimensions that are non-Gaussian. For instance, "independent components analysis" (ICA) ?? and "projection pursuit" ?? search for dimen-
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 21

Context: # Data Visualization Techniques

Distributions that have heavy tails relative to Gaussian distributions are important to consider. Another criterion is to find projections onto which the data has multiple modes. A more recent approach is to project the data onto a potentially curved manifold.

## Scatter Plots

Scatter plots are of course not the only way to visualize data. It's a creative exercise, and anything that helps enhance your understanding of the data is allowed in this game. To illustrate, I will give a few examples from a variety of techniques:

1. **Histogram**: A useful way to represent the distribution of a dataset.
2. **Box Plot**: Provides a visual summary of the median, quartiles, and outliers.
3. **Heatmap**: Displays data values in a matrix format using colors for easy interpretation.
4. **Line Graph**: Ideal for showing trends over time.

Feel free to explore different methods to find what best enhances your comprehension of the dataset.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 22

Context: # CHAPTER 2: DATA VISUALIZATION

## Introduction

Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data.

## Benefits of Data Visualization

- **Enhanced Understanding**: Complex data becomes more understandable through visual representation.
- **Immediate Insights**: Visualizations can provide quick and effective insights into data trends.
- **Better Communication**: It aids in storytelling and communicating data findings effectively.

## Common Types of Data Visualizations

1. **Bar Charts**
   - Useful for comparing quantities across categories.
2. **Line Graphs**
   - Ideal for showing trends over time.
3. **Pie Charts**
   - Best for illustrating proportions of a whole.
4. **Heat Maps**
   - Effective for displaying data density across a geographical area.

## Tools for Data Visualization

| Tool          | Description                                      | Cost        |
|---------------|--------------------------------------------------|-------------|
| Tableau       | Leading data visualization tool                  | Subscription |
| Microsoft Excel | Popular for creating basic charts and graphs | License fee  |
| Power BI      | Business analytics service from Microsoft       | Subscription |
| Google Data Studio | Free online tool for data visualization  | Free        |

## Conclusion

Data visualization is a crucial technique for data analysis and communication. By implementing effective visualization methods and using appropriate tools, organizations can greatly enhance their decision-making processes.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 24

Context: # CHAPTER 3. LEARNING

They understood that this was a lion. They understood that all lions have these particular characteristics in common, but may differ in some other ones (like the presence of a scar or scar someplace).

Bob has another disease which is called over-generalization. Once he has seen an object he believes almost everything is some, perhaps twisted instance of the same object class (In fact, I seem to suffer from this so now and then when I think all of machine learning can be explained by this one new exciting principle). If ancestral Bob walks the savanna and he has just encountered an instance of a lion and fled into a tree with his buddies, the next time he sees a squirrel he believes it is a small instance of a dangerous lion and flees into the trees again. Over-generalization seems to be rather common among small children.

One of the main conclusions from this discussion is that we should neither over-generalize nor over-fit. We need to be on the edge of being just right. But just right about what? It doesn’t seem there is one correct God-given definition of the category chairs. We seem to all agree, but one can surely find examples that would be difficult to classify. When do we generalize exactly right? The magic word is **PREDICTION**. From an evolutionary standpoint, all we have to do is make correct predictions about aspects of life that help us survive. Nobody really cares about the definition of lion, but we do care about our responses to the various animals (run away for lion, chase for deer). And there are a lot of things that can be predicted in the world. This food kills me but food is good for me. Drumming my fists on my hairy chest in front of a female generates opportunities for sex; sticking my hand into that yellow-orange flickering “flame” hurts my hand and so on. The world is wonderfully predictable and we are very good at predicting it.

So why do we care about object categories in the first place? Well, apparently they help us organize the world and make accurate predictions. The category *lions* is an abstraction and abstractions help us to generalize. In a certain sense, learning is all about finding useful abstractions or concepts that describe the world. Take the concept “fluid”; it describes all watery substances and summarizes some of their physical properties. Or the concept of “weight”: an abstraction that describes a certain property of objects.

Here is one very important corollary for you: 

> “machine learning is not in the business of remembering and regurgitating observed information, it is in the business of transferring (generalizing) properties from observed data onto new, yet unobserved data.” 

This is the mantra of machine learning that you should repeat to yourself every night before you go to bed (at least until the final exam).

The information we receive from the world has two components to it: there
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 27

Context: # 3.1 In a Nutshell

Learning is all about generalizing regularities in the training data to new, yet unobserved data. It is not about remembering the training data. Good generalization means that you need to balance prior knowledge with information from data. Depending on the dataset size, you can entertain more or less complex models. The correct size of the model can be determined by playing a compression game. 

Learning = generalization = abstraction = compression.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 28

Context: # CHAPTER 3. LEARNING

## Introduction

Learning involves acquiring knowledge, skills, attitudes, or competencies. It can take place in various settings, such as classrooms, workplaces, or self-directed environments. This chapter discusses the types and processes of learning.

## Types of Learning

1. **Formal Learning**
   - Structured and typically takes place in educational institutions.
   - Includes degrees, diplomas, and certifications.

2. **Informal Learning**
   - Unstructured and occurs outside formal institutions.
   - Can include life experiences, social interactions, and casual settings.

3. **Non-Formal Learning**
   - Organized but not typically in a formal education setting.
   - Often community-based, such as workshops and training sessions.

## Learning Processes

- **Cognitive Learning**: Involves mental processes and understanding. 
- **Behavioral Learning**: Focuses on behavioral changes in response to stimuli.
- **Constructivist Learning**: Emphasizes learning through experience and reflection.

## Table of Learning Theories

| Theory                    | Key Contributor      | Description                                      |
|--------------------------|----------------------|--------------------------------------------------|
| Behaviorism              | B.F. Skinner         | Learning as a change in behavior due to reinforcement. |
| Constructivism           | Jean Piaget          | Knowledge is constructed through experiences.    |
| Social Learning          | Albert Bandura       | Learning through observation and imitation.      |

## Conclusion

Understanding the different types and processes of learning can help educators and learners optimize educational experiences. Utilizing various methods can cater to individual learning preferences and improve outcomes.

## References

- Smith, J. (2020). *Learning Theories: A Comprehensive Approach*. Educational Press.
- Doe, A. (2019). *Techniques for Effective Learning*. Learning Publishers.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 31

Context: fall under the name "reinforcement learning". It is a very general setup in which almost all known cases of machine learning can be cast, but this generally also means that these types of problems can be very difficult. The most general RL problems do not even assume that you know what the world looks like (i.e. the maze for the mouse), so you have to simultaneously learn a model of the world and solve your task in it. 

This dual task induces interesting trade-offs: should you invest time now to learn machine learning and reap the benefit later in terms of a high salary working for Yahoo!, or should you stop investing now and start exploiting what you have learned so far? This is clearly a function of age, or the time horizon that you still have to take advantage of these investments. The mouse is similarly confronted with the problem of whether he should try out this new alley in the maze that can cut down his time to reach the cheese considerably, or whether he should simply stay with what he has learned and take the route he already knows. This clearly depends on how often he thinks he will have to run through the same maze in the future. We call this the exploration versus exploitation trade-off. The reason that RL is a very exciting field of research is because of its biological relevance. Do we not also have to figure out how the world works and survive in it?

Let's go back to the news-articles. Assume we have control over what article we will label next. Which one would be pick? Surely the one that would be most informative in some suitably defined sense. Or the mouse in the maze. Given that it needs to explore, where does he explore? Surely he will try to seek out alleys that look promising, i.e. alleys that he expects to maximize his reward. We call the problem of finding the next best data-case to investigate "active learning".

One may also be faced with learning multiple tasks at the same time. These tasks are related but not identical. For instance, consider the problem if recommending movies to customers of Netflix. Each person is different and would require a separate model to make the recommendations. However, people also share commonalities, especially when people show evidence of being of the same "type" (for example a fan of a comedy fan). We can learn personalized models but share features between them. Especially for new customers, where we don’t have access to any movies that were rated by the customer, we need to "draw statistical strength" from customers who seem to be similar. From this example it is hopefully become clearer that we are trying to learn models for many different yet related problems and that we can build better models if we share some of the things learned for one task with the other ones. The trick is not to share too much nor too little and how much we should share depends on how much data and prior knowledge we have access to for each task. We call this subfield of machine learning: "multi-task learning".
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 32

Context: # Chapter 4: Types of Machine Learning

## 4.1 In a Nutshell

There are many types of learning problems within machine learning. Supervised learning deals with predicting class labels from attributes, unsupervised learning tries to discover interesting structure in data, semi-supervised learning uses both labeled and unlabeled data to improve predictive performance, reinforcement learning can handle simple feedback in the form of delayed reward, active learning optimizes the next sample to include in the learning algorithm, and multi-task learning deals with sharing common model components between related learning tasks.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 35

Context: 5.1 The Idea In a Nutshell
==========================

To classify a new data item, you first look for the \( k \) nearest neighbors in feature space and assign it the same label as the majority of these neighbors. 

Because 98 noisy dimensions have been added, this effect is detrimental to the kNN algorithm. Once again, it is very important to choose your initial representation with much care and preprocess the data before you apply the algorithm. In this case, preprocessing takes the form of "feature selection" on which a whole book in itself could be written.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 36

Context: # CHAPTER 5. NEAREST NEIGHBORS CLASSIFICATION

## 5.1 Introduction

The nearest neighbors classification is a type of instance-based learning. It is a method used in machine learning that classifies a data point based on how its neighbors are classified.

## 5.2 Algorithm

The basic algorithm for k-nearest neighbors (k-NN) can be summarized as follows:

1. **Choose the number of neighbors, k.**
2. **Calculate the distance** between the new data point and all existing data points.
3. **Sort the distances** and determine the nearest k neighbors.
4. **Vote** for the most common class among the k neighbors.

## 5.3 Distance Metrics

Common distance metrics include:

- **Euclidean Distance**
- **Manhattan Distance**
- **Minkowski Distance**

| Distance Metric | Formula                                    |
|------------------|--------------------------------------------|
| Euclidean        | \(d(x, y) = \sqrt{\sum{(x_i - y_i)^2}}\) |
| Manhattan        | \(d(x, y) = \sum{|x_i - y_i|}\)           |
| Minkowski        | \(d(x, y) = \left( \sum{|x_i - y_i|^p} \right)^{1/p}\) |

## 5.4 Applications

Nearest neighbors classification can be applied in various fields such as:

- **Image Recognition**
- **Recommendation Systems**
- **Medical Diagnosis**

## 5.5 Pros and Cons

### Pros
- Simple and easy to implement.
- Effective in high-dimensional spaces.

### Cons
- Computation cost can be high with large datasets.
- Sensitive to irrelevant features and the scale of the data.

## 5.6 Conclusion

The nearest neighbors classification is a powerful technique in machine learning. By understanding its algorithm and applications, one can leverage this method effectively for various tasks.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 42

Context: # CHAPTER 6. THE NATIVE BAYESIAN CLASSIFIER

## 6.4 Regularization

The spam filter algorithm that we discussed in the previous sections does unfortunately not work very well if we wish to use many attributes (words, word-phrases). The reason is that for many attributes we may not encounter a single example in the dataset. Say for example that we defined the word “Nigeria” as an attribute, but that our dataset did not include one of those spam emails where you are promised mountains of gold if you invest your money in someone bank in Nigeria. Also assume there are indeed a few ham emails which talk about the nice people in Nigeria. Then any future email that mentions Nigeria is classified as ham with 100% certainty. More importantly, one cannot recover from this decision even if the email also mentions viagra, enlargement, mortgage and so on, all in a single email! This can be seen by the fact that 

\[ P_{\text{spam}}(X^{\text{Nigeria}} > 0) = -\infty \]

while the final score is a sum of these individual word-scores.

To counteract this phenomenon, we give each word in the dictionary a small probability of being present in any email (spam or ham), before seeing the data. This process is called smoothing. The impact on the estimated probabilities are given below:

\[
P_{\text{spam}}(X_i = j) = \alpha + \frac{\sum_{n} \mathbb{I}[X_n = j \land Y_n = 1]}{V_{\alpha} + \sum_{n} \mathbb{I}[Y_n = 1]} \tag{6.12}
\]

\[
P_{\text{ham}}(X_i = j) = \alpha + \frac{\sum_{n} \mathbb{I}[X_n = j \land Y_n = 0]}{V_{\alpha} + \sum_{n} \mathbb{I}[Y_n = 0]} \tag{6.13}
\]

where \( V_i \) is the number of possible values of attribute \( i \). Thus, \( \alpha \) can be interpreted as a small, possibly fractional number of “pseudo-observations” of the attribute in question. It’s like adding these observations to the actual dataset.

What value for \( \alpha \) do we use? Fitting its value on the dataset will not work, because the reason we added it was exactly because we assumed there was too little data in the first place (we hadn’t received one of those annoying “Nigeria” emails yet) and thus will relate to the phenomenon of overfitting. However, we can use the trick described in section ?? where we split the data two pieces. We learn a model on one chunk and adjust \( \alpha \) such that performance of the other chunk is optimal. We play this game multiple times with different splits and average the results.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 43

Context: # 6.5 Remarks

One of the main limitations of the NB classifier is that it assumes independence between attributes (This is presumably the reason why we call it the naive Bayesian classifier). This is reflected in the fact that each classifier has an independent vote in the final score. However, imagine that I measure the words, "home" and "mortgage". Observing "mortgage" certainly raises the probability of observing "home". We say that they are positively correlated. It would therefore be more fair if we attributed a smaller weight to "home" if we already observed mortgage because they convey the same thing: this email is about mortgages for your home. One way to obtain a more fair voting scheme is to model these dependencies explicitly. However, this comes at a computational cost (a longer time before you receive your email in your inbox) which may not always be worth the additional accuracy. One should also note that more parameters do not necessarily improve accuracy because too many parameters may lead to overfitting.

# 6.6 The Idea In a Nutshell

Consider Figure ??. We can classify data by building a model of how the data was generated. For NB we first decide whether we will generate a data-item from class \( Y = 0 \) or class \( Y = 1 \). Given that decision we generate the values for \( D \) attributes independently. Each class has a different model for generating attributes. Classification is achieved by computing which model was more likely to generate the new data-point, biasing the outcome towards the class that is expected to generate more data.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 58

Context: # CHAPTER 8: SUPPORT VECTOR MACHINES

## Introduction

Support Vector Machines (SVMs) are a set of supervised learning methods used for classification, regression, and outliers detection.

## Key Concepts

1. **Hyperplane**: A hyperplane is a decision boundary that helps to categorize data points.

2. **Support Vectors**: Support vectors are the data points that are closest to the hyperplane and influence its position and orientation.

3. **Margin**: The margin is the distance between the hyperplane and the nearest data point from either class. SVM aims to maximize the margin.

## Implementation Steps

- **Step 1**: Choose the appropriate kernel (linear, polynomial, RBF).
- **Step 2**: Train the SVM model using the training dataset.
- **Step 3**: Evaluate the model using a test dataset.

## Advantages of SVM

- Effective in high-dimensional spaces.
- Robust against overfitting, especially in high-dimensional datasets.

## Disadvantages of SVM

- Less effective on very large datasets.
- Poor performance with overlapping classes.

## Conclusion

Support Vector Machines are powerful tools for classification and regression tasks, offering advantages in high-dimensional spaces while having limitations in very large datasets.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 70

Context: # CHAPTER 11: KERNEL K-MEANS AND SPECTRAL CLUSTERING

## 11.1 Introduction

Kernel K-means is an extension of the K-means clustering algorithm. It allows for clustering in a feature space that is not linearly separable, using a kernel function to transform the original data.

## 11.2 Algorithm

The steps of the Kernel K-means algorithm are as follows:

1. **Choose a kernel function** \( K(x_i, x_j) \).
2. **Initialize** \( k \) centroids randomly.
3. **Assign points** to the nearest centroid using the kernel.
4. **Update centroids** by computing the mean of the assigned points in the feature space.
5. **Repeat** steps 3 and 4 until convergence.

## 11.3 Spectral Clustering

Spectral clustering is another clustering technique that uses the eigenvalues of the similarity matrix of the data.

### 11.3.1 Algorithm

1. Compute the **similarity matrix**.
2. Compute the **Laplacian matrix** from the similarity matrix.
3. Compute the eigenvalues and eigenvectors of the Laplacian matrix.
4. Use the top \( k \) eigenvectors to form a new feature space.
5. Apply K-means to cluster the data in the new space.

## 11.4 Conclusion

Kernel K-means and spectral clustering provide powerful methods for clustering in complex spaces, enabling better data segmentation in many applications.

## References

- *Author Name*, *Title of the Source*, Year
- *Author Name*, *Title of the Source*, Year

For further reading, explore the works on kernel methods and spectral analysis in clustering contexts.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 80

Context: # CHAPTER 13. FISHER LINEAR DISCRIMINANT ANALYSIS

## Introduction

Fisher Linear Discriminant Analysis (LDA) is a supervised learning method used mainly for classification. It projects data onto a lower-dimensional space while maximizing class separability.

## Key Concepts

1. **Discriminant Function**  
   The goal is to find a linear combination of features that separates two or more classes of objects or events.

2. **LDA vs. PCA**  
   - LDA focuses on maximizing the separation between multiple classes, while Principal Component Analysis (PCA) focuses on variance.

3. **Regularization**  
   Techniques can be applied to improve results when the number of features is high compared to the number of samples.

## Mathematical Formulation

Given a set of classes \(C_1, C_2, ..., C_k\), let \( \mu_i \) be the mean of class \( C_i \) and \( S_w \) and \( S_b \) the within-class and between-class scatter matrices, respectively. The LDA objective is to maximize:

\[
J(w) = \frac{|S_b|}{|S_w|}
\]

where \(w\) is the projection vector.

## Implementation Steps

1. Calculate the mean vectors for each class.
2. Compute the scatter matrices \( S_w \) and \( S_b \).
3. Solve the generalized eigenvalue problem for the matrix \( S_w^{-1}S_b \).
4. Choose the top eigenvectors to form a new feature space.

## Use Cases

- Face recognition
- Medical diagnosis
- Marketing segmentation

## Conclusion

Fisher LDA is a powerful tool for classification tasks with a focus on maximizing the separation between classes. Its mathematical foundation allows for effective implementation across various domains. 

## References

1. McLachlan, G. J. (2004). *Discriminant Analysis and Statistical Pattern Recognition*. Wiley-Interscience. 
2. Duda, R. O., Hart, P. E., & Stork, D. G. (2000). *Pattern Classification*. Wiley-Interscience. 

## Further Reading

- [Wikipedia: Linear Discriminant Analysis](https://en.wikipedia.org/wiki/Linear_discriminant_analysis)
- [Scikit-learn: Linear Discriminant Analysis](https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html)
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 92

Context: # APPENDIX B. KERNEL DESIGN

## Table of Contents
1. [Introduction](#introduction)
2. [Kernel Architecture](#kernel-architecture)
   - [Components](#components)
   - [Functions](#functions)
3. [Design Considerations](#design-considerations)
4. [Conclusion](#conclusion)

## Introduction
This appendix discusses the design of the kernel and its components.

## Kernel Architecture

### Components
- **Scheduler**: Manages the execution of processes.
- **Memory Manager**: Handles memory allocation and deallocation.
- **Device Drivers**: Interfaces with hardware devices.

### Functions
1. **Process Management**
   - Creating and managing processes.
   - Scheduling and dispatching of processes.
2. **Memory Management**
   - Allocating memory for processes.
   - Handling virtual memory.

## Design Considerations
- **Performance**: The kernel should be efficient in resource management.
- **Scalability**: Must support various hardware platforms.
- **Security**: Ensures that processes cannot access each other’s memory.

## Conclusion
The kernel design is crucial for the overall system performance and functionality. Proper architecture and considerations can significantly enhance the efficiency and security of the operating system.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 93

Context: # Bibliography

1. Author, A. (Year). *Title of the Book*. Publisher.
2. Author, B. (Year). *Title of the Article*. *Journal Name*, Volume(Issue), Page Range. DOI or URL if available.
3. Author, C. (Year). *Title of the Website*. Retrieved from URL.
4. Author, D. (Year). *Title of the Thesis or Dissertation*. University Name. 

- Point 1
- Point 2
- Point 3

## Additional References

| Author            | Title                   | Year | Publisher        |
|-------------------|-------------------------|------|------------------|
| Author, E.        | *Title of Article*      | Year | Publisher Name    |
| Author, F.        | *Title of Conference*   | Year | Conference Name   |

### Notes

- Ensure that all entries follow the same citation style for consistency.
- Check for publication dates and any updates that may be required.
##########

"""QUERY: can you give me a summary of the topics discussed in the context. display it in a table format with the topic and a short description as headers. Limit it to 5 lists only"""

Consider the chat history for relevant information. Use all information included. Use as much tokens as needed. Important: If you find information separated by a | in the context, it is a table formatted in Markdown.Important: Take a look at the QUERY and only the QUERY. If this is vague or unclear, please ignore everything and ask a follow-up question instead!


Final Files Sources: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 1, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 3, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 5, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 8, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 11, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 13, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 21, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 23, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 26, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 31, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 33, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 34, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 37, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 39, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 52, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 68, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 94, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 99, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 101, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 103, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 110, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 111, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 118, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 121, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 122, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 123, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 127, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 131, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 133, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 135, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 137, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 142, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 143, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 147, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 149, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 158, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 167, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 171, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 203, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 204, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 212, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 254, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 257, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 4, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 7, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 10, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 11, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 12, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 14, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 16, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 17, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 22, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 50, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 58, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 80, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 82, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 106, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 113, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 115, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 117, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 119, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 121, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 130, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 133, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 138, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 169, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 173, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 176, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 192, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 208, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 210, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 212, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 213, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 224, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 3, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 4, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 6, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 7, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 10, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 11, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 13, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 15, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 16, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 20, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 21, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 22, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 24, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 27, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 28, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 31, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 32, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 35, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 36, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 42, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 43, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 58, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 70, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 80, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 92, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 93
==================================================
**Elapsed Time: 3.07 seconds**
==================================================


FINAL ANSWER 
Answer: Certainly! Here is a summary of the topics discussed in the context, displayed in a table format with the topic and a short description:

| **Topic**             | **Short Description**                                                                                       |
|-----------------------|-------------------------------------------------------------------------------------------------------------|
| Sets                  | Definition, notation, and operations on sets, including union, intersection, and complement.                |
| Relations             | Explanation of relations, their properties, and examples of different types of relations.                   |
| Probability           | Introduction to probability, including outcomes, events, and sample spaces, with examples and applications. |
| Graphs                | Overview of graph theory, including definitions of vertices, edges, and types of graphs like DAGs.          |
| Trees                 | Discussion on tree structures, including rooted trees, binary trees, and their applications in computer science. |

This table captures the key topics and their descriptions as discussed in the context.
==================================================
**Elapsed Time: 2.80 seconds**
==================================================