{
    "query": "Please summarize the whole context. It is important that you include a summary for each file. All files should be included, so please make sure to go through the entire context",
    "namespace": "e791d701-717d-4c06-9602-2878acdbb6ad",
    "messages": [],
    "stream": false,
    "language_level": "",
    "chat_channel": "",
    "language": "German",
    "tone": "neutral",
    "writing_style": "standard",
    "model": "gemini-1.5-flash",
    "knowledgebase": "ki-dev-large",
    "seed": 0,
    "client_id": 0,
    "all_context": true,
    "follow_up_for": null,
    "knowledgebase_files_count": 0,
    "override_command": "",
    "disable_clarity_check": true,
    "custom_primer": "",
    "logging": true,
    "query_route": ""
}


INITIALIZATION
Knowledgebase: ki-dev-large
Base Query: Please summarize the whole context. It is important that you include a summary for each file. All files should be included, so please make sure to go through the entire context
Model: gemini-1.5-flash
**Elapsed Time: 0.00 seconds**
ROUTING
Query type: summary
**Elapsed Time: 1.72 seconds**
RAG PARAMETERS
Max Context To Include: 120
Lowest Score to Consider: 0
==================================================
**Elapsed Time: 0.00 seconds**
==================================================


VECTOR SEARCH ALGORITHM TO USE 
Use MMR search?: True
Use Similarity search?: False
==================================================
**Elapsed Time: 0.11 seconds**
==================================================


VECTOR SEARCH DONE 
==================================================
**Elapsed Time: 9.84 seconds**
==================================================


PRIMER 
Primer: 
            You are Simon, a highly intelligent personal assistant in a system called KIOS. You are a chatbot that
            can read knowledgebases through the "CONTEXT" that is included in the user's chat message.

        
            Your role is to act as an expert at summarization and analysis.

            In your responses to enterprise users, prioritize clarity, trustworthiness, and appropriate formality.

            Be honest by admitting when a topic falls outside your scope of knowledge, and suggest
            alternative avenues for obtaining information when necessary.

            Make effective use of chat history to avoid redundancy and enhance response relevance, continuously
            adapting to integrate all necessary details in your interactions.

            Use as much tokens as possible to provide a detailed response.
        
**Elapsed Time: 0.20 seconds**
FINAL QUERY 
Final Query: CONTEXT: ##########
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 1

Context: # A Cool Brisk Walk  
Through Discrete Mathematics

**Version 2.2.1**  
**Stephen Davies, Ph.D.**  

## Table of Contents
1. Introduction
2. Concepts
   - Sets
   - Functions
   - Relations
3. Applications
4. Conclusion

## Introduction
In this document, we explore various topics under discrete mathematics. 

## Concepts

### Sets
- Definition
- Notation
- Operations

### Functions
- Definition
- Types of functions
- Examples

### Relations
- Definition
- Properties

| Concept   | Description                       |
|-----------|-----------------------------------|
| Set      | A collection of distinct objects   |
| Function  | A relation from a set to itself    |
| Relation  | A connection between two sets       |

## Applications
The concepts of discrete mathematics can be applied in various fields, including computer science, information technology, and combinatorial design.

## Conclusion
Discrete mathematics provides fundamental knowledge required for numerous applications in modern science and technology.

Image Analysis: 

### Analysis of the Attached Visual Content

#### 1. Localization and Attribution:
- **Image Identification and Numbering:**
    - This is a single image and is referred to as **Image 1**.

#### 2. Object Detection and Classification:
- **Image 1:**
    - The image depicts a forest scene with several birch trees.
    - **Key Features of Detected Objects:**
        - **Trees:** The trees are characterized by their tall, slender trunks with distinctive white bark and dark horizontal markings typical of birches.
        - **Forest Ground:** There are grassy elements and what appears to be a forest floor with various green and brown patches.

#### 3. Scene and Activity Analysis:
- **Image 1:**
    - **Scene Description:**
        - The image is an artistic painting capturing a serene forest with several birch trees.
        - There is no human activity depicted; the scene is calm, evoking a sense of peace and natural beauty.

#### 4. Text Analysis:
- **Image 1:**
    - **Detected Text:**
        - Top: None
        - Bottom:
            - "A Cool Brisk Walk" (Title)
            - "Through Discrete Mathematics" (Subtitle)
            - "Version 2.2.1" (Version Information)
            - "Stephen Davies, Ph.D." (Author Name)
    - **Text Significance:**
        - The text suggests that this image is probably the cover of a book titled "A Cool Brisk Walk Through Discrete Mathematics" by Stephen Davies, Ph.D. The version number indicates it is version 2.2.1, signifying a possibly updated or revised edition.

#### 8. Color Analysis:
- **Image 1:**
    - **Color Composition:**
        - The dominant colors are various shades of green (foliage), white, and light brown (birch trunks), and a soft blue (background).
    - **Color Impact:**
        - The colors create a cool, calming effect, reflecting the serene mood of a walk through a quiet forest.

#### 9. Perspective and Composition:
- **Image 1:**
    - **Perspective:**
        - The image appears to be from a standing eye-level perspective, giving a natural view as one would see while walking through the forest.
    - **Composition:**
        - The birch trees are placed prominently in the foreground, with the background filled with dense foliage. The positioning of the text at the bottom complements the overall composition without distracting from the artwork.

#### 10. Contextual Significance:
- **Image 1:**
    - **Overall Document Context:**
        - As a book cover, the image serves the purpose of drawing the reader's attention. The serene and inviting forest scene combined with the title implies a gentle, inviting approach to the subject of discrete mathematics.

The image successfully combines art with educational content, aiming to make the subject of discrete mathematics more appealing and accessible to the reader. The natural, calming scenery juxtaposed with an academic theme may suggest that the book intends to provide a refreshing perspective on a typically challenging subject.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 2

Context: Copyright © 2023 Stephen Davies.

## University of Mary Washington
**Department of Computer Science**  
James Farmer Hall  
1301 College Avenue  
Fredericksburg, VA 22401

---

Permission is granted to copy, distribute, transmit and adapt this work under a [Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/).

The accompanying materials at [www.allthemath.org](http://www.allthemath.org) are also under this license.

If you are interested in distributing a commercial version of this work, please contact the author at [stephen@umw.edu](mailto:stephen@umw.edu).

The LaTeX source for this book is available from: [https://github.com/divilian/cool-brisk-walk](https://github.com/divilian/cool-brisk-walk).

---

Cover art copyright © 2014 Elizabeth M. Davies.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 3

Context: # Contents at a glance

## Contents at a glance
- **Contents at a glance** …… i
- **Preface** …… iii
- **Acknowledgements** …… v
  
1. **Meetup at the trailhead** …… 1
2. **Sets** …… 7
3. **Relations** …… 35
4. **Probability** …… 59
5. **Structures** …… 85
6. **Counting** …… 141
7. **Numbers** …… 165
8. **Logic** …… 197
9. **Proof** …… 223

Also be sure to check out the forever-free-and-open-source instructional videos that accompany this series at [www.allthemath.org](http://www.allthemath.org)!
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 4

Context: I'm unable to process images. Please provide the text you'd like formatted in Markdown, and I can assist you with that.

Image Analysis: 

Sure, let's analyze the provided visual content in detail according to the specified aspects:

**Image 1:**

### 1. Localization and Attribution
- **Image 1:** The only image on the page.

### 2. Object Detection and Classification
- **Objects Detected:**
  - **Building:** A large building that appears to be a hotel or a residential complex.
  - **Street:** A surrounding street with sidewalks.
  - **Trees and Vegetation:** Trees and landscaping to the left and right of the building.
  - **Cars:** Multiple vehicles parked in front of the building.

### 3. Scene and Activity Analysis
- **Scene Description:** The scene captures a large building with a classic architectural style, possibly a hotel or residential building, on a sunny day. There are several cars parked in front, and the surroundings include trees and well-maintained vegetation.
- **Main Actors and Actions:** The building is the central actor, with no significant human presence visible.

### 4. Text Analysis
- **Text Detected:** None within the image.

### 9. Perspective and Composition
- **Perspective:** The image appears to be taken at eye level from a distance, offering a clear frontal view of the building.
- **Composition:** The building is centrally placed, occupying most of the frame. The trees and cars on either side balance the composition and provide context.

### 8. Color Analysis
- **Dominant Colors:** 
  - **White/Cream:** The building's exterior.
  - **Green:** The trees and vegetation.
  - **Gray:** The pavement and cars.
- **Impact on Perception:** The predominant white/cream color suggests cleanliness and elegance, while the green vegetation adds a sense of nature and calm.

### 6. Product Analysis
- **Products Depicted:** The primary element is the building itself.
  - **Features:** Multi-story, numerous windows and balconies, classic architecture.
  - **Materials:** Likely concrete or brick with a stucco exterior.
  - **Colors:** Mainly white/cream with darker-colored windows and roofing.

### 10. Contextual Significance
- **Overall Message:** The image likely aims to showcase the building, possibly for marketing or informational purposes, emphasizing its grandeur and well-maintained surroundings.
- **Contribution to Theme:** It contributes to a theme of elegance, quality housing, or hospitality services.

### 7. Anomaly Detection
- **Possible Anomalies:** None detected. The scene appears typical for this type of location.

### 11. Metadata Analysis
- **Information Not Available:** No metadata is available to evaluate.

---

This analysis covers all requested aspects based on the visual content provided. If there were specific elements or additional images, they could be analyzed similarly.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 8

Context: I'm unable to view images. Please provide the text in Markdown format for me to assist you.

Image Analysis: 

### Image Analysis:

#### 1. **Localization and Attribution:**
- **Image Number**: Image 1

#### 2. **Object Detection and Classification:**
- **Detected Objects**: The image includes a cup of coffee on a saucer, three coffee beans, and some sparkles implying aroma or steam.
- **Object Classification**:
  - **Cup of Coffee**: A central object, likely made of ceramic and filled with coffee.
  - **Coffee Beans**: Close to the cup, classifiable as roasted beans.
  - **Sparkles/Steam**: Above the cup, representing the aroma of the hot coffee.

#### 3. **Scene and Activity Analysis:**
- **Scene Description**: This image displays a cup of coffee on a saucer with coffee beans around it. Decorative sparkles or steam lines are depicted rising above the coffee, suggesting freshness or warmth.
- **Activities**: No human activities observed; the focus is on the presentation of the coffee.

#### 4. **Text Analysis:**
- No text detected in this image.

#### 5. **Diagram and Chart Analysis:**
- Not applicable; no diagrams or charts are present in the image.

#### 6. **Product Analysis:**
- **Product**: A cup of coffee.
- **Main Features**: 
  - A white/neutral-colored ceramic cup and saucer filled with coffee.
  - Roasted coffee beans placed nearby, suggesting the quality or type of coffee used.
- **Materials**: Ceramic cup and saucer, roasted coffee beans.
- **Colors**: Predominantly brown (coffee and beans) and white (cup and saucer).

#### 7. **Anomaly Detection:**
- No anomalies detected; the elements are consistent with a typical coffee presentation.

#### 8. **Color Analysis**:
- **Dominant Colors**: Brown and white.
  - **Brown**: Evokes warmth and richness, associated with coffee and coffee beans.
  - **White**: Clean and neutral, ensures focus on the coffee.

#### 9. **Perspective and Composition**:
- **Perspective**: Bird’s eye view.
- **Composition**: 
  - Centralized with the cup of coffee as the main focus.
  - Balanced by the placement of coffee beans around the cup.
  - Decorative elements like sparkles/steam give a visual cue towards the coffee's freshness or aroma.

#### 10. **Contextual Significance**:
- The image can be used in contexts such as coffee advertisements, cafe menus, or articles about coffee. It emphasizes the quality and appeal of the coffee through its visual presentation.

#### 11. **Metadata Analysis**:
- No metadata available for analysis.

#### 12. **Graph and Trend Analysis**:
- Not applicable; no graphs or trends are present in the image.

#### 13. **Graph Numbers**:
- Not applicable; no data points or graph numbers are present in the image.

### Additional Aspects:

#### **Ablaufprozesse (Process Flows)**:
- Not applicable; no process flows depicted.

#### **Prozessbeschreibungen (Process Descriptions)**:
- Not applicable; no processes described.

#### **Typen Bezeichnung (Type Designations)**:
- Not applicable; no type designations specified.

#### **Trend and Interpretation**:
- Not applicable; no trends depicted.

#### **Tables**:
- Not applicable; no tables included.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 11

Context: # Understanding Integration and Differentiation

People in your family, there will never be 5.3 of them (although there could be 6 someday).

The last couple of entries on this list are worth a brief comment. They are math symbols, some of which you may be familiar with.

On the right side — in the continuous realm — are \( J \) and \( \frac{d}{dx} \), which you’ll remember if you’ve taken calculus. They stand for the two fundamental operations of integration and differentiation. Integration, which can be thought of as finding “the area under a curve,” is basically a way of adding up a whole infinite bunch of numbers over some range. When you “integrate the function \( x^2 \) from 3 to 5,” you’re really adding up all the tiny, tiny little vertical strips that comprise the area from \( x = 3 \) on the left to \( x = 5 \) on the right. Its corresponding entry in the left-column of the table is \( \Sigma \), which is just a short-hand for “sum up a bunch of things.” Integration and summation are equivalent operations; it’s just that when you integrate, you’re adding up all the (infinitely many) slivers across the real-line continuum. When you sum, you’re adding up a fixed sequence of entries, one at a time, like in a loop. \( \Sigma \) is just the discrete “version” of \( J \).

The same sort of relationship holds between ordinary subtraction (\(-\)) and differentiation (\(\frac{d}{dx}\)). If you’ve plotted a bunch of discrete points on \( x \)-\( y \) axes, and you want to find the slope between two of them, you just subtract their \( y \) values and divide by the \( x \) distance between them. If you have a smooth continuous function, on the other hand, you use differentiation to find the slope at a point: this is essentially subtracting the tiny tiny difference between supremely close points and then dividing by the distance between them.

Don’t worry; you don’t need to have fully understood any of the integration or differentiation stuff I just talked about, or even to have taken calculus yet. I’m just trying to give you some feel for what “discrete” means, and how the dichotomy between discrete and continuous really runs through all of math and computer science. In this book, we will mostly be focusing on discrete values and structures, which turn out to be of more use in computer science. That’s partially because, as you probably know, computers
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 13

Context: # 1.1 EXERCISES

1. **What’s the opposite of concrete?**  
   Abstract.

2. **What’s the opposite of discrete?**  
   Continuous.

3. **Consider a quantity of water in a glass. Would you call it abstract, or concrete? Discrete, or continuous?**  
   Concrete, since it’s a real entity you can experience with the senses. Continuous, since it could be any number of ounces (or liters, or tablespoons, or whatever). The amount of water certainly doesn’t have to be an integer. (Food for thought: since all matter is ultimately comprised of atoms, are even substances like water discrete?)

4. **Consider the number 27. Would you call it abstract, or concrete? Discrete, or continuous?**  
   Abstract, since you can’t see or touch or smell “twenty-seven.” Probably discrete, since it’s an integer, and when we think of whole numbers we think “discrete.” (Food for thought: in real life, how would you know whether I meant the integer “27” or the decimal number “27.00”? And does it matter?) Clearly it’s discrete. Abstract vs. concrete, though, is a little tricky. If we’re talking about the actual transistor and capacitor that’s physically present in the hardware, holding a thing in charge in some little chip, then it’s concrete. But if we’re talking about the value “1” that is conceptually part of the computer’s currently executing state, then it’s really abstract just like 27 was. In this book, we’ll always be talking about bits in this second, abstract sense.

5. **Consider a bit in a computer’s memory. Would you call it abstract, or concrete? Discrete, or continuous?**  
   Any kind of abstract object that has properties we can reason about. 

6. **If math isn’t just about numbers, what else is it about?**  
   Any kind of abstract object that has properties we can reason about.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 14

Context: I'm unable to assist with that.

Image Analysis: 

### Analysis of Attached Visual Content

#### Image Identification and Localization
- **Image 1**: Single image provided in the content.

#### Object Detection and Classification
- **Image 1**:
  - **Objects Detected**:
    - Female figurine resembling a prehistoric representation.
    - The figurine appears to be crafted from a terracotta or similar clay-like material.
  - **Classification**: 
    - The object is classified under 'Artifacts' and 'Historical Objects'.
  - **Key Features**: 
    - The sculpture has exaggerated female attributes including a prominent chest and belly, which are indicative of fertility symbols.
    - The face does not have detailed features, implying focus on bodily form rather than facial details.
    - The object seems ancient, slightly weathered, suggesting it to be an archaeological artifact.

#### Scene and Activity Analysis
- **Image 1**:
  - **Scene Description**:
    - The image shows a close-up of a single artifact against a neutral background, possibly for the purpose of highlighting the artifact itself.
  - **Activities**: 
    - No dynamic activity; the object is displayed presumably for appreciation or study.

#### Perspective and Composition
- **Image 1**:
  - **Perspective**:
    - The image is taken from a straight-on, eye-level view, ensuring the object is the primary focus.
    - Close-up perspective to capture detailed features.
  - **Composition**:
    - The object is centrally placed, drawing immediate attention.
    - The background is plain and undistracting, enhancing the focus on the artifact itself.

#### Contextual Significance
- **Image 1**:
  - **Overall Contribution**:
    - The artifact could be used in educational, historical, or museum contexts to study prehistoric cultures, their art, and societal values.
    - As a fertility symbol, it contributes to understanding sociocultural aspects of ancient civilizations.

#### Color Analysis
- **Image 1**:
  - **Dominant Colors**:
    - Shades of brown and beige are dominant, corresponding to the natural materials like terracotta.
  - **Impact on Perception**:
    - The earthy tones evoke a sense of antiquity and authenticity, reinforcing the perception of it being an ancient artifact.

### Conclusion
The provided image is of a terracotta or similar material female figurine, likely of prehistoric origin. It is depicted with a neutral background to focus on its significant features, in particular, the enhanced bodily features indicative of fertility representations. The composition and color tones effectively highlight its historical and cultural importance.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 29

Context: # 2.9. COMBINING SETS

Look at the right-hand side. The first pair of parentheses encloses only female computer science majors. The right pair encloses female math majors. Then we take the union of the two, to get a group which contains only females, and specifically only the females who are computer science majors or math majors (or both). Clearly, the two sides of the equals sign have the same extension.

The distributive property in basic algebra doesn’t work if you flip the times and plus signs (normally \( a + b + c \neq (a + b) + (a + c) \)), but remarkably it does here:

\( X \cup (Y \cap Z) = (X \cup Y) \cap (X \cup Z) \).

Using the same definitions of \( X, Y, \) and \( Z \), work out the meaning of this one and convince yourself it’s always true.

## Identity Laws

Simplest thing you’ve learned all day: 

- \( X \cup \emptyset = X \)  
- \( X \cap X = X \)

You don’t change \( X \) by adding nothing to it, or taking nothing away from it.

## Dominance Laws

The flip side of the above is that \( X \cup \Omega = \Omega \) and \( X \cap \emptyset = \emptyset \). If you take \( X \) and add everything and the kitchen sink to it, you get everything and the kitchen sink. And if you restrict \( X \) to having nothing, it of course has nothing.

## Complement Laws

\( X \cup \overline{X} = \Omega \). This is another way of saying “everything (in the domain of discourse) is either in, or not in, a set.” So if I take \( X \), and then I take everything not in \( X \), and smooth the two together, I get everything. In a similar vein, \( X \cap \overline{X} = \emptyset \), because there can’t be any element that’s both in \( X \) and not in \( X \): that would be a contradiction. Interestingly, the first of these two laws has become controversial in modern philosophy. It’s called “the law of the excluded middle,” and is explicitly repudiated in many modern logic systems.

## De Morgan's Laws

De Morgan’s laws. Now these are worth memorizing, if only because (1) they’re incredibly important, and (2) they
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 34

Context: # 2.12 Partitions

Finally, there’s a special variation on the subset concept called a **partition**. A partition is a group of subsets of another set that together are both **collectively exhaustive** and **mutually exclusive**. This means that every element of the original set is in **one and only one** of the sets in the partition. Formally, a partition of \( X \) is a group of sets \( X_1, X_2, \ldots, X_n \) such that:

\[ 
X_1 \cup X_2 \cup \cdots \cup X_n = X, 
\]

and 

\[ 
X_i \cap X_j = \emptyset \quad \text{for all } i, j. 
\]

So let’s say we’ve got a group of subsets that are supposedly a partition of \( X \). The first line above says that if we combine the contents of all of them, we get everything that’s in \( X \) (and nothing more). This is called being collectively exhaustive. The second line says that no two of the sets have anything in common; they are mutually exclusive.

As usual, an example is worth a thousand words. Suppose the set \( D \) is \(\{ \text{Dad, Mom, Lizzy, T.J., Johnny} \}\). A partition is any way of dividing \( D \) into subsets that meet the above conditions. One such partition is:

- \(\{ \text{Lizzy, T.J.} \}\),
- \(\{ \text{Mom, Dad} \}\), 
- \(\{ \text{Johnny} \}\).

Another one is:

- \(\{ \text{Lizzy} \}\),
- \(\{ \text{T.J.} \}\), 
- \(\{ \text{Mom} \}\),
- \(\{ \text{Johnny} \}\).

Yet another is:

- \(\emptyset\),
- \(\{ \text{Lizzy, T.J., Johnny, Mom, Dad} \}\), and 
- \(\emptyset\).
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 38

Context: ```
# CHAPTER 2. SETS

1. **Is Yoda ⊂ J?**  
   No. Yoda isn't even a set, so it can't be a subset of anything.

2. **Is { Yoda } ⊆ J?**  
   Yes. The { Yoda } set that contains only Yoda is in fact a subset of J.

3. **Is { Yoda } ∈ J?**  
   No. Yoda is one of the elements of J, but { Yoda } is not. In other words, J contains Yoda, but J does not contain a set which contains Yoda (nor does it contain any sets at all, in fact).

4. **Is S ⊆ J?**  
   No.

5. **Is G ⊆ F?**  
   Yes, since the two sets are equal.

6. **Is G ⊂ F?**  
   No, since the two sets are equal, so neither is a proper subset of the other.

7. **Is ∅ ⊆ C?**  
   Yes, since the empty set is a subset of every set.

8. **Is ∅ ⊂ C?**  
   Yes, since the empty set is a subset of every set.

9. **Is F ⊆ Ω?**  
   Yes, since every set is a subset of Ω, and F is certainly not equal to Ω.

10. **Is F ⊂ Ω?**  
   Yes and yes. The empty set is an element of X because it's one of the elements, and it's also a subset of X because it's a subset of every set. Hmmm...

11. **Suppose X = { Q, α, { Z } }. Is α ∈ X?**  
   Yes.

12. **Let A = { Macbeth, Hamlet, Othello }. B = { Scrabble, Monopoly, Othello }, and T = { Hamlet, Village, Town }. What's A ∩ B?**  
   { Othello }.

13. **What's A ∪ B?**  
   { Macbeth, Hamlet, Othello, Scrabble, Monopoly } (The elements can be listed in any order).
```
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 42

Context: I'm unable to view images or attachments. Please provide the text directly, and I'll help you format it in Markdown.

Image Analysis: 

### Image Analysis

#### 1. Localization and Attribution:
- **Image**: There is only one image provided.

#### 2. Object Detection and Classification:
- **Objects**:
  - Circular objects on a surface
  - Rectangular object in the middle
  - Grid or lines on the surface
  - Text in the center of the image

#### 3. Scene and Activity Analysis:
- **Scene Description**: 
  - The scene appears to be a diagram or an illustrative layout on a surface. It depicts connections or pathways between various nodes or points.
  - **Activities**: 
    - It resembles a chart or a flow diagram used for visualizing a process, hierarchy, or series of steps.

#### 4. Text Analysis:
- **Extracted Text**:
  - The text in the center reads: "Prototyping".
- **Analysis**:
  - The term "Prototyping" indicates that the diagram is likely illustrating the steps, components, or elements involved in the process of creating prototypes. This could encompass anything from design thinking steps to development phases.

#### 5. Diagram and Chart Analysis:
- **Diagram Description**:
  - The diagram is a visually organized depiction with lines connecting different points.
  - These points could represent steps, features, or components in a process.
  - There is no additional text around the objects, making it a high-level illustration.
- **Key Insights**:
  - The primary focus is on the process of "Prototyping." The connections indicate interactions or dependencies between different elements.

#### 8. Color Analysis:
- **Color Composition**:
  - The background is white.
  - The circular objects and lines are black.
  - The central rectangle has black text and a white background.
- **Impact**:
  - The monochromatic color scheme with white background and black lines/objects makes it straightforward and easy to focus on the structure and relationships within the diagram.

#### 9. Perspective and Composition:
- **Perspective**:
  - The image appears to be taken from a bird’s eye view directly above the diagram.
- **Composition**:
  - The central text "Prototyping" is the focal point.
  - Lines emanate from or lead to this central point, creating a balanced and symmetrical layout.

### Conclusion:
The image is a visually minimalist and symmetrical diagram focusing on the concept of "Prototyping". It uses a central text element with connected nodes or points to represent a process or series of interactions. The monochrome color palette ensures clarity and directs attention towards the structure of the illustration without any distractions.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 57

Context: # 3.7. PROPERTIES OF FUNCTIONS

would be “phoneExtension(Sally) = 1317”, indicating that Sally can be reached at x1317. Some of the available extensions may be currently unused, but every employee does have one (and only one) which makes it a function. But since no two employees have the same extension, it is also an injective function.

Injective functions are sometimes called **one-to-one** functions. (One-to-one and injective are exact synonyms.)

- **Surjective.** A surjective function is one that reaches all the elements of its codomain: some z does in fact reach every y. Another way of saying this is: for a surjective function, the range equals the entire codomain. You can see that this is impossible if the domain is smaller than the codomain, since there wouldn’t be enough z values to reach all the y values. If we added Pepsi and Barq’s Root Beer to our Y set, we would thereby eliminate the possibility of any surjective functions from X to Y (unless we also added wizards, of course).

The function `worksIn` — with employees as the domain and departments as the codomain — is an example of an surjective function. One mapping of this function would be `worksIn(Sid) = Marketing`, indicating that Sid works in the Marketing department. Each employee works for one department, which makes it a function. But at least one employee works in every department (i.e., there are no empty departments with no people in them) which makes it surjective.

Surjective functions are sometimes called **“onto”** functions. (Onto and surjective are exact synonyms.)

- **Bijective.** Finally, a bijective function is simply one that is both injective and surjective. With an injective function, every y is mapped to by at most one x; with a surjective function, every y is mapped to by at least one x; so with a bijective function, every y is mapped to by exactly one x. Needless to say, the domain and the codomain must have the same cardinality for this to be possible.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 62

Context: # CHAPTER 3. RELATIONS

7. **Is T symmetric?**  
   No, since it contains \((Kirk, Scotty)\) but not \((Scotty, Kirk)\).

8. **Is T antisymmetric?**  
   No, since it contains \((Scotty, Spock)\) and also \((Spock, Scotty)\).

9. **Is T transitive?**  
   Yes, since for every \((x, y)\) and \((y, z)\) present, the corresponding \((x, z)\) is also present. (The only example that fits this is \(x = Kirk, y = Spock, z = Scotty\), and the required ordered pair is indeed present.)

10. Let H be an endorelational on T, defined as follows:  
    \{ \((Kirk, Kirk)\), \((Spock, Spock)\), \((Uhura, Scotty)\), \((Scotty, Uhura)\), \((Spock, McCoy)\), \((McCoy, Spock)\), \((Scotty, Scotty)\), \((Uhura, Uhura)\) \}.  
    **Is H reflexive?**  
    No, since it’s missing \((McCoy, McCoy)\).

11. **Is H symmetric?**  
    Yes, since for every \((x, y)\) it contains, the corresponding \((y, x)\) is also present.

12. **Is H antisymmetric?**  
    No, since it contains \((Uhura, Scotty)\) and also \((Scotty, Uhura)\).

13. **Is H transitive?**  
    Yes, since there aren’t any examples of \((x, y)\) and \((y, z)\) pairs both being present.

14. Let outbrake be an edorelational on the set of all crew members of the Enterprise, where \((x, y)\) if \(x\) outranks \(y\) if character \(x\) has a higher Star Fleet rank than \(y\).  
    **Is outranking reflexive?**  
    No, since no officer outranks him/herself.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 66

Context: I'm unable to assist with this request.

Image Analysis: 

**Image Analysis**

1. **Localization and Attribution:**
    - There is only one image on the page, referred to as **Image 1**.
    - **Image 1** is located centrally on the page.

2. **Object Detection and Classification:**
    - **Image 1** contains several objects, including:
        - **Text Elements:** The text "Critical Path is the Longest of all Paths" is prominently displayed.
        - **Graphical Elements:** A flowchart or process diagram is visible, with various nodes and connecting arrows.
        - **Nodes and Connections:** Several boxes (nodes) with labeled text and directed arrows indicating the flow or sequence.

3. **Scene and Activity Analysis:**
    - **Image 1** depicts a flowchart or a process diagram. This generally represents a sequence of steps in a process or workflow. The primary activity is the visualization of the critical path in a process.
        - The nodes likely represent different tasks or milestones.
        - The connecting arrows show dependencies and the sequential order of tasks.
        - The critical path, being the longest, indicates the sequence of tasks that determine the total duration of the process.

4. **Text Analysis:**
    - Extracted Text: "Critical Path is the Longest of all Paths"
        - Significance: This text emphasizes the importance of the critical path in project management or process flow. It signifies that the critical path dictates the overall time required to complete all tasks in the process.

5. **Diagram and Chart Analysis:**
    - **Image 1** features a diagram depicting a critical path:
        - **Axes and Scales:** No conventional axes as it's a flowchart.
        - **Nodes:** Each node likely represents a step or task in the process, with labeled connectors showing task dependencies.
        - **Key Insight:** Identifying the critical path helps in managing project timelines effectively, ensuring project milestones are met as scheduled.

6. **Anomaly Detection:**
    - There are no apparent anomalies or unusual elements in **Image 1**. The flowchart appears structured and coherent.

7. **Perspective and Composition:**
    - The image is created in a straightforward, top-down perspective, typical for process diagrams and flowcharts.
    - Composition centers around the critical path, with nodes and connectors drawing attention to the sequence and dependencies.

8. **Contextual Significance:**
    - The image likely serves an educational or informative purpose within documents or presentations related to project management or process optimization.
    - The depiction of the critical path forms a crucial element in understanding how task sequences impact project timelines.

9. **Trend and Interpretation:**
    - Trend: The longest path in the sequence (critical path) directly impacts the overall duration for project completion.
    - Interpretation: Any delay in tasks on the critical path will delay the entire project, underlining the need to monitor and manage these tasks closely.

10. **Process Description (Prozessbeschreibungen):**
    - The image describes a process where the longest sequence of tasks (critical path) dictates the project length, requiring focus and management to ensure timely completion.

11. **Graph Numbers:**
    - Data points and specific numbers for each node are not labeled in the image, but each node represents a step that adds up to form the critical path.

By examining **Image 1**, it becomes clear that it plays a crucial role in illustrating process flows and emphasizing the importance of the critical path in project management.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 78

Context: # CHAPTER 4. PROBABILITY

Again, let’s say we had information leaked to us that the winner, whoever she may be, is female. We can now update our estimate:

$$
Pr(R|F) = \frac{Pr(R \cap F)}{Pr(F)} = \frac{Pr(|Kelly|)}{Pr(F)} = Pr\{(Kelly, Fantasia, Carrie)\} = \frac{2}{5} = .4
$$

You see, once we find out that David is no longer a possibility, our only remaining hope for a rock star is Kelly. And she has only 40% of the probability that’s left over. Note that this is a higher chance for her personally—she’s got to be excited by the press leak—but it’s lower for rock stars, of which she is only one (and evidently, not the predicted strongest).

Background knowledge can even peg our probability estimate to an extreme: all the way to 0, or to 1. What’s \(Pr(U|C)\), the probability of an underage winner, given that he/she is a country singer? The intersection of \(U\) and \(C\) is zero, so this makes \(Pr(U|C) = 0\). In words: a country winner eliminates any possibility of an underage winner. And what’s \(Pr(F|U)\), the probability that a woman wins, given that we know the winner to be underage? Well, \(F \cap U\) and \(U\) are the same (check me), so \(Pr(F|U) = \frac{Pr(F \cap U)}{Pr(U)} = \frac{4}{4} = 1\). Therefore, an underage winner guarantees a female winner.

The way I think about conditional probability is this: look at the diagram, consider the events known to have occurred, and then mentally block out everything except that. Once we know the background fact (s), we’re essentially dealing with a restricted world. Take the example of the known female winner. Once we know that event \(F\) in fact occurred, we can visually filter out David, and look at the Bob as thought that were our entire world. In this restricted female-only view, the underage elements comprise a greater percentage of the total than they did before. And half the rock-star elements have now been obscured, leaving only Kelly as the one-of-the-remaining-three.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 81

Context: # 4.5 TOTAL PROBABILITY

Pr(A) = Pr(A ∩ C₁) + Pr(A ∩ C₂) + ... + Pr(A ∩ C_N)  
= Pr(A|C₁)Pr(C₁) + Pr(A|C₂)Pr(C₂) + ... + Pr(A|C_N)Pr(C_N)  
= ∑ᵏ₌₁ Pr(A|Cₖ)Pr(Cₖ)  

is the formula.¹

Let's take an example of this approach. Suppose that as part of a promotion for Muvico Cinemas movie theatre, we're planning to give a door prize to the 1000th customer this Saturday afternoon. We want to know, though, the probability that this person will be a minor. Figuring out how many patrons overall will be under 18 might be difficult. But suppose we're showing these three films on Saturday: *Spider-Man: No Way Home*, *Here Before*, and *Sonic the Hedgehog 2*. We can estimate the fraction of each movie's viewers that will be minors: .6, .01, and .95, respectively. We can also predict how many tickets will be sold for each film: 2,000 for *Spider-Man*, 500 for *Here Before*, and 1,000 for *Sonic*.

Applying frequentist principles, we can compute the probability that a particular visitor will be seeing each of the movies:

- \(Pr(Spider-Man) = \frac{2000}{2000 + 500 + 1000} = 0.571\)

- \(Pr(Here \ Before) = \frac{500}{2000 + 500 + 1000} = 0.143\)

- \(Pr(Sonic) = \frac{1500}{2000 + 500 + 1000} = 0.286\)

¹ If you're not familiar with the notation in that last line, realize that \(Σ\) (a capital Greek "sigma") just represents a sort of loop with a counter. The \(k\) in the sign means that the counter is \(k\) and starts at 1. The "N" above the line means the counter goes up to \(N\), which is its last value. And what does the loop do? It adds up a cumulative sum. The thing being added to the total each time through the loop is the expression to the right of the sign. The last line with the \(Σ\) is just a more compact way of expressing the preceding line.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 85

Context: # 4.6. BAYES' THEOREM

See how that works? If I do have the disease (and there's a 1 in 1,000 chance of that), there's a .99 probability of me testing positive. On the other hand, if I don't have the disease (a 999 in 1,000 chance of that), there's a .01 probability of me testing positive anyway. The sum of those two mutually exclusive probabilities is 0.01098.

Now we can use our Bayes’ Theorem formula to deduce:

\[ 
P(D|T) = \frac{P(T|D) P(D)}{P(T)} 
\]

\[
P(D|T) = \frac{.99 \cdot \frac{1}{1000}}{0.01098} \approx .0902 
\]

Wow. We tested positive on a 99% accurate medical exam, yet we only have about a 9% chance of actually having the disease! Great news for the patient, but a head-scratcher for the math student. How can we understand this? Well, the key is to look back at that Total Probability calculation in equation 4.1. Remember that there were two ways to test positive: one where you had the disease, and one where you didn't. Look at the contribution to the whole that each of those two probabilities produced. The first was 0.00099, and the second was 0.9999, over ten times higher. Why? Simply because the disease is so rare. Think about it: the test fails once every hundred times, but a random person only has the disease once every thousand times. If you test positive, it’s far more likely that the test screwed up than that you actually have the disease, which is rarer than blue moons.

Anyway, all of this about diseases and tests is a side note. The main point is that Bayes' Theorem allows us to recast a search for \(P(X|Y)\) into a search for \(P(Y|X)\), which is often easier to find numbers for.

One of many computer science applications of Bayes' Theorem is in text mining. In this field, we computationally analyze the words in documents in order to automatically classify them or form summaries or conclusions about their contents. One goal might be to identify the true author of a document, given samples of the writing of various suspected authors. Consider the Federalist Papers,
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 99

Context: # 5.1. GRAPHS

A graph of dependencies like this must be both **directed** and **acyclic**, or it wouldn’t make sense. Directed, of course, means that task X can require task Y to be completed before it, without the reverse also being true. If they both depended on each other, we’d have an infinite loop, and no brownies could ever get baked! Acyclic means that no kind of cycle can exist in the graph, even one that goes through multiple vertices. Such a cycle would again result in an infinite loop, making the project hopeless. Imagine if there were an arrow from **bake for 30 mins** back to **grease pan** in Figure 5.4. Then, we’d have to grease the pan before pouring the goop into it, and we’d also have to bake before greasing the pan! We’d be stuck right off the bat: there’d be no way to complete any of those tasks since they all indirectly depend on each other. A graph that is both directed and acyclic (and therefore free of these problems) is sometimes called a **DAG** for short.

```
pour brown stuff in bowl
  ├── crack two eggs
  ├── measure 2 tbsp oil
  ├── preheat oven
  ├── grease pan
  └── mix ingredients
          └── pour into pan
                  ├── bake for 30 mins
                  └── cool
                        └── enjoy!
```

*Figure 5.4: A DAG.*
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 100

Context: # CHAPTER 5. STRUCTURES

## Spatial Positioning

One important thing to understand about graphs is which aspects of a diagram are relevant. Specifically, the spatial positioning of the vertices doesn’t matter. In Figure 5.2 we view Muhammad Ali in the mid-upper left, and Sonny Liston in the extreme upper right. But this was an arbitrary choice, and irrelevant. More specifically, this isn’t part of the information the diagram claims to represent. We could have positioned the vertices differently, as in Figure 5.5, and had the same graph. In both diagrams, there are the same vertices, and the same edges between them (check me). Therefore, these are mathematically the same graph.

```
George Foreman
Sonny Liston
Muhammad Ali
Joe Frazier
```

Figure 5.5: A different look to the same graph as Figure 5.2.

This might not seem surprising for the prize fighter graph, but for graphs like the MapQuest graph, which actually represent physical locations, it can seem jarring. In Figure 5.3 we could have drawn Richmond north of Fredericksburg, and Virginia Beach on the far west side of the diagram, and still had the same graph, provided that all the nodes and links were the same. Just remember that the spatial positioning is designed for human convenience, and isn’t part of the mathematical information. It’s similar to how there’s no order to the elements of a set, even though when we specify a set extensionally, we have to list them in some order to avoid writing all the element names on top of each other. On a graph diagram, we have to draw each vertex somewhere, but where we put it is simply aesthetic.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 103

Context: # 5.1 GRAPHS

{ Harry } and { Ron, Hermione, Neville }.

In other words, the “connectedness” of a graph can be represented precisely as a partition of the set of vertices. Each connected subset is in its own group, and every vertex is in one and only one group; therefore, these isolated groups are mutually exclusive and collectively exhaustive. Cool.

## Graph Traversal

If you had a long list — perhaps of phone numbers, names, or purchase orders — and you needed to go through and do something to each element of the list — dial all the numbers, scan the list for a certain name, and add up all the orders — it’d be pretty obvious how to do it. You just start at the top and work your way down. It might be tedious, but it’s not confusing.

Iterating through the elements like this is called **traversing** the data structure. You want to make sure you encounter each element once (and only once) so you can do whatever needs to be done with it. It’s clear how to traverse a list. But how to traverse a graph? There is no obvious “first” or “last” node, and each one is linked to potentially many others. And as we’ve seen, the vertices might not even be fully connected, so a traversal path through all the nodes might not even exist.

There are two different ways of traversing a graph: **breadth-first** and **depth-first**. They provide different ways of exploring the nodes, and as a side effect, each is able to discover whether the graph is connected or not. Let’s look at each in turn.

### Breadth-First Traversal

With breadth-first traversal, we begin at a starting vertex (it doesn’t matter which one) and explore the graph cautiously and delicately. We probe equally deep in all directions, making sure we’ve looked a little ways down each possible path before exploring each of those paths a little further.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 118

Context: ```markdown
# Chapter 5: Structures

## Figure 5.14
The stages of Prim's minimal connecting edge set algorithm. Heavy lines indicate edges that have been (irrevocably) added to the set.

| Step | Details       | Value  | Connections  |
|------|---------------|--------|--------------|
| 1    | Node A       | 98     | Node B, Node C |
| 2    | Node A       | 100    | Node D       |
| 3    | Node C       | 150    | Node E, Node F |
| 4    | Node D       | 200    | Node G       |
| 5    | Node B       | 300    | Node H       |
| 6    | Node E       | 400    | Node I       |
| 7    | Node F       | 500    | Node J       |

...
```
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 121

Context: ## 5.2. TREES

Instead, we would have a different rooted tree, depicted in the right half of the figure. Both of these rooted trees have all the same edges as the free tree did: B is connected to both A and C, F is connected only to A, etc. The only difference is which node is designated the root.

```
    A
   / \
  F   B
     / \
    C   D
         \
          E
```

```
    C
   / \
  B   D
 /     \
A       E
 \
  F
```

**Figure 5.16:** Two different rooted trees with the same vertices and edges.

Up to now we’ve said that the spatial positioning on graphs is irrelevant. But this changes a bit with rooted trees. Vertical positioning is our only way of showing which nodes are “above” others, and the word “above” does indeed have meaning here: it means closer to the root. The altitude of a node shows how many steps it is away from the root. In the right rooted tree, nodes B, D, and E are all one step away from the root (C), while node F is three steps away.

The key aspect to rooted trees — which is both their greatest advantage and greatest limitation — is that every node has one and only one path to the root. This behavior is inherited from free trees: as we noted, every node has only one path to every other.

Trees have a myriad of applications. Think of the files and folders on your hard drive: at the top is the root of the filesystem (perhaps “/” on Linux/Mac or “C:\” on Windows) and underneath that are named folders. Each folder can contain files as well as other named folders, and so on down the hierarchy. The result is that each file has one, and only one, distinct path to it from the top of the filesystem. The file can be stored, and later retrieved, in exactly one way.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 128

Context: # CHAPTER 5. STRUCTURES

Finally, to traverse a tree in-order, we:

1. Treat the left child and all its descendants as a subtree, and traverse it in its entirety.
2. Visit the root.
3. Traverse the right subtree in its entirety.

So instead of visiting the root first (pre-order) or last (post-order), we treat it in between our left and right children. This might seem to be a strange thing to do, but there's a method to the madness which will become clear in the next section.

For the sample tree, the first visited node is B. This is because it’s the first node encountered that doesn’t have a left subtree, which means step 1 doesn’t need to do anything. This is followed by O and I, for the same reason. We then visit K before its right subtree, which in turn visits G, M, E, C, A, H, F, B, L, N. (See Figure 5.20.)

If your nodes are spaced out evenly, you can read the in-order traversal of the diagram by moving your eyes left to right. Be careful about this, though, because ultimately the spatial position doesn’t matter, but rather the relationships between nodes. For instance, if I had drawn node I further to the right, in order to make the lines between D–O–I less steep, that I node might have been pushed physically to the right of K. But that wouldn't change the order and K visited earlier.

Finally, it’s worth mentioning that all of these traversal methods make elegant use of recursion. Recursion is a way of taking a large problem and breaking it up into similar, but smaller, sub-problems. Then, each of those subproblems can be attacked in the same way as you attacked the larger problem: by breaking them up into subsublproblems. All you need is a rule for eventually stopping the “breaking up” process by actually doing something.

Every time one of these traversal processes treats a left or right child as a subtree, they are “recursing” by re-initiating the whole traversal process on a smaller tree. Pre-order traversal, for instance,
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 138

Context: # CHAPTER 5. STRUCTURES

missing her. Jennifer has effectively been lost.

```plaintext
     Mitch
        ┌───────────┐
     Randi         Ben
                   ┌────────┐
                   Jim      Xander
                   └───────┘
                     Jennifer
                   ┌────────┐
                Molly
```

Figure 5.27: A wrong (non)-BST after removing Jessica incorrectly.

One correct way (there are others) to do a node removal is to replace the node with the left-most descendant of its right subtree. (Or, equivalently, the right-most descendant of its left subtree.) Let’s be careful to define this: to get the left-most descendant of a node’s right subtree, we (1) go to the right child of the node, and then (2) go as left as we possibly can from there, until we come to a node that has no left child. That node (the one without a left child) is officially the left-most descendant of the original node’s right subtree.

Example: flip back to Figure 5.17 (p. 117). What is the left-most descendant of G’s right subtree? 

**Answer:** We start by going right from G down to H, and then we go as left as possible... which turns out to be only one node’s worth of “left,” because we hit A, and A has no left child (or right child, for that matter). Work these additional examples out for yourself: what is the left-most descendant of K’s right subtree? Of D’s? Of H’s?

Okay, let’s return to Figure 5.26 (p. 129) and remove Jessica the correct way. We simply find the left-most descendant of her right subtree.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 147

Context: # 5.4. EXERCISES

25. **Is it a binary search tree?**  
   No. Although nearly every node does satisfy the BST property (all the nodes in its left subtree come before it alphabetically, and all the nodes in its right subtree come after it), there is a single exception: Zoe is in Wash's left subtree, whereas she should be to his right.  
   Many ways; one would be to swap Zoe's and Wash's positions. If we do that, the fixed tree would be:

   ```
       Mal
      /   \
    Jayne  Zoe
    /  \   /
  Inara Kayle
           \
           River
             \
             Simon
               \
               Wash
   ```

26. **How could we fix it?**  
   Take a moment and convince yourself that every node of this new tree does in fact satisfy the BST property. It's not too bad, but it does have one too many levels in it (it has a height of 4, whereas all its nodes would fit in a tree of height 3).

27. **Is the tree balanced?**  
   Many ways; one would be to rotate the River-Simon-Wash subtree so that Simon becomes Zoe's left child. Simon would then be the parent of River (on his left) and Wash (on his right).

28. **How could we make it more balanced?**  
   Suggested approaches include reviewing the structure to ensure the depths of all leaves are as equal as possible or implementing rotations to balance out the heights of subtrees.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 158

Context: # CHAPTER 6. COUNTING

and so on. Once we have the **RISK** permutations, we can generate the **BRISK** permutations in the same way:

| B | R | I | S | K |
|---|---|---|---|---|
| R | B | I | S | K |
| R | I | B | S | K |
| R | I | S | B | K |
| B | I | R | S | K |
| T | B | R | I | S | K |
| I | R | B | S | K |
| I | R | S | K |
| I | S | R | K |
| B | S | K |
| . . . |

Another algorithm to achieve the same goal (though in a different order) is as follows:

## Algorithm #2 for enumerating permutations

1. Begin with a set of **n** objects.

   a. If **n = 1**, there is only one permutation; namely, the object itself.
   
   b. Otherwise, remove each of the objects in turn, and prepend that object to the permutations of all the others, creating another permutation each time.

I find this one a little easier to get my head around, but in the end it’s personal preference. The permutations of **BRISK** are: “B followed by all the permutations of **RISK**, plus **I** followed by all the permutations of **BISK**, plus **R** followed by all the permutations of **BRSK**, etc.” So the first few permutations of a 4-letter word are:

- R I S K
- R I K S
- R S I K
- R S K I
- R K I S
- R K S I
- I R S K
- I R K S
- I S R K
- I S K R
- I K R S
- I K S R
- S R I K
- S R K I
- S I R K
- S I K R
- S K R I
- S K I R
- K R I S
- K R S I
- K I R S
- K I S R
- K S R I
- K S I R
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 167

Context: # 6.4 Summary

Most of the time, counting problems all boil down to a variation of one of the following three basic situations:

- \( n^k \) — this is when we have \( k \) different things, each of which is free to take on one of \( n \) completely independent choices.

- \( \binom{n}{k} \) — this is when we’re taking a sequence of \( k \) different things from a set of \( n \), but no repeats are allowed. (A special case of this is \( n! \), when \( k = n \).)

- \( \binom{n}{k} \) — this is when we’re taking \( k \) different things from a set of \( n \), but the order doesn’t matter.

Sometimes it’s tricky to deduce exactly which of these three situations apply. You have to think carefully about the problem and ask yourself whether repeated values would be allowed and whether it matters what order the values appear in. This is often subtle.

As an example, suppose my friend and I work out at the same gym. This gym has 18 different weight machines to choose from, each of which exercises a different muscle group. Each morning,
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 169

Context: 6.5 Exercises
==========

1. Inside a dusty chest marked “Halloween costumes” in the family attic, there are four different outfits (a wizard's cape, army fatigues, and two others), five different headgear (a Batman helmet, a headband, a tiara, etc.), and nine different accessories (a wand, a lightsaber, a pipe, and many others). If a child were to choose a costume by selecting one outfit, one headgear, and one accessory, how many costume choices would he/she have?

   ```
   4 x 5 x 9 = 180
   ```

2. What if the child were permitted to skip one or more of the items (for instance, choosing a costume with an outfit and accessory, but no headgear)?

   ```
   5 x 6 x 10 = 300, since now “no choice at all” is effectively another choice for each of the categories.3 Kind of amazing how much that increases the total!
   ```

&nbsp;

3. Note, by the way, that this approach does not work for situations like the license plate example on p. 114. Namely, you can’t say “if a license plate can have fewer than 7 characters, we can just add ‘no character’ at this position” as one of the options for that position, and calculate 27^7 = 94,931,877,133 possible plates. That number is too high. Why? (Hint: for some of those choices, you can get the same license plate in more than one way. Hint 2: if we chose ‘A’ for the first license plate character, ‘no character’ for the second, followed by ‘ANTMAN’ we get the license plate ‘ANTMAN’. But what other choices could we have made, that would also have resulted in ‘ANTMAN’?)
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 172

Context: I'm unable to view images. However, if you provide the text you'd like me to correct in Markdown format, I'll be happy to help you restructure and format it!

Image Analysis: 

Since the attached image is completely white, there is no content to analyze under the specified aspects. 

To provide specific details for each aspect:

1. **Localization and Attribution:**
    - **Image 1:** The attached image is the only image.

2. **Object Detection and Classification:**
    - No objects detected.

3. **Scene and Activity Analysis:**
    - No scene or activities present.

4. **Text Analysis:**
    - No text detected.

5. **Diagram and Chart Analysis:**
    - No diagrams or charts detected.

6. **Product Analysis:**
    - No products depicted.

7. **Anomaly Detection:**
    - No anomalies or unusual elements detected.

8. **Color Analysis:**
    - The entire image is white.

9. **Perspective and Composition:**
    - No specific perspective or composition due to the image being plain white.

10. **Contextual Significance:**
    - No contextual elements are present to analyze.

11. **Metadata Analysis:**
    - No metadata available to review.

12. **Graph and Trend Analysis:**
    - No graphs present.

13. **Graph Numbers:**
    - No data points available.

- **Ablaufprozesse (Process Flows):**
    - No process flows depicted.

- **Prozessbeschreibungen (Process Descriptions):**
    - No processes described.

- **Typen Bezeichnung (Type Designations):**
    - No types or categories specified.

- **Trend and Interpretation:**
    - No trends to interpret.

- **Tables:**
    - No tables present.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 175

Context: # 7.1. WHAT IS A “NUMBER?”

When you think of a number, I want you to try to erase the sequence of digits from your mind. Think of a number as what is: a **quantity**. Here's what the number seventeen really looks like:

```
   8
 8 8 8
   8
```

It’s just an **amount**. There are more circles in that picture than in some pictures, and less than in others. But in no way is it “two digits,” nor do the particular digits “1” and “7” come into play any more or less than any other digits.

Let’s keep thinking about this. Consider this number, which I’ll label “A”:

(A)

Now let’s add another circle to it, creating a different number I’ll call “B”:

(B)

And finally, we’ll do it one more time to get “C”:

(C)

(Look carefully at those images and convince yourself that I added one circle each time.)

When going from A to B, I added one circle. When going from B to C, I also added one circle. Now I ask you: was going from B to C any more “significant” than going from A to B? Did anything qualitatively different happen?

The answer is obviously no. Adding a circle is adding a circle; there’s nothing more to it than that. But if you had been writing
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 198

Context: # CHAPTER 7. NUMBERS

which fit in our -128 to 127 range, but whose sum will not:

| carry-in   | 1111111  |
|------------|----------|
|            | 01001111  ← 103<sub>10</sub> |
| +          | 01011111  ← 95<sub>10</sub> |
| carry-out  | 01100010  |

The carry-in to the last bit was 1, but the carry-out was 0, so for two's-complement this means we detected overflow. It's a good thing, too, since 11010010 in two's-complement represents -57<sub>10</sub>, which is certainly not 103 + 95.

Essentially, if the carry-in is not equal to the carry-out, that means we added two positive numbers and came up with a negative number, or that we added two negatives and got a positive. Clearly, this is an erroneous result, and the simple comparison tells us that. Just be careful to realize that the rule for detecting overflow depends totally on the particular representation scheme we're using. A carry-out of 1 always means overflow... in the unsigned scheme. For two's-complement, we can easily get a carry-out of 1 with no error at all, provided the carry-in is also 1.

## “It’s all relative”

Finally, if we come up for air out of all this mass of details, it’s worth emphasizing that there is no intrinsically “right” way to interpret a binary number. If I show you a bit pattern — say, 11010000 — and ask you what value it represents, you can’t tell me without knowing how to interpret it.

If I say, “Oh, that’s an unsigned number,” you’d first look at the leftmost bit, see that it’s a 1, and realize you have a negative number. Then you’d take the remaining seven bits and treat them as digits in a simple base 2 numbering scheme. You’d add 2<sup>6</sup> + 2<sup>5</sup> + 2<sup>2</sup> to get 68, and then respond, “ah, then that’s the number -68<sub>10</sub>.” And you’d be right.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 202

Context: # CHAPTER 7. NUMBERS

12. Are the numbers 617,418 and 617,424 congruent mod 37?  
   Yes. The number 617,418 is exactly 6 less than 617,424. Let's say there are k stones left over after removing groups of three from 617,418. (k must be 0, 1, or 2, of course.) Now if we did the same remove-groups-of-three thing but starting with 617,424 instead, we'll have two more groups of three than we did before, but then also have exactly k stones left over.

13. Are the numbers 617,418 and 617,424 congruent mod 2?  
   Yes. The number 617,418 is exactly 6 less than 617,424. If there are k stones left over after removing pairs of stones from 617,418, we'd get three additional pairs if we had instead started with 617,424, but then also have exactly k stones left over.

14. Are the numbers 617,418 and 617,424 congruent mod 57?  
   No. Five doesn’t go evenly into six.

15. Are the numbers 617,418 and 617,424 congruent mod 67?  
   Yes. The number 617,418 is exactly 6 less than 617,424. If there are k stones left over after removing groups of six from 617,418, we’d get one additional group if we had instead started with 617,424, and then have exactly k stones left over.

16. What’s 11₁₆ + 2₉?  
   3₁₆.

17. What’s 11₁₆ + 9₉?  
   16₁₆.

18. What’s 1₁₆ + 1₈ + E₂₆?  
   A₁₆.

19. What’s 1₁₆ + F₁₆?  
   10₁₆. (This is the first time we've had to "carry".)

20. What’s 11₁₆ + 22₉?  
   33₁₆.

21. What’s 11₁₆ + 99₉?  
   AA₁₆.

22. What’s 11₁₆ + EF₁₆?  
   100₁₆. (As in exercise 19, we must carry.)
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 212

Context: ```
# CHAPTER 8. LOGIC

| X | Y | X∨Y | X•Y | ¬X | X=Y | X≠Y |
|---|---|-----|-----|----|-----|-----|
| 0 | 0 |  0  |  0  |  1 |  1  |  1  |
| 0 | 1 |  1  |  0  |  1 |  0  |  1  |
| 1 | 0 |  1  |  0  |  0 |  0  |  1  |
| 1 | 1 |  1  |  1  |  0 |  1  |  0  |

Take a moment and look carefully through the entries in that table, and make sure you agree that this correctly represents the outputs for the five operators. (Note that ¬, being a unary operator, only has X as an input, which means that the value of Y is effectively ignored for that column.)

Now sometimes we have a more complex expression (like the \( C \oplus (A \land B) \Rightarrow \neg A \) example from above) and we want to know the truth value of the entire expression. Under what circumstances — i.e., for what truth values of A, B, and C — is that expression true? We can use truth tables to calculate this piece by piece.

Let’s work through that example in its entirety. First, we set up the inputs for our truth table:

| A | B | C |
|---|---|---|
| 0 | 0 | 0 |
| 0 | 0 | 1 |
| 0 | 1 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 1 | 1 |

In this case, there are three inputs to the expression (A, B, and C) and so we have \( 2^3 \), or eight, rows in the truth table.

Now we work our way through the expression inside out, writing down the values of intermediate parts of the expression. We need to know the value of ¬B to figure some other things out, so let’s start with that one:
```
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 221

Context: # 8.2. PREDICATE LOGIC

This asserts that for every \( x \), \( \text{HasGovernor}(x) \) is true. Actually, this isn’t quite right, for although Michigan and California have governors, mayonnaise does not. To be precise, we should say:

\[
\forall x \in S \, \text{HasGovernor}(x),
\]

where \( S \) is the set of all fifty states in the U.S.

We can use a quantifier for any complex expression, not just a simple predicate. For instance, if \( H \) is the set of all humans, then:

\[
\forall h \in H \, \text{Adult}(h) \oplus \text{Child}(h)
\]

states that every human is either an adult or a child, but not both. (Imagine drawing an arbitrary line at a person’s 18th birthday.) Another (more common) way to write this is to dispense with sets and define another predicate \( \text{Human} \). Then we can say:

\[
\forall h \, \text{Human}(h) \Rightarrow \text{Adult}(h) \oplus \text{Child}(h).
\]

Think this through carefully. We’re now asserting that this expression is true for all objects, whether they be Duchess Kate Middleton, little Prince Louis, or a bowl of oatmeal. To see that it’s true for all three, let’s first be equal to Kate Middleton. We substitute Kate for \( h \) and get:

\[
\text{Human}(\text{Kate}) \Rightarrow \text{Adult}(\text{Kate}) \oplus \text{Child}(\text{Kate})
\]

\[
\text{true} \Rightarrow \text{true} \oplus \text{false}
\]

\[
\text{true} \Rightarrow \text{true}
\]

\[
\text{true} \checkmark
\]

Remember that “implies” (\( \Rightarrow \)) is true as long as the premise (left-hand side) is false and/or the conclusion (right-hand side) is true. In this case, they’re both true, so we have a true result. Something similar happens for Prince Louis:

\[
\text{Human}(\text{Louis}) \Rightarrow \text{Adult}(\text{Louis}) \oplus \text{Child}(\text{Louis})
\]

\[
\text{true} \Rightarrow \text{false} \oplus \text{true}
\]

\[
\text{true} \Rightarrow \text{true}
\]

\[
\text{true} \checkmark
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 226

Context: # Chapter 8: Logic

Here's a funny one I’ll end with. Consider the sentence **“He made her duck.”** What is intended here? Did some guy reach out with his hand and forcefully push a woman’s head down out of the way of a screaming projectile? Or did he prepare a succulent dish of roasted fowl to celebrate her birthday? Oh, if the computer could only know. If we’d used predicate logic instead of English, it could!
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 227

Context: # 8.3 Exercises

Let \( B \) be the proposition that Joe Biden was elected president in 2020, \( C \) be the proposition that Covid-19 was completely and permanently eradicated from the earth in 2021, and \( R \) be the proposition that *Roe v. Wade* was overturned in 2022.

1. What’s \( B \land C \)?  
   **False.**

2. What’s \( B \land C' \)?  
   **True.**

3. What’s \( A \land R \)?  
   **True.**

4. What’s \( A \land R' \)?  
   **False.**

5. What’s \( \neg C \lor R \)?  
   **True.**

6. What’s \( \neg(C \lor R')? \)  
   **False.**

7. What’s \( \neg(C \lor R) \)?  
   **True.**

8. What’s \( \neg C \land B \)?  
   **False.**

9. What’s \( C \land B' \)?  
   **True.**

10. What’s \( \neg C \to B \)?  
    **False.**

11. What’s \( \neg\neg\neg B \)?  
    **True.**

12. What’s \( \neg B' \)?  
    **False.**

13. What’s \( \neg\neg\neg C' \)?  
    **True.**

14. What’s \( B \lor C \lor R \)?  
    **True.**

15. What’s \( B \land C \land R'? \)  
    **False.**

16. What’s \( B \land C \land R \)?  
    **True.** (Even though there is plainly no causality there.)

17. What’s \( B \to R? \)  
    **True.** (Ditto.)

18. What’s \( R \to B? \)  
    **False.** (The premise is true, so the conclusion must also be true for this sentence to be true.)

19. What’s \( B \to C? \)  
    **True.** (The premise is true, so all bets are off and the sentence is true.)

20. What’s \( C \to B? \)  
    **True.** (The premise is false, so all bets are off and the sentence is true.)
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 230

Context: # CHAPTER 8. LOGIC

|   | Statement                                                                                              |
|---|-------------------------------------------------------------------------------------------------------|
| 35| True or false: ∀x Professor(x) ⇔ Human(x). <br> **True!** This is what we were trying to say all along. Every professor is a person. |
| 36| True or false: ¬∃x Professor(x) → ¬Human(x). <br> **True!** This is an equivalent statement to item 35. There's nothing in the universe that is a professor yet not a human. (At least, at the time of this writing.) |
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 233

Context: # 9.1. PROOF CONCEPTS

The phone is in my pocket, and has not rung, and I conclude that the plan has not changed. I look at my watch, and it reads 5:17 pm, which is earlier than the time they normally leave, so I know I'm not going to catch them walking out the door. This is, roughly speaking, my thought process that justifies the conclusion that the house will be empty when I pull into the garage.

Notice, however, that this prediction depends precariously on several facts. What if I spaced out the day of the week, and this is actually Thursday? All bets are off. What if my cell phone battery has run out of charge? Then perhaps they did try to call me but couldn’t reach me. What if I set my watch wrong and it’s actually 4:17 pm? Etc. Just like a chain is only as strong as its weakest link, a whole proof falls apart if even one step isn’t reliable.

Knowledge bases in artificial intelligence systems are designed to support these chains of reasoning. They contain statements expressed in formal logic that can be examined to deduce only the new facts that logically follow from the old. Suppose, for instance, that we had a knowledge base that currently contained the following facts:

1. \( A = C \)
2. \( \neg (C \land D) \)
3. \( (F \lor E) \to D \)
4. \( A \lor B \)

These facts are stated in propositional logic, and we have no idea what any of the propositions really mean, but then neither does the computer, so hey. Fact #1 tells us that if proposition A (whatever that may be) is true, then we know C is true as well. Fact #2 tells us that we know CAD is false, which means at least one of the two must be false. And so on. Large knowledge bases can contain thousands or even millions of such expressions. It’s a complete record of everything the system “knows.”

Now suppose we learn an additional fact: \( \neg B \). In other words, the system interacts with its environment and comes to the conclusion.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 236

Context: # CHAPTER 9. PROOF

## COLD

You start with one word (like **WARM**) and you have to come up with a sequence of words, each of which differs from the previous by only one letter, such that you eventually reach the ending word (like **COLD**). It's sort of like feeling around in the dark:

- **WARM**
- **WART**
- **WALT**
- **WILT**
- **WILD**
- ...  

This attempt seemed promising at first, but now it looks like it's going nowhere. ("**WOLD?**" "**CILD?**" Hmm...) After starting over and playing around with it for a while, you might stumble upon:

- **WARM**
- **WORM**
- **WORD**
- **CORD**
- **COLD**

This turned out to be a pretty direct path: for each step, the letter we changed was exactly what we needed it to be for the target word **COLD**. Sometimes, though, you have to meander away from the target a little bit to find a solution, like going from **BLACK** to **WHITE**:

- **BLACK**
- **CLACK**
- **CRACK**
- **TRACK**
- **TRICK**
- **TRICE**
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 244

Context: # Example 2

A famous story tells of Carl Friedrich Gauss, perhaps the most brilliant mathematician of all time, getting in trouble one day as a schoolboy. As punishment, he was sentenced to tedious work: adding together all the numbers from 1 to 100. To his teacher's astonishment, he came up with the correct answer in a moment, not because he was quick at adding integers, but because he recognized a trick. The first number on the list (1) and the last (100) add up to 101. So do 3 and 98, and so do 4 and 97, etc., all the way up to 50 and 51. So really what you have here is 50 different sums of 101 each, so the answer is \( 50 \times 101 = 5050 \). In general, if you add the numbers from 1 to \( x \), where \( x \) is any integer at all, you'll get \( \frac{x(x + 1)}{2} \).

Now, use mathematical induction to prove that Gauss was right (i.e., that \( \sum_{i=1}^{x} i = \frac{x(x + 1)}{2} \)) for all numbers \( x \).

First, we have to cast our problem as a predicate about natural numbers. This is easy: say “let \( P(n) \) be the proposition that 

\[
\sum_{i=1}^{n} i = \frac{n(n + 1)}{2} 
\]

Then, we satisfy the requirements of induction:

1. **Base Case**: We prove that \( P(1) \) is true simply by plugging it in. Setting \( n = 1 \), we have:

   \[
   \sum_{i=1}^{1} i = \frac{1(1 + 1)}{2}
   \]
   \[
   1 = \frac{1(2)}{2}
   \]
   \[
   1 = 1 \quad \text{✓}
   \]

2. **Inductive Step**: We now must prove that \( P(k) \) implies \( P(k + 1) \). Put another way, we assume \( P(k) \) is true, and then use that assumption to prove that \( P(k + 1) \) is also true.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 251

Context: # 9.3. PROOF BY INDUCTION

## 2. Inductive Step
We now must prove that \((\forall i \leq k) P(i) \Rightarrow P(k + 1)\). Put another way, we assume that all trees smaller than the one we’re looking at have one more node than edge, and then use that assumption to prove that the tree we’re looking at also has one more node than edge.

We proceed as follows. Take any free tree with \(k + 1\) nodes. Removing any edge gives you two free trees, each with \(k\) nodes or less. (Why? Well, if you remove any edge from a free tree, the nodes will no longer be connected, since a free tree is “minimally connected” as it is. And we can't break it into more than two trees by removing a single edge, since the edge connects exactly two nodes and each group of nodes on the other side of the removed edge are still connected to each other.)

Now the sum of the nodes in these two smaller trees is still \(k + 1\). (This is because we haven't removed any nodes from the original free tree — we've simply removed an edge.) If we let \(k_1\) be the number of nodes in the first tree, and \(k_2\) the number of nodes in the second, we have \(k_1 + k_2 = k + 1\).

Okay, but how many edges does the first tree have? Answer: \(k_1 - 1\). How do we know that? By the inductive hypothesis. We’re assuming that any tree smaller than \(k + 1\) nodes has one less edge than node, and so we’re taking advantage of that (legal) assumption here. Similarly, the second tree has \(k_2 - 1\) edges.

Bingo. Removing one edge from our original tree of \(k + 1\) nodes gave us a total of \(k - 1\) edges. Therefore, that original tree must have had \(k\) edges. We have now proven that a tree of \(k + 1\) nodes has \(k\) edges, assuming that all smaller trees also have one less edge than node.

## 3. Conclusion
Therefore, by the strong form of mathematical induction, \(\forall n \geq 1, P(n)\).
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 252

Context: # 9.4 Final Word

Finding proofs is an art. In some ways, it’s like programming: you have a set of building blocks, each one defined very precisely, and your goal is to figure out how to assemble those blocks into a structure that starts with only axioms and ends with your conclusion. It takes skill, patience, practice, and sometimes a little bit of luck.

Many mathematicians spend years pursuing one deeply difficult proof, like Appel and Haken who finally cracked the infamous four-color map problem in 1976, or Andrew Wiles who solved Fermat’s Last Theorem in 1994. Some famous mathematical properties may never have proofs, such as Christian Goldbach’s 1742 conjecture that every even integer is the sum of two primes; or the most elusive and important question in computing theory: does P=NP? (Put very simply: if you consider the class of problems where it’s easy to verify a solution once you have it, but crazy hard to find it in the first place, is there actually an easy algorithm for finding the solution that we just haven’t figured out yet?) Most computer scientists think “no,” but despite a mind-boggling number of hours invested by the brightest minds in the world, no one has ever been able to prove it one way or the other.

Most practicing computer scientists spend time taking advantage of the known results about mathematical objects and structures, and rarely (if ever) have to construct a watertight proof about them. For the more theoretically-minded student, however, who enjoys probing the basis behind the tools and speculating about additional properties that might exist, devising proofs is an essential skill that can also be very rewarding.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 257

Context: 9.4. FINAL WORD
=================

- ordered triples, 15
- org charts, 113
- outcomes, 60, 62
- overflow, 188

P = NP?, 244  
parent (of a node), 114  
partial orders, 43  
partial permutations, 151, 154  
partitions, 26, 71, 94  
Pascal's triangle, 157  
passwords, 146  
paths (in a graph), 88, 113  
perfect binary tree, 122, 239  
permutations, 147  
PINs, 143  
poker, 160  
pop (off a stack), 99  
posts, 43  
post-order traversal, 118  
postulates, 226  
power sets, 24, 36  
pre-order traversal, 117  
predicates, 210, 211, 232  
predicate logic, 210  
premise (of implication), 200  
Prim's algorithm, 107  
Prime, Robert, 107  
prior probability, 68  
probability measures, 61, 63, 65  
product operator (II), 142, 160  
proof, 223  
proof by contradiction, 229  
propositional logic, 197, 225  
propositions, 197, 210  

- psychology, 70, 86
- push (on a stack), 99
- quantifiers, 212, 215
- queue, 95, 97
- quotient, 173, 174

- range (of a function), 48
- rational numbers (ℝ), 17, 24
- reachable, 89
- real numbers (ℝ), 71, 94
- rebalancing (a tree), 132
- recursion, 116, 120, 149, 231
- red-black trees, 133
- reflexive (relation), 40, 43
- relations, finite, 39
- relations, infinite, 39
- remainder, 173, 174
- right child, 116  
- root (of a tree), 112, 114  
- rooted trees, 112, 134  
- Russell's paradox, 15  

- sample space (Ω), 60  
- semantic network, 87  
- set operators, 18  
- set-builder notation, 11  
- sets, 8, 93  
- sets of sets, 15  
- sets, finite, 12  
- sets, fuzzy, 10  
- sets, infinite, 12  
- sibling (of a node), 115  
- signs-magnitude binary numbers, 183, 189  

**Sonic the Hedgehog**, 73  
- southern states, 72  
- spatial positioning, 92, 113
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 1

Context: # A Brief Introduction to Machine Learning for Engineers

(2018), “A Brief Introduction to Machine Learning for Engineers”, Vol. XX, No. XX, pp 1–231. DOI: XXX.

Osvaldo Simeone  
Department of Informatics  
King’s College London  
osvaldo.simeone@kcl.ac.uk
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 4

Context: ```
6.6 Discriminative Models........................................159  
6.7 Autoencoders...............................................163  
6.8 Ranking*...................................................164  
6.9 Summary....................................................164  

IV Advanced Modelling and Inference..............................165  

7 Probabilistic Graphical Models................................166  
7.1 Introduction...............................................167  
7.2 Bayesian Networks..........................................170  
7.3 Markov Random Fields.......................................178  
7.4 Bayesian Inference in Probabilistic Graphical Models......182  
7.5 Summary....................................................185  

8 Approximate Inference and Learning............................186  
8.1 Monte Carlo Methods........................................187  
8.2 Variational Inference.......................................189  
8.3 Monte Carlo-Based Variational Learning*....................197  
8.4 Approximate Learning*......................................199  
8.5 Summary....................................................201  

V Conclusions...................................................202  

9 Concluding Remarks............................................203  

Appendices.......................................................206  

A Appendix A: Information Measures..............................207  
A.1 Entropy....................................................207  
A.2 Conditional Entropy and Mutual Information................210  
A.3 Divergence Measures.......................................212  

B Appendix B: KL Divergence and Exponential Family..........215  

Acknowledgements...............................................217  
```
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 5

Context: References
===========
218
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 9

Context: # Acronyms

- **AI:** Artificial Intelligence
- **AMP:** Approximate Message Passing
- **BN:** Bayesian Network
- **DAG:** Directed Acyclic Graph
- **ELBO:** Evidence Lower Bound
- **EM:** Expectation Maximization
- **ERM:** Empirical Risk Minimization
- **GAN:** Generative Adversarial Network
- **GLM:** Generalized Linear Model
- **HMM:** Hidden Markov Model
- **i.i.d.:** independent identically distributed
- **KL:** Kullback-Leibler
- **LASSO:** Least Absolute Shrinkage and Selection Operator
- **LBP:** Loopy Belief Propagation
- **LL:** Log-Likelihood
- **LLR:** Log-Likelihood Ratio
- **LS:** Least Squares
- **MC:** Monte Carlo
- **MCMC:** Markov Chain Monte Carlo
- **MDL:** Minimum Description Length
- **MFVI:** Mean Field Variational Inference
- **MI:** Maximum Likelihood
- **NRF:** Markov Random Field
- **NLL:** Negative Log-Likelihood
- **PAC:** Probably Approximately Correct
- **pdf:** probability density function
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 11

Context: # Part I
## Basics
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 13

Context: # 1.1. What is Machine Learning?

This starts with an in-depth analysis of the problem domain, which culminates with the definition of a mathematical model. The mathematical model is meant to capture the key features of the problem under study and is typically the result of the work of a number of experts. The mathematical model is finally leveraged to derive hand-crafted solutions to the problem.

For instance, consider the problem of defining a chemical process to produce a given molecule. The conventional flow requires chemists to leverage their knowledge of models that predict the outcome of individual chemical reactions, in order to craft a sequence of suitable steps that synthesize the desired molecule. Another example is the design of speech translation or image/video compression algorithms. Both of these tasks involve the definition of models and algorithms by teams of experts, such as linguists, psychologists, and signal processing practitioners, not infrequently during the course of long standardization meetings.

The engineering design flow outlined above may be too costly and inefficient for problems in which faster or less expensive solutions are desirable. The machine learning alternative is to collect large data sets, e.g., of labeled speech, images, or videos, and to use this information to train general-purpose learning machines to carry out the desired task. While the standard engineering flow relies on domain knowledge and on design optimized for the problem at hand, machine learning lets large amounts of data dictate algorithms and solutions. To this end, rather than requiring a precise model of the set-up under study, machine learning requires the specification of an objective, of a model to be trained, and of an optimization technique.

Returning to the first example above, a machine learning approach would proceed by training a general-purpose machine to predict the outcome of known chemical reactions based on a large data set, and then by using the trained algorithm to explore ways to produce more complex molecules. In a similar manner, large data sets of images or videos would be used to train a general-purpose algorithm with the aim of obtaining compressed representations from which the original input can be recovered with some distortion.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 17

Context: # 1.3 Goals and Outline

This monograph considers only passive and offline learning.

## 1.3 Goals and Outline

This monograph aims at providing an introduction to key concepts, algorithms, and theoretical results in machine learning. The treatment concentrates on probabilistic models for supervised and unsupervised learning problems. It introduces fundamental concepts and algorithms by building on first principles, while also exposing the reader to more advanced topics with extensive pointers to the literature, within a unified notation and mathematical framework. Unlike other texts that are focused on one particular aspect of the field, an effort has been made here to provide a broad but concise overview in which the main ideas and techniques are systematically presented. Specifically, the material is organized according to clearly defined categories, such as discriminative and generative models, frequentist and Bayesian approaches, exact and approximate inference, as well as directed and undirected models. This monograph is meant as an entry point for researchers with a background in probability and linear algebra. A prior exposure to information theory is useful but not required.

Detailed discussions are provided on basic concepts and ideas, including overfitting and generalization, Maximum Likelihood and regularization, and Bayesian inference. The text also endeavors to provide intuitive explanations and pointers to advanced topics and research directions. Sections and subsections containing more advanced material that may be regarded at a first reading are marked with a star (+).

The reader will find here further discussions on computing platform or programming frameworks, such as map-reduce, nor details on specific applications involving large data sets. These can be easily found in a vast and growing body of work. Furthermore, rather than providing exhaustive details on the existing myriad solutions in each specific category, techniques have been selected that are useful to illustrate the most salient aspects. Historical notes have also been provided only for a few selected milestone events.

Finally, the monograph attempts to strike a balance between the algorithmic and theoretical viewpoints. In particular, all learning algorithms are presented in a manner that emphasizes their theoretical foundations while also providing practical insights.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 22

Context: ```markdown
variables \( t_n \) are assumed to be dependent on \( x_n \), and are referred to as dependent variables, labels, or responses. An example is illustrated in Fig. 2.1. We use the notation \( \mathbf{x}_D = (x_1, \ldots, x_N)^T \) for the covariates and \( \mathbf{t}_D = (t_1, \ldots, t_N)^T \) for the labels in the training set \( D \). Based on this data, the goal of supervised learning is to identify an algorithm to predict the label \( t \) for a new, that is, as of yet unobserved, domain point \( x \).

![Figure 2.1: Example of a training set \( D \) with \( N = 10 \) points \( (x_n, t_n) \), \( n = 1, \ldots, N \).](path_to_image)

The outlined learning task is clearly impossible in the absence of additional information on the mechanism relating variables \( z \) and \( t \). With reference to Fig. 2.1, unless we assume, say, that \( z \) and \( t \) are related by a function \( t = f(z) \) with some properties, such as smoothness, we have no way of predicting the label \( t \) for an unobserved domain point \( z \). This observation is formalized by the no free lunch theorem to be reviewed in Chapter 5: one cannot learn rules that generalize to unseen examples without making assumptions about the mechanism generating the data. The set of all assumptions made by the learning algorithm is known as the inductive bias.

This discussion points to a key difference between memorizing and learning. While the former amounts to mere retrieval of a value \( t_n \), 
```
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 23

Context: 2.2. Inference
====================

Corresponding to an already observed pair \((x_t, y_t) \in \mathcal{D}\), learning entails the capability to predict the value \(t\) for an unseen domain point \(x\). Learning, in other words, converts experience – in the form of \(\mathcal{D}\) – into expertise or knowledge – in the form of a predictive algorithm. This is well captured by the following quote by Jorge Luis Borges: “To think is to forget details, generalize, make abstractions.” [138]

By and large, the goal of supervised learning is that of identifying a predictive algorithm that minimizes the generalization loss, that is, the error in the prediction of a new label \(t\) for an unobserved explanatory variable \(x\). How exactly to formulate this problem, however, depends on one’s viewpoint on the nature of the model that is being learned. This leads to the distinction between the frequentist and the Bayesian approaches, which is central to this chapter. As will be discussed, the MDL philosophy deviates from the mentioned focus on prediction as the goal of learning, by targeting instead a parsimonious description of the data set \(\mathcal{D}\).

### 2.2 Inference

Before we start our discussion of learning, it is useful to review some basic concepts concerning statistical inference, as they will be needed throughout this chapter and in the rest of this monograph. We specifically consider the inference problem of predicting a \(y\) given the observation of another \(x\) under the assumption that their joint distribution \(p(x, y)\) is known. As a matter of terminology, it is noted that here we will use the term “inference” as it is typically intended in the literature on probabilistic graphical models (see, e.g., [81]), thereby diverging from its use in other branches of the machine learning literature (see, e.g., [23]).

In order to define the problem of optimal inference, one needs to define a non-negative loss function \(\ell(t, \hat{t})\). This defines the cost, or loss or risk, incurred when the correct value is \(t\) while the estimate is \(\hat{t}\). An important example is the \(\ell_t\) loss:

\[
\ell_t(\hat{t}) = |t - \hat{t}|, \quad (2.1)
\]

which includes as a special case the quadratic loss \(\ell_q(t, \hat{t}) = (t - \hat{t})^2\).
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 36

Context: L(w^*) is the best generalization loss for the given model. Recall that the loss L(w^*) can be achieved when N is large enough. Finally, the term (L(w_M) - L(w^*)) is the estimation error or generalization gap[^1] that is incurred due to the fact that N is not large enough and hence we have w_M ≠ w^*.

From the decomposition (2.22), a large N allows us to reduce the estimation error, but it has no effect on the bias. This is seen in Fig. 2.4, where the asymptotic achieved by the generalization loss L(w^*) increases equals the minimum generalization loss L(w^*) for the given model. Choosing a small value of M in the regime of large data points provides a floor on the achievable generalization loss that no amount of additional data can overcome.

## Validation and Testing

In the discussion above, it was assumed that the generalization loss L(w) can somehow be evaluated. Since this depends on the true unknown distribution p(z,t), this evaluation is, strictly speaking, not possible. How then to estimate the generalization loss in order to enable model order selection using a plot as in Fig. 2.33? The standard solution is to use validation.

The most basic form of validation prescribes the division of the available data into two sets: a hold-out, or validation, set and the training set. The validation set is used to evaluate an approximation of the generalization loss L(w) via the empirical average:

L(w) = \frac{1}{N_v} \sum_{n=1}^{N_v} \ell(t_n, w) \quad (2.23)

where the sum is done over the N_v elements of the validation set.

The just described hold-out approach to validation has a clear drawback, as part of the available data needs to be set aside and not used for training. This means that the number of data points that can be used for training is smaller than the number of overall available data points. To partially alleviate this problem, a more sophisticated, and commonly used, approach to validation is k-fold cross-validation. With this method, the available data points are partitioned, typically at random, into k equally sized subsets. The generalization loss is then esti[^1]: This is also defined as generalisation error in some references.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 50

Context: ```
44
# A Gentle Introduction through Linear Regression

Produces a description of approximately 
$$ - \sum_{n=1}^{N} \log{p(y_n | x_n, w_{ML}, \beta_{ML})} $$
bits. This description is, however, not sufficient, since the decoder of the description should also be informed about the parameters $(w_{ML}, \beta_{ML})$.

Using quantization, the parameters can be described by means of a number $C(M)$ of bits that is proportional to the number of parameters, here $M + 2$. Concatenating these bits with the description produced by the ML model yields the overall description length

$$ - \sum_{n=1}^{N} \log{p[y_n | x_n, w_{ML}, \beta_{ML}]} + C(M). \quad (2.39) $$

MDL – in the simplified form discussed here – selects the model order $M$ that minimizes the description length (2.39). Accordingly, the term $C(M)$ acts as a regularizer. The optimal value of $M$ for the MDL criterion is hence the result of the trade-off between the minimization of the overhead $C(M)$, which calls for a small value of $M$, and the minimization of the NLL, which decreases with $M$.

Under some technical assumptions, the overhead term can be often evaluated in the form 
$$ \left( \frac{K}{2} \right) N \log N + c, $$ 
where $K$ is the number of parameters in the model and $c$ is a constant. This expression is not quite useful in practice, but it provides intuition about the mechanism used by MDL to combat overfitting.

## 2.6 Information-Theoretic Metrics

We now provide a brief introduction to information-theoretic metrics by leveraging the example studied in this chapter. As we will see in the following chapters, information-theoretic metrics are used extensively in the definition of learning algorithms. Appendix A provides a detailed introduction to information-theoretic measures in terms of inferential tasks. Here we introduce the key metrics of Kullback-Leibler (KL) divergence and entropy by examining the asymptotic behavior of ML in the regime of large $N$. The case with finite $N$ is covered in Chapter 6 (see Sec. 6.4.3).
```
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 53

Context: # 2.7 Interpretation and Causality

Having learned a predictive model using any of the approaches discussed above, an important, and often overlooked, issue is the interpretation of the results returned by the learned algorithm. This has in fact grown into a separate field within the active research area of deep neural networks (see Chapter 4) [102]. Here, we describe a typical pitfall of interpretation that relates to the assessment of causality relationships between the variables in the model. We follow an example in [113].

![Fig. 2.10](path/to/figure) shows a possible distribution of data points on the plane defined by coordinates \( x = \text{exercise} \) and \( t = \text{cholesterol} \) (the numerical values are arbitrary). Learning a model that relates \( x \) to \( t \) would clearly identify an upward trend—an individual that exercises more can be predicted to have a higher cholesterol level. This prediction is legitimate and supported by the available data, but can we also conclude that exercising less would reduce one’s cholesterol? In other words, can we conclude that there exists a causal relationship between \( x \) and \( t \)? We know the answer to be no, but this cannot be ascertained from the data in the figure.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 55

Context: # 2.8 Summary

In this chapter, we have reviewed three key learning frameworks, namely frequentist, Bayesian and MDL, within a parametric probabilistic setup. The frequentist viewpoint postulates the presence of a true unknown distribution for the data and aims at learning a predictor that generalizes well on unseen data drawn from this distribution. This can be done either by learning a probabilistic model to be plugged into the expression of the optimal predictor or by directly solving the ERM problem over the predictor. The Bayesian approach outputs a predictive distribution that combines prior information with the data by solving the inference problem of computing the posterior distribution over the unseen label. Finally, the MDL method aims at selecting a model that allows the data to be described with the smallest number of bits, hence doing away with the need to define the task of generalizing over unobserved examples.

The chapter has also focused extensively on the key problem of overfitting, demonstrating how the performance of a learning algorithm can be understood in terms of bias and estimation error. In particular, while choosing a hypothesis class is essential in order to enable learning, choosing the “wrong” class constitutes an irrecoverable bias that can make learning impossible. As a real-world example, as reported in [109], including as independent variables in the ZIP code of an individual seeking credit at a bank may disadvantage certain applicants or minorities. Another example of this phenomenon is the famous experiment by B. F. Skinner on pigeons [133].

We conclude this chapter by emphasizing an important fact about the probabilistic models that are used in modern machine learning applications. In frequentist methods, typically at least two (possibly conditional) distributions are involved: the empirical data distribution and the model distribution. The former amounts to the histogram of the data which, by the law of large numbers, tends to the real distribution.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 58

Context: # 3.1 Preliminaries

We start with a brief review of some definitions that will be used throughout the chapter and elsewhere in the monograph (see [28] for more details). Readers with a background in convex analysis and calculus may just review the concept of sufficient statistic in the last paragraph.

First, we define a **convex set** as a subset of \(\mathbb{R}^D\), for some \(D\), that contains all segments between any two points in the set. Geometrically, convex sets hence cannot have “indentations.” Function \(f(x)\) is convex if its domain is a convex set and if it satisfies the inequality \(f(\lambda x_1 + (1 - \lambda)x_2) \leq \lambda f(x_1) + (1 - \lambda)f(x_2)\) for all \(x_1\) and \(x_2\) in its domain and for all \(0 \leq \lambda \leq 1\). Geometrically, this condition says that the function is “U”-shaped: the curve defining the function cannot be above the segment obtained by connecting any two points on the curve. A function is strictly convex if the inequality above is strict except for \(\lambda = 0\) or \(\lambda = 1\); a concave, or strictly concave, function is defined by reversing the inequality above – it is hence “n-shaped.”

The minimization of a convex (“U”) function over a convex constraint set or the maximization of a concave (“n”) function over a convex constraint set are known as convex optimization problems. For these problems, there exist powerful analytical and algorithmic tools to obtain globally optimal solutions [28].

We also introduce two useful concepts from calculus. The **gradient** of a differentiable function \(f(x)\) with \(x = [x_1, \ldots, x_D]^T \in \mathbb{R}^D\) is defined as the \(D \times 1\) vector \(\nabla f(x) = [\frac{\partial f(x)}{\partial x_1}, \ldots, \frac{\partial f(x)}{\partial x_D}]^T\) containing all partial derivatives. At any point \(x\) in the domain of the function, the gradient is a vector that points to the direction of locally maximal increase of the function. The Hessian \(\nabla^2 f(x)\) is the \(D \times D\) matrix with \((i,j)\) element given by the second-order derivative \(\frac{\partial^2 f(x)}{\partial x_i \partial x_j}\). It captures the local curvature of the function.

1. A statistic is a function of the data.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 66

Context: ```
where we have defined the cumulative sufficient statistics

$$ u_k(z_P) = \frac{1}{N} \sum_{n=1}^N u_k(x_n), \quad k = 1, \ldots, K. $$  (3.25)

and the vector 

$$ u(z_P) = [u_1(x_P) \cdots u_K(x_P)]^T. $$

A first important observation is that the LL function only depends on the $K$ statistics $u_k(x_P)$, $k = 1, \ldots, K$. Therefore, the vector $u(x_P)$ is a sufficient statistic for the estimate of $\theta$ given the observation $z_P$. Importantly, vector $u(x_P)$ is of size $K$, and hence it does not grow with the size $N$ of the data set. In fact, the exponential family turns out to be unique in this respect: Informally, among all distributions whose support does not depend on the parameters, only distributions in the exponential family have sufficient statistics whose number does not grow with the number $N$ of observations (Koopman-Pitman-Darling theorem) [7].

### Gradient of the LL function

A key result that proves very useful in deriving learning algorithms is the expression of the gradient of the LL (3.24) with respect to the natural parameters. To start, the partial derivative with respect to $\theta_k$ can be written as

$$ \frac{\partial \ln p(z|x_P)}{\partial \theta_k} = u_k(x_P) - \frac{\partial A(\eta)}{\partial \theta_k} $$  (3.26)

Using (3.9) and (3.10), this implies that we have

$$ \frac{1}{N} \frac{\partial \ln p(z|x_P)}{\partial \theta_k} = \frac{1}{N} u_k(x_P) - \mu_k, $$  (3.27)

and for the gradient

$$ \frac{1}{N} \nabla_{\theta} \ln p(x_P) = \frac{1}{N} u(z_P) - \mu $$

$$ = \frac{1}{N} \sum_{n=1}^N u(x_n). $$  (3.28)

The gradient (3.28) is hence given by the difference between the empirical average 

$$ \frac{1}{N} \sum_{n=1}^N u(x_n) $$

of the sufficient statistics given the data $x_P$ and the ensemble average $\mu$.
```
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 77

Context: # 3.6 Maximum Entropy Property*

In this most technical section, we review the maximum entropy property of the exponential family. Besides providing a compelling motivation for adopting models in this class, this property also illuminates the relationship between natural and mean parameters.

The key result is the following: The distribution \(p(x|\theta)\) in (3.1) obtains the maximum entropy over all distributions \(p(x)\) that satisfy the constraints \(\mathbb{E}_{x \sim p(x)} [u_k(x)] = \mu_k\) for all \(k = 1, \ldots, K\). Recall that, as mentioned in Chapter 2 and discussed in more detail in Appendix A, the entropy is a measure of randomness of a random variable. Mathematically, the distribution \(p(x | \eta)\) solves the optimization problem

\[
\max_{p} H(p) \quad \text{s.t.} \quad \mathbb{E}_{x \sim p(x)} [u_k(x)] = \mu_k \quad \text{for } k = 1, \ldots, K.
\]

Each natural parameter \(\eta_k\) turns out to be the optimal Lagrange multiplier associated with the \(k\)th constraint (see [45, Ch. 6-7]).

To see the practical relevance of this result, suppose that the only information available about some data \(x\) is given by the means of given functions \(u(x)\), \(k = 1, \ldots, K\). The probabilistic model (3.1) can then be interpreted as encoding the least additional information about the data, in the sense that it is the "most random" distribution under the given constraints. This observation justifies the adoption of this model by the maximum entropy principle.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 80

Context: # Probabilistic Models for Learning

## 3.9 Summary

In this chapter, we have reviewed an important class of probabilistic models that are widely used as components in learning algorithms for both supervised and unsupervised learning tasks. Among the key properties of members of this class, known as the exponential family, are the simple form taken by the gradient of the log-likelihood (LL), as well as the availability of conjugate priors in the same family for Bayesian inference. 

An extensive list of distributions in the exponential family along with corresponding sufficient statistics, measure functions, log-partition functions, and mappings between natural and mean parameters can be found in [156]. More complex examples include the Restricted Boltzmann Machines (RBMs) to be discussed in Chapter 6 and Chapter 8. It is worth mentioning that there are also distributions not in the exponential family, such as the uniform distribution parameterized by its support. The chapter also covered the important idea of applying exponential models to supervised learning via Generalized Linear Models (GLMs). Energy-based models were finally discussed as an advanced topic.

The next chapter will present various applications of models in the exponential family to classification problems.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 82

Context: # 4
## Classification

The previous chapters have covered important background material on learning and probabilistic models. In this chapter, we use the principles and ideas covered so far to study the supervised learning problem of classification. Classification is arguably the quintessential machine learning problem, with the most advanced state of the art and the most extensive application to problems as varied as email spam detection and medical diagnosis. 

Due to space limitations, this chapter cannot provide an exhaustive review of all existing techniques and latest developments, particularly in the active field of neural network research. For instance, we do not cover decision trees here (see, e.g., [155]). Rather, we will provide a principled taxonomy of approaches, and offer a few representative techniques for each category within a unified framework. We will specifically proceed by first introducing as preliminary material the Stochastic Gradient Descent optimization method. Then, we will discuss deterministic and probabilistic discriminative models, and finally we will cover probabilistic generative models.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 88

Context: 82 Classification

Given a point \( x \), it is useful to measure the confidence level at which the classifier assigns \( x \) to the class identified through rule (4.5). This can be done by quantifying the Euclidean distance between \( x \) and the decision hyperplane. As illustrated in Fig. 4.2, this distance, also known as the classification margin, can be computed as \( | \langle x, \mathbf{w} \rangle | / \| \mathbf{w} \| \).

A point \( x \) has a true label \( t \), which may or may not coincide with the one assigned by rule (4.5). To account for this, we augment the definition of margin by giving a positive sign to correctly classified points and a negative sign to incorrectly classified points. Assuming that \( t \) takes values in \( \{-1, 1\} \), this yields the definition of geometric margin as

\[
\text{margin} = \frac{t \cdot \langle x, \mathbf{w} \rangle}{\| \mathbf{w} \|} \tag{4.7}
\]

whose absolute value equals the classification margin. For future reference, we also define the functional margin as \( t \cdot \langle x, \mathbf{w} \rangle \).

**Feature-based model**. The model described above, in which the activation is a linear function of the input variables \( x \), has the following drawbacks:

1. **Bias**: As suggested by the example in Fig. 4.3, dividing the domain of the covariates \( x \) by means of a hyperplane may fail to capture. 

![Figure 4.3: A non-linearly separable training set.](path/to/image)
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 90

Context: # Classification

Furthermore, as discussed in Sec. 2.3.5, overfitting can be controlled by introducing a regularization function \( R(\theta) \) on the weight vector \( \theta \). Accordingly, a deterministic predictor \( \hat{y}(\mathbf{x}) \) as defined in (4.5) can be learned by solving the regularized ERM problem

\[
\min_{\theta} L_D(\theta) + \lambda N R(\theta). \tag{4.9}
\]

with the empirical risk

\[
L_D(\theta) = \frac{1}{N} \sum_{n=1}^{N} \ell(t_n, \hat{y}(\mathbf{x}_n)). \tag{4.10}
\]

In (4.9), the hyperparameter \( \lambda \) should be selected via validation as explained in Sec. 2.3.5.

Extending the examples discussed in Sec. 2.3.5, the regularization term is typically convex but possibly not differentiable, e.g., \( R(\theta) = \|\theta\|_1 \). Furthermore, a natural choice for the loss function is the 0-1 loss, which implies that the generalization loss \( L_g \) in (2.2) is the probability of classification error.

In the special case of linearly separable data sets, the resulting ERM problem can be converted to a Linear Program (LP) [133]. Since it is in practice impossible to guarantee the separability condition a priori, one needs generally to solve directly the ERM problem (4.9). The function \( \text{sign}() \) has zero derivative almost everywhere, and is not differentiable when the argument is zero. For this reason, it is difficult to tackle problem (4.9) via standard gradient-based optimization algorithms such as SGD. It is instead often useful to consider surrogate loss functions \( \ell(t, a) \) that depend directly on the differentiable (affine) activation \( a(\mathbf{x}) \). The surrogate loss function should preferably be convex in \( a \) and in \( \mathbf{x} \), ensuring that the resulting regularized ERM problem

\[
\min_{\theta} \frac{1}{N} \sum_{n=1}^{N} \ell(t_n, \hat{y}(\mathbf{x}_n)) + \lambda N R(\theta)
\]

is convex. This facilitates optimization [28], and, under suitable additional conditions, guarantees generalization [133].
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 105

Context: 4.5. Discriminative Probabilistic Models: Beyond GLM

Put

\[
\delta^{L+1} = (y - t) \quad (4.36a)
\]

\[
\delta' = (W^{(L+1)})^T \delta^{(l)} \cdot h'(a^{(l)}) \quad (4.36b)
\]

\[
\nabla_{w_{ij}} L(W) = \delta^{(h^{(l)})^T} \quad \text{for } l = 1, 2, \ldots, L + 1 \quad (4.36c)
\]

where \(h'( \cdot )\) denotes the first derivative of the function \(h( \cdot )\); the product \( \cdot \) is taken element-wise; and we set \( h^0 = x \).

Backpropagation requires a forward pass and a backward pass for every considered training example. The forward pass uses the neural network as defined by equations (4.35). This entails multiplications by the weight matrices \(W^l\) in order to compute the activation vectors, as well as applications of the non-linear function \(h( \cdot )\). In contrast, the backward pass requires only linear operations, which, by (4.36b), are based on the transpose \((W^{(l)})^T\) of the weight matrix \(W^l\) used in the forward pass.

The derivatives (4.36c) computed during the backward pass are of the general form 

\[
\nabla_{w_{ij}} L(W) = h^{(l-1)} \cdot \delta_j \quad (4.37)
\]

where \(w_{ij}\) is the \((i,j)\)th element of matrix \(W^l\) corresponding to the weight between the pre-synaptic neuron \(j\) in layer \(l-1\) and the post-synaptic neuron \(i\) in layer \(l\); \(h^{(l-1)}\) is the output of the pre-synaptic neuron \(j\); and \(\delta_j\) is the back-propagated error. The back-propagated error assigns “responsibility” for the error \(t_j - y_j\) measured at the last layer (layer \(L+1\)) to each synaptic weight \(w_{ij}\) between neuron \(j\) in layer \(l-1\) and neuron \(i\) in layer \(l\). The back-propagated error is obtained via the linear operations in (4.36b).
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 111

Context: # 4.6  Generative Probabilistic Models

![Figure 4.9](path/to/image.png)  
**Figure 4.9:** Probability that the class label is the same as for the examples marked with circles according to the output of the generative model QDA. The probability is represented by the color map illustrated by the bar on the right of the figure. For this example, it can be seen that LDA fails to separate the two classes (not shown).

### Example 4.4
We continue the example in Sec. 4.5 by showing in Fig. 4.9 the probability (4.43) that the class label is the same as for the examples marked with circles according to the output of QDA. Given that the covariates have a structure that is well modeled by a mixture of Gaussians with different covariance matrices, QDA is seen to perform well, arguably better than the discriminative models studied in Sec. 4.5. It is important to note, however, that LDA would fail in this example. This is because a model with equal class-dependent covariance matrices, as assumed by LDA, would entail a significant bias for this example.

#### 4.6.3 Multi-Class Classification*
As an example of a generative probabilistic model with multiple classes, we briefly consider the generalization of QDA to \( K \geq 2 \) classes. Extending (4.41) to multiple classes, the model is described as:

\[
x \sim \text{Cat}(\pi) \tag{4.44a}
\]  
\[
x | t \sim \mathcal{N}(\mu_k, \Sigma_k) \tag{4.44b}
\]  

Image Analysis: 

### Analysis of the Visual Content

#### 1. Localization and Attribution
- **Image Number**: Image 1
  - This is a single image located near the top of the page, integrated within the textual content.

#### 2. Object Detection and Classification
- **Objects Identified**:
  - **Graph**: The main object is a scatter plot with mathematical notations and a probability color map.
  - **Text**: The text below the graph and the paragraphs containing examples and explanations.

#### 3. Scene and Activity Analysis
- **Scene Description**: Image 1 depicts a scatter plot graph illustrating the results of a statistical model used in generative probabilistic models. The graph is surrounded by explanatory text describing the significance of the plotted data.
- **Activities**: The scene represents an educational or informational setting, focusing on explaining a statistical concept.

#### 4. Text Analysis
- **Text Extracted**:
  - **Figure Caption**: "Figure 4.9: Probability that the class label is the same as for the examples marked with circles according to the output of the generative model QDA. The probability is represented by the color map illustrated by the bar on the right of the figure. For this example, it can be seen that LDA fails to separate the two classes (not shown)."
  - **Example Text**: "Example 4.4. We continue the example in Sec. 4.5 by showing in Fig. 4.9 the probability (4.43) that the class label is the same as for the examples marked with circles according to the output of QDA. Given that the covariates have a structure that is well modelled by a mixture of Gaussians with different covariance matrices, QDA is seen to perform well, arguably better than the discriminative models studied in Sec. 4.5. It is important to note, however, that LDA would fail in this example. This is because a model with equal class-dependent covariance matrices, as assumed by LDA, would entail a significant bias for this example."
  - **Section Text**: "4.6.3 Multi-Class Classification* As an example of a generative probabilistic model with multiple classes, we briefly consider the generalization of QDA to \(K \geq 2\) classes. Extending (4.41) to multiple classes, the model is described as \(t \sim Cat(\pi)\) (4.44a) \(x|t = k \sim \mathcal{N}(\mu_k, \Sigma_k)\) (4.44b)"
- **Significance**: The text explains concepts related to generative probabilistic models, highlighting the differences between Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA). It offers an example and discusses performance with an emphasis on covariance structures.

#### 5. Diagram and Chart Analysis
- **Graph Analysis**:
  - **Axes and Scales**:
    - **X-axis**: Z1 ranging from -4 to 4
    - **Y-axis**: Z2 ranging from -3 to 3
    - **Color Bar**: Probability scale ranging from 0.01 to 1
  - **Data Presented**: 
    - Red crosses and blue circles represent data points, with the color map indicating the probability that a data point's class label matches the model's output.
  - **Key Insights**:
    - The scatter plot illustrates that QDA accurately assigns class labels to the data points marked with circles.
    - The color map indicates higher probabilities in regions with concentrated data points, showing QDA's effectiveness.
    - The caption notes LDA's failure to separate the two classes, which suggests the necessity of QDA for this dataset.

#### 9. Perspective and Composition
- **Perspective Description**: The image is viewed directly from the front, capturing a clear view of the scatter plot and accompanying text.
- **Composition**: The scatter plot is centered in the image with explanatory text surrounding it, ensuring clarity and focus on the visual data representation.

### Contextual Significance
- **Image Contribution**: This image and its associated text play a crucial role in explaining a key concept within the broader context of generative probabilistic models. It provides a visual and theoretical comparison between LDA and QDA, supporting the reader's understanding of the material.

### Diagram and Trend Analysis
- **Trend Observation**: The scatter plot indicates clusters where QDA performs well in classifying data points. The color gradient reveals areas with varying probabilities, underscoring where QDA's classification aligns with actual data labels.

### Tables and Graph Numbers
- This image does not include tables or specific graph data points for listing.

### Process Flows and Descriptions
- No process flows or descriptions are presented in this image.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 113

Context: # 4.8 Summary

When training the \(k\)th model, the outputs \(a_1(x, \mathbf{w}_1), \ldots, a_k(x, \mathbf{w}_{k-1})\) of the previously trained models, as well as their weights \(\pi_1, \ldots, \pi_{k-1}\), are kept fixed. Excluding the models \(k + 1, \ldots, K\), the training loss can be written as

\[
L_k = \sum_{n=1}^{N} \alpha_{n}^{(k)} \exp(-\pi_k t_n - \sum_{j=1}^{k-1} \pi_j a_j(x_n, \mathbf{w}_j)). \tag{4.47}
\]

with the weights

\[
\alpha_{n}^{(k)} = \exp\left(-t_n - \sum_{j=1}^{k-1} \pi_j a_j(x_n, \mathbf{w}_j)\right). \tag{4.48}
\]

An important point is that the weights (4.48) are larger for training samples with smaller functional margin under the mixture model \(\sum_{j=1}^{k} \pi_j a_j(x_n, \mathbf{w}_j)\). Therefore, when training the \(k\)th model, we give more importance to examples that fare worse in terms of classification margins under the current mixture model. Note that, at each training step \(k\), one trains a simple model, which has the added advantage of reducing the computational complexity as compared to the direct learning of the full training model. We refer to [23, Ch. 14][33, Ch. 10] for further details.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 115

Context: 4.8. Summary
-----------

Chapter 8 for details). Learning rate schedules that satisfy (4.49) include \(\gamma(t) = 1/t\). The intuitive reason for the use of diminishing learning rates is the need to limit the impact of the “noise” associated with the finite-sample estimate of the gradient [22]. The proof of convergence leverages the unbiasedness of the estimate of the gradient obtained by SGD.

In practice, a larger mini-batch size \(S\) decreases the variance of the estimate of the gradient, hence improving the accuracy when close to a stationary point. However, choosing a smaller \(S\) can improve the speed of convergence when the current solution is far from the optimum [152, Chapter 8][22]. A smaller mini-batch size \(S\) is also known to improve the generalization performance of learning algorithms by avoiding sharp external points of the training loss function [66, 79] (see also Sec. 4.5). Furthermore, as an alternative to decreasing the step size, one can also increase the size of the mini-batch along the iterations of the SGD algorithm [136].

Variations and Generalizations
-------------------------------

Many variations of the discussed basic SGD algorithm have been proposed and routinely used. General principles motivating these schedule variants include [56, Chapter 8]: 

1. **Momentum**, or heavy-ball, memory: correct the direction suggested by the stochastic gradient by considering the “momentum” acquired during the last update; 
2. **Adapitivity**: use a different learning rate for different parameters depending on an estimate of the curvature of the loss function with respect to each parameter; 
3. **Control variates**: in order to reduce the variance of the SGD updates, and control variates that do not affect the unbiasedness of the stochastic gradient; and
4. **Second-order updates**: include information about the curvature of the cost or objective function in the parameter update.

As detailed in [56, Chapter 8][76, 43], to which we refer for further discussions, methods in the first category include Nesterov momentum; in the second category, we find AdaGrad, RMSProp, and Adam; and the third encompasses SVRG and SAGA. Finally, the fourth features Newton's method.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 117

Context: # 4.8. Summary

The problem turns out to be quadratic and convex. Importantly, the resulting optimal activation can be expressed as

\[
\alpha(x, w) = \sum_{n=1}^{N} \alpha_n k(x, z_n), \tag{4.50}
\]

where \(\alpha_n\) are the optimal dual variables, and we have defined the kernel function

\[
k(x, y) = \phi(x)^\top \phi(y), \tag{4.51}
\]

where \(x\) and \(y\) are two argument vectors. The kernel function measures the correlation—informally, the similarity—between the two input vectors \(x\) and \(y\). The activation (4.50) has been an intuitive interpretation: the decision about the label of an example \(x\) depends on the support vectors \(z_n\), where \(\alpha_n > 0\), that are the most similar to \(x\). Note that equation (4.50) can also be justified using the representer theorem in [133, Chapter 16], which shows that the optimal weight vector must be a linear combination of the feature vectors \(\{\phi(z_n)\}_{n=1}^N\).

Working in the dual domain can have computational advantages when the number of the primal variables, here the size \(D\) of the weight vector \(w\), is larger than the number \(N\) of dual variables. While this seems a priori unlikely to happen in practice, it turns out that this is not the case. The key idea is that one can use (4.50) with any other kernel function, not necessarily one explicitly defined by a feature function \(\phi\). A kernel function is any symmetric function measuring the correlation of two data points, possibly in an infinite-dimensional space. This is known as the kernel trick.

As a first example, the polynomial kernel

\[
k(x, y) = (x^\top y + r)^d, \tag{4.52}
\]

where \(r > 0\), corresponds to a correlation \(\phi(x)^\top \phi(y)\) in a high-dimensional space \(D\). For instance, with \(l = 1\) and \(D = 2\) we have \(D' = 6\) and the feature vector \(\phi(x) = [\sqrt{x_1}, x_1^2, x_2^2, \sqrt{x_1x_2}, x_1^3, x_2^3]\) [104]. As another, more extreme example, the conventional Gaussian kernel

\[
k(x, y) = e^{-\|x - y\|^2} \tag{4.53}
\]

corresponds to an inner product in an infinite-dimensional space [104]. An extensive discussion on kernel methods can be found in [104].
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 119

Context: # 5
## Statistical Learning Theory*

Statistical learning theory provides a well-established theoretical framework in which to study the trade-off between the number \( N \) of available data points and the generalization performance of a trained machine. The approach formalizes the notions of model capacity, estimation error (or generalization gap), and bias that underlie many of the design choices required by supervised learning, as we have seen in the previous chapters. 

This chapter is of mathematical nature, and it departs from the algorithmic focus of the text so far. While it may be skipped at a first reading, the chapter sheds light on the key empirical observations made in the previous chapters relative to learning in a frequentist setup. It does so by covering the theoretical underpinnings of supervised learning within the classical framework of statistical learning theory. 

To this end, the chapter contains a number of formal statements with proofs. The proofs have been carefully selected in order to highlight and clarify the key theoretical ideas. This chapter follows mostly the treatment in [133].
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 125

Context: ```
5.2 PAC Learnability and Sample Complexity
===========================================

In order to formally address the key question posed above regarding the learnability of a model \( \mathcal{H} \), we make the following definitions. As mentioned, for simplicity, we consider binary classification under the 0-1 loss, although the analysis can be generalized under suitable conditions [133].

### Definition 5.2

A hypothesis class \( \mathcal{H} \) is PAC learnable if, for any \( \epsilon, \delta \in (0, 1) \), there exist an \( (N, \epsilon, \delta) \) PAC learning rule as long as the inequality 

\[
N \geq N_{\mathcal{H}}(\epsilon, \delta)
\]

is satisfied for some function \( N_{\mathcal{H}}(\epsilon, \delta) < \infty \).

In words, a hypothesis class is PAC learnable if, as long as enough data is collected, a learning algorithm can be found that obtains any desired level of accuracy and confidence. An illustration of the threshold \( N_{\mathcal{H}}(\epsilon, \delta) \) can be found in Fig. 5.2. A less strong definition of PAC learnability requires (5.7) to hold only for all distributions \( p(x, t) \) that can be written as 

\[
p(x, t) = p(x)1(t = \hat{i}(x))
\]

for some marginal distribution \( p(x) \) and for some hypothesis \( \hat{i}(x) \in \mathcal{H} \). The condition (5.9) is known as the realizability assumption, which implies that the data is generated from some mechanism that is included in the hypothesis class. Note that realizability implies the linear separability of any data set drawn from the true distribution for the class of linear predictors (see Chapter 4).

A first proposition, and perhaps surprising, observation is that not all models are PAC learnable. As an extreme example of this phenomenon, consider the class of all functions from \( \mathbb{R}^d \) to \( \{0, 1\} \). By the no free lunch theorem, this class is not PAC learnable. In fact, given any amount of data, we can always find a distribution \( p(x, t) \) under which the PAC condition is not satisfied. Intuitively, even in the realizable case, knowing the correct predictor \( \hat{i}(x) \) in (5.9) for any number of 
```
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 133

Context: # 5.5. Summary

To elaborate, consider a probabilistic model \( \mathcal{H} \) defined as the set of all pmfs \( p(x|\theta) \) parameterized by \( \theta \) in a given set. With some abuse of notation, we take \( \mathcal{H} \) to be also the domain of parameter \( \theta \). To fix the ideas, assume that \( \theta \) takes values over a finite alphabet. We know from Section 2.5 that a distribution \( q(x) \) is associated with a lossless compression scheme that requires around \( -\log q(x) \) bits to describe a value. Furthermore, if we were informed about the true parameter \( \theta \), the minimum average coding length would be the entropy \( H(p(\cdot|\theta)) \), which requires setting \( q(x) = p(x|\theta) \) (see Appendix A).

Assume now that we only know that the parameter \( \theta \) lies in set \( \mathcal{H} \), and hence the true distribution \( p(x|\theta) \) is not known. In this case, we cannot select the true parameter distribution, and we need instead to choose a generally different distribution \( q(x) \) to define a compression scheme. With a given distribution \( q(x) \), the average coding length is given by 

\[
\bar{L}(q(x), \theta) = -\sum_{x} p(x|\theta) \log q(x). 
\]
Therefore, the choice of a generally different distribution \( q(x) \) entails a redundancy of 

\[
\Delta R(q(x), \theta) = -\sum_{x} p(x|\theta) \log \frac{q(x)}{p(x|\theta)} \geq 0 \tag{5.22}
\]
bits.

The redundancy \( \Delta R(q(x), \theta) \) in (5.22) depends on the true value of \( \theta \). Since the latter is not known, this quantity cannot be computed. We can instead obtain a computable metric by maximizing over all values of \( \theta \in \mathcal{H} \), which yields the worst-case redundancy

\[
\Delta R(q(x), \mathcal{H}) = \max_{\theta \in \mathcal{H}} \Delta R(q(x), \theta). \tag{5.23}
\]

This quantity can be minimized over \( q(x) \) yielding the so-called min-max redundancy:

\[
\Delta R(q(x)) = \min_{q} \Delta R(q(x), \mathcal{H}) \tag{5.24}
\]
\[
= \min_{q} \max_{\theta} \sum_{x} p(x|\theta) \log \frac{p(x|\theta)}{q(x)} \tag{5.25}
\]
\[
= \min_{q} \max_{\theta} \sum_{x} p(x|\theta) \log \frac{p(x|\theta)}{q(x)}. \tag{5.26}
\]

The minimum redundancy can be taken as a measure of the capacity of model \( \mathcal{H} \), since a richer model tends to yield a larger \( \Delta R(\mathcal{H}) \). In fact,
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 138

Context: # Unsupervised Learning

may wish to cluster a set \( D \) of text documents according to their topics, by modeling the latter as an unobserved label \( z_n \). Broadly speaking, this requires to group together documents that are similar according to some metric. It is important at this point to emphasize the distinction between classification and clustering: While the former assumes the availability of a labelled set of training examples and evaluates its (generalization) performance on a separate set of unlabelled examples, the latter works with a single, unlabelled, set of examples. The different notation used for the labels \( z_n \) in lieu of \( t_n \) is meant to provide a reminder of this key difference.

## Dimensions reduction and representation:
Given the set \( D \), we would like to represent the data points \( x_n \in D \) in a space of lower dimensionality. This makes it possible to highlight independent explanatory factors, and/or to ease visualization and interpretation [93], e.g., for text analysis via vector embedding (see, e.g., [124]).

## Feature extraction:
Feature extraction is the task of deriving functions of the data points \( x_n \) that provide useful lower-dimensional inputs for tasks such as supervised learning. The extracted features are unobserved, and hence latent, variables. As an example, the hidden layer of a deep neural network extracts features from the data for use by the output layer (see Sec. 4.5).

## Generation of new samples:
The goal here is to train a machine that is able to produce samples that are approximately distributed according to the true distribution \( p(x) \). For example, in computer graphics for filmmaking or gaming, one may want to train a software that is able to produce artistic scenes based on a given description.

The variety of tasks and the difficulty in providing formal definitions, e.g., on the realism of an artificially generated image, may hinder unsupervised learning, at least in its current state, a less formal field than supervised learning. Often, loss criteria in unsupervised learning measure the divergence between the learned model and the empirical data distribution, but there are important exceptions, as we will see.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 147

Context: # 6.3. ML, ELBO and EM

The equivalence between the three forms of the ELBO can be easily checked. The form (6.11) justifies the definition of the negative of the ELBO as variational free energy or Gibbs free energy, which is the difference of energy and entropy. This form is particularly useful for undirected models in which one specifies directly the joint distribution \( p(x, z|\theta) \), such as energy-based models, while the form (6.12) is especially well suited for directed models that account for the discriminative distribution \( p(z|x, \theta) \), as for deep neural networks [27]. For both forms the first term can be interpreted as a cross-entropy loss. The form from (6.13) is more compact, and suggests that, as we will formalize below, the ELBO is maximized when \( q(z) \) is selected to match the model distribution. The last form yields terms that are generally not easily computable, but it illustrates the relationship between the log-likelihood function and the ELBO, as we discuss next.

The following theorem describes the defining property of the ELBO as well as another important property. Taken together, these features make the ELBO uniquely suited for the development of algorithmic solutions for problem (6.7).

## Theorem 6.1
The ELBO is a global lower bound on the LL function, that is,

\[
\ln p(x|\theta) \geq L(q, \theta), \tag{6.15}
\]

where equality holds at a value \( \theta^* \) if and only if the distribution \( q(z) \) satisfies \( q(z) = p(z|x, \theta) \). Furthermore, the ELBO is concave in \( q(z) \) for a fixed \( \theta \), and if \( p(x, z|\theta) \) is concave in \( \theta \), it is also concave in \( \theta \) for a fixed \( q(z) \).

**Proof.** The first part of the theorem follows immediately from the form (6.14), which can rewrite as

\[
\ln p(x|\theta) = L(q, \theta) + \text{KL}(q(z)||p(z|x,\theta)), \tag{6.16}
\]

and from Gibbs' inequality. In fact, the latter implies that the KL divergence \( \text{KL}(q(z)||p(z|x,\theta)) \) is non-negative and equal to zero if and only if the two distributions in the argument are equal. The concavity of the ELBO can be easily checked using standard properties of convex functions [28]. As a note, an alternative proof of the first part of
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 169

Context: # 6.8 Ranking

## Denoising autoencoders
An alternative approach to facilitate the learning of useful features is that taken by denoising autoencoders [150]. Denoising autoencoders add noise to each input \( x_n \), obtaining a noisy version \( \tilde{x}_n \), and then train the machine with the aim of recovering the input \( x_n \) from its noisy version \( \tilde{x}_n \). Formally, this can be done by minimizing the empirical risk \( \sum_{n} \mathcal{L}(x_n, G_{\theta}(\tilde{x}_n)) \).

## Probabilistic autoencoders
Instead of using deterministic encoder and decoder, it is possible to work with probabilistic encoders and decoders, namely \( p(z|x, \theta) \) and \( p(x|z, \theta) \), respectively. Treating the decoder as a variational distribution, learning can then be done by a variant of the EM algorithm. The resulting algorithm, known as Variational AutoEncoder (VAE), will be mentioned in Chapter 8.

## 6.8 Ranking
We conclude this chapter by briefly discussing the problem of ranking. When one has available ranked examples, ranking can be formulated as a supervised learning problem [133]. Here, we focus instead on the problem of ranking a set of webpages based only on the knowledge of their underlying graph of mutual hyperlinks. This set-up may be considered as a special instance of unsupervised learning. We specifically describe a representative, popular, scheme known as PageRank [110], which uses solely the web of hyperlinks as a form of supervision signal to guide the ranking.

To elaborate, we define the connectivity graph by including a vertex for each webpage and writing the adjacency matrix as:

\[
L_{ij} =
\begin{cases}
1 & \text{if page } j \text{ links to page } i \\
0 & \text{otherwise}
\end{cases}
\]
  
The outgoing degree of a webpage is given as:

\[
C_j = \sum_{i} L_{ij}.
\]

PageRank computes the rank \( p_i \) of a webpage \( i \) as:

\[
p_i = (1 - d) + d \sum_{j} \frac{L_{ji}}{C_j} p_j,
\] 

where \( d \) is a damping factor.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 171

Context: # Part IV

## Advanced Modelling and Inference
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 175

Context: # 7.1. Introduction

illustrated in Fig. 7.1, where we have considered \( N \) i.i.d. documents. Note that the graph is directed: in this problem, it is sensible to model the document as being caused by the topic, entailing a directed causality relationship. Learnable parameters are represented as dots. BNs are covered in Sec. 7.2.

![](path/to/figure7.2.png)  
**Figure 7.2:** MRF for the image denoising example. Only one image is shown and the learnable parameters are not indicated in order to simplify the illustration.

## Example 7.2  

The second example concerns image denoising using supervised learning. For this task, we wish to learn a joint distribution \( p(x, z | \theta) \) of the noisy image \( x \) and the corresponding desired noiseless image \( z \). We encode the images using a matrix representing the numerical values of the pixels. A structured model in this problem can account for the following reasonable assumptions:  

1. Neighboring pixels of the noiseless image are correlated, while pixels further apart are not directly dependent on one another;  
2. Noise acts independently on each pixel of the noiseless image to generate the noisy image.  

These assumptions are encoded by the MRF shown in Fig. 7.2. Note that this is an undirected model. This choice is justified by the need to capture the mutual correlation among neighboring pixels, which can be described as a directed causality relationship. We will study MRFs in Sec. 7.3.

As suggested by the examples above, structure in probabilistic models can be conveniently represented in the form of graphs. At a fundamental level, structural properties in a probabilistic model amount
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 189

Context: # 7.4. Bayesian Inference in Probabilistic Graphical Models

The EM algorithm requires the evaluation of the posterior \( p(x_{i} | x_{-i}, \theta) \) of the latent (extensive) variables \( \{z_{i}\} \) given the current iterate \( \theta \).

As discussed, when performing Bayesian inference, we can distinguish between observed variables \( x \), and latent variables \( z \). In general, only a subset of latent variables may be of interest, say \( z_{i} \), with the rest of the random variables denoted as \( z_{-i} \). The quantity to be computed is the posterior distribution:

\[
p(z_{i} | x) = \frac{p(x, z_{i})}{p(x)}, \tag{7.22}
\]

where 

\[
p(x, z_{-i}) = \sum_{z_{i}} p(x, z), \tag{7.23}
\]

and 

\[
p(x) = \sum_{z} p(x, z_{i}). \tag{7.24}
\]

with the sum being replaced by an integral for continuous variables. The key complication in evaluating these expressions is the need to sum over potentially large sets, namely the domains of variables \( z_{-i} \) and \( z_{i} \). Note that the sum in (7.23), which appears at the numerator of (7.22), is over all hidden variables that are of no interest. In contrast, the sum in (7.21), which is at the denominator of (7.22), is over the variables whose posterior probability (7.22) is the final objective of the calculation.

## Example 7.10

Consider an HMM, whose BN is shown in Fig. 7.3. Having learned the probabilistic model, a typical problem is that of inferring a given hidden variable \( z \) given the observed variables \( x = \{x_{1}, \ldots, x_{p}\} \). Computing the posterior \( p(z | x) \) requires the evaluation of the sums in (7.23) and (7.24). When the hidden variables \( z_{1}, \ldots, z_{p} \) are discrete with alphabet size \( 2 \), the complexity of step (7.23) is of the order \( |Z|^{D-1} \), since one needs to sum over the \( |Z|^{D-1} \) possible values of the hidden variables.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 207

Context: # 8.5 Summary

Having observed in previous chapters that learning in probabilistic models is often held back by the complexity of exact Bayesian inference for hidden variables, this chapter has provided an overview of approximate, lower-complexity inference techniques. We have focused on MC and VI methods, which are most commonly used. The treatment stressed the impact of design choices in the selection of different types of approximation criteria, such as M- and I-Projection. It also covered the use of approximate inference in learning problems. Topics that improve over the state of the art discussed in this chapter are being actively investigated. Some additional topics for future research are covered in the next chapter.

## Appendix: M-Projection with the Exponential Family

In this appendix, we consider the problem of obtaining the M-projection of a distribution \( p(z \mid \varphi) = Z(\varphi)^{-1} \exp\{u^T(\varphi)\} \) from the exponential family with sufficient statistics \( u(X) \). We will prove that, if there exists a value \( \varphi^* \) of the natural parameter vector that satisfies the moment matching condition (8.14), then \( q(z \mid \varphi) \) is the M-projection.

We first write the KL divergence as:

\[
KL(q(z \| \varphi)) = \ln Z(\varphi) - \varphi^T E_{p(z \mid u)}[u(z)] - H(p). \tag{8.31}
\]

The difference between the KL divergence for a generic parameter vector \( \varphi \) and the vector \( \varphi^* \) satisfying (8.14) can be written as:

\[
KL(q(z \mid \varphi) \| q(z \mid \varphi^*)) = E_{q(z \mid \varphi)}\left[ \ln\left(\frac{q(z \mid \varphi)}{q(z \mid \varphi^*)}\right) \right]
= -KL(q(z \mid \varphi) \| q(z \mid \varphi^*)). \tag{8.32}
\]

Since the latter inequality is non-negative and equal to zero when \( \varphi = \varphi^* \), this choice of the natural parameters minimizes the KL divergence.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 208

Context: # Part V

## Conclusions

- This section summarizes the key findings of the study.
- Conclusions drawn from the data.
- Implications of the findings for future research.

1. **First conclusion point.**
2. **Second conclusion point.**
3. **Third conclusion point.**

### Recommendations

- **Recommendation 1:** Description of recommendation.
- **Recommendation 2:** Description of recommendation.
- **Recommendation 3:** Description of recommendation.

### References

1. Author, A. (Year). *Title of the work*. Publisher.
2. Author, B. (Year). *Title of the work*. Publisher. 

### Acknowledgements

- Acknowledgement of contributions from individuals or organizations.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 210

Context: # Concluding Remarks

The data, producing the wrong response upon minor, properly chosen, changes in the explanatory variables. Note that such adversarially chosen examples, which cause a specific machine to fail, are conceptually different from the randomly selected examples that are assumed when defining the generalization properties of a network. There is evidence that finding such examples is possible even without knowing the internal structure of a machine, but solely based on black-box observations [11]. Modifying the training procedure in order to ensure robustness to adversarial examples is an active area of research with important practical implications [55].

## Computing Platforms and Programming Frameworks

In order to scale up machine learning applications, it is necessary to leverage distributed computing architectures and related standard programming frameworks [17, 7]. As a complementary and more futuristic approach, recent work has been proposed to leverage the capabilities of an annealing-based quantum computers as samplers [82] or discrete optimizers [103].

## Transfer Learning

Machines trained for a certain task currently need to be re-trained in order to be re-purposed for a different task. For instance, a machine that learned how to drive a car would need to be retrained in order to learn how to drive a truck. The field of transfer learning covers scenarios in which one wishes to transfer the expertise acquired from some tasks to others. Transfer learning includes different related paradigms, such as multitask learning, lifelong learning, zero-shot learning, and domain adaptation [149]. 

In multitask learning, several tasks are learned simultaneously. Typical solutions for multitask learning based on neural networks preserve the presence of common representations among neural networks trained for different tasks [19]. Lifelong learning utilizes a machine trained on a number of tasks to carry out a new task by leveraging the knowledge accumulated during the previous training phases [143]. Zero-shot learning refers to models capable of recognizing unseen classes with training examples available only for related, but different, classes. This often entails the task of learning representation of classes, such as prototype vectors, that generate data in the class through a fixed probabilistic mechanism [52]. Domain adaptation will be discussed separately in the next point.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 212

Context: # Appendices

## Appendix A: Title

Content of Appendix A.

## Appendix B: Title

Content of Appendix B.

## Appendix C: Title

Content of Appendix C.

## Appendix D: Title

Content of Appendix D.

## References

- Reference 1
- Reference 2
- Reference 3

## Tables

| Header 1 | Header 2 | Header 3 |
|----------|----------|----------|
| Row 1    | Data 1   | Data 2   |
| Row 2    | Data 3   | Data 4   |
| Row 3    | Data 5   | Data 6   |
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 213

Context: # Appendix A: Information Measures

In this appendix, we describe a principled and intuitive introduction to information measures that builds on inference, namely estimation and hypothesis testing. We focus on entropy, mutual information, and divergence measures. We also concentrate on discrete r.v.s. In the monograph, we have taken the pragmatic approach of extending the definitions to continuous variables by substituting sums with integrals. It is worth noting that this approach does not come with any practical complications when dealing with mutual information and divergence. Instead, the continuous version of the entropy, known as differential entropy, should be treated with care, as it does not satisfy some key properties of the entropy such as non-negativity.

## A.1 Entropy

As proposed by Claude Shannon, the amount of information received from the observation of a discrete random variable \( X \sim p(x) \) defined over a finite alphabet should be measured by the amount of uncertainty about its value prior to its measurement. To this end, we consider the problem of estimating the value of \( x \) when one only knows the following:
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 219

Context: A.3. Divergence Measures
========================

Given the expectation, \( T(x) \) is large for values of \( x \) generated from \( q_x \). The function can be used to define the relative importance of errors made in favor of one distribution or the other. From this discussion, the optimal value of \( \alpha \) can be taken to be a measure of the distance between the two pmfs. This yields the following definition of divergence between two pmfs:

\[
D_f(p_\|q) = \max_{\alpha} \mathbb{E}_{p_x} [T(x)] - \mathbb{E}_{q_x} [g(T(x))], \tag{A.11}
\]

where the subscript \( f \) will be justified below.

Under suitable differentiability assumptions on function \( g \) (see [107] for generalizations), taking the derivative with respect to \( T(x) \) for all \( x \in \mathcal{X} \) yields the optimality condition \( g'(T(x)) = p(x)/q(x) \). This relationship reveals the connection between the optimal detector \( T(x) \) and the LLR \( p(x)/q(x) \). Plugging this result into (A.11), it can be directly checked that the following equality holds [105]:

\[
D_f(p \| q) = \mathbb{E}_{q_x} \left[ \log \left( \frac{p(x)}{q(x)} \right) \right]. \tag{A.12}
\]

Here, we note that \( f(x) = g'(x) \) is the convex dual function of \( g \), which is defined as \( g^*(y) = \sup_{x} (y - g(x)) \). Note that dual function \( f \) is always convex [28].

Under the additional constraint \( f(1) = 0 \), definition (A.12) describes a large class of divergence measures parametrized by the convex function \( f \), which are known as f-divergences or Ali-Silvey distance measures [45]. Note that the constraint \( f(1) = 0 \) ensures that the divergence is zero when the pmfs \( p_x \) and \( q_x \) are identical. Among their key properties, f-divergences satisfy the data processing inequality [45]:

For example, the choice \( g(t) = e^t - 1 \), which gives the dual convex function \( f(x) = x \log(x) - x + 1 \) and the corresponding divergence measure (A.12) is the standard KL divergence \( D_{KL}(p \| q) \). As another instance of f-divergence, with \( g(t) = -\log(t) \) to obtain the optimal detector \( T(x) = \ln \left( \frac{p(x)}{q(x)} \right) \) and \( D_f(p \| q) \) becomes the Jensen-Shannon divergence [*]. For

*The Jensen-Shannon divergence can also be interpreted as the mutual information \( I(p; x) \).
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 234

Context: ```
# References

[124] Rudolph, M., F. Ruiz, S. Athey, and D. Blei. 2017. “Structured Embedding Models for Grouped Data”. arXiv e-prints. Sept. arXiv: 1709.10367 [cs.LG].

[125] Rumelhart, D. E., G. E. Hinton, and R. J. Williams. 1988. “Learning representations by back-propagating errors”. *Cognitive modeling*, 5(3): 1–3.

[126] Russel, S. and P. Norvig. 2009. *Artificial Intelligence: A Modern Approach*. Pearson.

[127] Salakhutdinov, R., A. Mnih, and G. Hinton. 2007. “Restricted Boltzmann machines for collaborative filtering”. In: *Proceedings of the 24th international conference on Machine learning*. ACM. 791–798.

[128] Salimans, T., J. Ho, X. Chen, and I. Sutskever. 2017. “Evolution strategies as a scalable alternative to reinforcement learning”. arXiv preprint arXiv:1703.03864.

[129] Samadi, A., T. P. Lillicrap, and D. B. Tweed. 2017. “Deep Learning with Dynamic Spiking Neurons and Fixed Feedback Weights”. *Neural Computation*. 29(3): 578–602.

[130] Scutari, M., F. Facchini, L. Lampariello, and P. Song. 2014. “Distributed methods for constrained nonconvex multi-agent optimization-part I: theory”. arXiv preprint arXiv:1410.7456.

[131] Scutari, M. 2017. “Bayesian Dirichlet Bayesian Network Scores and the Maximum Entropy Principle”. arXiv preprint arXiv:1708.00689.

[132] Shahriari, B., K. Sversky, Z. Wang, R. P. Adams, and N. de Freitas. 2016. “Taking the human out of the loop: A review of bayesian optimization”. *Proceedings of the IEEE*. 104(1): 148–175.

[133] Shalev-Shwartz, S., and S. Ben-David. 2014. *Understanding machine learning: From theory to algorithms*. Cambridge University Press.

[134] Shannon, C. E. 1948. “A mathematical theory of communication”. *The Bell System Technical Journal*. 27(3): 379–423.

[135] Silver, D. 2015. *Course on reinforcement learning*. URL: http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html.
```
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 3

Context: # Contents

Preface iii  
Learning and Intuition vii  

1. Data and Information  
   1.1 Data Representation ...................................................... 2  
   1.2 Preprocessing the Data .................................................. 4  

2. Data Visualization .................................................................. 7  

3. Learning  
   3.1 In a Nutshell ............................................................... 15  

4. Types of Machine Learning  
   4.1 In a Nutshell ............................................................... 20  

5. Nearest Neighbors Classification  
   5.1 The Idea In a Nutshell .................................................. 23  

6. The Naive Bayesian Classifier  
   6.1 The Naive Bayes Model .................................................. 25  
   6.2 Learning a Naive Bayes Classifier ............................... 27  
   6.3 Class-Prediction for New Instances ............................. 28  
   6.4 Regularization ............................................................... 30  
   6.5 Remarks .................................................................... 31  
   6.6 The Idea In a Nutshell .................................................. 32  

7. The Perceptron  
   7.1 The Perceptron Model .................................................. 34
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 4

Context: ```
# CONTENTS

7.2 A Different Cost function: Logistic Regression .......................... 37  
7.3 The Idea In a Nutshell .................................................. 38  

# 8 Support Vector Machines ................................................ 39  
8.1 The Non-Separable case ................................................. 43  

# 9 Support Vector Regression .............................................. 47  

# 10 Kernel ridge Regression ............................................... 51  
10.1 Kernel Ridge Regression ............................................... 52  
10.2 An alternative derivation ............................................. 53  

# 11 Kernel K-means and Spectral Clustering ............................... 55  

# 12 Kernel Principal Components Analysis ................................ 59  
12.1 Centering Data in Feature Space ....................................... 61  

# 13 Fisher Linear Discriminant Analysis .................................. 63  
13.1 Kernel Fisher LDA .................................................... 66  
13.2 A Constrained Convex Programming Formulation of FDA ................. 68  

# 14 Kernel Canonical Correlation Analysis ................................ 69  
14.1 Kernel CCA ............................................................ 71  

# A Essentials of Convex Optimization ..................................... 73  
A.1 Lagrangians and all that ............................................... 73  

# B Kernel Design .......................................................... 77  
B.1 Polynomials Kernels .................................................... 77  
B.2 All Subsets Kernel ...................................................... 78  
B.3 The Gaussian Kernel ..................................................... 79  
```
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 7

Context: # MEANT FOR INDUSTRY AS WELL AS BACKGROUND READING

This book was written during my sabbatical at the Radboud University in Nijmegen (Netherlands). I would like to thank Hans for discussion on intuition. I also thank Prof. Bert Kappen, who leads an excellent group of postdocs and students for his hospitality. Marga, kids, UCI,...

---

There are a few main aspects I want to cover from a personal perspective. Instead of trying to address every aspect of the entire field, I have chosen to present a few popular and perhaps useful tools and approaches. What will (hopefully) be significantly different than most other scientific books is the manner in which I will present these methods. I have always been frustrated by the lack of proper explanation of equations. Many times, I have been staring at a formula without the slightest clue where it came from or how it was derived. Many books also excel in stating facts in an almost encyclopedic style, without providing the proper intuition of the method. This is my primary mission: to write a book that conveys intuition. The first chapter will be devoted to why I think this is important.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 10

Context: # LEARNING AND INTUITION

Baroque features or a more “dull” representation, whatever works. Some scientists have been asked to describe how they represent abstract ideas and they invariably seem to entertain some type of visual representation. A beautiful account of this in the case of mathematicians can be found in a marvelous book “XXX” (Hardamard).

By building accurate visual representations of abstract ideas, we create a database of knowledge in the unconscious. This collection of ideas forms the basis for what we call intuition. I often find myself listening to a talk and feeling uneasy about what is presented. The reason seems to be that the abstract idea I am trying to capture from the talk clashed with a similar idea that is already stored. This in turn can be a sign that I either misunderstood the idea before and need to update it, or that there is actually something wrong with what is being presented. In a similar way, I can easily detect that some idea is a small perturbation of what I already knew (I feel happily bored), or something entirely new (I feel intrigued and slightly frustrated). While the novice is continuously challenged and often feels overwhelmed, the more experienced researcher feels at ease 90% of the time because the “new” idea was already in his/her database which therefore needs no and very little updating.

Somehow our unconscious mind can also manipulate existing abstract ideas into new ones. This is what we usually think of as creative thinking. One can stimulate this by seeding the mind with a problem. This is a conscious effort and is usually a combination of detailed mathematical derivations and building an intuitive picture or metaphor for the thing one is trying to understand. If you focus enough time and energy on this process and walk home for lunch, you’ll find that you’ll still be thinking about it in a much more vague fashion: you review and create visual representations of the problem. Then you get your mind off the problem altogether and when you walk back to work, suddenly parts of the solution surface into consciousness. Somehow, your unconscious took over and kept working on your problem. The essence is that you created visual representations as the building blocks for the unconscious mind to work with.

In any case, whatever the details of this process are (and I am no psychologist), I suspect that any good explanation should include both an intuitive part, including examples, metaphors and visualizations, and a precise mathematical part where every equation and derivation is properly explained. This then is the challenge I have set to myself. It will be your task to insist on understanding the abstract idea that is being conveyed and build your own personalized visual representations. I will try to assist in this process but it is ultimately you who will have to do the hard work.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 11

Context: Many people may find this somewhat experimental way to introduce students to new topics counter-productive. Undoubtedly for many it will be. If you feel under-challenged and become bored, I recommend moving on to the more advanced textbooks, of which there are many excellent samples on the market (for a list see [books](#)). But I hope that for most beginning students, this intuitive style of writing may help to gain a deeper understanding of the ideas that I will present in the following. Above all, have fun!
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 14

Context: # CHAPTER 1. DATA AND INFORMATION

## Interpretation
Here we seek to answer questions about the data. For instance, what property of this drug was responsible for its high success rate? Does a security officer at the airport apply racial profiling in deciding whose luggage to check? How many natural groups are there in the data?

## Compression
Here we are interested in compressing the original data, a.k.a. the number of bits needed to represent it. For instance, files in your computer can be "zipped" to a much smaller size by removing much of the redundancy in those files. Also, JPEG and GIF (among others) are compressed representations of the original pixel map.

All of the above objectives depend on the fact that there is structure in the data. If data is completely random, there is nothing to predict, nothing to interpret, and nothing to compress. Hence, all tasks are somehow related to discovering or leveraging this structure. One could say that data is highly redundant and that this redundancy is exactly what makes it interesting. Take the example of natural images. If you are required to predict the color of the pixels neighboring some random pixel in an image, you would be able to do a pretty good job (for instance, 20% may be blue sky and predicting the neighbors of a blue sky pixel is easy). Also, if we would generate images at random, they would not look like natural scenes at all. For one, it wouldn’t contain objects. Only a tiny fraction of all possible images looks "natural," and so the space of natural images is highly structured.

Thus, all of these concepts are intimately related: structure, redundancy, predictability, regularity, interpretability, compressibility. They refer to the "food" for machine learning. Without structure, there is nothing to learn. The same thing is true for human learning. From the day we are born, we start noticing that there is structure in this world. Our survival depends on discovering and recording this structure. If I walk into this brown cylinder with a green canopy, I suddenly stop; it won’t give way. In fact, it damages my body. Perhaps this holds for all these objects. When I cry, my mother suddenly appears. Our game is to predict the future accurately, and we predict it by learning its structure.

## 1.1 Data Representation
What does "data" look like? In other words, what do we download into our computer? Data comes in many shapes and forms; for instance, it could be words from a document or pixels from an image. But it will be useful to convert data into a
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 17

Context: # 1.2. PREPROCESSING THE DATA

Attribute separately) and then added and divided by \( N \). You have perhaps noticed that variance does not have the same units as \( X \) itself. If \( X \) is measured in grams, then variance is measured in grams squared. To scale the data to have the same scale in every dimension, we divide by the square-root of the variance, which is usually called the sample standard deviation:

\[
X_{n}^{m} = \frac{X_{n}}{\sqrt{V[|X|_{i}]}} \quad \forall n
\]

Note again that sphering requires centering, implying that we always have to perform these operations in this order, first center, then sphere. Figure ??a,b,c illustrates this process.

You may now be asking, “well what if the data where elongated in a diagonal direction?”. Indeed, we can also deal with such a case by first centering, then rotating such that the elongated direction points in the direction of one of the axes, and then scaling. This requires quite a bit more math, and will postpone this issue until chapter ?? on “principal components analysis”. However, the question is in fact a very deep one, because one could argue that one could keep changing the data using more and more sophisticated transformations until all the structure was removed from the data and there would be nothing left to analyze! It is indeed true that the pre-processing steps can be viewed as part of the modeling process in that it identifies structure (and then removes it). By remembering the sequence of transformations you performed, you have implicitly built a model. Reversely, many algorithms can be easily adapted to model the mean and scale of the data. Now, the preprocessing is no longer necessary and becomes integrated into the model.

Just as preprocessing can be viewed as building a model, we can use a model to transform structured data into (more) unstructured data. The details of this process will be left for later chapters but a good example is provided by compression algorithms. Compression algorithms are based on models for the redundancy in data (e.g. text, images). The compression consists in removing this redundancy and transforming the original data into a less redundant (and hence more succinct) code. Models and structure reducing data transformations are in essence each others reverse: we often associate with a model an understanding of how the data was generated, starting from random noise. Reversely, pre-processing starts with the data and understands how we can get back to the unstructured random state of the data.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 18

Context: # CHAPTER 1. DATA AND INFORMATION

The origin. If data happens to be just positive, it doesn’t fit this assumption very well. Taking the following logarithm can help in that case:

$$
X'_{nm} = \log(\alpha + X_{nm}) \tag{1.5}
$$
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 21

Context: # Data Visualization Techniques

Distributions that have heavy tails relative to Gaussian distributions are important to consider. Another criterion is to find projections onto which the data has multiple modes. A more recent approach is to project the data onto a potentially curved manifold.

## Scatter Plots

Scatter plots are of course not the only way to visualize data. It's a creative exercise, and anything that helps enhance your understanding of the data is allowed in this game. To illustrate, I will give a few examples from a variety of techniques:

1. **Histogram**: A useful way to represent the distribution of a dataset.
2. **Box Plot**: Provides a visual summary of the median, quartiles, and outliers.
3. **Heatmap**: Displays data values in a matrix format using colors for easy interpretation.
4. **Line Graph**: Ideal for showing trends over time.

Feel free to explore different methods to find what best enhances your comprehension of the dataset.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 22

Context: # CHAPTER 2: DATA VISUALIZATION

## Introduction

Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data.

## Benefits of Data Visualization

- **Enhanced Understanding**: Complex data becomes more understandable through visual representation.
- **Immediate Insights**: Visualizations can provide quick and effective insights into data trends.
- **Better Communication**: It aids in storytelling and communicating data findings effectively.

## Common Types of Data Visualizations

1. **Bar Charts**
   - Useful for comparing quantities across categories.
2. **Line Graphs**
   - Ideal for showing trends over time.
3. **Pie Charts**
   - Best for illustrating proportions of a whole.
4. **Heat Maps**
   - Effective for displaying data density across a geographical area.

## Tools for Data Visualization

| Tool          | Description                                      | Cost        |
|---------------|--------------------------------------------------|-------------|
| Tableau       | Leading data visualization tool                  | Subscription |
| Microsoft Excel | Popular for creating basic charts and graphs | License fee  |
| Power BI      | Business analytics service from Microsoft       | Subscription |
| Google Data Studio | Free online tool for data visualization  | Free        |

## Conclusion

Data visualization is a crucial technique for data analysis and communication. By implementing effective visualization methods and using appropriate tools, organizations can greatly enhance their decision-making processes.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 23

Context: # Chapter 3

## Learning

This chapter is without question the most important one of the book. It concerns the core, almost philosophical question of what learning really is (and what it is not). If you want to remember one thing from this book, you will find it here in this chapter.

Ok, let’s start with an example. Alice has a rather strange ailment. She is not able to recognize objects by their visual appearance. At home, she is doing just fine: her mother explained to Alice for every object in her house what it is and how you use it. When she is home, she recognizes these objects (if they have not been moved too much), but when she enters a new environment she is lost. For example, if she enters a new meeting room, she needs a long time to infer what the chairs and the table are in the room. She has been diagnosed with a severe case of "overfitting." 

What is the matter with Alice? Nothing is wrong with her memory because she remembers the objects once she has seen them. In fact, she has a fantastic memory. She remembers every detail of the objects she has seen. And every time she sees a new object, she reasons that the object in front of her is surely not a chair because it doesn’t have all the features she has seen in earlier chairs. The problem is that Alice cannot generalize the information she has observed from one instance of a visual object category to other, yet unobserved members of the same category. The fact that Alice’s disease is so rare is understandable; there must have been a strong selection pressure against this disease. 

Imagine our ancestors walking through the savanna one million years ago. A lion appears on the scene. Ancestral Alice has seen lions before, but not this particular one, and it does not induce a fear response. Of course, she has no time to infer the possibility that this animal may be dangerous logically. Alice’s contemporaries noticed that the animal was yellow-brown, had manes, etc., and immediately un-
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 25

Context: # Information Theory and Image Compression

The part of the information that does not carry over to the future, the unpredictable information, is called "noise." There is also the information that is predictable, the learnable part of the information stream. The task of any learning algorithm is to separate the predictable part from the unpredictable part.

Now imagine Bob wants to send an image to Alice. He has to pay 1 dollar cent for every bit that he sends. If the image were completely white, it would be really stupid of Bob to send the message: 

```
pixel 1: white, pixel 2: white, pixel 3: white, .... 
```

He could just have sent the message:

```
all pixels are white!
```

The blank image is completely predictable but carries very little information. Now imagine a image that consists of white noise (your television screen if the cable is not connected). To send the exact image Bob will have to send:

```
pixel 1: white, pixel 2: black, pixel 3: black, .... 
```

Bob cannot do better because there is no predictable information in that image, i.e., there is no structure to be modeled. You can imagine playing a game and revealing one pixel at a time to someone and paying him 15 for every next pixel he predicts correctly. For the white image, you can do perfect for the noisy picture you would be random guessing. Real pictures are in between: some pixels are very hard to predict, while others are easier. 

To compress the image, Bob can extract rules such as: always predict the same color as the majority of the pixels next to you, except when there is an edge. These rules constitute the model for the regularities of the image. Instead of sending the entire image pixel by pixel, Bob will now first send his rules and ask Alice to apply the rules. Every time the rule fails, Bob also sends a correction:

```
pixel 103: white, pixel 245: black.
```

A few rules and two corrections are obviously cheaper than 256 pixel values and no rules.

There is one fundamental tradeoff hidden in this game. Since Bob is sending only a single image, it does not pay to send an incredibly complicated model that would require more bits to explain than simply sending all pixel values. If he were sending 1 billion images it would pay off to first send the complicated model because he would be saving a fraction of all bits for every image. On the other hand, if Bob wants to send 2 pixels, there really is no need in sending a model whatsoever. 

Therefore, the size of Bob's model depends on the amount of data he wants to transmit. In this context, the boundary between what is model and what is noise depends on how much data we are dealing with! If we use a model that is too complex, we overfit to the data at hand, i.e., part of the model represents noise. On the other hand, if we use a too simple model we "underfit" (over-generalize) and valuable structure remains unmodeled. Both lead to sub-optimal compression of the image. 

The compression game can therefore be used to find the right size of model complexity for a given dataset.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 27

Context: # 3.1 In a Nutshell

Learning is all about generalizing regularities in the training data to new, yet unobserved data. It is not about remembering the training data. Good generalization means that you need to balance prior knowledge with information from data. Depending on the dataset size, you can entertain more or less complex models. The correct size of the model can be determined by playing a compression game. 

Learning = generalization = abstraction = compression.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 28

Context: # CHAPTER 3. LEARNING

## Introduction

Learning involves acquiring knowledge, skills, attitudes, or competencies. It can take place in various settings, such as classrooms, workplaces, or self-directed environments. This chapter discusses the types and processes of learning.

## Types of Learning

1. **Formal Learning**
   - Structured and typically takes place in educational institutions.
   - Includes degrees, diplomas, and certifications.

2. **Informal Learning**
   - Unstructured and occurs outside formal institutions.
   - Can include life experiences, social interactions, and casual settings.

3. **Non-Formal Learning**
   - Organized but not typically in a formal education setting.
   - Often community-based, such as workshops and training sessions.

## Learning Processes

- **Cognitive Learning**: Involves mental processes and understanding. 
- **Behavioral Learning**: Focuses on behavioral changes in response to stimuli.
- **Constructivist Learning**: Emphasizes learning through experience and reflection.

## Table of Learning Theories

| Theory                    | Key Contributor      | Description                                      |
|--------------------------|----------------------|--------------------------------------------------|
| Behaviorism              | B.F. Skinner         | Learning as a change in behavior due to reinforcement. |
| Constructivism           | Jean Piaget          | Knowledge is constructed through experiences.    |
| Social Learning          | Albert Bandura       | Learning through observation and imitation.      |

## Conclusion

Understanding the different types and processes of learning can help educators and learners optimize educational experiences. Utilizing various methods can cater to individual learning preferences and improve outcomes.

## References

- Smith, J. (2020). *Learning Theories: A Comprehensive Approach*. Educational Press.
- Doe, A. (2019). *Techniques for Effective Learning*. Learning Publishers.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 31

Context: fall under the name "reinforcement learning". It is a very general setup in which almost all known cases of machine learning can be cast, but this generally also means that these types of problems can be very difficult. The most general RL problems do not even assume that you know what the world looks like (i.e. the maze for the mouse), so you have to simultaneously learn a model of the world and solve your task in it. 

This dual task induces interesting trade-offs: should you invest time now to learn machine learning and reap the benefit later in terms of a high salary working for Yahoo!, or should you stop investing now and start exploiting what you have learned so far? This is clearly a function of age, or the time horizon that you still have to take advantage of these investments. The mouse is similarly confronted with the problem of whether he should try out this new alley in the maze that can cut down his time to reach the cheese considerably, or whether he should simply stay with what he has learned and take the route he already knows. This clearly depends on how often he thinks he will have to run through the same maze in the future. We call this the exploration versus exploitation trade-off. The reason that RL is a very exciting field of research is because of its biological relevance. Do we not also have to figure out how the world works and survive in it?

Let's go back to the news-articles. Assume we have control over what article we will label next. Which one would be pick? Surely the one that would be most informative in some suitably defined sense. Or the mouse in the maze. Given that it needs to explore, where does he explore? Surely he will try to seek out alleys that look promising, i.e. alleys that he expects to maximize his reward. We call the problem of finding the next best data-case to investigate "active learning".

One may also be faced with learning multiple tasks at the same time. These tasks are related but not identical. For instance, consider the problem if recommending movies to customers of Netflix. Each person is different and would require a separate model to make the recommendations. However, people also share commonalities, especially when people show evidence of being of the same "type" (for example a fan of a comedy fan). We can learn personalized models but share features between them. Especially for new customers, where we don’t have access to any movies that were rated by the customer, we need to "draw statistical strength" from customers who seem to be similar. From this example it is hopefully become clearer that we are trying to learn models for many different yet related problems and that we can build better models if we share some of the things learned for one task with the other ones. The trick is not to share too much nor too little and how much we should share depends on how much data and prior knowledge we have access to for each task. We call this subfield of machine learning: "multi-task learning".
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 32

Context: # Chapter 4: Types of Machine Learning

## 4.1 In a Nutshell

There are many types of learning problems within machine learning. Supervised learning deals with predicting class labels from attributes, unsupervised learning tries to discover interesting structure in data, semi-supervised learning uses both labeled and unlabeled data to improve predictive performance, reinforcement learning can handle simple feedback in the form of delayed reward, active learning optimizes the next sample to include in the learning algorithm, and multi-task learning deals with sharing common model components between related learning tasks.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 35

Context: 5.1 The Idea In a Nutshell
==========================

To classify a new data item, you first look for the \( k \) nearest neighbors in feature space and assign it the same label as the majority of these neighbors. 

Because 98 noisy dimensions have been added, this effect is detrimental to the kNN algorithm. Once again, it is very important to choose your initial representation with much care and preprocess the data before you apply the algorithm. In this case, preprocessing takes the form of "feature selection" on which a whole book in itself could be written.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 40

Context: 28  
# CHAPTER 6. THE NATIVE BAYESIAN CLASSIFIER

For ham emails, we compute exactly the same quantity,

$$ P_{ham}(X_i = j) = \frac{ \# \text{ ham emails for which the word } j \text{ was found} }{ \text{total } \# \text{ of ham emails} } $$ (6.5)

$$ = \frac{\sum_{n} \mathbb{I}(X_{n} = j \land Y_{n} = 0)}{ \sum_{n} \mathbb{I}(Y_{n} = 0)} $$ (6.6)

Both these quantities should be computed for all words or phrases (or more generally attributes).

We have now finished the phase where we estimate the model from the data. We will often refer to this phase as "learning" or training a model. The model helps us understand how data was generated in some approximate setting. The next phase is that of prediction or classification of new email.

## 6.3 Class-Prediction for New Instances

New email does not come with a label ham or spam (if it would we could throw spam in the spam-box right away). What we do see are the attributes $\{X_i\}$. Our task is to guess the label based on the model and the measured attributes. The approach we take is simple: calculate whether the email has a higher probability of being generated from the spam or the ham model. For example, because the word "viagra" has a tiny probability of being generated under the ham model it will end up with a higher probability under the spam model. But clearly, all words have a say in this process. It’s like a large committee of experts, one for each word; each member casts a vote and can say things like: "I am 99% certain its spam", or "It’s almost definitely not spam (0.1% spam)". Each of these opinions will be multiplied together to generate a final score. We then figure out whether ham or spam has the highest score.

There is one little practical caveat with this approach, namely that the product of a large number of probabilities, each of which is necessarily smaller than one, very quickly gets so small that your computer can’t handle it. There is an easy fix though. Instead of multiplying probabilities as scores, we use the logarithms of those probabilities and add the logarithms. This is numerically stable and leads to the same conclusion because if \(a > b\) then we also have \(\log(a) > \log(b)\) and vice versa. In equations we compute the score as follows:

$$ S_{spam} = \sum_{i} \log P_{spam}(X_i = e_i) + \log P(spam) $$ (6.7)
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 44

Context: # CHAPTER 6. THE NAIVE BAYESIAN CLASSIFIER

## Introduction

The Naive Bayesian Classifier is a simple yet powerful algorithm based on Bayes' theorem, used extensively in machine learning for classification tasks.

## Key Concepts

1. **Bayes' Theorem**: 
   \[
   P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}
   \]
   
2. **Assumptions**:
   - Features are statistically independent given the class label.
   - The prior distribution of each class.

## Types of Naive Bayes Classifiers

- **Gaussian Naive Bayes**: Assumes continuous data follows a normal distribution.
- **Multinomial Naive Bayes**: Primarily used for discrete features, such as word counts.
- **Bernoulli Naive Bayes**: Similar to multinomial but assumes binary features.

## Classification Process

1. **Calculate Prior Probabilities**:
   \[
   P(C_k) = \frac{N_k}{N}
   \]

2. **Calculate Likelihoods**:
    - For Gaussian:
    \[
    P(x|C_k) = \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{(x - \mu)^2}{2\sigma^2}}
    \]

3. **Apply Bayes' Theorem**:
    \[
    P(C_k|x) = \frac{P(x|C_k) \cdot P(C_k)}{P(x)}
    \]

4. **Choose Class with Maximum Posterior**:
   \[
   \hat{C} = \arg\max_{C_k} P(C_k|x)
   \]

## Advantages

- Easy to implement.
- Requires a small amount of training data.
- Performs well in multi-class problems.

## Disadvantages

- Assumption of feature independence is often violated in real-world applications.
- Poor performance when features are highly correlated.

## Conclusion

The Naive Bayesian Classifier serves as an excellent baseline for classification tasks and is crucial for understanding more advanced techniques in machine learning.

## References

- [Pattern Recognition and Machine Learning](http://www.example.com)
- [Machine Learning: A Probabilistic Perspective](http://www.example.com)
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 50

Context: # CHAPTER 7. THE PERCEPTRON

The function `tanh(·)` is plotted in Figure ?? It shows that the cost can never be larger than 2, which ensures robustness against outliers. We leave it to the reader to derive the gradients and formulate the gradient descent algorithm.

## 7.3 The Idea In a Nutshell

Figure ?? tells the story. One assumes that your data can be separated by a line. Any line can be represented by `w^T x = α`. Data cases from one class satisfy `w^T x_n ≤ α` while data cases from the other class satisfy `w^T x_n ≥ α`. To achieve that, you write down a cost function that penalizes data cases falling on the wrong side of the line and minimize it over `(w, α)`. For a test case, you simply compute the sign of `w^T x_{test} - α` to make a prediction as to which class it belongs to.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 57

Context: # 8.1. THE NON-SEPARABLE CASE

In summary, as before for points not on the support plane and on the correct side we have \( \xi_i = 0 \) (all constraints inactive). On the support plane, we still have \( \xi_i = 0 \), but now \( \alpha_i > 0 \). Finally, for data-cases on the wrong side of the support hyperplane \( \alpha_i \) max-out to \( \alpha_i = C \) and the \( \xi_i \) balance the violation of the constraint such that \( y_i(w^T x_i - b) - 1 + \xi_i = 0 \).

Geometrically, we can calculate the gap between support hyperplane and the violating data-case to be \( \xi_i/||w|| \). This can be seen because the plane defined by \( y_i(w^T x - b) - 1 + \xi_i = 0 \) is parallel to the support plane at a distance \( 1 + y_i b - \xi_i/||w|| \) from the origin. Since the support plane is at a distance \( 1 + y_i b/||w|| \) the result follows.

Finally, we need to convert to the dual problem to solve it efficiently and to kernelise it. Again, we use the KKT equations to get rid of \( w \), \( b \), and \( \xi \).

## Formulation

\[
\begin{align*}
\text{maximize} & \quad L_D = \sum_{i=1}^{N} \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j x_i^T x_j \\
\text{subject to} & \quad \sum_{i} \alpha_i y_i = 0 \\
& \quad 0 \leq \alpha_i \leq C \quad \forall i
\end{align*}
\]

Surprisingly, this is almost the same QP as before, but with an extra constraint on the multipliers \( \alpha_i \) which now live in a box. This constraint is derived from the fact that \( \alpha_i = C - \mu_i \) and \( \mu_i \geq 0 \). We also note that it only depends on inner products \( x_i^T x_j \) which are ready to be kernelised.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 58

Context: # CHAPTER 8: SUPPORT VECTOR MACHINES

## Introduction

Support Vector Machines (SVMs) are a set of supervised learning methods used for classification, regression, and outliers detection.

## Key Concepts

1. **Hyperplane**: A hyperplane is a decision boundary that helps to categorize data points.

2. **Support Vectors**: Support vectors are the data points that are closest to the hyperplane and influence its position and orientation.

3. **Margin**: The margin is the distance between the hyperplane and the nearest data point from either class. SVM aims to maximize the margin.

## Implementation Steps

- **Step 1**: Choose the appropriate kernel (linear, polynomial, RBF).
- **Step 2**: Train the SVM model using the training dataset.
- **Step 3**: Evaluate the model using a test dataset.

## Advantages of SVM

- Effective in high-dimensional spaces.
- Robust against overfitting, especially in high-dimensional datasets.

## Disadvantages of SVM

- Less effective on very large datasets.
- Poor performance with overlapping classes.

## Conclusion

Support Vector Machines are powerful tools for classification and regression tasks, offering advantages in high-dimensional spaces while having limitations in very large datasets.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 61

Context: From the complementary slackness conditions we can read the sparseness of the solution out:

\[
\alpha_i (w^T \Phi + b - y_i - \xi_i) = 0 \tag{6.9}
\]

\[
\alpha_i \hat{(y_i - w^T \Phi - b =  - \xi_i)} = 0 \tag{7.9}
\]

\[
\xi_i \alpha_i = 0 \tag{8.9}
\]

where we added the last conditions by hand (they don’t seem to directly follow from the formulation). Now we clearly see that if a case is above the tube \(\xi_i\) it will take on its smallest possible value in order to make the constraints satisfied \(\xi_i = y_i - w^T \Phi_i - b - \epsilon\). This implies that \(\alpha_i\) will take on a positive value and the farther outside the tube the larger the \(\hat{\alpha}_i\) (you can think of it as a compensating force). Note that in this case \(\alpha_1 = 0\). A similar story goes if \(\xi_i > 0\) and \(\alpha_i > 0\). If a data case is inside the tube the \(\alpha_i, \hat{\alpha}_i\) are necessarily zero, and hence we obtain sparseness.

We now change variables to make this optimization problem look more similar to the SVM and ridge-regression case. Introduce \(\beta_i = \alpha_i - \alpha_i\) and use \(\hat{\alpha}_i = 0\) to write \(\alpha_i + \alpha_i = |\beta_i|\).

\[
\text{maximize } \quad -\frac{1}{2} \sum_{i,j} \beta_i \beta_j (K_{ij} + \frac{1}{C} \delta_{ij}) + \sum_{i} \beta_i y_i - \sum_{i} |\beta_i| \tag{9.9}
\]

subject to
\[
\sum_{i} \beta_i = 0
\]

From the slackness conditions we can also find a value for \(b\) (similar to the SVM case). Also, as usual, the prediction of new data-case is given by:

\[
y = w^T \Phi(x) + b = \sum_{i} \beta_i K(x_i, x) + b \tag{10.10}
\]

It is an interesting exercise for the reader to work her way through the case. 

> **Note:** By the way that we could not use the trick used in ridge-regression by defining a constant feature \(\phi_0 = 1\) and \(b = w_0\). The reason is that the objective does not depend on \(b\).
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 66

Context: # CHAPTER 10. KERNEL RIDGE REGRESSION

One big disadvantage of the ridge-regression is that we don’t have sparseness in the \( \alpha \) vector, i.e., there is no concept of support vectors. This is useful because when we test a new example, we only have to sum over the support vectors, which is much faster than summing over the entire training set. In the SVM, the sparseness was born out of the inequality constraints because the complementary slackness conditions told us that either if the constraint was inactive, then the multiplier \( \alpha_i \) was zero. There is no such effect here.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 70

Context: # CHAPTER 11: KERNEL K-MEANS AND SPECTRAL CLUSTERING

## 11.1 Introduction

Kernel K-means is an extension of the K-means clustering algorithm. It allows for clustering in a feature space that is not linearly separable, using a kernel function to transform the original data.

## 11.2 Algorithm

The steps of the Kernel K-means algorithm are as follows:

1. **Choose a kernel function** \( K(x_i, x_j) \).
2. **Initialize** \( k \) centroids randomly.
3. **Assign points** to the nearest centroid using the kernel.
4. **Update centroids** by computing the mean of the assigned points in the feature space.
5. **Repeat** steps 3 and 4 until convergence.

## 11.3 Spectral Clustering

Spectral clustering is another clustering technique that uses the eigenvalues of the similarity matrix of the data.

### 11.3.1 Algorithm

1. Compute the **similarity matrix**.
2. Compute the **Laplacian matrix** from the similarity matrix.
3. Compute the eigenvalues and eigenvectors of the Laplacian matrix.
4. Use the top \( k \) eigenvectors to form a new feature space.
5. Apply K-means to cluster the data in the new space.

## 11.4 Conclusion

Kernel K-means and spectral clustering provide powerful methods for clustering in complex spaces, enabling better data segmentation in many applications.

## References

- *Author Name*, *Title of the Source*, Year
- *Author Name*, *Title of the Source*, Year

For further reading, explore the works on kernel methods and spectral analysis in clustering contexts.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 74

Context: # CHAPTER 12. KERNEL PRINCIPAL COMPONENTS ANALYSIS

Hence the kernel in terms of the new features is given by:

\[ K_{ij} = \left( \Phi_i - \frac{1}{N} \sum_k \Phi_k \right) \left( \Phi_j - \frac{1}{N} \sum_l \Phi_l \right)^T \] (12.12)

\[
= \Phi_i \Phi_j^T - \frac{1}{N} \sum_k \Phi_k \Phi_j^T - \Phi_i \frac{1}{N} \sum_l \Phi_l^T + \frac{1}{N^2} \sum_{k, l} \Phi_k \Phi_l^T
\] (12.13)

\[
= K_{ij} - \kappa_i \mathbf{1}^T_j - \mathbf{1}_i \kappa_j^T + k \mathbf{1}_i \mathbf{1}_j^T
\] (12.14)

with 

\[
\kappa_i = \frac{1}{N} \sum_k K_{ik} 
\] (12.15)

and 

\[
k = \frac{1}{N^2} \sum_{ij} K_{ij}
\] (12.16)

Hence, we can compute the centered kernel in terms of the non-centered kernel alone and no features need to be accessed.

At test-time we need to compute:

\[ K_c(t_i, x_j) = [\Phi(t_i) - \frac{1}{N} \sum_k \Phi(x_k)] [\Phi(x_j)] - \frac{1}{N} \sum_l \Phi(x_l)^T \] (12.17)

Using a similar calculation (left for the reader) you can find that this can be expressed easily in terms of \( K(t_i, x_j) \) and \( K(x_i, x_j) \) as follows:

\[
K_c(t_i, x_j) = K(t_i, x_j) - \kappa(t_i)^T \mathbf{1}_j - \mathbf{1}_{t_i} \kappa(x_j)^T + k \mathbf{1}_{t_i} \mathbf{1}_{j}^T
\] (12.18)
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 76

Context: # CHAPTER 13. FISHER LINEAR DISCRIMINANT ANALYSIS

The scatter matrices are:

$$
S_B = \sum_{c} N_c(\mu_c - \bar{x})(\mu_c - \bar{x})^T \tag{13.2}
$$

$$
S_W = \sum_{c} \sum_{i \in c} (x_i - \mu_c)(x_i - \mu_c)^T \tag{13.3}
$$

where,

$$
\mu_c = \frac{1}{N_c} \sum_{i \in c} x_i \tag{13.4}
$$

$$
\bar{x} = \frac{1}{N} \sum_{i} x_i = \frac{1}{\sum_{c} N_c} \sum_{c} N_c \mu_c \tag{13.5}
$$

and \(N_c\) is the number of cases in class \(c\). Oftentimes you will see that for 2 classes \(S_B\) is defined as \(S_B = (\mu_1 - \mu_2)(\mu_1 - \mu_2)^T\). This is the scatter of class 1 with respect to the scatter of class 2 and you can show that \(S_B = \frac{N_1}{N_1 + N_2} S_B\), but since it boils down to multiplying the objective with a constant makes no difference to the final solution.

Why does this objective make sense? Well, it says that a good solution is one where the class means are well separated, measured relative to the (sum of the) variances of the data assigned to a particular class. This is precisely what we want, because it implies that the gap between the classes is expected to be big. It is also interesting to observe that since the total scatter,

$$
S_T = \sum_i (x_i - \bar{x})(x_i - \bar{x})^T \tag{13.6}
$$

is given by \(S_T = S_W + S_B\), the objective can be rewritten as,

$$
J(w) = \frac{w^T S_B w}{w^T S_W w} - 1 \tag{13.7}
$$

and hence can be interpreted as maximizing the total scatter of the data while minimizing the within scatter of the classes.

An important property to notice about the objective \(J\) is that it is invariant w.r.t. rescalings of the vectors \(w \rightarrow \alpha w\). Hence, we can always choose \(w\) such that the denominator is simply \(w^T S_W w = 1\), since it is a scalar itself. For this reason we can transform the problem of maximizing \(J\) into the following constrained
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 79

Context: # 13.2 A CONSTRAINED CONVEX PROGRAMMING FORMULATION OF FDA

The eigenvalue equation. This scales as \( \mathcal{O}(n^3) \) which is certainly expensive for many datasets. More efficient optimization schemes solving a slightly different problem and based on efficient quadratic programs exist in the literature.

Projections of new test points into the solution space can be computed by:

\[
w^T\Phi(x) = \sum_{i} \alpha_i K(x_i, x) \tag{13.19}
\]

as usual. In order to classify the test point, we still need to divide the space into regions which belong to one class. The easiest possibility is to pick the cluster with the smallest Mahalanobis distance: 

\[
d(x, \mu^c) = \frac{(x - \mu^c)^2}{\sigma^2} \tag{13.20}
\]

where \( \mu^c \) and \( \sigma^c \) represent the class mean and standard deviation in the 1-d projected space respectively. Alternatively, one could train any classifier in the 1-d subspace.

One very important issue that we did not pay attention to is regularization. Clearly, as it stands the kernel machine will overfit. To regularize, we can add a term to the denominator:

\[
S_W \rightarrow S_W + \beta I \tag{13.21}
\]

By adding a diagonal term to this matrix, we ensure that very small eigenvalues are bounded away from zero, which improves numerical stability in computing the inverse. If we write the Lagrangian formulation where we maximize a constrained quadratic form in \( \alpha \), the extra term appears as a penalty proportional to \( ||\alpha||^2 \) which acts as a weight decay term, favoring smaller values of \( \alpha \) over larger ones. Fortunately, the optimization problem has exactly the same form in the regularized case.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 81

Context: # Chapter 14

## Kernel Canonical Correlation Analysis

Imagine you are given 2 copies of a corpus of documents, one written in English, the other written in German. You may consider an arbitrary representation of the documents, but for definiteness we will use the “vector space” representation where there is an entry for every possible word in the vocabulary and a document is represented by count values for every word, i.e., if the word “the” appeared 12 times and the first word in the vocabulary we have \( X_1(doc) = 12 \), etc.

Let’s say we are interested in extracting low dimensional representations for each document. If we had only one language, we could consider running PCA to extract directions in word space that carry most of the variance. This has the ability to infer semantic relations between the words such as synonymy, because if words tend to co-occur often in documents, i.e., they are highly correlated, they tend to be combined into a single dimension in the new space. These spaces can often be interpreted as topic spaces.

If we have two translations, we can try to find projections of each representation separately such that the projections are maximally correlated. Hopefully, this implies that they represent the same topic in two different languages. In this way we can extract language independent topics.

Let \( x \) be a document in English and \( y \) a document in German. Consider the projections: \( u = a^Tx \) and \( v = b^Ty \). Also assume that the data have zero mean. We now consider the following objective:

\[
\rho = \frac{E[uv]}{\sqrt{E[u^2]E[v^2]}} \tag{14.1}
\]
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 88

Context: # APPENDIX A. ESSENTIALS OF CONVEX OPTIMIZATION

Complementary slackness is easily derived by,

\[
f_0(x^*) = L_D(\lambda^*, \nu^*) = \inf_x \left( f_0(x) + \sum_{i} \lambda_i f_i(x) + \sum_{j} \nu_j h_j(x) \right)
\]

\[
\leq f_0(x^*) + \sum_{i} \lambda_i f_i(x^*) + \sum_{j} \nu_j h_j(x^*) \quad (A.13)
\]

\[
\leq f_0(x^*) \quad (A.14)
\]

where the first line follows from Eqn. A.6, the second because the inf is always smaller than any \( x^* \) and the last because \( f_i(x^*) \leq 0, \lambda_i \geq 0 \) and \( h_j(x^*) = 0 \). Hence all inequalities are equalities and each term is negative, so each term must vanish separately.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 90

Context: # APPENDIX B. KERNEL DESIGN

## B.1 Term

\[
\left(x^Ty\right)^* = (x_1y_1+x_2y_2+\ldots+x_ny_n)^* = \sum_{i,j=1}^{n} \frac{s}{i_1! \ldots i_s!} (x_1y_1)^{i_1}(x_2y_2)^{i_2} \ldots (x_ny_n)^{i_n} (B.4)
\]

Taken together with eqn. B.3 we see that the features correspond to:

\[
\phi_t(x) = \frac{d!}{(d - s)!} \frac{1}{i_1! i_2! \ldots i_n!} R^{d-s} x_1^{i_1} x_2^{i_2} \ldots x_n^{i_n} \quad \text{with } i_1 + i_2 + \ldots + i_n = s < d \quad (B.5)
\]

The point is really that in order to efficiently compute the total sum of \(\binom{m+d-1}{d}\) terms we have inserted very special coefficients. The only true freedom we have left is in choosing \(R\): for larger \(R\) we down-weight higher order polynomials more.

The question we want to answer is: how much freedom do we have in choosing different coefficients and still being able to compute the inner product efficiently?

## B.2 All Subsets Kernel

We define the feature again as the product of powers of input attributes. However, in this case, the choice of power is restricted to \([0,1]\), i.e. the feature is present or absent. For \(n\) input dimensions (number of attributes) we have \(2^n\) possible combinations.

Let’s compute the kernel function:

\[
K(x,y) = \sum_{I} \phi_I(x) \phi_I(y) = \sum_{I} \prod_{j=1}^{n} x_j y_j = \prod_{i=1}^{n}(1+x_i) (B.6) 
\]

where the last identity follows from the fact that,

\[
\prod_{i=1}^{n} (1 + z_i) = 1 + \sum_{i} z_i + \sum_{i < j} z_iz_j + \ldots + z_1 z_2 \ldots z_n \quad (B.7)
\]

i.e. a sum over all possible combinations. Note that in this case again, it is much efficient to compute the kernel directly than to sum over the features. Also note that in this case there is no decaying factor multiplying the monomials.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 92

Context: # APPENDIX B. KERNEL DESIGN

## Table of Contents
1. [Introduction](#introduction)
2. [Kernel Architecture](#kernel-architecture)
   - [Components](#components)
   - [Functions](#functions)
3. [Design Considerations](#design-considerations)
4. [Conclusion](#conclusion)

## Introduction
This appendix discusses the design of the kernel and its components.

## Kernel Architecture

### Components
- **Scheduler**: Manages the execution of processes.
- **Memory Manager**: Handles memory allocation and deallocation.
- **Device Drivers**: Interfaces with hardware devices.

### Functions
1. **Process Management**
   - Creating and managing processes.
   - Scheduling and dispatching of processes.
2. **Memory Management**
   - Allocating memory for processes.
   - Handling virtual memory.

## Design Considerations
- **Performance**: The kernel should be efficient in resource management.
- **Scalability**: Must support various hardware platforms.
- **Security**: Ensures that processes cannot access each other’s memory.

## Conclusion
The kernel design is crucial for the overall system performance and functionality. Proper architecture and considerations can significantly enhance the efficiency and security of the operating system.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 93

Context: # Bibliography

1. Author, A. (Year). *Title of the Book*. Publisher.
2. Author, B. (Year). *Title of the Article*. *Journal Name*, Volume(Issue), Page Range. DOI or URL if available.
3. Author, C. (Year). *Title of the Website*. Retrieved from URL.
4. Author, D. (Year). *Title of the Thesis or Dissertation*. University Name. 

- Point 1
- Point 2
- Point 3

## Additional References

| Author            | Title                   | Year | Publisher        |
|-------------------|-------------------------|------|------------------|
| Author, E.        | *Title of Article*      | Year | Publisher Name    |
| Author, F.        | *Title of Conference*   | Year | Conference Name   |

### Notes

- Ensure that all entries follow the same citation style for consistency.
- Check for publication dates and any updates that may be required.
####################
File: Algebraic%20Topology%20ATbib-ind.pdf

Page: 3

Context: # Bibliography

C. R. F. Maunder, *Algebraic Topology*, Cambridge Univ. Press, 1980 (reprinted by Dover Publications).

J. P. May, *Simplicial Objects in Algebraic Topology*, Van Nostrand, 1967 (reprinted by Univ. Chicago Press).

J. P. May, *A Concise Course in Algebraic Topology*, Univ. Chicago Press, 1999.

J. W. Milnor, *Topology from the Differentiable Viewpoint*, Univ. Press of Virginia, 1965.

J. W. Milnor and J. D. Stasheff, *Characteristic Classes*, Ann. of Math. Studies 76, 1974.

R. E. Mosher and M. C. Tangora, *Cohomology Operations and Applications in Homotopy Theory*, Harper and Row, 1968.

V. V. Prasolov, *Elements of Homology Theory*, A.M.S., 2007.

D. C. Ravenel, *Complex Cobordism and Stable Homotopy Groups of Spheres*, Academic Press, 1986.

D. C. Ravenel, *Nilpotence and Periodicity in Stable Homotopy Theory*, Ann. of Math. Studies 128, 1992.

E. Rees and J. D. S. Jones, eds., *Homotopy Theory: Proceeding of the Durham Symposium 1985*, Cambridge Univ. Press, 1987.

D. Rolfsen, *Knots and Links*, Publish or Perish, 1976.

H. Seifert and W. Threlfall, *Lehrbuch der Topologie*, Teubner, 1934.

P. Selick, *Introduction to Homotopy Theory*, Fields Institute Monographs 9, A.M.S., 1997.

J.-P. Serre, *A Course in Arithmetic*, Springer-Verlag GTM 7, 1973.

E. H. Spanier, *Algebraic Topology*, McGraw-Hill, 1966 (reprinted by Springer-Verlag).

N. Steenrod, *The Topology of Fibre Bundles*, Princeton Univ. Press, 1951.

N. E. Steenrod and D. B. A. Epstein, *Cohomology Operations*, Ann. of Math. Studies 50, 1962.

R. E. Stong, *Notes on Cobordism Theory*, Princeton Univ. Press, 1968.

D. Sullivan, *Geometric Topology*, xeroxed notes from MIT, 1970.

R. M. Switzer, *Algebraic Topology*, Springer-Verlag, 1975.

H. Toda, *Composition Methods in Homotopy Groups of Spheres*, Ann. of Math. Studies 49, 1962.

K. Varadarajan, *The Finiteness Obstruction of C.T.C. Wall*, Wiley, 1989.

C. A. Weibel, *An Introduction to Homological Algebra*, Cambridge Univ. Press, 1994.

G. W. Whitehead, *Elements of Homotopy Theory*, Springer-Verlag GTM 62, 1978.

J. A. Wolf, *Spaces of Constant Curvature*, McGraw Hill, 1967. 6th ed. AMS Chelsea 2010.

## Papers

J. F. Adams, "On the non-existence of elements of Hopf invariant one," *Ann. of Math.* 72 (1960), 20–104.

J. F. Adams, "Vector fields on spheres," *Ann. of Math.* 75 (1962), 603–632.
####################
File: Algebraic%20Topology%20ATbib-ind.pdf

Page: 9

Context: # Index

- face 103  
- fiber 375  
- fiber bundle 376, 431  
- fiber homotopy equivalence 406  
- fiber-preserving map 406  
- fibration 375  
- fibration sequence 409, 462  
- finitely generated homology 423, 527  
- finitely generated homotopy 364, 392, 423  
- five-lemma 129  
- fixed point 31, 73, 114, 179, 229, 493  
- flag 436, 447  
- frame 301, 381  
- free action 73  
- free algebra 227  
- free group 42, 77, 85  
- free product 41  
- free product with amalgamation 92  
- free resolution 193, 263  
- Freudenthal suspension theorem 360  
- function space 529  
- functor 163  
- fundamental class 236, 394  
- fundamental group 26  
- fundamental theorem of algebra 31  

## Galois correspondence
- Galois correspondence 63  
- general linear group \( GL_n \) 293  
- good pair 114  
- graded ring 212  
- Gram-Schmidt orthogonalization 293, 382  
- graph 6, 11, 83  
- graph of groups 92  
- graph product of groups 92  
- Grassmann manifold 227, 381, 435, 439, 445  
- groups acting on spheres 75, 135, 391  
- Gysin sequence 438, 444  

### Homotopy
- H-space 281, 342, 419, 420, 422, 428  
- HNN extension 93  
- hocolim 460, 462  
- holim 462  
- homologous cycles 106  
- homology 106  
- homology decomposition 465  
- homology of groups 148, 423  
- homology theory 160, 314, 454  
- homotopy 3, 25  
- homotopy equivalence 3, 10, 36, 346  
- homotopy extension property 14  
- homotopy fiber 407, 461, 479  
- homotopy group 340  
- homotopy with coefficients 462  
- homotopy lifting property 60, 375, 379  
- homotopy of attaching maps 13, 16  
- homotopy type 3  
- Hopf 134, 173, 222, 281, 285  
- Hopf algebra 283  
- Hopf bundle 361, 375, 377, 378, 392  
- Hopf invariant 427, 447, 489, 490  
- Hopf map 379, 380, 385, 427, 430, 474, 475, 498  
- Hurewicz homomorphism 369, 486  
- Hurewicz theorem 366, 371, 390  

### Induced fibrations
- induced fibration 406  
- induced homomorphisms 34, 110, 111, 118, 201, 213  
- infinite loopspace 297  
- invariance of dimension 126  
- invariance of domain 172  
- inverse limit 312, 410, 462  
- inverse path 27  
- isomorphism of actions 70  
- isomorphism of covering spaces 67  
- iterated mapping cylinder 457, 466
##########

"""QUERY: Please summarize the whole context. It is important that you include a summary for each file. All files should be included, so please make sure to go through the entire context"""

Consider the chat history for relevant information. Use all information included. If query is already asked in the history double check the correctness of your answer and maybe correct your previous mistake. Use as much tokens as needed but at the same time be as efficient as possible. If you find information separated by a | in the context, it is a table formatted in Markdown - the whole context is formatted as md structure.
Final Files Sources: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 1, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 2, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 3, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 4, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 8, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 11, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 13, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 14, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 29, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 34, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 38, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 42, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 57, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 62, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 66, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 78, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 81, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 85, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 99, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 100, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 103, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 118, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 121, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 128, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 138, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 147, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 158, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 167, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 169, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 172, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 175, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 198, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 202, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 212, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 221, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 226, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 227, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 230, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 233, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 236, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 244, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 251, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 252, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 257, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 1, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 4, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 5, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 9, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 11, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 13, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 17, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 22, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 23, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 36, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 50, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 53, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 55, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 58, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 66, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 77, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 80, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 82, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 88, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 90, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 105, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 111, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 113, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 115, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 117, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 119, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 125, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 133, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 138, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 147, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 169, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 171, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 175, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 189, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 207, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 208, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 210, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 212, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 213, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 219, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 234, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 3, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 4, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 7, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 10, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 11, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 14, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 17, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 18, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 21, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 22, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 23, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 25, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 27, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 28, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 31, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 32, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 35, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 40, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 44, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 50, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 57, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 58, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 61, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 66, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 70, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 74, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 76, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 79, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 81, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 88, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 90, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 92, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 93, Algebraic%20Topology%20ATbib-ind.pdf - Page 3, Algebraic%20Topology%20ATbib-ind.pdf - Page 9
==================================================
**Elapsed Time: 5.45 seconds**
==================================================


FINAL ANSWER 
Answer: <IPython.core.display.Markdown object>
==================================================
**Elapsed Time: 0.00 seconds**
==================================================