INITIALIZATION
Knowledgebase: ki-dev-large
Base Query: can you give me 5 sentence summary with your context?
Model: gpt-4o
Use Curl?: None
==================================================
**Elapsed Time: 0.00 seconds**
==================================================


ROUTING
Query type: summary
==================================================
**Elapsed Time: 1.66 seconds**
==================================================


RAG PARAMETERS
Max Context To Include: 100
Lowest Score to Consider: 0.1
==================================================
**Elapsed Time: 8.28 seconds**
==================================================


VECTOR SEARCH ALGORITHM TO USE 
Use MMR search?: True
Use Similarity search?: False
==================================================
**Elapsed Time: 0.11 seconds**
==================================================


VECTOR SEARCH DONE 
==================================================
**Elapsed Time: 6.60 seconds**
==================================================


PRIMER 
Primer: 
            You are Simon, a highly intelligent personal assistant in a system called KIOS. You are a chatbot that
            can read knowledgebases through the "CONTEXT" that is included in the user's chat message.

        
            Your role is to act as an expert at summarization and analysis.

            In your responses to enterprise users, prioritize clarity, trustworthiness, and appropriate formality.

            Be honest by admitting when a topic falls outside your scope of knowledge, and suggest
            alternative avenues for obtaining information when necessary.

            Make effective use of chat history to avoid redundancy and enhance response relevance, continuously
            adapting to integrate all necessary details in your interactions.

            Use as much tokens as possible to provide a detailed response.
        
==================================================
**Elapsed Time: 0.19 seconds**
==================================================


FINAL QUERY 
Final Query: CONTEXT: ##########
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 1

Context: # A Cool Brisk Walk  
Through Discrete Mathematics

**Version 2.2.1**  
**Stephen Davies, Ph.D.**  

## Table of Contents
1. Introduction
2. Concepts
   - Sets
   - Functions
   - Relations
3. Applications
4. Conclusion

## Introduction
In this document, we explore various topics under discrete mathematics. 

## Concepts

### Sets
- Definition
- Notation
- Operations

### Functions
- Definition
- Types of functions
- Examples

### Relations
- Definition
- Properties

| Concept   | Description                       |
|-----------|-----------------------------------|
| Set      | A collection of distinct objects   |
| Function  | A relation from a set to itself    |
| Relation  | A connection between two sets       |

## Applications
The concepts of discrete mathematics can be applied in various fields, including computer science, information technology, and combinatorial design.

## Conclusion
Discrete mathematics provides fundamental knowledge required for numerous applications in modern science and technology.

Image Analysis: 

### Analysis of the Attached Visual Content

#### 1. Localization and Attribution:
- **Image Identification and Numbering:**
    - This is a single image and is referred to as **Image 1**.

#### 2. Object Detection and Classification:
- **Image 1:**
    - The image depicts a forest scene with several birch trees.
    - **Key Features of Detected Objects:**
        - **Trees:** The trees are characterized by their tall, slender trunks with distinctive white bark and dark horizontal markings typical of birches.
        - **Forest Ground:** There are grassy elements and what appears to be a forest floor with various green and brown patches.

#### 3. Scene and Activity Analysis:
- **Image 1:**
    - **Scene Description:**
        - The image is an artistic painting capturing a serene forest with several birch trees.
        - There is no human activity depicted; the scene is calm, evoking a sense of peace and natural beauty.

#### 4. Text Analysis:
- **Image 1:**
    - **Detected Text:**
        - Top: None
        - Bottom:
            - "A Cool Brisk Walk" (Title)
            - "Through Discrete Mathematics" (Subtitle)
            - "Version 2.2.1" (Version Information)
            - "Stephen Davies, Ph.D." (Author Name)
    - **Text Significance:**
        - The text suggests that this image is probably the cover of a book titled "A Cool Brisk Walk Through Discrete Mathematics" by Stephen Davies, Ph.D. The version number indicates it is version 2.2.1, signifying a possibly updated or revised edition.

#### 8. Color Analysis:
- **Image 1:**
    - **Color Composition:**
        - The dominant colors are various shades of green (foliage), white, and light brown (birch trunks), and a soft blue (background).
    - **Color Impact:**
        - The colors create a cool, calming effect, reflecting the serene mood of a walk through a quiet forest.

#### 9. Perspective and Composition:
- **Image 1:**
    - **Perspective:**
        - The image appears to be from a standing eye-level perspective, giving a natural view as one would see while walking through the forest.
    - **Composition:**
        - The birch trees are placed prominently in the foreground, with the background filled with dense foliage. The positioning of the text at the bottom complements the overall composition without distracting from the artwork.

#### 10. Contextual Significance:
- **Image 1:**
    - **Overall Document Context:**
        - As a book cover, the image serves the purpose of drawing the reader's attention. The serene and inviting forest scene combined with the title implies a gentle, inviting approach to the subject of discrete mathematics.

The image successfully combines art with educational content, aiming to make the subject of discrete mathematics more appealing and accessible to the reader. The natural, calming scenery juxtaposed with an academic theme may suggest that the book intends to provide a refreshing perspective on a typically challenging subject.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 3

Context: # Contents at a glance

## Contents at a glance
- **Contents at a glance** …… i
- **Preface** …… iii
- **Acknowledgements** …… v
  
1. **Meetup at the trailhead** …… 1
2. **Sets** …… 7
3. **Relations** …… 35
4. **Probability** …… 59
5. **Structures** …… 85
6. **Counting** …… 141
7. **Numbers** …… 165
8. **Logic** …… 197
9. **Proof** …… 223

Also be sure to check out the forever-free-and-open-source instructional videos that accompany this series at [www.allthemath.org](http://www.allthemath.org)!
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 5

Context: # Preface

Discrete math is a popular book topic — start Googling around and you’ll find a zillion different textbooks about it. Take a closer look, and you’ll discover that most of these are pretty thick, dense volumes packed with lots of equations and proofs. They’re principled approaches, written by mathematicians and (seemingly) to mathematicians. I speak with complete frankness when I say I’m comforted to know that the human race is well covered in this area. We need smart people who can derive complex expressions and prove theorems from scratch, and I’m glad we have them.

Your average computer science practitioner, however, might be better served by a different approach. There are elements to the discrete math mindset that a budding software developer needs experience with. This is why discrete math is (properly, I believe) part of the mandatory curriculum for most computer science undergraduate programs. But for future programmers and engineers, the emphasis should be different than it is for mathematicians and researchers in computing theory. A practical computer scientist mostly needs to be able to use these tools, not to derive them. She needs familiarity, and practice, with the fundamental concepts and the thought processes they involve. The number of times the average software developer will need to construct a proof in graph theory is probably near zero. But the times she'll find it useful to reason about probability, logic, or the principles of collections are frequent.

I believe the majority of computer science students benefit most from simply gaining an appreciation for the richness and rigor of
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 8

Context: I'm unable to view images. Please provide the text in Markdown format for me to assist you.

Image Analysis: 

### Image Analysis:

#### 1. **Localization and Attribution:**
- **Image Number**: Image 1

#### 2. **Object Detection and Classification:**
- **Detected Objects**: The image includes a cup of coffee on a saucer, three coffee beans, and some sparkles implying aroma or steam.
- **Object Classification**:
  - **Cup of Coffee**: A central object, likely made of ceramic and filled with coffee.
  - **Coffee Beans**: Close to the cup, classifiable as roasted beans.
  - **Sparkles/Steam**: Above the cup, representing the aroma of the hot coffee.

#### 3. **Scene and Activity Analysis:**
- **Scene Description**: This image displays a cup of coffee on a saucer with coffee beans around it. Decorative sparkles or steam lines are depicted rising above the coffee, suggesting freshness or warmth.
- **Activities**: No human activities observed; the focus is on the presentation of the coffee.

#### 4. **Text Analysis:**
- No text detected in this image.

#### 5. **Diagram and Chart Analysis:**
- Not applicable; no diagrams or charts are present in the image.

#### 6. **Product Analysis:**
- **Product**: A cup of coffee.
- **Main Features**: 
  - A white/neutral-colored ceramic cup and saucer filled with coffee.
  - Roasted coffee beans placed nearby, suggesting the quality or type of coffee used.
- **Materials**: Ceramic cup and saucer, roasted coffee beans.
- **Colors**: Predominantly brown (coffee and beans) and white (cup and saucer).

#### 7. **Anomaly Detection:**
- No anomalies detected; the elements are consistent with a typical coffee presentation.

#### 8. **Color Analysis**:
- **Dominant Colors**: Brown and white.
  - **Brown**: Evokes warmth and richness, associated with coffee and coffee beans.
  - **White**: Clean and neutral, ensures focus on the coffee.

#### 9. **Perspective and Composition**:
- **Perspective**: Bird’s eye view.
- **Composition**: 
  - Centralized with the cup of coffee as the main focus.
  - Balanced by the placement of coffee beans around the cup.
  - Decorative elements like sparkles/steam give a visual cue towards the coffee's freshness or aroma.

#### 10. **Contextual Significance**:
- The image can be used in contexts such as coffee advertisements, cafe menus, or articles about coffee. It emphasizes the quality and appeal of the coffee through its visual presentation.

#### 11. **Metadata Analysis**:
- No metadata available for analysis.

#### 12. **Graph and Trend Analysis**:
- Not applicable; no graphs or trends are present in the image.

#### 13. **Graph Numbers**:
- Not applicable; no data points or graph numbers are present in the image.

### Additional Aspects:

#### **Ablaufprozesse (Process Flows)**:
- Not applicable; no process flows depicted.

#### **Prozessbeschreibungen (Process Descriptions)**:
- Not applicable; no processes described.

#### **Typen Bezeichnung (Type Designations)**:
- Not applicable; no type designations specified.

#### **Trend and Interpretation**:
- Not applicable; no trends depicted.

#### **Tables**:
- Not applicable; no tables included.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 11

Context: # Understanding Integration and Differentiation

People in your family, there will never be 5.3 of them (although there could be 6 someday).

The last couple of entries on this list are worth a brief comment. They are math symbols, some of which you may be familiar with.

On the right side — in the continuous realm — are \( J \) and \( \frac{d}{dx} \), which you’ll remember if you’ve taken calculus. They stand for the two fundamental operations of integration and differentiation. Integration, which can be thought of as finding “the area under a curve,” is basically a way of adding up a whole infinite bunch of numbers over some range. When you “integrate the function \( x^2 \) from 3 to 5,” you’re really adding up all the tiny, tiny little vertical strips that comprise the area from \( x = 3 \) on the left to \( x = 5 \) on the right. Its corresponding entry in the left-column of the table is \( \Sigma \), which is just a short-hand for “sum up a bunch of things.” Integration and summation are equivalent operations; it’s just that when you integrate, you’re adding up all the (infinitely many) slivers across the real-line continuum. When you sum, you’re adding up a fixed sequence of entries, one at a time, like in a loop. \( \Sigma \) is just the discrete “version” of \( J \).

The same sort of relationship holds between ordinary subtraction (\(-\)) and differentiation (\(\frac{d}{dx}\)). If you’ve plotted a bunch of discrete points on \( x \)-\( y \) axes, and you want to find the slope between two of them, you just subtract their \( y \) values and divide by the \( x \) distance between them. If you have a smooth continuous function, on the other hand, you use differentiation to find the slope at a point: this is essentially subtracting the tiny tiny difference between supremely close points and then dividing by the distance between them.

Don’t worry; you don’t need to have fully understood any of the integration or differentiation stuff I just talked about, or even to have taken calculus yet. I’m just trying to give you some feel for what “discrete” means, and how the dichotomy between discrete and continuous really runs through all of math and computer science. In this book, we will mostly be focusing on discrete values and structures, which turn out to be of more use in computer science. That’s partially because, as you probably know, computers
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 13

Context: # 1.1 EXERCISES

1. **What’s the opposite of concrete?**  
   Abstract.

2. **What’s the opposite of discrete?**  
   Continuous.

3. **Consider a quantity of water in a glass. Would you call it abstract, or concrete? Discrete, or continuous?**  
   Concrete, since it’s a real entity you can experience with the senses. Continuous, since it could be any number of ounces (or liters, or tablespoons, or whatever). The amount of water certainly doesn’t have to be an integer. (Food for thought: since all matter is ultimately comprised of atoms, are even substances like water discrete?)

4. **Consider the number 27. Would you call it abstract, or concrete? Discrete, or continuous?**  
   Abstract, since you can’t see or touch or smell “twenty-seven.” Probably discrete, since it’s an integer, and when we think of whole numbers we think “discrete.” (Food for thought: in real life, how would you know whether I meant the integer “27” or the decimal number “27.00”? And does it matter?) Clearly it’s discrete. Abstract vs. concrete, though, is a little tricky. If we’re talking about the actual transistor and capacitor that’s physically present in the hardware, holding a thing in charge in some little chip, then it’s concrete. But if we’re talking about the value “1” that is conceptually part of the computer’s currently executing state, then it’s really abstract just like 27 was. In this book, we’ll always be talking about bits in this second, abstract sense.

5. **Consider a bit in a computer’s memory. Would you call it abstract, or concrete? Discrete, or continuous?**  
   Any kind of abstract object that has properties we can reason about. 

6. **If math isn’t just about numbers, what else is it about?**  
   Any kind of abstract object that has properties we can reason about.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 14

Context: I'm unable to assist with that.

Image Analysis: 

### Analysis of Attached Visual Content

#### Image Identification and Localization
- **Image 1**: Single image provided in the content.

#### Object Detection and Classification
- **Image 1**:
  - **Objects Detected**:
    - Female figurine resembling a prehistoric representation.
    - The figurine appears to be crafted from a terracotta or similar clay-like material.
  - **Classification**: 
    - The object is classified under 'Artifacts' and 'Historical Objects'.
  - **Key Features**: 
    - The sculpture has exaggerated female attributes including a prominent chest and belly, which are indicative of fertility symbols.
    - The face does not have detailed features, implying focus on bodily form rather than facial details.
    - The object seems ancient, slightly weathered, suggesting it to be an archaeological artifact.

#### Scene and Activity Analysis
- **Image 1**:
  - **Scene Description**:
    - The image shows a close-up of a single artifact against a neutral background, possibly for the purpose of highlighting the artifact itself.
  - **Activities**: 
    - No dynamic activity; the object is displayed presumably for appreciation or study.

#### Perspective and Composition
- **Image 1**:
  - **Perspective**:
    - The image is taken from a straight-on, eye-level view, ensuring the object is the primary focus.
    - Close-up perspective to capture detailed features.
  - **Composition**:
    - The object is centrally placed, drawing immediate attention.
    - The background is plain and undistracting, enhancing the focus on the artifact itself.

#### Contextual Significance
- **Image 1**:
  - **Overall Contribution**:
    - The artifact could be used in educational, historical, or museum contexts to study prehistoric cultures, their art, and societal values.
    - As a fertility symbol, it contributes to understanding sociocultural aspects of ancient civilizations.

#### Color Analysis
- **Image 1**:
  - **Dominant Colors**:
    - Shades of brown and beige are dominant, corresponding to the natural materials like terracotta.
  - **Impact on Perception**:
    - The earthy tones evoke a sense of antiquity and authenticity, reinforcing the perception of it being an ancient artifact.

### Conclusion
The provided image is of a terracotta or similar material female figurine, likely of prehistoric origin. It is depicted with a neutral background to focus on its significant features, in particular, the enhanced bodily features indicative of fertility representations. The composition and color tones effectively highlight its historical and cultural importance.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 23

Context: # 2.6 Sets of Sets

Sets are heterogeneous — a single set can contain four universities, seven integers, and an image — and so it might occur to you that they can contain other sets as well. This is indeed true, but let me issue a stern warning: you can get into deep water very quickly when you start thinking about “sets of sets.” In 1901, in fact, the philosopher Bertrand Russell pointed out that this idea can lead to unsolvable contradictions unless you put some constraints on it. What became known as “Russell’s Paradox” famously goes as follows: consider the set \( R \) of all sets that do not have themselves as a member.

For this reason, we’ll be very careful to use curly braces to denote sets, and parentheses to denote ordered pairs.

By the way, although the word “coordinate” is often used to describe the elements of an ordered pair, that’s really a geometry-centric word that implies a visual plot of some kind. Normally we won’t be plotting elements like that, but we will still have to deal with ordered pairs. I’ll just call the constituent parts “elements” to make it more general.

Three-dimensional points need ordered triples \( (x, y, z) \), and it doesn’t take a rocket scientist to deduce that we could extend this to any number of elements. The question is what to call them, and you do sort of sound like a rocket scientist (or other generic nerd) when you say tuple. Some people rhyme this word with “Drupal,” and others with “couple,” by the way, and there seems to be no consensus. If you have an ordered pair-type thing with 5 elements, therefore, it’s a 5-tuple (or a quintuple). If it has 117 elements, it’s a 117-tuple, and there’s really nothing else to call it. The general term (if we don’t know or want to specify how many elements) is n-tuple. In any case, it’s an ordered sequence of elements that may contain duplicates, so it’s very different than a set.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 26

Context: # 2.9 Combining Sets

Okay, so we have sets. Now what can we do with them? When you first learn about numbers back before kindergarten, the next thing you learn is how to combine numbers using various operations to produce other numbers. These include \( -, \times, +, \div \), exponents, roots, etc. Sets, too, have operations that are useful for combining to make other sets. These include:

- **Union (\(\cup\))**: The union of two sets is a set that includes the elements that either (or both) of them have as members. For instance, if \( A = \{ \text{Dad}, \text{Lizzy} \} \) and \( B = \{ \text{Lizzy}, \text{T.J.}, \text{Johnny} \} \), then \( A \cup B = \{ \text{Dad}, \text{Lizzy}, \text{T.J.}, \text{Johnny} \} \). Note that any element is in the union if it is in \( A \) or \( B \). For this reason, there is a strong relationship between the union operator of sets and the “or” (\( \vee \)) operator of boolean logic that we’ll see later.

- **Intersection (\(\cap\))**: The intersection of two sets is a set that includes the elements that both of them have as members. In the above example, \( A \cap B = \{ \text{Lizzy} \} \). There is a strong connection between intersection and the “and” (\(\land\)) boolean logic operator.

- **(Partial) Complement (\(-\))**: Looks like subtraction, but significantly different. \( A - B \) contains the elements from \( A \) that are also not in \( B \). So you start with \( A \) and then “subtract off” the contents of \( B \) if they occur. In the above example, \( A - B = \{ \text{Dad} \} \). (Note that T.J. and Johnny didn’t really enter into the calculation.) Unlike \( \cup \) and \( \cap \), it is not commutative. This means it’s not symmetric: \( A - B \) doesn’t (normally) give the same answer as \( B - A \). In this example, \( B - A = \{ \text{T.J.}, \text{Johnny} \} \), whereas, if you ever reverse the operands with union or intersection, you’ll always get the same result as before.

- **(Total) Complement (\(X\))**: Same as the partial complement, above, except that the implied first operand is \( \Omega \). In other words, \( A - B \) is “all the things in \( A \) that aren’t in \( B \),” whereas...
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 39

Context: # 2.13. EXERCISES

21. What’s \( A \cap B \)?  
   \{ Macbeth, Hamlet \}.

22. What’s \( B \cap T \)?  
   ∅.

23. What’s \( B \cap \overline{T} \)?  
   B. (which is { Scrabble, Monopoly, Othello. })

24. What’s \( A \cup (B \cap T) \)?  
   \{ Hamlet, Othello, Macbeth \}.

25. What’s \( (A \cup B) \cap T \)?  
   \{ Macbeth, Hamlet \}.

26. What’s \( A \setminus B \)?  
   Simply \( T \), since these two sets have nothing in common.

27. What’s \( T' \setminus B \)?  
   \{ (Hamlet, Macbeth), (Hamlet, Hamlet), (Hamlet, Othello), (Village, Macbeth), (Village, Hamlet), (Village, Othello), (Town, Hamlet), (Town, Othello) \}. The order of the ordered pairs within the set is not important; the order of the elements within each ordered pair is important.

28. What’s \( T \cap X \)?  
   0.

29. What’s \( (B \cap B) \times (A \cap T) \)?  
   \{ (Scrabble, Hamlet), (Monopoly, Hamlet), (Othello, Hamlet) \}.

30. What’s \( I \cup B \cup T \)?  
   7.

31. What’s \( I \cap (A \cap T) \)?  
   21. (The first parenthesized expression gives rise to a set with 7 elements, and the second to a set with three elements (B itself). Each element from the first set gets paired with an element from the second, so there are 21 such pairings.)
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 40

Context: # CHAPTER 2. SETS

### 33. Is A an extensional set, or an intensional set?
The question doesn't make sense. Sets aren't "extensional" or "intensional"; rather, a given set can be described extensionally or intensionally. The description given in item 19 is an extensional one; an intensional description of the same set would be "The Shakespeare tragedies Stephen studied in high school."

### 34. Recall that G was defined as { Matthew, Mark, Luke, John }. Is this a partition of G?
- { Luke, Matthew }
- { John }

No, because the sets are not collectively exhaustive (Mark is missing).

### 35. Is this a partition of G?
- { Mark, Luke }
- { Matthew, Luke }

No, because the sets are neither collectively exhaustive (John is missing) nor mutually exclusive (Luke appears in two of them).

### 36. Is this a partition of G?
- { Matthew, Mark, Luke }
- { John }

Yes. (Trivia: this partitions the elements into the synoptic gospels and the non-synoptic gospels).

### 37. Is this a partition of G?
- { Matthew, Luke }
- { John, Mark }

Yes. (This partitions the elements into the gospels which feature a Christmas story and those that don't).
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 50

Context: # CHAPTER 3. RELATIONS

- **Transitivity.** A relation is transitive if whenever \(xRy\) and \(yRz\), then it’s guaranteed that \(xRz\). The “isTallerThan” relation we defined is transitive: if you tell me that Bob is taller than Jane, and Jane is taller than Sue, then I know Bob must be taller than Sue, without even having to tell me that. That’s just how “taller than” works. An example of a non-transitive relation would be “hasBeaten” with NFL teams. Just because the Patriots beat the Steelers this year, and the Steelers beat the Giants, that does not imply that the Patriots necessarily beat the Giants. The Giants might have actually beaten the team-who-beat-the-team-who-beat-them (such things happen), or heck, the two teams might not even have played each other this year.

All of the above examples were defined intentionally. Just for practice, let’s look at some extensionally defined relations as well. Using our familiar Harry Potter set as \(A\), consider the following relation:

| Relation              |
|-----------------------|
| (Harry, Ron)         |
| (Ron, Hermione)      |
| (Ron, Ron)           |
| (Hermione, Ron)      |
| (Hermione, Hermione) |

Consider: is this relation reflexive? No. It has (Ron, Ron) and (Hermione, Hermione), but it’s missing (Harry, Harry), so it’s not reflexive. Is it symmetric? Yes. Look carefully at the ordered pairs. We have a (Harry, Ron), but also a matching (Ron, Harry). We have a (Hermione, Ron), but also a matching (Ron, Hermione). So every time we have a \( (x, y) \) we also have the matching \( (y, x) \), which is the definition of symmetry. Is it antisymmetric? No, because (among other things) both (Harry, Ron) and (Ron, Harry) are present. Finally, is it transitive? No. We have (Harry, Ron) and (Ron, Hermione), which means that if it’s transitive we would have to also have (Harry, Hermione) in there, which we don’t.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 66

Context: I'm unable to assist with this request.

Image Analysis: 

**Image Analysis**

1. **Localization and Attribution:**
    - There is only one image on the page, referred to as **Image 1**.
    - **Image 1** is located centrally on the page.

2. **Object Detection and Classification:**
    - **Image 1** contains several objects, including:
        - **Text Elements:** The text "Critical Path is the Longest of all Paths" is prominently displayed.
        - **Graphical Elements:** A flowchart or process diagram is visible, with various nodes and connecting arrows.
        - **Nodes and Connections:** Several boxes (nodes) with labeled text and directed arrows indicating the flow or sequence.

3. **Scene and Activity Analysis:**
    - **Image 1** depicts a flowchart or a process diagram. This generally represents a sequence of steps in a process or workflow. The primary activity is the visualization of the critical path in a process.
        - The nodes likely represent different tasks or milestones.
        - The connecting arrows show dependencies and the sequential order of tasks.
        - The critical path, being the longest, indicates the sequence of tasks that determine the total duration of the process.

4. **Text Analysis:**
    - Extracted Text: "Critical Path is the Longest of all Paths"
        - Significance: This text emphasizes the importance of the critical path in project management or process flow. It signifies that the critical path dictates the overall time required to complete all tasks in the process.

5. **Diagram and Chart Analysis:**
    - **Image 1** features a diagram depicting a critical path:
        - **Axes and Scales:** No conventional axes as it's a flowchart.
        - **Nodes:** Each node likely represents a step or task in the process, with labeled connectors showing task dependencies.
        - **Key Insight:** Identifying the critical path helps in managing project timelines effectively, ensuring project milestones are met as scheduled.

6. **Anomaly Detection:**
    - There are no apparent anomalies or unusual elements in **Image 1**. The flowchart appears structured and coherent.

7. **Perspective and Composition:**
    - The image is created in a straightforward, top-down perspective, typical for process diagrams and flowcharts.
    - Composition centers around the critical path, with nodes and connectors drawing attention to the sequence and dependencies.

8. **Contextual Significance:**
    - The image likely serves an educational or informative purpose within documents or presentations related to project management or process optimization.
    - The depiction of the critical path forms a crucial element in understanding how task sequences impact project timelines.

9. **Trend and Interpretation:**
    - Trend: The longest path in the sequence (critical path) directly impacts the overall duration for project completion.
    - Interpretation: Any delay in tasks on the critical path will delay the entire project, underlining the need to monitor and manage these tasks closely.

10. **Process Description (Prozessbeschreibungen):**
    - The image describes a process where the longest sequence of tasks (critical path) dictates the project length, requiring focus and management to ensure timely completion.

11. **Graph Numbers:**
    - Data points and specific numbers for each node are not labeled in the image, but each node represents a step that adds up to form the critical path.

By examining **Image 1**, it becomes clear that it plays a crucial role in illustrating process flows and emphasizing the importance of the critical path in project management.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 67

Context: # Chapter 4

## Probability

Probability is the study of **uncertainty**. This may seem like a hopeless endeavor, sort of like knowing the unknowable, but it's not. The study of probability gives us tools for taming the uncertain world we live and program in, and for reasoning about it in a precise and helpful way.

We may not know exactly how long a particular visitor is willing to wait for our webpage to load in their browser, but we can use probability to estimate how much traffic we’ll lose if this takes longer than a certain average duration. We may not know which specific passwords a hacker will try as he attempts to break our security protocol, but we can use probability to estimate how feasible this approach will be for him. We may not know exactly when a certain program will run out of RAM and have to swap its data out to virtual memory, but we can predict how often this is likely to occur — and how painful it will be for us — given a certain system load and user behavior.

The trick is to use the tools we've already built — sets, relations, functions — to characterize and structure our notions of the relative likelihood of various outcomes. Once those underpinnings are secured, a layer of deductive reasoning will help us make good use of that information to begin to predict the future.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 68

Context: # 4.1 Outcomes and events

Since life is uncertain, we don’t know for sure what is going to happen. But let’s start by assuming we know what things might happen. Something that might happen is called an outcome. You can think of this as the result of an experiment if you want to, although normally we won’t be talking about outcomes that we have explicitly manipulated and measured via scientific means. It’s more like we’re just curious how some particular happening is going to turn out, and we’ve identified the different ways it can turn out and called them outcomes.

Now we’ve been using the symbol \( \Omega \) to refer to “the domain of discourse” or “the universal set” or “all the stuff we’re talking about.” We’re going to give it yet another name now: the sample space. \( \Omega \), the sample space, is simply the *set of all possible outcomes*. Any particular outcome — call it \( O \) — is an element of this set, just like in Chapter 1 every conceivable element was a member of the domain of discourse.

If a woman is about to have a baby, we might define \( \Omega \) as \{ boy, girl \}. Any particular outcome is either boy or girl (not both), but both outcomes are in the sample space, because both are possible. If we roll a die, we’d define \( \Omega \) as \{ 1, 2, 3, 4, 5, 6 \}. If we’re interested in motor vehicle safety, we might define \( \Omega \) for a particular road trip as \{ safe, accident \}. The outcomes don’t have to be equally likely; an important point we’ll return to soon.

In probability, we define an event as a *subset of the sample space*. In other words, an event is a group of related outcomes (though an event might contain just one outcome, or even zero). I always thought this was a funny definition for the word “event”: it’s not the first thing that word brings to mind. But it turns out to be a useful concept, because sometimes we’re not interested in any particular outcome necessarily, but rather in whether the outcome — whatever it is — has a certain property. For instance, suppose at the start of some game, my opponent and I each roll the die, agreeing that the highest roller gets to go first. Suppose he rolls a 2. Now it’s my turn. The \( \Omega \) for my die roll is of course \{ 1, 2, 3 \}.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 75

Context: # 4.3. PHILOSOPHICAL INTERLUDE

The idea of how often heads will occur in the long run. But if we flip it a million times and get 500,372 heads, we can confidently say that the probability of getting a head on a single flip is approximately .500.

This much isn’t controversial: it’s more like common sense. But the frequentist philosophy states that this is really the only way that probability can be defined. It’s what probability is: the frequency with which we can expect certain outcomes to occur, based on our observations of their past behavior. Probabilities only make sense for things that are repeatable, and reflect a known, reliable trend in how often they produce certain results. Historical proponents of this philosophy include John Venn, the inventor of the aforementioned Venn diagram, and Ronald Fisher, one of the greatest biologists and statisticians of all time.

If frequentism is on a quest for experimental objectivity, Bayesianism might be called “subjective.” This isn’t to say it’s arbitrary or sloppy. It simply has a different notion of what probability ultimately means. Bayesians interpret probability as a quantitative personal assessment of the likelihood of something happening. They point out that for many (most) events of interest, trials are neither possible nor sensible. Suppose I’m considering asking a girl out to the prom, and I’m trying to estimate how likely it is she’ll go with me. It’s not like I’m going to ask her a hundred times and count how many times she says yes, then divide by 100 to get a probability. There is in fact no way to perform a trial or use past data to guide me, and at any rate she’s only going to say yes or no once. So based on my background knowledge and my assumptions about her, myself, and the world, I form an opinion which could be quantified as a “percent chance.”

Once I’ve formed this opinion (which of course involves guesswork and subjectivity) I can then reason about it mathematically, using all the tools we’ve been developing. Of special interest to Bayesians is the notion of updating probabilities when new information comes to light, a topic we’ll return to in a moment. For the Bayesian, the probability of some hypothesis being true is between 0 and 1, and when an agent (a human, or a bot) makes decisions, she/he/it
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 78

Context: # CHAPTER 4. PROBABILITY

Again, let’s say we had information leaked to us that the winner, whoever she may be, is female. We can now update our estimate:

$$
Pr(R|F) = \frac{Pr(R \cap F)}{Pr(F)} = \frac{Pr(|Kelly|)}{Pr(F)} = Pr\{(Kelly, Fantasia, Carrie)\} = \frac{2}{5} = .4
$$

You see, once we find out that David is no longer a possibility, our only remaining hope for a rock star is Kelly. And she has only 40% of the probability that’s left over. Note that this is a higher chance for her personally—she’s got to be excited by the press leak—but it’s lower for rock stars, of which she is only one (and evidently, not the predicted strongest).

Background knowledge can even peg our probability estimate to an extreme: all the way to 0, or to 1. What’s \(Pr(U|C)\), the probability of an underage winner, given that he/she is a country singer? The intersection of \(U\) and \(C\) is zero, so this makes \(Pr(U|C) = 0\). In words: a country winner eliminates any possibility of an underage winner. And what’s \(Pr(F|U)\), the probability that a woman wins, given that we know the winner to be underage? Well, \(F \cap U\) and \(U\) are the same (check me), so \(Pr(F|U) = \frac{Pr(F \cap U)}{Pr(U)} = \frac{4}{4} = 1\). Therefore, an underage winner guarantees a female winner.

The way I think about conditional probability is this: look at the diagram, consider the events known to have occurred, and then mentally block out everything except that. Once we know the background fact (s), we’re essentially dealing with a restricted world. Take the example of the known female winner. Once we know that event \(F\) in fact occurred, we can visually filter out David, and look at the Bob as thought that were our entire world. In this restricted female-only view, the underage elements comprise a greater percentage of the total than they did before. And half the rock-star elements have now been obscured, leaving only Kelly as the one-of-the-remaining-three.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 81

Context: # 4.5 TOTAL PROBABILITY

Pr(A) = Pr(A ∩ C₁) + Pr(A ∩ C₂) + ... + Pr(A ∩ C_N)  
= Pr(A|C₁)Pr(C₁) + Pr(A|C₂)Pr(C₂) + ... + Pr(A|C_N)Pr(C_N)  
= ∑ᵏ₌₁ Pr(A|Cₖ)Pr(Cₖ)  

is the formula.¹

Let's take an example of this approach. Suppose that as part of a promotion for Muvico Cinemas movie theatre, we're planning to give a door prize to the 1000th customer this Saturday afternoon. We want to know, though, the probability that this person will be a minor. Figuring out how many patrons overall will be under 18 might be difficult. But suppose we're showing these three films on Saturday: *Spider-Man: No Way Home*, *Here Before*, and *Sonic the Hedgehog 2*. We can estimate the fraction of each movie's viewers that will be minors: .6, .01, and .95, respectively. We can also predict how many tickets will be sold for each film: 2,000 for *Spider-Man*, 500 for *Here Before*, and 1,000 for *Sonic*.

Applying frequentist principles, we can compute the probability that a particular visitor will be seeing each of the movies:

- \(Pr(Spider-Man) = \frac{2000}{2000 + 500 + 1000} = 0.571\)

- \(Pr(Here \ Before) = \frac{500}{2000 + 500 + 1000} = 0.143\)

- \(Pr(Sonic) = \frac{1500}{2000 + 500 + 1000} = 0.286\)

¹ If you're not familiar with the notation in that last line, realize that \(Σ\) (a capital Greek "sigma") just represents a sort of loop with a counter. The \(k\) in the sign means that the counter is \(k\) and starts at 1. The "N" above the line means the counter goes up to \(N\), which is its last value. And what does the loop do? It adds up a cumulative sum. The thing being added to the total each time through the loop is the expression to the right of the sign. The last line with the \(Σ\) is just a more compact way of expressing the preceding line.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 85

Context: # 4.6. BAYES' THEOREM

See how that works? If I do have the disease (and there's a 1 in 1,000 chance of that), there's a .99 probability of me testing positive. On the other hand, if I don't have the disease (a 999 in 1,000 chance of that), there's a .01 probability of me testing positive anyway. The sum of those two mutually exclusive probabilities is 0.01098.

Now we can use our Bayes’ Theorem formula to deduce:

\[ 
P(D|T) = \frac{P(T|D) P(D)}{P(T)} 
\]

\[
P(D|T) = \frac{.99 \cdot \frac{1}{1000}}{0.01098} \approx .0902 
\]

Wow. We tested positive on a 99% accurate medical exam, yet we only have about a 9% chance of actually having the disease! Great news for the patient, but a head-scratcher for the math student. How can we understand this? Well, the key is to look back at that Total Probability calculation in equation 4.1. Remember that there were two ways to test positive: one where you had the disease, and one where you didn't. Look at the contribution to the whole that each of those two probabilities produced. The first was 0.00099, and the second was 0.9999, over ten times higher. Why? Simply because the disease is so rare. Think about it: the test fails once every hundred times, but a random person only has the disease once every thousand times. If you test positive, it’s far more likely that the test screwed up than that you actually have the disease, which is rarer than blue moons.

Anyway, all of this about diseases and tests is a side note. The main point is that Bayes' Theorem allows us to recast a search for \(P(X|Y)\) into a search for \(P(Y|X)\), which is often easier to find numbers for.

One of many computer science applications of Bayes' Theorem is in text mining. In this field, we computationally analyze the words in documents in order to automatically classify them or form summaries or conclusions about their contents. One goal might be to identify the true author of a document, given samples of the writing of various suspected authors. Consider the Federalist Papers,
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 88

Context: # CHAPTER 4. PROBABILITY

Pr(L|F). Considering only the women, then, you compute Pr(L|F) = \(\frac{22}{198} \approx .111\) from the data in the table.

Wait a minute. That’s exactly what we had before. Learning that we had chosen a woman told us nothing useful about her handedness. That’s what we mean by saying that the L and F events are independent of each other.

The shrewd reader may object that this was a startling coincidence: the numbers worked out exactly perfectly to produce this result. The proportion of left-handed females was precisely the same as that of left-handed males, down to the penny. Is this really likely to occur in practice? And if not, isn’t independence theoretical as to be irrelevant?

There are two ways of answering that question. The first is to admit that in real life, of course, we’re bound to get some noise in our data, just because the sample is finite and there are random fluctuations in who we happen to survey. For the same reason, if we flipped an ordinary coin 1,000 times, we aren’t likely to get exactly 500 heads. But that doesn’t mean we should rush to the conclusion that the coin is biased. Statisticians have sophisticated ways of answering this question by computing how much the experimental data needs to deviate from what we’d expect before we raise a red flag. Suffice to say here that even if the contingency table we collect isn’t picture perfect, we may still conclude that two events are independent if they’re “close enough” to independence.

The other response, though, is that yes, the burden of proof is indeed on independence, rather than on non-independence. In other words, we shouldn’t start by cavalierly assuming all the events we’re considering are in fact independent, and only changing our mind if we see unexpected correlations between them. Instead, we should always be suspicious that two events will affect each other in some way, and only conclude they’re independent if the data we collect works out more or less “evenly” as in the example above. To say that Pr(A|B) is the same as Pr(A) is an aggressive statement, outside the norm, and we shouldn’t assume it without strong evidence.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 93

Context: # Chapter 5

## Structures

Much of computer science deals with representing and manipulating information. To do this, people have devised various **structures** for organizing chunks of data in a way that makes it easy to store, search, and retrieve. There’s a whole course in most computer science curricula called “data structures” which covers how to implement these structures in code. In this book, we won’t be talking about the code, but rather the abstract structures themselves. This chapter has a lot of pictures in it, which depict examples of the various structures in a very general way. The concepts here map directly to code when you need to put them into practice.

There are all kinds of data structures — arrays, linked lists, queues, stacks, hashtables, and heaps, to name a few — but they almost all boil down to one of two fundamental kinds of things: graphs, and trees. There are the two structures we’ll focus on in this chapter. A graph is just about the most general structure you can envision: a bunch of scattered data elements that are related to each other in some way. Almost every data structure imaginable can be recast as a type of graph. Trees are sort of a special case of graphs, but also sort of a topic in their own right, kind of like functions were a special type of relation, but also kind of different. A tree can be seen as a type of graph that imposes extra special conditions which give some navigational benefit.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 97

Context: # 5.1. GRAPHS

## Figure 5.2: A directed graph.

```
         Sonny Liston
           /      \
      Muhammad Ali  George Foreman
           \      /
         Joe Frazier
```

## Figure 5.3: A weighted (and undirected) graph.

```
        Washington, DC
              |
             50
              |
        Fredericksburg
             150
              |
             60
              |
        Virginia Beach
              |
             100
              |
          Richmond
```

**adjacent.** If two vertices have an edge between them, they are said to be adjacent.

**connected.** The word connected has two meanings: it applies both to pairs of vertices and to entire graphs.

We say that two vertices are connected if there is at least one path between them. Each vertex is therefore "reachable" from the other. In Figure 5.1, President and actor are connected, but Ford's Theatre and Civil War are not.

"Connected" is also used to describe entire graphs, if every node can be reached from all others. It's easy to see that Fig.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 99

Context: # 5.1. GRAPHS

A graph of dependencies like this must be both **directed** and **acyclic**, or it wouldn’t make sense. Directed, of course, means that task X can require task Y to be completed before it, without the reverse also being true. If they both depended on each other, we’d have an infinite loop, and no brownies could ever get baked! Acyclic means that no kind of cycle can exist in the graph, even one that goes through multiple vertices. Such a cycle would again result in an infinite loop, making the project hopeless. Imagine if there were an arrow from **bake for 30 mins** back to **grease pan** in Figure 5.4. Then, we’d have to grease the pan before pouring the goop into it, and we’d also have to bake before greasing the pan! We’d be stuck right off the bat: there’d be no way to complete any of those tasks since they all indirectly depend on each other. A graph that is both directed and acyclic (and therefore free of these problems) is sometimes called a **DAG** for short.

```
pour brown stuff in bowl
  ├── crack two eggs
  ├── measure 2 tbsp oil
  ├── preheat oven
  ├── grease pan
  └── mix ingredients
          └── pour into pan
                  ├── bake for 30 mins
                  └── cool
                        └── enjoy!
```

*Figure 5.4: A DAG.*
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 106

Context: # CHAPTER 5. STRUCTURES

Let's run this algorithm in action on a set of Facebook users. Figure 5.9 depicts eleven users and the friendships between them. First, we choose Greg as the starting node (not for any particular reason, just that we have to start somewhere). We mark him (in grey on the diagram) and put him in the queue (the queue contents are listed at the bottom of each frame, with the front of the queue on the left). Then, we begin our loop. When we take Greg off the queue, we visit him (which means we "do whatever we need to do to Greg") and then mark and enqueue his adjacent nodes Chuck and Lizzy. It does not matter which order we put them into the queue, just as it did not matter what node we started with. In pane 3, Chuck has been dequeued, visited, and his adjacent nodes put on the queue. Only one node gets enqueued here — Adrian — because obviously Greg has already been marked (and even visited, no less) and this marking allows us to be smart and not re-enqueue him.

It’s at this point that the “breadth-first” feature becomes apparent. We've just finished with Chuck, but instead of proceeding Adrian next, we resume with Lizzy. This is because she has been waiting patiently on the queue, and her turn has come up. So we lay Adrian aside (in the queue, of course) and visit Lizzy, enqueueing her neighbor Elaine in the process. Then, we go back to Adrian. The process continues, in “one step on the top path, one step on the bottom path” fashion, until our two exploration paths actually meet each other on the back end. Visiting Lizzy causes us to enqueue Brittany, and then when we take Kim off the queue, we do not re-enqueue Brittany because she has been marked and we know she's already being taken care of.

For space considerations, Figure 5.9 leaves off at this point, but our course would continue visiting nodes in the queue until the queue was empty. As you can see, Hank and Daniel will not have visited at all in this process: this is apparently because nobody they know knows anybody in the Greg crowd, and so there's no way to reach them from Greg. This is what I meant earlier by saying that as a side effect, the BFT algorithm tells us whether the graph is connected or not. All we have to do is start somewhere, run BFT, and then see whether any nodes have not been marked and visited. If there are any, we can continue with another starting point, and
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 116

Context: # CHAPTER 5. STRUCTURES

That's immediately closest to the connected group, and adds it without a second thought. There's no notion of "perhaps I'll get a shorter overall edge set if I forego connecting this tempting close node right now."

Sometimes, a greedy algorithm turns out to give an optimal result. Often it does not, and more sophisticated approaches can find better solutions. In this case, it happens to work out that the greedy approach does work! Prim's algorithm will always find the set of edges that connects all the nodes and does so with the lowest possible total distance. It's amazing that it can do so, especially since it never backtracks or revises its opinion the way Dijkstra's algorithm does.

Let's follow the algorithm's progress in the WWII example. We can start with any node, so we'll pick Vichy just at random. Frame 1 of Figure 5.14 shows what happens when the algorithm begins with Vichy: we simply examine all its neighbors and connect the one that's closest to it. Nothing could be simpler. In this case, Lyon is a mere 250 km away, which is closer than anything else to Vichy, so we connect it and add the Vichy-Lyon edge to our edge set. The figure shows a heavy black line between Vichy and Lyon to show that it will officially be a supply line.

And so it goes. In successive frames, we add Marseille, Nice, and Briançon to the set of connected nodes, since we can do no better than 150 km, 150 km, and 200 km, respectively. Note carefully that in frame 4 we connect Briançon to Lyon—not to Nice—because 200 < 250. Note also that the algorithm can jump around from side to side—we aren't looking for the shortest edge from the most recently added node, but from any connected node.

The final result is shown in the last frame. This is the best way to connect all the cities to each other, if "best" means "least total supply line distance," which in this case works out to 2450 total kilometers. But if you look carefully, you'll notice a fascinating thing. 

>This network of edges does not contain the shortest path from Bordeaux to Strasbourg! I find that result dumbfounding. Wouldn't  
it's very easy to fall into a trance and always add nodes only to the ends of the growing scale. In fact, I originally did that with this very example!
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 117

Context: # 5.1. GRAPHS

### Figure 5.13
The stages of Dijkstra's shortest-path algorithm. The “current node” is shown in grey, with visited nodes (whose best paths and shortest distances have been unalterably determined) in black. The sign next to each node shows the tentative shortest distance to that node from Bordeaux.

|       |      |       |      |
|-------|------|-------|------|
|   1   |   2  |   3   |  4   |
| ![Graph 1](img1.png) | ![Graph 2](img2.png) | ![Graph 3](img3.png) | ![Graph 4](img4.png) |
|       |      |       |      |
|   5   |   6  |   7   |      |
| ![Graph 5](img5.png) | ![Graph 6](img6.png) | ![Graph 7](img7.png) |      |
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 123

Context: # 5.2 TREES

parent is A, and A has no parent.

## child

Some nodes have children, which are nodes connected directly below it. A’s children are F and B, C’s are D and E, B’s only child is C, and E has no children.

## sibling

A node with the same parent. E’s sibling is D, B’s is F, and none of the other nodes have siblings.

## ancestor

Your parent, grandparent, great-grandparent, etc., all the way back to the root. B’s only ancestor is A, while E’s ancestors are C, B, and A. Note that F is not C’s ancestor, even though it’s above it on the diagram; there’s no connection from C to F, except back through the root (which doesn’t count).

## descendant

Your children, grandchildren, great-grandchildren, etc., all the way to the leaves. B’s descendants are C, D, and E, while A’s are F, B, C, D, and E.

## leaf

A node with no children. F, D, and E are leaves. Note that in a (very) small tree, the root could itself be a leaf.

## internal node

Any node that’s not a leaf. A, B, and C are the internal nodes in our example.

## depth (of a node)

A node’s depth is the distance (in number of nodes) from it to the root. The root itself has depth zero. In our example, B is of depth 1, E is of depth 3, and A is of depth 0.

## height (of a tree)

A rooted tree’s height is the maximum depth of any of its nodes; i.e., the maximum distance from the root to any node. Our example has a height of 3, since the “deepest” nodes are D and E, each with a depth of 3. A tree with just one node is considered to have a height of 0. Bizarrely, but to be consistent, we’ll say that the empty tree has height -1! Strange, but what else could it be? To say it has height 0 seems inconsistent with a one-node tree also having height 0. At any rate, this won’t come up much.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 128

Context: # CHAPTER 5. STRUCTURES

Finally, to traverse a tree in-order, we:

1. Treat the left child and all its descendants as a subtree, and traverse it in its entirety.
2. Visit the root.
3. Traverse the right subtree in its entirety.

So instead of visiting the root first (pre-order) or last (post-order), we treat it in between our left and right children. This might seem to be a strange thing to do, but there's a method to the madness which will become clear in the next section.

For the sample tree, the first visited node is B. This is because it’s the first node encountered that doesn’t have a left subtree, which means step 1 doesn’t need to do anything. This is followed by O and I, for the same reason. We then visit K before its right subtree, which in turn visits G, M, E, C, A, H, F, B, L, N. (See Figure 5.20.)

If your nodes are spaced out evenly, you can read the in-order traversal of the diagram by moving your eyes left to right. Be careful about this, though, because ultimately the spatial position doesn’t matter, but rather the relationships between nodes. For instance, if I had drawn node I further to the right, in order to make the lines between D–O–I less steep, that I node might have been pushed physically to the right of K. But that wouldn't change the order and K visited earlier.

Finally, it’s worth mentioning that all of these traversal methods make elegant use of recursion. Recursion is a way of taking a large problem and breaking it up into similar, but smaller, sub-problems. Then, each of those subproblems can be attacked in the same way as you attacked the larger problem: by breaking them up into subsublproblems. All you need is a rule for eventually stopping the “breaking up” process by actually doing something.

Every time one of these traversal processes treats a left or right child as a subtree, they are “recursing” by re-initiating the whole traversal process on a smaller tree. Pre-order traversal, for instance,
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 130

Context: # CHAPTER 5. STRUCTURES

## Figure 5.21: A full binary tree

```
       G
     /   \
    K     H
   / \   / \
  D   M A   B
 / \ / \ |   |
C  E F   N   L
   |
   I
```

## Complete binary tree

A complete binary tree is one in which every level has all possible nodes present, except perhaps for the deepest level, which is filled all the way from the left. Figure 5.21 is not complete, but it would be if we fixed it up as in Figure 5.22.

## Figure 5.22: A complete binary tree

```
       G
     /   \
    K     H
   / \   / \
  D   M A   B
 / \   |   |
I   E  L   F
         / \
        N   -
```

Unlike full binary trees, it is always possible to have a complete binary tree no matter how many nodes it contains. You just keep filling in from left to right, level after level.

## Perfect binary tree

Our last special type has a rather audacious title, but a “perfect” tree is simply one that is exactly balanced.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 138

Context: # CHAPTER 5. STRUCTURES

missing her. Jennifer has effectively been lost.

```plaintext
     Mitch
        ┌───────────┐
     Randi         Ben
                   ┌────────┐
                   Jim      Xander
                   └───────┘
                     Jennifer
                   ┌────────┐
                Molly
```

Figure 5.27: A wrong (non)-BST after removing Jessica incorrectly.

One correct way (there are others) to do a node removal is to replace the node with the left-most descendant of its right subtree. (Or, equivalently, the right-most descendant of its left subtree.) Let’s be careful to define this: to get the left-most descendant of a node’s right subtree, we (1) go to the right child of the node, and then (2) go as left as we possibly can from there, until we come to a node that has no left child. That node (the one without a left child) is officially the left-most descendant of the original node’s right subtree.

Example: flip back to Figure 5.17 (p. 117). What is the left-most descendant of G’s right subtree? 

**Answer:** We start by going right from G down to H, and then we go as left as possible... which turns out to be only one node’s worth of “left,” because we hit A, and A has no left child (or right child, for that matter). Work these additional examples out for yourself: what is the left-most descendant of K’s right subtree? Of D’s? Of H’s?

Okay, let’s return to Figure 5.26 (p. 129) and remove Jessica the correct way. We simply find the left-most descendant of her right subtree.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 144

Context: # CHAPTER 5. STRUCTURES

10. How many vertices and edges are there in the graph below?

   ![Graph](image_reference_placeholder)

   **7 and 10, respectively.**

11. What's the degree of vertex L?  
   **It has an in-degree of 2, and an out-degree of 1.**

12. Is this graph directed?  
   **Yes.**

13. Is this graph connected?  
   **Depends on what we mean. There are two different notions of “connectedness” for directed graphs. One is strongly connected, which means every vertex is reachable from any other by following the arrows in their specified directions. By that definition, this graph is not connected: there’s no way to get to H from L, for example. It is weakly connected, however, which means that if you ignore the arrowheads and consider it like an undirected graph, it would be connected.**

14. Is it a tree?  
   **No. For one thing, it can’t have any “extra” edges beyond what’s necessary to make it connected, and there’s redundancy here.**

15. Is it a DAG?  
   **All in all, if you look very carefully, you’ll see that there is indeed a cycle: I→J→K→L→I. So if this graph were to represent a typical project workflow, it would be impossible to complete.**

16. If we reversed the direction of the I→J edge, would it be a DAG?  
   **Yes. The edges could now be completed in this order: H, G, L, I, M, K, and finally J.**
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 145

Context: # 5.4. EXERCISES

17. If this graph represented an endorelation, how many ordered pairs would it have?  
   **Answer:** 10.

18. Suppose we traversed the graph below in depth-first fashion, starting with node P. In what order would we visit the nodes?

   ![Graph](attachment_link_here)  
   **Answer:** There are two possible answers: P, Q, R, S, T, N, O, or else P, O, N, T, S, R, Q. (The choice just depends on whether we go "left" or "right" initially.) Note in particular that either O or Q is at the very end of the list.

19. Now we traverse the same graph breadth-first fashion, starting with node P. Now in what order would we visit the nodes?

   **Answer:** Again, two possible answers: P, O, Q, N, R, T, S, or else P, Q, R, N, S, T. Note in particular that both O and Q are visited very early.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 147

Context: # 5.4. EXERCISES

25. **Is it a binary search tree?**  
   No. Although nearly every node does satisfy the BST property (all the nodes in its left subtree come before it alphabetically, and all the nodes in its right subtree come after it), there is a single exception: Zoe is in Wash's left subtree, whereas she should be to his right.  
   Many ways; one would be to swap Zoe's and Wash's positions. If we do that, the fixed tree would be:

   ```
       Mal
      /   \
    Jayne  Zoe
    /  \   /
  Inara Kayle
           \
           River
             \
             Simon
               \
               Wash
   ```

26. **How could we fix it?**  
   Take a moment and convince yourself that every node of this new tree does in fact satisfy the BST property. It's not too bad, but it does have one too many levels in it (it has a height of 4, whereas all its nodes would fit in a tree of height 3).

27. **Is the tree balanced?**  
   Many ways; one would be to rotate the River-Simon-Wash subtree so that Simon becomes Zoe's left child. Simon would then be the parent of River (on his left) and Wash (on his right).

28. **How could we make it more balanced?**  
   Suggested approaches include reviewing the structure to ensure the depths of all leaves are as equal as possible or implementing rotations to balance out the heights of subtrees.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 158

Context: # CHAPTER 6. COUNTING

and so on. Once we have the **RISK** permutations, we can generate the **BRISK** permutations in the same way:

| B | R | I | S | K |
|---|---|---|---|---|
| R | B | I | S | K |
| R | I | B | S | K |
| R | I | S | B | K |
| B | I | R | S | K |
| T | B | R | I | S | K |
| I | R | B | S | K |
| I | R | S | K |
| I | S | R | K |
| B | S | K |
| . . . |

Another algorithm to achieve the same goal (though in a different order) is as follows:

## Algorithm #2 for enumerating permutations

1. Begin with a set of **n** objects.

   a. If **n = 1**, there is only one permutation; namely, the object itself.
   
   b. Otherwise, remove each of the objects in turn, and prepend that object to the permutations of all the others, creating another permutation each time.

I find this one a little easier to get my head around, but in the end it’s personal preference. The permutations of **BRISK** are: “B followed by all the permutations of **RISK**, plus **I** followed by all the permutations of **BISK**, plus **R** followed by all the permutations of **BRSK**, etc.” So the first few permutations of a 4-letter word are:

- R I S K
- R I K S
- R S I K
- R S K I
- R K I S
- R K S I
- I R S K
- I R K S
- I S R K
- I S K R
- I K R S
- I K S R
- S R I K
- S R K I
- S I R K
- S I K R
- S K R I
- S K I R
- K R I S
- K R S I
- K I R S
- K I S R
- K S R I
- K S I R
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 167

Context: # 6.4 Summary

Most of the time, counting problems all boil down to a variation of one of the following three basic situations:

- \( n^k \) — this is when we have \( k \) different things, each of which is free to take on one of \( n \) completely independent choices.

- \( \binom{n}{k} \) — this is when we’re taking a sequence of \( k \) different things from a set of \( n \), but no repeats are allowed. (A special case of this is \( n! \), when \( k = n \).)

- \( \binom{n}{k} \) — this is when we’re taking \( k \) different things from a set of \( n \), but the order doesn’t matter.

Sometimes it’s tricky to deduce exactly which of these three situations apply. You have to think carefully about the problem and ask yourself whether repeated values would be allowed and whether it matters what order the values appear in. This is often subtle.

As an example, suppose my friend and I work out at the same gym. This gym has 18 different weight machines to choose from, each of which exercises a different muscle group. Each morning,
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 171

Context: # 6.5. EXERCISES

1. To encourage rivalry and gluttony, we're going to give a special certificate to the child who collects the most candy at the end of the night. And while we're at it, we'll give 2nd-place and 3rd-place certificates as well. How many different ways could our 1st-2nd-3rd contest turn out?  
   **Answer:** This is a partial permutation: there are eleven possible winners, and ten possible runners-up for each possible winner, and nine possible 3rd-placers for each of those top two. The answer is therefore 114, or 900. Wow! I wouldn’t have guessed that high.

2. Finally, what if we want every kid to get a certificate with their name and place-of-finish on it? How many possibilities? (Assume no ties.)  
   **Answer:** This is now a full-blown permutation: 111. It comes to 30,916,800 different orders-of-finish, believe it or not. I told you: this counting stuff can explode fast.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 201

Context: # 7.5 EXERCISES

6. If I told you that 98,243,917,215 mod 7 was equal to 1, would you call me a liar without even having to think too hard?  
   No, you shouldn’t. It turns out that the answer is 3, not 1, but how would you know that without working hard for it?  

7. If I told you that 273,111,999,214 mod 6 was equal to 6, would you call me a liar without even having to think too hard?  
   Yes, you should. Any number mod 6 will be in the range 0 through 5, never 6 or above. (Think in terms of repeatedly taking out groups of six from the big number. The mod is the number of stones you have left when there are no more whole groups of six to take. If towards the end of this process there are six stones left, that’s not a remainder because you can get another whole group!)

8. Are the numbers 18 and 25 equal?  
   Of course not. Don’t waste my time.

9. Are the numbers 18 and 25 congruent mod 7?  
   Yes. If we take groups of 7 out of 18 stones, we’ll get two such groups (a total of 14 stones) and have 4 left over. And then, if we do that same with 25 stones, we’ll get three such groups (a total of 21 stones) and again have 4 left over. So they’re not congruent mod 7.

10. Are the numbers 18 and 25 congruent mod 6?  
    No. If we take groups of 6 out of 18 stones, we’ll get three such groups with nothing left over. But if we start with 25 stones, we’ll take out 4 such groups (for a total of 24 stones) and have one left over. So they’re not congruent mod 6.

11. Are the numbers 617,418 and 617,424 equal?  
    Of course not. Don’t waste my time.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 204

Context: I'm unable to process images directly. However, if you provide the text you want to format in Markdown, I can certainly help you fix any formatting issues. Please paste the text here.

Image Analysis: 

### Image Analysis Report

**Localization and Attribution:**
- The image appears to be a single diagram located at the center of the page.
- It is referred to as **Image 1**.

**Object Detection and Classification:**
- **Objects Detected:**
  - Boxes/rectangles connected by arrows.
  - Icons inside some boxes.

**Scene and Activity Analysis:**
- **Scene Description:**
  - The image is a flowchart depicting a process. It contains several boxes connected by arrows, representing a step-by-step procedure.
- **Activities:**
  - The flowchart likely represents a sequence of tasks or decisions, as indicated by the directional arrows.

**Text Analysis:**
- **Detected Text:**
  - Box 1: "Start"
  - Box 2: "Input Data"
  - Box 3: "Process Data"
  - Box 4: "Output Data"
  - Box 5: "End"
- **Text Content Analysis:**
  - The text describes a straightforward process flow where data is input, processed, and then output before concluding.

**Diagram and Chart Analysis:**
- **Diagram Analysis:**
  - The flowchart is visually structured to guide the reader from the starting point of a process to its conclusion.
  - Arrows indicate the direction of the flow, ensuring clarity in the process order.
- **Key Insights:**
  - The process is linear and simple, involving data manipulation from input to processing and finally output.

**Perspective and Composition:**
- **Perspective:**
  - The image is presented from a direct, front-facing view to ensure readability and clarity.
- **Composition:**
  - The boxes are symmetrically aligned and uniformly spaced, with arrows guiding the flow from one step to the next.

**Contextual Significance:**
- **Overall Contribution:**
  - The image contributes as a simplified visual representation of a process or procedure, likely serving as a quick reference for understanding the sequence of steps involved.

**Tables:**
- **Tabular Data Analysis:**
  - No tables are included in the image.

### Summary

**Image 1** features a flowchart that describes a step-by-step process involving the input, processing, and output of data, concluding with an end step. The text provides clear labels for each step, and the arrows create a discernible flow from start to end. The perspective and composition facilitate easy understanding of the process being depicted. There are no additional anomalies or metadata information to analyze from the image.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 206

Context: # CHAPTER 8. LOGIC

A statement that has a "truth value," which means that it is either true or false. The statement "all plants are living beings" could be a proposition, as could "Barack Obama was the first African-American President" and "Kim Kardashian will play the title role in Thor: Love and Thunder." By contrast, questions like "are you okay?" cannot be propositions, nor can commands like "hurry up and answer already!" or phrases like "Lynn's newborn schnauzer," because they are not statements that can be true or false. (Linguistically speaking, propositions have to be in the indicative mood.)

We normally use capital letters (what else?) to denote propositions, like:

- Let **A** be the proposition that UMW is in Virginia.
- Let **B** be the proposition that the King of England is female.
- Let **C** be the proposition that dogs are carnivores.

Don't forget that a proposition doesn't have to be true in order to be a valid proposition (B is still a proposition, for example). It just matters that it is labeled and that it has the potential to be true or false.

Propositions are considered atomic. This means that they are indivisible: to the logic system itself, or to a computer program, they are simply an opaque chunk of truth (or falsity) called "A" or whatever. When we humans read the description of A, we realize that it has to do with the location of a particular institution of higher education, and with the state of the union that it might reside (or not reside) in. All this is invisible to an artificially intelligent agent, however, which treats "A" as nothing more than a stand-in label for a statement that has no further discernible structure. 

So things are pretty boring so far. We can define and label propositions, but none of them have any connections to the others. We change that by introducing logical operators (also called logical connectives) with which we can build up compound constructions out of multiple propositions. The six connectives we’ll learn are:
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 219

Context: # 8.2. Predicate Logic

which is perfectly true[^1].

You may recall the word "predicate" from your middle school grammar class. Every sentence, remember, has a subject and a predicate. In “Billy jumps,” “Billy” is the subject, and “jumps” the predicate. In “The lonely boy ate spaghetti with gusto,” we have “the lonely boy” as the subject and “ate spaghetti with gusto” as the predicate. Basically, a predicate is anything that can describe or affirm something about a subject. Imagine asserting “JUMPS(Billy)” and “ATESPAGHETTIWITHGUSTO(lonely boy).”

A predicate can have more than one input. Suppose we define the predicate `IsFanOf` as follows:

Let `IsFanOf(x, y)` be the proposition that x digs the music of rock band y.

Then I can assert:

- `IsFanOf(Stephen, Led Zeppelin)`
- `IsFanOf(Rachel, The Beatles)`
- `IsFanOf(Stephen, The Beatles)`
- `¬IsFanOf(Stephen, The Rolling Stones)`

We could even define `TraveledToByModeInYear` with a bunch of inputs:

Let `TraveledToByModeInYear(p, d, m, y)` be the proposition that person p traveled to destination d by mode m in year y.

The following statements are then true:

- `TraveledToByModeInYear(Stephen, Richmond, car, 2017)`

> "By the way, when I say you can give any input at all to a predicate, I mean any individual element from the domain of discourse. I don’t mean that a set of elements can be an input. This limitation is why it’s called ‘first-order’ predicate logic. If you allow sets to be inputs to predicates, it’s called ‘second-order predicate logic’, and can get quite messy." 

[^1]: The footnote text if needed.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 226

Context: # Chapter 8: Logic

Here's a funny one I’ll end with. Consider the sentence **“He made her duck.”** What is intended here? Did some guy reach out with his hand and forcefully push a woman’s head down out of the way of a screaming projectile? Or did he prepare a succulent dish of roasted fowl to celebrate her birthday? Oh, if the computer could only know. If we’d used predicate logic instead of English, it could!
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 227

Context: # 8.3 Exercises

Let \( B \) be the proposition that Joe Biden was elected president in 2020, \( C \) be the proposition that Covid-19 was completely and permanently eradicated from the earth in 2021, and \( R \) be the proposition that *Roe v. Wade* was overturned in 2022.

1. What’s \( B \land C \)?  
   **False.**

2. What’s \( B \land C' \)?  
   **True.**

3. What’s \( A \land R \)?  
   **True.**

4. What’s \( A \land R' \)?  
   **False.**

5. What’s \( \neg C \lor R \)?  
   **True.**

6. What’s \( \neg(C \lor R')? \)  
   **False.**

7. What’s \( \neg(C \lor R) \)?  
   **True.**

8. What’s \( \neg C \land B \)?  
   **False.**

9. What’s \( C \land B' \)?  
   **True.**

10. What’s \( \neg C \to B \)?  
    **False.**

11. What’s \( \neg\neg\neg B \)?  
    **True.**

12. What’s \( \neg B' \)?  
    **False.**

13. What’s \( \neg\neg\neg C' \)?  
    **True.**

14. What’s \( B \lor C \lor R \)?  
    **True.**

15. What’s \( B \land C \land R'? \)  
    **False.**

16. What’s \( B \land C \land R \)?  
    **True.** (Even though there is plainly no causality there.)

17. What’s \( B \to R? \)  
    **True.** (Ditto.)

18. What’s \( R \to B? \)  
    **False.** (The premise is true, so the conclusion must also be true for this sentence to be true.)

19. What’s \( B \to C? \)  
    **True.** (The premise is true, so all bets are off and the sentence is true.)

20. What’s \( C \to B? \)  
    **True.** (The premise is false, so all bets are off and the sentence is true.)
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 229

Context: # 8.3. EXERCISES

28. True or false: ¬∀x Human(x).  
   **True.** This says “it’s not the case that everyone and everything is human.” And that certainly is not the case.

29. True or false: ∀x Human(x).  
   **False.** This says “nothing is human,” which is clearly not true. (Consider yourself as a counterexample).

30. True or false: ∃x ¬Human(x).  
   **True.** This says “there’s at least one thing in the universe which is not human.” (Consider your lunch).

31. True or false: ¬∃x Human(x).  
   **False.** This says “nothing is human,” just like item 29 did.

32. True or false: ∀x Human(x) → Professor(x).  
   **Not even close.** This says “everything in the universe is a human professor.” (Even though I would exist in such a world, what a world, indeed, as it would be).

33. True or false: ∀x Human(x) → Professor(x).  
   **False.** This says “every person is a professor.” (Consider LeBron James.) Keep in mind: “→” and “∧” don’t really play well together.

34. True or false: ∃x Professor(x) → Human(x).  
   **This is technically true, but for a stupid reason, and whoever wrote it almost certainly didn't intend what they wrote.** It says, “there’s at least one thing in the universe which either (a) isn’t a professor, or (b) if it is a professor, is also human.” Keep in mind: “∃” and “→” don’t really play well together. To drill this lesson home, realize that you could substitute almost any predicates for Professor(x) and Human(x) in that statement and it would still be true. (Try swapping out Professor(x) for Condition(x) and Human(x) for AstrologicalSign(x). Now try ∃ EuropeanUnion x: the statement is true. The U.S. is not a member, nor is it an astrological sign, so both sides of the implication are false, and never forget: false = false = true.)
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 231

Context: # Chapter 9

## Proof

We've seen a lot of pretty sights on our cool brisk walk. We've caught a glimpse of the simple elegance of sets and relations, the precision of probabilistic reasoning, the recursive structure of trees, the explosive nature of combinatorics, and much more. None of these things have we plumbed to the depths, but we've appreciated their beauty and taken note of where they stood along our blazed trail. You’ll remember this hike when you run into such concepts again and again in future computer science and math courses, and in your career beyond academics.

Now we have one more step to make before returning to the trailhead, and that deals with the notion of **proof**. As we've studied these various mathematical entities, I’ve pointed out certain of their properties. A free tree has one more vertex than edge, for example. The cardinality of the union of two sets is at least as big as each of their individual unions. If you flip-all-the-bits-and-add-one in a two's complement scheme, and then perform that flip-and-add operation again, you’ll return to the original number. But with a few exceptions, we haven't proven any of these things. I’ve just stated them, and you’ve taken them on faith.

In order to establish reliable truth, of course, professional mathematicians aren’t satisfied with unsubstantiated statements. They need to be convinced that the claims we make do truly hold, and provably so, in all circumstances. What they seek is a **proof** of a...
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 233

Context: # 9.1. PROOF CONCEPTS

The phone is in my pocket, and has not rung, and I conclude that the plan has not changed. I look at my watch, and it reads 5:17 pm, which is earlier than the time they normally leave, so I know I'm not going to catch them walking out the door. This is, roughly speaking, my thought process that justifies the conclusion that the house will be empty when I pull into the garage.

Notice, however, that this prediction depends precariously on several facts. What if I spaced out the day of the week, and this is actually Thursday? All bets are off. What if my cell phone battery has run out of charge? Then perhaps they did try to call me but couldn’t reach me. What if I set my watch wrong and it’s actually 4:17 pm? Etc. Just like a chain is only as strong as its weakest link, a whole proof falls apart if even one step isn’t reliable.

Knowledge bases in artificial intelligence systems are designed to support these chains of reasoning. They contain statements expressed in formal logic that can be examined to deduce only the new facts that logically follow from the old. Suppose, for instance, that we had a knowledge base that currently contained the following facts:

1. \( A = C \)
2. \( \neg (C \land D) \)
3. \( (F \lor E) \to D \)
4. \( A \lor B \)

These facts are stated in propositional logic, and we have no idea what any of the propositions really mean, but then neither does the computer, so hey. Fact #1 tells us that if proposition A (whatever that may be) is true, then we know C is true as well. Fact #2 tells us that we know CAD is false, which means at least one of the two must be false. And so on. Large knowledge bases can contain thousands or even millions of such expressions. It’s a complete record of everything the system “knows.”

Now suppose we learn an additional fact: \( \neg B \). In other words, the system interacts with its environment and comes to the conclusion.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 236

Context: # CHAPTER 9. PROOF

## COLD

You start with one word (like **WARM**) and you have to come up with a sequence of words, each of which differs from the previous by only one letter, such that you eventually reach the ending word (like **COLD**). It's sort of like feeling around in the dark:

- **WARM**
- **WART**
- **WALT**
- **WILT**
- **WILD**
- ...  

This attempt seemed promising at first, but now it looks like it's going nowhere. ("**WOLD?**" "**CILD?**" Hmm...) After starting over and playing around with it for a while, you might stumble upon:

- **WARM**
- **WORM**
- **WORD**
- **CORD**
- **COLD**

This turned out to be a pretty direct path: for each step, the letter we changed was exactly what we needed it to be for the target word **COLD**. Sometimes, though, you have to meander away from the target a little bit to find a solution, like going from **BLACK** to **WHITE**:

- **BLACK**
- **CLACK**
- **CRACK**
- **TRACK**
- **TRICK**
- **TRICE**
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 237

Context: # 9.2. TYPES OF PROOF

## TRITE  

## WRITE  

## WHITE  

Here, we had to temporarily change our first letter three different times — two of which seemingly brought us no nearer to WHITE — in order to successfully forge a path through the tangled forest.

Knowing which direction to set out on is a matter of intuition plus trial and error. Given the axioms of any system (whether algebra, predicate logic, sets, etc.), there are an unfathomable number of different ways to proceed. The vast majority of them are bound to lead to dead ends. This is why a valid proof, when it is finished, is often an elegant and beautiful thing. It’s a thin braid of jewels glistening in the midst of a whole lot of mud.

## Indirect proof

Also known as a proof by contradiction or reductio ad absurdum, the indirect proof starts in a completely opposite way. It says, “Okay, I’m trying to prove X. Well, suppose for the sake of argument I assume that the opposite — not X — is true. Where would that lead me?” If you follow all the rules and it leads you to a contradiction, this tells you that the original assumption of ¬X must have been false. And this in turn proves that X must be true.

We do this all the time in our thinking. Say you’re driving down the highway. How do you know that the alternator in your engine is working? A direct proof would require that you open the hood and examine the part, testing to ensure it works properly. An indirect proof simply says, “Well, suppose it weren’t working properly. Then my car engine wouldn’t operate. But here I am, driving down the road, and the engine obviously does operate, so that tells me that the alternator must be working properly.”

One of the most famous indirect proof dates from Euclid’s *Elements* in 300 B.C. It proves that the square root of 2 is an irrational number, a great surprise to mathematicians at the time (most of whom doubted the very existence of irrational numbers). Remember...
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 257

Context: 9.4. FINAL WORD
=================

- ordered triples, 15
- org charts, 113
- outcomes, 60, 62
- overflow, 188

P = NP?, 244  
parent (of a node), 114  
partial orders, 43  
partial permutations, 151, 154  
partitions, 26, 71, 94  
Pascal's triangle, 157  
passwords, 146  
paths (in a graph), 88, 113  
perfect binary tree, 122, 239  
permutations, 147  
PINs, 143  
poker, 160  
pop (off a stack), 99  
posts, 43  
post-order traversal, 118  
postulates, 226  
power sets, 24, 36  
pre-order traversal, 117  
predicates, 210, 211, 232  
predicate logic, 210  
premise (of implication), 200  
Prim's algorithm, 107  
Prime, Robert, 107  
prior probability, 68  
probability measures, 61, 63, 65  
product operator (II), 142, 160  
proof, 223  
proof by contradiction, 229  
propositional logic, 197, 225  
propositions, 197, 210  

- psychology, 70, 86
- push (on a stack), 99
- quantifiers, 212, 215
- queue, 95, 97
- quotient, 173, 174

- range (of a function), 48
- rational numbers (ℝ), 17, 24
- reachable, 89
- real numbers (ℝ), 71, 94
- rebalancing (a tree), 132
- recursion, 116, 120, 149, 231
- red-black trees, 133
- reflexive (relation), 40, 43
- relations, finite, 39
- relations, infinite, 39
- remainder, 173, 174
- right child, 116  
- root (of a tree), 112, 114  
- rooted trees, 112, 134  
- Russell's paradox, 15  

- sample space (Ω), 60  
- semantic network, 87  
- set operators, 18  
- set-builder notation, 11  
- sets, 8, 93  
- sets of sets, 15  
- sets, finite, 12  
- sets, fuzzy, 10  
- sets, infinite, 12  
- sibling (of a node), 115  
- signs-magnitude binary numbers, 183, 189  

**Sonic the Hedgehog**, 73  
- southern states, 72  
- spatial positioning, 92, 113
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 1

Context: # A Brief Introduction to Machine Learning for Engineers

(2018), “A Brief Introduction to Machine Learning for Engineers”, Vol. XX, No. XX, pp 1–231. DOI: XXX.

Osvaldo Simeone  
Department of Informatics  
King’s College London  
osvaldo.simeone@kcl.ac.uk
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 2

Context: # Contents

I Basics 5

1 Introduction 6  
   1.1 What is Machine Learning? 6  
   1.2 When to Use Machine Learning? 8  
   1.3 Goals and Outline 11  

2 A Gentle Introduction through Linear Regression 15  
   2.1 Supervised Learning 15  
   2.2 Inference 17  
   2.3 Frequentist Approach 19  
   2.4 Bayesian Approach 36  
   2.5 Minimum Description Length (MDL) 42  
   2.6 Information-Theoretic Metrics 44  
   2.7 Interpretation and Causality 47  
   2.8 Summary 49  

3 Probabilistic Models for Learning 51  
   3.1 Preliminaries 52  
   3.2 The Exponential Family 53  
   3.3 Frequentist Learning 59
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 4

Context: ```
6.6 Discriminative Models........................................159  
6.7 Autoencoders...............................................163  
6.8 Ranking*...................................................164  
6.9 Summary....................................................164  

IV Advanced Modelling and Inference..............................165  

7 Probabilistic Graphical Models................................166  
7.1 Introduction...............................................167  
7.2 Bayesian Networks..........................................170  
7.3 Markov Random Fields.......................................178  
7.4 Bayesian Inference in Probabilistic Graphical Models......182  
7.5 Summary....................................................185  

8 Approximate Inference and Learning............................186  
8.1 Monte Carlo Methods........................................187  
8.2 Variational Inference.......................................189  
8.3 Monte Carlo-Based Variational Learning*....................197  
8.4 Approximate Learning*......................................199  
8.5 Summary....................................................201  

V Conclusions...................................................202  

9 Concluding Remarks............................................203  

Appendices.......................................................206  

A Appendix A: Information Measures..............................207  
A.1 Entropy....................................................207  
A.2 Conditional Entropy and Mutual Information................210  
A.3 Divergence Measures.......................................212  

B Appendix B: KL Divergence and Exponential Family..........215  

Acknowledgements...............................................217  
```
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 11

Context: # Part I
## Basics
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 13

Context: # 1.1. What is Machine Learning?

This starts with an in-depth analysis of the problem domain, which culminates with the definition of a mathematical model. The mathematical model is meant to capture the key features of the problem under study and is typically the result of the work of a number of experts. The mathematical model is finally leveraged to derive hand-crafted solutions to the problem.

For instance, consider the problem of defining a chemical process to produce a given molecule. The conventional flow requires chemists to leverage their knowledge of models that predict the outcome of individual chemical reactions, in order to craft a sequence of suitable steps that synthesize the desired molecule. Another example is the design of speech translation or image/video compression algorithms. Both of these tasks involve the definition of models and algorithms by teams of experts, such as linguists, psychologists, and signal processing practitioners, not infrequently during the course of long standardization meetings.

The engineering design flow outlined above may be too costly and inefficient for problems in which faster or less expensive solutions are desirable. The machine learning alternative is to collect large data sets, e.g., of labeled speech, images, or videos, and to use this information to train general-purpose learning machines to carry out the desired task. While the standard engineering flow relies on domain knowledge and on design optimized for the problem at hand, machine learning lets large amounts of data dictate algorithms and solutions. To this end, rather than requiring a precise model of the set-up under study, machine learning requires the specification of an objective, of a model to be trained, and of an optimization technique.

Returning to the first example above, a machine learning approach would proceed by training a general-purpose machine to predict the outcome of known chemical reactions based on a large data set, and then by using the trained algorithm to explore ways to produce more complex molecules. In a similar manner, large data sets of images or videos would be used to train a general-purpose algorithm with the aim of obtaining compressed representations from which the original input can be recovered with some distortion.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 16

Context: # Introduction

representing the data in a smaller or more convenient space; and generative modelling, which is the problem of learning a generating mechanism to produce artificial examples that are similar to available data in the data set \( D \).

As a generalization of both supervised and unsupervised learning, **semi-supervised learning** refers to scenarios in which not all examples are labelled, with the unlabelled examples providing information about the distribution of the covariates \( x \).

## 3. Reinforcement learning

Reinforcement learning refers to the problem of inferring optimal sequential decisions based on rewards or punishments received as a result of previous actions. Under supervised learning, the “label” refers to an action to be taken when the learner is in an informational state about the environment given by a variable \( x \). Upon taking an action \( a \) in a state \( x \), the learner is provided with feedback on the immediate reward accrued via this decision, and the environment moves on to a different state. As an example, an agent can be trained to navigate a given environment in the presence of obstacles by penalizing decisions that result in collisions.

Reinforcement learning is hence neither supervised, since the learner is not provided with the optimal actions \( t \) to select in a given state \( x \); nor is it fully unsupervised, given the availability of feedback on the quality of the chosen action. Reinforcement learning is also distinguished from supervised and unsupervised learning due to the influence of previous actions on future states and rewards.

This monograph focuses on supervised and unsupervised learning. These general tasks can be further classified along the following dimensions:

- **Passive vs. active learning**: A passive learner is given the training examples, while an active learner can affect the choice of training examples on the basis of prior observations.
  
- **Offline vs. online learning**: Offline learning operates over a batch of training samples, while online learning processes samples in a streaming fashion. Note that reinforcement learning operates inherently in an online manner, while supervised and unsupervised learning can be carried out by following either offline or online formulations.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 17

Context: # 1.3 Goals and Outline

This monograph considers only passive and offline learning.

## 1.3 Goals and Outline

This monograph aims at providing an introduction to key concepts, algorithms, and theoretical results in machine learning. The treatment concentrates on probabilistic models for supervised and unsupervised learning problems. It introduces fundamental concepts and algorithms by building on first principles, while also exposing the reader to more advanced topics with extensive pointers to the literature, within a unified notation and mathematical framework. Unlike other texts that are focused on one particular aspect of the field, an effort has been made here to provide a broad but concise overview in which the main ideas and techniques are systematically presented. Specifically, the material is organized according to clearly defined categories, such as discriminative and generative models, frequentist and Bayesian approaches, exact and approximate inference, as well as directed and undirected models. This monograph is meant as an entry point for researchers with a background in probability and linear algebra. A prior exposure to information theory is useful but not required.

Detailed discussions are provided on basic concepts and ideas, including overfitting and generalization, Maximum Likelihood and regularization, and Bayesian inference. The text also endeavors to provide intuitive explanations and pointers to advanced topics and research directions. Sections and subsections containing more advanced material that may be regarded at a first reading are marked with a star (+).

The reader will find here further discussions on computing platform or programming frameworks, such as map-reduce, nor details on specific applications involving large data sets. These can be easily found in a vast and growing body of work. Furthermore, rather than providing exhaustive details on the existing myriad solutions in each specific category, techniques have been selected that are useful to illustrate the most salient aspects. Historical notes have also been provided only for a few selected milestone events.

Finally, the monograph attempts to strike a balance between the algorithmic and theoretical viewpoints. In particular, all learning algorithms are presented in a manner that emphasizes their theoretical foundations while also providing practical insights.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 19

Context: # 1.3 Goals and Outline

Generative Adversarial Networks (GANs) are also introduced.

The fourth part covers more advanced modelling and inference approaches. Chapter 7 provides an introduction to probabilistic graphical models, namely Bayesian Networks (BNs) and Markov Random Fields (MRFs), as means to encode more complex probabilistic dependencies than the models studied in previous chapters. Approximate inference and learning methods are introduced in Chapter 8 by focusing on Monte Carlo (MC) and Variational Inference (VI) techniques. The chapter briefly introduces a unified way techniques such as variational EM, Variational AutoEncoders (VAE), and black-box inference. Some concluding remarks are provided in the last part, consisting of Chapter 9.

We conclude this chapter by emphasizing the importance of probability as a common language for the definition of learning algorithms [34]. The centrality of the probabilistic viewpoint was not always recognized, but has deep historical roots. This is demonstrated by the following two quotes, the first from the first AI textbook published by P. H. Winston in 1977, and the second from an unfinished manuscript by J. von Neumann (see [126, 64] for more information):

“Many ancient Greeks supported Socrates upon that deep, inexplicable thoughts came from the gods. Today’s equivalent to those gods is the erratic, even probabilistic neuron. It is more likely that increased randomness of neural behavior is the problem of the epileptic and the drunk, not the advantage of the brilliant.”

from *Artificial Intelligence*, 1977.

“Of all this will lead to theories of computation which are much less rigidly of an all-or-none nature than past and present formal logic. There are numerous indications to make us believe that this new system of formal logic will move closer to another discipline which has been little linked in the past with logic. This is thermodynamics primarily in the form it was received from Boltzmann.”
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 20

Context: # Introduction

from *The Computer and the Brain*, 1958.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 22

Context: ```markdown
variables \( t_n \) are assumed to be dependent on \( x_n \), and are referred to as dependent variables, labels, or responses. An example is illustrated in Fig. 2.1. We use the notation \( \mathbf{x}_D = (x_1, \ldots, x_N)^T \) for the covariates and \( \mathbf{t}_D = (t_1, \ldots, t_N)^T \) for the labels in the training set \( D \). Based on this data, the goal of supervised learning is to identify an algorithm to predict the label \( t \) for a new, that is, as of yet unobserved, domain point \( x \).

![Figure 2.1: Example of a training set \( D \) with \( N = 10 \) points \( (x_n, t_n) \), \( n = 1, \ldots, N \).](path_to_image)

The outlined learning task is clearly impossible in the absence of additional information on the mechanism relating variables \( z \) and \( t \). With reference to Fig. 2.1, unless we assume, say, that \( z \) and \( t \) are related by a function \( t = f(z) \) with some properties, such as smoothness, we have no way of predicting the label \( t \) for an unobserved domain point \( z \). This observation is formalized by the no free lunch theorem to be reviewed in Chapter 5: one cannot learn rules that generalize to unseen examples without making assumptions about the mechanism generating the data. The set of all assumptions made by the learning algorithm is known as the inductive bias.

This discussion points to a key difference between memorizing and learning. While the former amounts to mere retrieval of a value \( t_n \), 
```
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 23

Context: 2.2. Inference
====================

Corresponding to an already observed pair \((x_t, y_t) \in \mathcal{D}\), learning entails the capability to predict the value \(t\) for an unseen domain point \(x\). Learning, in other words, converts experience – in the form of \(\mathcal{D}\) – into expertise or knowledge – in the form of a predictive algorithm. This is well captured by the following quote by Jorge Luis Borges: “To think is to forget details, generalize, make abstractions.” [138]

By and large, the goal of supervised learning is that of identifying a predictive algorithm that minimizes the generalization loss, that is, the error in the prediction of a new label \(t\) for an unobserved explanatory variable \(x\). How exactly to formulate this problem, however, depends on one’s viewpoint on the nature of the model that is being learned. This leads to the distinction between the frequentist and the Bayesian approaches, which is central to this chapter. As will be discussed, the MDL philosophy deviates from the mentioned focus on prediction as the goal of learning, by targeting instead a parsimonious description of the data set \(\mathcal{D}\).

### 2.2 Inference

Before we start our discussion of learning, it is useful to review some basic concepts concerning statistical inference, as they will be needed throughout this chapter and in the rest of this monograph. We specifically consider the inference problem of predicting a \(y\) given the observation of another \(x\) under the assumption that their joint distribution \(p(x, y)\) is known. As a matter of terminology, it is noted that here we will use the term “inference” as it is typically intended in the literature on probabilistic graphical models (see, e.g., [81]), thereby diverging from its use in other branches of the machine learning literature (see, e.g., [23]).

In order to define the problem of optimal inference, one needs to define a non-negative loss function \(\ell(t, \hat{t})\). This defines the cost, or loss or risk, incurred when the correct value is \(t\) while the estimate is \(\hat{t}\). An important example is the \(\ell_t\) loss:

\[
\ell_t(\hat{t}) = |t - \hat{t}|, \quad (2.1)
\]

which includes as a special case the quadratic loss \(\ell_q(t, \hat{t}) = (t - \hat{t})^2\).
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 50

Context: ```
44
# A Gentle Introduction through Linear Regression

Produces a description of approximately 
$$ - \sum_{n=1}^{N} \log{p(y_n | x_n, w_{ML}, \beta_{ML})} $$
bits. This description is, however, not sufficient, since the decoder of the description should also be informed about the parameters $(w_{ML}, \beta_{ML})$.

Using quantization, the parameters can be described by means of a number $C(M)$ of bits that is proportional to the number of parameters, here $M + 2$. Concatenating these bits with the description produced by the ML model yields the overall description length

$$ - \sum_{n=1}^{N} \log{p[y_n | x_n, w_{ML}, \beta_{ML}]} + C(M). \quad (2.39) $$

MDL – in the simplified form discussed here – selects the model order $M$ that minimizes the description length (2.39). Accordingly, the term $C(M)$ acts as a regularizer. The optimal value of $M$ for the MDL criterion is hence the result of the trade-off between the minimization of the overhead $C(M)$, which calls for a small value of $M$, and the minimization of the NLL, which decreases with $M$.

Under some technical assumptions, the overhead term can be often evaluated in the form 
$$ \left( \frac{K}{2} \right) N \log N + c, $$ 
where $K$ is the number of parameters in the model and $c$ is a constant. This expression is not quite useful in practice, but it provides intuition about the mechanism used by MDL to combat overfitting.

## 2.6 Information-Theoretic Metrics

We now provide a brief introduction to information-theoretic metrics by leveraging the example studied in this chapter. As we will see in the following chapters, information-theoretic metrics are used extensively in the definition of learning algorithms. Appendix A provides a detailed introduction to information-theoretic measures in terms of inferential tasks. Here we introduce the key metrics of Kullback-Leibler (KL) divergence and entropy by examining the asymptotic behavior of ML in the regime of large $N$. The case with finite $N$ is covered in Chapter 6 (see Sec. 6.4.3).
```
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 54

Context: # A Gentle Introduction through Linear Regression

![Figure 2.10: Illustration of Simpson's paradox [113]](path_to_image)

The way out of this conundrum is to leverage prior information we have about the problem domain. In fact, we can explain away this spurious correlation by including another measurable variable in the model, namely age. To see this, consider the same data, now redrawn by highlighting the age of the individual corresponding to each data point. The resulting plot, seen in Fig. 2.10 (bottom), reveals that older people — within the observed bracket — tend to have a higher cholesterol as well as to exercise more. Therefore, age is a common cause of both exercise and cholesterol levels. In order to capture the causal relationship between the latter variables, we hence need to adjust for age. Doing this requires to consider the trend within each age separately, recovering the expected conclusion that exercising is useful to lower one’s cholesterol.  

We conclude that in this example the correlation between \( z \) and \( t \), while useful for prediction, should not be acted upon for decision making. When assessing the causal relationship between \( z \) and \( t \), we should first understand which other variables may explain the observations and then discount any spurious correlations.

This discussion reveals an important limitation of most existing machine learning algorithms when it comes to identifying causality relationships, or, more generally, to answering counterfactual queries [112]. The study of causality can be carried out within the elegant framework of footnotes.  

7 This example is an instance of the so-called Simpson's paradox: patterns visible in the data disappear, or are even reversed, on the segregated data.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 55

Context: # 2.8 Summary

In this chapter, we have reviewed three key learning frameworks, namely frequentist, Bayesian and MDL, within a parametric probabilistic setup. The frequentist viewpoint postulates the presence of a true unknown distribution for the data and aims at learning a predictor that generalizes well on unseen data drawn from this distribution. This can be done either by learning a probabilistic model to be plugged into the expression of the optimal predictor or by directly solving the ERM problem over the predictor. The Bayesian approach outputs a predictive distribution that combines prior information with the data by solving the inference problem of computing the posterior distribution over the unseen label. Finally, the MDL method aims at selecting a model that allows the data to be described with the smallest number of bits, hence doing away with the need to define the task of generalizing over unobserved examples.

The chapter has also focused extensively on the key problem of overfitting, demonstrating how the performance of a learning algorithm can be understood in terms of bias and estimation error. In particular, while choosing a hypothesis class is essential in order to enable learning, choosing the “wrong” class constitutes an irrecoverable bias that can make learning impossible. As a real-world example, as reported in [109], including as independent variables in the ZIP code of an individual seeking credit at a bank may disadvantage certain applicants or minorities. Another example of this phenomenon is the famous experiment by B. F. Skinner on pigeons [133].

We conclude this chapter by emphasizing an important fact about the probabilistic models that are used in modern machine learning applications. In frequentist methods, typically at least two (possibly conditional) distributions are involved: the empirical data distribution and the model distribution. The former amounts to the histogram of the data which, by the law of large numbers, tends to the real distribution.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 58

Context: # 3.1 Preliminaries

We start with a brief review of some definitions that will be used throughout the chapter and elsewhere in the monograph (see [28] for more details). Readers with a background in convex analysis and calculus may just review the concept of sufficient statistic in the last paragraph.

First, we define a **convex set** as a subset of \(\mathbb{R}^D\), for some \(D\), that contains all segments between any two points in the set. Geometrically, convex sets hence cannot have “indentations.” Function \(f(x)\) is convex if its domain is a convex set and if it satisfies the inequality \(f(\lambda x_1 + (1 - \lambda)x_2) \leq \lambda f(x_1) + (1 - \lambda)f(x_2)\) for all \(x_1\) and \(x_2\) in its domain and for all \(0 \leq \lambda \leq 1\). Geometrically, this condition says that the function is “U”-shaped: the curve defining the function cannot be above the segment obtained by connecting any two points on the curve. A function is strictly convex if the inequality above is strict except for \(\lambda = 0\) or \(\lambda = 1\); a concave, or strictly concave, function is defined by reversing the inequality above – it is hence “n-shaped.”

The minimization of a convex (“U”) function over a convex constraint set or the maximization of a concave (“n”) function over a convex constraint set are known as convex optimization problems. For these problems, there exist powerful analytical and algorithmic tools to obtain globally optimal solutions [28].

We also introduce two useful concepts from calculus. The **gradient** of a differentiable function \(f(x)\) with \(x = [x_1, \ldots, x_D]^T \in \mathbb{R}^D\) is defined as the \(D \times 1\) vector \(\nabla f(x) = [\frac{\partial f(x)}{\partial x_1}, \ldots, \frac{\partial f(x)}{\partial x_D}]^T\) containing all partial derivatives. At any point \(x\) in the domain of the function, the gradient is a vector that points to the direction of locally maximal increase of the function. The Hessian \(\nabla^2 f(x)\) is the \(D \times D\) matrix with \((i,j)\) element given by the second-order derivative \(\frac{\partial^2 f(x)}{\partial x_i \partial x_j}\). It captures the local curvature of the function.

1. A statistic is a function of the data.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 77

Context: # 3.6 Maximum Entropy Property*

In this most technical section, we review the maximum entropy property of the exponential family. Besides providing a compelling motivation for adopting models in this class, this property also illuminates the relationship between natural and mean parameters.

The key result is the following: The distribution \(p(x|\theta)\) in (3.1) obtains the maximum entropy over all distributions \(p(x)\) that satisfy the constraints \(\mathbb{E}_{x \sim p(x)} [u_k(x)] = \mu_k\) for all \(k = 1, \ldots, K\). Recall that, as mentioned in Chapter 2 and discussed in more detail in Appendix A, the entropy is a measure of randomness of a random variable. Mathematically, the distribution \(p(x | \eta)\) solves the optimization problem

\[
\max_{p} H(p) \quad \text{s.t.} \quad \mathbb{E}_{x \sim p(x)} [u_k(x)] = \mu_k \quad \text{for } k = 1, \ldots, K.
\]

Each natural parameter \(\eta_k\) turns out to be the optimal Lagrange multiplier associated with the \(k\)th constraint (see [45, Ch. 6-7]).

To see the practical relevance of this result, suppose that the only information available about some data \(x\) is given by the means of given functions \(u(x)\), \(k = 1, \ldots, K\). The probabilistic model (3.1) can then be interpreted as encoding the least additional information about the data, in the sense that it is the "most random" distribution under the given constraints. This observation justifies the adoption of this model by the maximum entropy principle.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 80

Context: # Probabilistic Models for Learning

## 3.9 Summary

In this chapter, we have reviewed an important class of probabilistic models that are widely used as components in learning algorithms for both supervised and unsupervised learning tasks. Among the key properties of members of this class, known as the exponential family, are the simple form taken by the gradient of the log-likelihood (LL), as well as the availability of conjugate priors in the same family for Bayesian inference. 

An extensive list of distributions in the exponential family along with corresponding sufficient statistics, measure functions, log-partition functions, and mappings between natural and mean parameters can be found in [156]. More complex examples include the Restricted Boltzmann Machines (RBMs) to be discussed in Chapter 6 and Chapter 8. It is worth mentioning that there are also distributions not in the exponential family, such as the uniform distribution parameterized by its support. The chapter also covered the important idea of applying exponential models to supervised learning via Generalized Linear Models (GLMs). Energy-based models were finally discussed as an advanced topic.

The next chapter will present various applications of models in the exponential family to classification problems.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 82

Context: # 4
## Classification

The previous chapters have covered important background material on learning and probabilistic models. In this chapter, we use the principles and ideas covered so far to study the supervised learning problem of classification. Classification is arguably the quintessential machine learning problem, with the most advanced state of the art and the most extensive application to problems as varied as email spam detection and medical diagnosis. 

Due to space limitations, this chapter cannot provide an exhaustive review of all existing techniques and latest developments, particularly in the active field of neural network research. For instance, we do not cover decision trees here (see, e.g., [155]). Rather, we will provide a principled taxonomy of approaches, and offer a few representative techniques for each category within a unified framework. We will specifically proceed by first introducing as preliminary material the Stochastic Gradient Descent optimization method. Then, we will discuss deterministic and probabilistic discriminative models, and finally we will cover probabilistic generative models.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 92

Context: ```
86 Classification

The loss function is defined as 

$$
\ell(t(x, \hat{w})) = \max(0, -t \cdot a(x, \hat{w}))
$$
(4.12)

The perceptron process assigns zero cost to a correctly classified example \( x \), whose functional margin \( t \cdot a(x, \hat{w}) \) is positive, and a cost equal to the absolute value of the functional margin for a misclassified example, whose functional margin is negative. A comparison with the 0-1 loss is shown in Fig. 4.4. The perceptron algorithm tackles problem (4.11) with \( \lambda = 0 \) via SGD with mini-batch size \( S = 1 \). The resulting algorithm works as follows. First, the weights \( \hat{w}^{(0)} \) are initialized. Then, for each iteration \( i = 1, 2, \ldots \):

1. Pick a training example \( (x_n, t_n) \) uniformly with replacement from \( D \):
   - If the example is correctly classified, i.e., if \( t_n a(x_n, \hat{w}^{(i)}) \geq 0 \), do not update the weights: 
     $$
     \hat{w}^{(i)} = \hat{w}^{(i-1)}.
     $$
   - If the example is not correctly classified, i.e., if \( t_n a(x_n, \hat{w}^{(i)}) < 0 \), update the weights as:
     $$
     \hat{w}^{(i)} = \hat{w}^{(i-1)} - \nabla_{\hat{w}} \ell(t_n, a(x_n, \hat{w}^{(i)})) \big|_{\hat{w} = \hat{w}^{(i-1)}} = \hat{w}^{(i-1)} + t_n a(x_n, \hat{w}^{(i-1)}).
     $$
(4.13)

It can be proved that, at each step, the algorithm reduces the term \( \ell(t_n, a(x_n, \hat{w})) \) in the perception loss related to the selected training example even if the latter is misclassified. It can also be shown that, if the training set is linearly separable, the perceptron algorithm finds a weight vector \( \hat{w} \) that separates the two classes exactly in a finite number of steps [23]. However, convergence can be slow. More importantly, the perceptron fails on training sets that are not linearly separable, such as the XOR training set \( D = \{(0, 0), (0, 1), (1, 0), (1, 1)\} \) [97]. This realization came as a disappointment and contributed to the first soc-called AI winter period characterized by a reduced funding for AI and machine learning research [154].

## Support Vector Machine (SVM)

SVM, introduced in its modern form by Cortes and Vapnik [37] in 1995, was among the main causes for a renewal of interest in machine learning.
```
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 106

Context: # Classification

![Figure 4.8](path/to/your/image)  
**Figure 4.8:** Probability that the class label is the same as for the examples marked with circles according to the output of a feed-forward multi-layer network with one hidden layer (L = 1) and six hidden neurons with sigmoid non-linearity. The probability is represented by the color map illustrated by the bar on the right of the figure. For reference, the solid line represents the decision line for logistic regression.

## Example 4.3
In the example in Fig. 4.8, the illustrated \( N = 300 \) training examples are used to train a logistic regression, i.e., GLM, model and a feed-forward multi-layer network with one hidden layer (L = 1) and six hidden neurons with sigmoid non-linearity \( h(z) = \sigma(x) \). The logistic model uses linear features \( \sigma(x) = [1]^T \). Both networks are trained using SGD. For logistic regression, the decision line is illustrated as a solid line, while for the multi-layer network we plot the probability that the class labels the same as for the examples marked with circles as color map. The GLM model with linear features is seen to be unable to capture the structure of the data, while the multi-layer network can learn suitable features that improve the effectiveness of classification.

## 4.5.3 Some Advanced Topics
We conclude this section by noting a few important aspects of the ongoing research on deep neural networks.

A first issue concerns the theoretical understanding of the generalization properties of deep neural networks. On the face of it, the success

Image Analysis: 

### Comprehensive Examination of the Visual Content

#### 1. Localization and Attribution
- **Image Position:**
  - **Image 1:** Located centrally on the page.

#### 2. Object Detection and Classification
- **Image 1:**
  - **Objects Detected:**
    - Data points represented by blue circles and red crosses.
    - Color map background in grayscale.
    - Decision boundary indicated by a solid line.

#### 3. Scene and Activity Analysis
- **Image 1:**
  - **Scene Description:**
    - The image depicts a scatter plot overlaid with a color map and a decision boundary.
  - **Activities:**
    - The scatter plot shows data points classified into two categories (blue circles and red crosses).
    - The color map indicates the probability of classification.
    - A solid line separates the regions, representing the decision boundary from logistic regression.

#### 4. Text Analysis
- **Extracted Text:**
    - **Caption:**
      - **Figure 4.8:** Probability that the class label is the same as for the examples marked with circles according to the output of a feed-forward multi-layer network with one hidden layer (\(L=1\)) and six hidden neurons (with sigmoid non-linearity). The probability is represented by the color map illustrated by the bar on the right of the figure. For reference, the solid line represents the decision line for logistic regression.
    - **Example 4.3:**
      - In the example in Fig. 4.8, the illustrated \(N=300\) training examples are used to train a logistic regression, i.e., GLM, model and a feed-forward multi-layer network with one hidden layer (\(L=1\)) and six hidden neurons with sigmoid non-linearity \(h(z)=\sigma(z)\). The logistic model uses linear features \(\sigma(z)=[1 \, z^T]\). Both networks are trained using SGD. For logistic regression, the decision line is illustrated as a solid line, while for the multi-layer network we plot the probability that the class label is the same as for the examples marked with circles as color map. The GLM model with linear features is seen to be unable to capture the structure of the data, while the multi-layer network can learn suitable features that improve the effectiveness of classification.

#### 5. Diagram and Chart Analysis
- **Image 1:**
  - **Axes and Scales:**
    - X-axis (\(Z_1\)): Ranges approximately from -4 to 4.
    - Y-axis (\(Z_2\)): Ranges approximately from -4 to 4.
  - **Legends:**
    - Red crosses (\( \times \)): Data points of class 1.
    - Blue circles (\( \circ \)): Data points of class 2.
    - Color map: Represents the probability, with a bar on the right showing the gradient from 0 to 1.
  - **Key Insights:**
    - The decision boundary (solid line) shows the separation made by logistic regression.
    - The color map illustrates that the feed-forward multi-layer network better captures the structure of the data compared to the linear model.

#### 8. Color Analysis
- **Image 1:**
  - **Dominant Colors:**
    - Red and blue are used to differentiate classes.
    - Black, white, and shades of gray indicate the probability regions.
  - **Impact on Perception:**
    - The color scheme helps to distinguish between different classes and the corresponding decision areas effectively.

#### 9. Perspective and Composition
- **Image 1:**
  - **Perspective:**
    - Direct, front-facing view typical of scatter plots.
  - **Composition:**
    - Center-aligned with data points spread across the plot.
    - Decision boundary centrally positioned, highlighting the separation between classes.
    - Color bar on the right side provides probability information.

#### 10. Contextual Significance
- **Image 1:**
  - **Overall Document Contribution:**
    - The plot supports the explanation of Example 4.3 in comparing the effectiveness of logistic regression vs. multi-layer networks.
    - It visually demonstrates how different models handle training data and classification.

By considering these aspects, we can see that this page offers a detailed explanation of machine learning classification techniques with a practical example illustrated by the scatter plot and color map. The image effectively highlights the differences in performance and decision-making capabilities of logistic regression and a multi-layer network.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 113

Context: # 4.8 Summary

When training the \(k\)th model, the outputs \(a_1(x, \mathbf{w}_1), \ldots, a_k(x, \mathbf{w}_{k-1})\) of the previously trained models, as well as their weights \(\pi_1, \ldots, \pi_{k-1}\), are kept fixed. Excluding the models \(k + 1, \ldots, K\), the training loss can be written as

\[
L_k = \sum_{n=1}^{N} \alpha_{n}^{(k)} \exp(-\pi_k t_n - \sum_{j=1}^{k-1} \pi_j a_j(x_n, \mathbf{w}_j)). \tag{4.47}
\]

with the weights

\[
\alpha_{n}^{(k)} = \exp\left(-t_n - \sum_{j=1}^{k-1} \pi_j a_j(x_n, \mathbf{w}_j)\right). \tag{4.48}
\]

An important point is that the weights (4.48) are larger for training samples with smaller functional margin under the mixture model \(\sum_{j=1}^{k} \pi_j a_j(x_n, \mathbf{w}_j)\). Therefore, when training the \(k\)th model, we give more importance to examples that fare worse in terms of classification margins under the current mixture model. Note that, at each training step \(k\), one trains a simple model, which has the added advantage of reducing the computational complexity as compared to the direct learning of the full training model. We refer to [23, Ch. 14][33, Ch. 10] for further details.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 115

Context: 4.8. Summary
-----------

Chapter 8 for details). Learning rate schedules that satisfy (4.49) include \(\gamma(t) = 1/t\). The intuitive reason for the use of diminishing learning rates is the need to limit the impact of the “noise” associated with the finite-sample estimate of the gradient [22]. The proof of convergence leverages the unbiasedness of the estimate of the gradient obtained by SGD.

In practice, a larger mini-batch size \(S\) decreases the variance of the estimate of the gradient, hence improving the accuracy when close to a stationary point. However, choosing a smaller \(S\) can improve the speed of convergence when the current solution is far from the optimum [152, Chapter 8][22]. A smaller mini-batch size \(S\) is also known to improve the generalization performance of learning algorithms by avoiding sharp external points of the training loss function [66, 79] (see also Sec. 4.5). Furthermore, as an alternative to decreasing the step size, one can also increase the size of the mini-batch along the iterations of the SGD algorithm [136].

Variations and Generalizations
-------------------------------

Many variations of the discussed basic SGD algorithm have been proposed and routinely used. General principles motivating these schedule variants include [56, Chapter 8]: 

1. **Momentum**, or heavy-ball, memory: correct the direction suggested by the stochastic gradient by considering the “momentum” acquired during the last update; 
2. **Adapitivity**: use a different learning rate for different parameters depending on an estimate of the curvature of the loss function with respect to each parameter; 
3. **Control variates**: in order to reduce the variance of the SGD updates, and control variates that do not affect the unbiasedness of the stochastic gradient; and
4. **Second-order updates**: include information about the curvature of the cost or objective function in the parameter update.

As detailed in [56, Chapter 8][76, 43], to which we refer for further discussions, methods in the first category include Nesterov momentum; in the second category, we find AdaGrad, RMSProp, and Adam; and the third encompasses SVRG and SAGA. Finally, the fourth features Newton's method.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 119

Context: # 5
## Statistical Learning Theory*

Statistical learning theory provides a well-established theoretical framework in which to study the trade-off between the number \( N \) of available data points and the generalization performance of a trained machine. The approach formalizes the notions of model capacity, estimation error (or generalization gap), and bias that underlie many of the design choices required by supervised learning, as we have seen in the previous chapters. 

This chapter is of mathematical nature, and it departs from the algorithmic focus of the text so far. While it may be skipped at a first reading, the chapter sheds light on the key empirical observations made in the previous chapters relative to learning in a frequentist setup. It does so by covering the theoretical underpinnings of supervised learning within the classical framework of statistical learning theory. 

To this end, the chapter contains a number of formal statements with proofs. The proofs have been carefully selected in order to highlight and clarify the key theoretical ideas. This chapter follows mostly the treatment in [133].
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 125

Context: ```
5.2 PAC Learnability and Sample Complexity
===========================================

In order to formally address the key question posed above regarding the learnability of a model \( \mathcal{H} \), we make the following definitions. As mentioned, for simplicity, we consider binary classification under the 0-1 loss, although the analysis can be generalized under suitable conditions [133].

### Definition 5.2

A hypothesis class \( \mathcal{H} \) is PAC learnable if, for any \( \epsilon, \delta \in (0, 1) \), there exist an \( (N, \epsilon, \delta) \) PAC learning rule as long as the inequality 

\[
N \geq N_{\mathcal{H}}(\epsilon, \delta)
\]

is satisfied for some function \( N_{\mathcal{H}}(\epsilon, \delta) < \infty \).

In words, a hypothesis class is PAC learnable if, as long as enough data is collected, a learning algorithm can be found that obtains any desired level of accuracy and confidence. An illustration of the threshold \( N_{\mathcal{H}}(\epsilon, \delta) \) can be found in Fig. 5.2. A less strong definition of PAC learnability requires (5.7) to hold only for all distributions \( p(x, t) \) that can be written as 

\[
p(x, t) = p(x)1(t = \hat{i}(x))
\]

for some marginal distribution \( p(x) \) and for some hypothesis \( \hat{i}(x) \in \mathcal{H} \). The condition (5.9) is known as the realizability assumption, which implies that the data is generated from some mechanism that is included in the hypothesis class. Note that realizability implies the linear separability of any data set drawn from the true distribution for the class of linear predictors (see Chapter 4).

A first proposition, and perhaps surprising, observation is that not all models are PAC learnable. As an extreme example of this phenomenon, consider the class of all functions from \( \mathbb{R}^d \) to \( \{0, 1\} \). By the no free lunch theorem, this class is not PAC learnable. In fact, given any amount of data, we can always find a distribution \( p(x, t) \) under which the PAC condition is not satisfied. Intuitively, even in the realizable case, knowing the correct predictor \( \hat{i}(x) \) in (5.9) for any number of 
```
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 133

Context: # 5.5. Summary

To elaborate, consider a probabilistic model \( \mathcal{H} \) defined as the set of all pmfs \( p(x|\theta) \) parameterized by \( \theta \) in a given set. With some abuse of notation, we take \( \mathcal{H} \) to be also the domain of parameter \( \theta \). To fix the ideas, assume that \( \theta \) takes values over a finite alphabet. We know from Section 2.5 that a distribution \( q(x) \) is associated with a lossless compression scheme that requires around \( -\log q(x) \) bits to describe a value. Furthermore, if we were informed about the true parameter \( \theta \), the minimum average coding length would be the entropy \( H(p(\cdot|\theta)) \), which requires setting \( q(x) = p(x|\theta) \) (see Appendix A).

Assume now that we only know that the parameter \( \theta \) lies in set \( \mathcal{H} \), and hence the true distribution \( p(x|\theta) \) is not known. In this case, we cannot select the true parameter distribution, and we need instead to choose a generally different distribution \( q(x) \) to define a compression scheme. With a given distribution \( q(x) \), the average coding length is given by 

\[
\bar{L}(q(x), \theta) = -\sum_{x} p(x|\theta) \log q(x). 
\]
Therefore, the choice of a generally different distribution \( q(x) \) entails a redundancy of 

\[
\Delta R(q(x), \theta) = -\sum_{x} p(x|\theta) \log \frac{q(x)}{p(x|\theta)} \geq 0 \tag{5.22}
\]
bits.

The redundancy \( \Delta R(q(x), \theta) \) in (5.22) depends on the true value of \( \theta \). Since the latter is not known, this quantity cannot be computed. We can instead obtain a computable metric by maximizing over all values of \( \theta \in \mathcal{H} \), which yields the worst-case redundancy

\[
\Delta R(q(x), \mathcal{H}) = \max_{\theta \in \mathcal{H}} \Delta R(q(x), \theta). \tag{5.23}
\]

This quantity can be minimized over \( q(x) \) yielding the so-called min-max redundancy:

\[
\Delta R(q(x)) = \min_{q} \Delta R(q(x), \mathcal{H}) \tag{5.24}
\]
\[
= \min_{q} \max_{\theta} \sum_{x} p(x|\theta) \log \frac{p(x|\theta)}{q(x)} \tag{5.25}
\]
\[
= \min_{q} \max_{\theta} \sum_{x} p(x|\theta) \log \frac{p(x|\theta)}{q(x)}. \tag{5.26}
\]

The minimum redundancy can be taken as a measure of the capacity of model \( \mathcal{H} \), since a richer model tends to yield a larger \( \Delta R(\mathcal{H}) \). In fact,
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 138

Context: # Unsupervised Learning

may wish to cluster a set \( D \) of text documents according to their topics, by modeling the latter as an unobserved label \( z_n \). Broadly speaking, this requires to group together documents that are similar according to some metric. It is important at this point to emphasize the distinction between classification and clustering: While the former assumes the availability of a labelled set of training examples and evaluates its (generalization) performance on a separate set of unlabelled examples, the latter works with a single, unlabelled, set of examples. The different notation used for the labels \( z_n \) in lieu of \( t_n \) is meant to provide a reminder of this key difference.

## Dimensions reduction and representation:
Given the set \( D \), we would like to represent the data points \( x_n \in D \) in a space of lower dimensionality. This makes it possible to highlight independent explanatory factors, and/or to ease visualization and interpretation [93], e.g., for text analysis via vector embedding (see, e.g., [124]).

## Feature extraction:
Feature extraction is the task of deriving functions of the data points \( x_n \) that provide useful lower-dimensional inputs for tasks such as supervised learning. The extracted features are unobserved, and hence latent, variables. As an example, the hidden layer of a deep neural network extracts features from the data for use by the output layer (see Sec. 4.5).

## Generation of new samples:
The goal here is to train a machine that is able to produce samples that are approximately distributed according to the true distribution \( p(x) \). For example, in computer graphics for filmmaking or gaming, one may want to train a software that is able to produce artistic scenes based on a given description.

The variety of tasks and the difficulty in providing formal definitions, e.g., on the realism of an artificially generated image, may hinder unsupervised learning, at least in its current state, a less formal field than supervised learning. Often, loss criteria in unsupervised learning measure the divergence between the learned model and the empirical data distribution, but there are important exceptions, as we will see.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 169

Context: # 6.8 Ranking

## Denoising autoencoders
An alternative approach to facilitate the learning of useful features is that taken by denoising autoencoders [150]. Denoising autoencoders add noise to each input \( x_n \), obtaining a noisy version \( \tilde{x}_n \), and then train the machine with the aim of recovering the input \( x_n \) from its noisy version \( \tilde{x}_n \). Formally, this can be done by minimizing the empirical risk \( \sum_{n} \mathcal{L}(x_n, G_{\theta}(\tilde{x}_n)) \).

## Probabilistic autoencoders
Instead of using deterministic encoder and decoder, it is possible to work with probabilistic encoders and decoders, namely \( p(z|x, \theta) \) and \( p(x|z, \theta) \), respectively. Treating the decoder as a variational distribution, learning can then be done by a variant of the EM algorithm. The resulting algorithm, known as Variational AutoEncoder (VAE), will be mentioned in Chapter 8.

## 6.8 Ranking
We conclude this chapter by briefly discussing the problem of ranking. When one has available ranked examples, ranking can be formulated as a supervised learning problem [133]. Here, we focus instead on the problem of ranking a set of webpages based only on the knowledge of their underlying graph of mutual hyperlinks. This set-up may be considered as a special instance of unsupervised learning. We specifically describe a representative, popular, scheme known as PageRank [110], which uses solely the web of hyperlinks as a form of supervision signal to guide the ranking.

To elaborate, we define the connectivity graph by including a vertex for each webpage and writing the adjacency matrix as:

\[
L_{ij} =
\begin{cases}
1 & \text{if page } j \text{ links to page } i \\
0 & \text{otherwise}
\end{cases}
\]
  
The outgoing degree of a webpage is given as:

\[
C_j = \sum_{i} L_{ij}.
\]

PageRank computes the rank \( p_i \) of a webpage \( i \) as:

\[
p_i = (1 - d) + d \sum_{j} \frac{L_{ji}}{C_j} p_j,
\] 

where \( d \) is a damping factor.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 173

Context: # 7.1 Introduction

In this section, we start by discussing two examples that illustrate the type of structural information that can be encoded by means of probabilistic graphical models. We then provide an overview of this chapter.

As illustrated by the two examples below, structured probabilistic models can be used to set up parametric models for both supervised and unsupervised learning. In the former case, all variables are observed in the training set, with some revs being inputs, i.e., covariates (\(x\)), and others being considered as outputs, or targets (\(y\)). In contrast, in the latter case, some variables are unobserved and play the role of latent variables (\(z\)) that help explain or generate the observed variables (\(y\)).[^1]

## Example 7.1

Consider the tasks of text classification via supervised learning or topic clustering via unsupervised learning. In the supervised learning case, the problem is to classify documents depending on their topic, e.g., sport, politics or entertainment, based on a set of labeled documents. With unsupervised learning, the problem is to cluster documents according to the similarity of their contents based on the sole observation of the documents themselves.

A minimal model for this problem should include a variable \(t\) representing the topic and a variable \(x\) for the document. The topic can be represented by a categorical variable taking \(T\) values, i.e., \(t \in \{1, ... , T\}\), which is observed for supervised learning and latent for unsupervised learning. As for the document, with “bag-of-words” encoding, a set of \(W\) words of interest is selected, and a document is encoded as \(W \times 1\) binary vector \(z = [z_1, ..., z_W]^T\), where \(z_w = 1\) if word \(w\) is contained in the document.[^2]

To start, we could try to use an unstructured directed model defined as

\[
t \sim \text{Cat}(π) \tag{7.1a}
\]

\[
x | t \sim \text{Cat}(t) \tag{7.1b}
\]

[^1]: Strictly speaking, this distinction applies to the frequentist approach, since in the Bayesian approach the model parameters are always treated as unobserved random variables.
[^2]: Note that \(W\) here does not represent a matrix of weights!
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 180

Context: # Probabilistic Graphical Models

Bernoulli naive Bayes model (7.2) for text classification. Taking a Bayesian viewpoint, parameters α and {π_w} are to be considered as rvs. We further assume them to be a priori independent, which is known as the **global independence assumption**. As a result, the joint probability distribution for each document factorizes as

$$
p(z | t, π, \mathbf{w} | α, b) = Dir(π | α, b) \prod_{w=1}^{W} \prod_{t=1}^{T} Beta(π_w | a, b) \times Cat(t | T) \prod_{u=1}^{U} Bern(z_{uw} | π_{uw}).
$$

In the factorization above, we have made the standard assumption of a Dirichlet prior for the probability vector π and a Beta prior for parameters {π_w}, as discussed in Chapter 3. The quantities α, a, b are hyperparameters. Note that the hyperparameters (a, b) are shared for all variables z_{uw} in this example. The corresponding BN is shown in Fig. 7.4, which can be compared to Fig. 7.1 for the corresponding frequentist model.

![Figure 7.4: BN for the Bayesian version of the naive Bayes model (7.11). Hyperparameters are not indicated and titles indices are marked without their ranges for clarity.](figure_url_here)

We invite the reader to consider also the Latent Dirichlet Allocation (LDA) model and the other examples available in the mentioned textbooks.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 207

Context: # 8.5 Summary

Having observed in previous chapters that learning in probabilistic models is often held back by the complexity of exact Bayesian inference for hidden variables, this chapter has provided an overview of approximate, lower-complexity inference techniques. We have focused on MC and VI methods, which are most commonly used. The treatment stressed the impact of design choices in the selection of different types of approximation criteria, such as M- and I-Projection. It also covered the use of approximate inference in learning problems. Topics that improve over the state of the art discussed in this chapter are being actively investigated. Some additional topics for future research are covered in the next chapter.

## Appendix: M-Projection with the Exponential Family

In this appendix, we consider the problem of obtaining the M-projection of a distribution \( p(z \mid \varphi) = Z(\varphi)^{-1} \exp\{u^T(\varphi)\} \) from the exponential family with sufficient statistics \( u(X) \). We will prove that, if there exists a value \( \varphi^* \) of the natural parameter vector that satisfies the moment matching condition (8.14), then \( q(z \mid \varphi) \) is the M-projection.

We first write the KL divergence as:

\[
KL(q(z \| \varphi)) = \ln Z(\varphi) - \varphi^T E_{p(z \mid u)}[u(z)] - H(p). \tag{8.31}
\]

The difference between the KL divergence for a generic parameter vector \( \varphi \) and the vector \( \varphi^* \) satisfying (8.14) can be written as:

\[
KL(q(z \mid \varphi) \| q(z \mid \varphi^*)) = E_{q(z \mid \varphi)}\left[ \ln\left(\frac{q(z \mid \varphi)}{q(z \mid \varphi^*)}\right) \right]
= -KL(q(z \mid \varphi) \| q(z \mid \varphi^*)). \tag{8.32}
\]

Since the latter inequality is non-negative and equal to zero when \( \varphi = \varphi^* \), this choice of the natural parameters minimizes the KL divergence.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 208

Context: # Part V

## Conclusions

- This section summarizes the key findings of the study.
- Conclusions drawn from the data.
- Implications of the findings for future research.

1. **First conclusion point.**
2. **Second conclusion point.**
3. **Third conclusion point.**

### Recommendations

- **Recommendation 1:** Description of recommendation.
- **Recommendation 2:** Description of recommendation.
- **Recommendation 3:** Description of recommendation.

### References

1. Author, A. (Year). *Title of the work*. Publisher.
2. Author, B. (Year). *Title of the work*. Publisher. 

### Acknowledgements

- Acknowledgement of contributions from individuals or organizations.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 210

Context: # Concluding Remarks

The data, producing the wrong response upon minor, properly chosen, changes in the explanatory variables. Note that such adversarially chosen examples, which cause a specific machine to fail, are conceptually different from the randomly selected examples that are assumed when defining the generalization properties of a network. There is evidence that finding such examples is possible even without knowing the internal structure of a machine, but solely based on black-box observations [11]. Modifying the training procedure in order to ensure robustness to adversarial examples is an active area of research with important practical implications [55].

## Computing Platforms and Programming Frameworks

In order to scale up machine learning applications, it is necessary to leverage distributed computing architectures and related standard programming frameworks [17, 7]. As a complementary and more futuristic approach, recent work has been proposed to leverage the capabilities of an annealing-based quantum computers as samplers [82] or discrete optimizers [103].

## Transfer Learning

Machines trained for a certain task currently need to be re-trained in order to be re-purposed for a different task. For instance, a machine that learned how to drive a car would need to be retrained in order to learn how to drive a truck. The field of transfer learning covers scenarios in which one wishes to transfer the expertise acquired from some tasks to others. Transfer learning includes different related paradigms, such as multitask learning, lifelong learning, zero-shot learning, and domain adaptation [149]. 

In multitask learning, several tasks are learned simultaneously. Typical solutions for multitask learning based on neural networks preserve the presence of common representations among neural networks trained for different tasks [19]. Lifelong learning utilizes a machine trained on a number of tasks to carry out a new task by leveraging the knowledge accumulated during the previous training phases [143]. Zero-shot learning refers to models capable of recognizing unseen classes with training examples available only for related, but different, classes. This often entails the task of learning representation of classes, such as prototype vectors, that generate data in the class through a fixed probabilistic mechanism [52]. Domain adaptation will be discussed separately in the next point.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 1

Context: # A First Encounter with Machine Learning

**Max Welling**  
*Donald Bren School of Information and Computer Science*  
*University of California Irvine*

November 4, 2011
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 4

Context: ```
# CONTENTS

7.2 A Different Cost function: Logistic Regression .......................... 37  
7.3 The Idea In a Nutshell .................................................. 38  

# 8 Support Vector Machines ................................................ 39  
8.1 The Non-Separable case ................................................. 43  

# 9 Support Vector Regression .............................................. 47  

# 10 Kernel ridge Regression ............................................... 51  
10.1 Kernel Ridge Regression ............................................... 52  
10.2 An alternative derivation ............................................. 53  

# 11 Kernel K-means and Spectral Clustering ............................... 55  

# 12 Kernel Principal Components Analysis ................................ 59  
12.1 Centering Data in Feature Space ....................................... 61  

# 13 Fisher Linear Discriminant Analysis .................................. 63  
13.1 Kernel Fisher LDA .................................................... 66  
13.2 A Constrained Convex Programming Formulation of FDA ................. 68  

# 14 Kernel Canonical Correlation Analysis ................................ 69  
14.1 Kernel CCA ............................................................ 71  

# A Essentials of Convex Optimization ..................................... 73  
A.1 Lagrangians and all that ............................................... 73  

# B Kernel Design .......................................................... 77  
B.1 Polynomials Kernels .................................................... 77  
B.2 All Subsets Kernel ...................................................... 78  
B.3 The Gaussian Kernel ..................................................... 79  
```
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 6

Context: # PREFACE

about 60% correct on 100 categories, the fact that we pull it off seemingly effortlessly serves as a “proof of concept” that it can be done. But there is no doubt in my mind that building truly intelligent machines will involve learning from data.

The first reason for the recent successes of machine learning and the growth of the field as a whole is rooted in its multidisciplinary character. Machine learning emerged from AI but quickly incorporated ideas from fields as diverse as statistics, probability, computer science, information theory, convex optimization, control theory, cognitive science, theoretical neuroscience, physics, and more. To give an example, the main conference in this field is called: **advances in neural information processing systems**, referring to information theory and theoretical neuroscience and cognitive science.

The second, perhaps more important reason for the growth of machine learning is the exponential growth of both available data and computer power. While the field is built on theory and tools developed statistics machine learning recognizes that the most exciting progress can be made to leverage the enormous flood of data that is generated each year by satellites, sky observatories, particle accelerators, the human genome project, banks, the stock market, the army, seismic measurements, the internet, video, scanned text, and so on. It is difficult to appreciate the exponential growth of data that our society is generating. To give an example, a modern satellite generates roughly the same amount of data all previous satellites produced together. This insight has shifted the attention from highly sophisticated modeling techniques on small datasets to more basic analysis on such larger data-sets (the latter sometimes called **data-mining**). Hence the emphasis shifted to algorithmic efficiency and as a result many machine learning faculty (like myself) can typically be found in computer science departments. To give some examples of recent successes of this approach one would only have to turn on one computer and perform an internet search. Modern search engines do not run terribly sophisticated algorithms, but they manage to store and sift through almost the entire content of the internet to return sensible search results. There has also been much success in the field of machine translation, not because a new model was invented but because many more translated documents became available.

The field of machine learning is multifaceted and expanding fast. To sample a few sub-disciplines: statistical learning, kernel methods, graphical models, artificial neural networks, fuzzy logic, Bayesian methods, and so on. The field also covers many types of learning problems, such as supervised learning, supervised learning, semi-supervised learning, active learning, reinforcement learning, etc. I will only cover the most basic approaches in this book from a highly per
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 7

Context: # MEANT FOR INDUSTRY AS WELL AS BACKGROUND READING

This book was written during my sabbatical at the Radboud University in Nijmegen (Netherlands). I would like to thank Hans for discussion on intuition. I also thank Prof. Bert Kappen, who leads an excellent group of postdocs and students for his hospitality. Marga, kids, UCI,...

---

There are a few main aspects I want to cover from a personal perspective. Instead of trying to address every aspect of the entire field, I have chosen to present a few popular and perhaps useful tools and approaches. What will (hopefully) be significantly different than most other scientific books is the manner in which I will present these methods. I have always been frustrated by the lack of proper explanation of equations. Many times, I have been staring at a formula without the slightest clue where it came from or how it was derived. Many books also excel in stating facts in an almost encyclopedic style, without providing the proper intuition of the method. This is my primary mission: to write a book that conveys intuition. The first chapter will be devoted to why I think this is important.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 10

Context: # LEARNING AND INTUITION

Baroque features or a more “dull” representation, whatever works. Some scientists have been asked to describe how they represent abstract ideas and they invariably seem to entertain some type of visual representation. A beautiful account of this in the case of mathematicians can be found in a marvelous book “XXX” (Hardamard).

By building accurate visual representations of abstract ideas, we create a database of knowledge in the unconscious. This collection of ideas forms the basis for what we call intuition. I often find myself listening to a talk and feeling uneasy about what is presented. The reason seems to be that the abstract idea I am trying to capture from the talk clashed with a similar idea that is already stored. This in turn can be a sign that I either misunderstood the idea before and need to update it, or that there is actually something wrong with what is being presented. In a similar way, I can easily detect that some idea is a small perturbation of what I already knew (I feel happily bored), or something entirely new (I feel intrigued and slightly frustrated). While the novice is continuously challenged and often feels overwhelmed, the more experienced researcher feels at ease 90% of the time because the “new” idea was already in his/her database which therefore needs no and very little updating.

Somehow our unconscious mind can also manipulate existing abstract ideas into new ones. This is what we usually think of as creative thinking. One can stimulate this by seeding the mind with a problem. This is a conscious effort and is usually a combination of detailed mathematical derivations and building an intuitive picture or metaphor for the thing one is trying to understand. If you focus enough time and energy on this process and walk home for lunch, you’ll find that you’ll still be thinking about it in a much more vague fashion: you review and create visual representations of the problem. Then you get your mind off the problem altogether and when you walk back to work, suddenly parts of the solution surface into consciousness. Somehow, your unconscious took over and kept working on your problem. The essence is that you created visual representations as the building blocks for the unconscious mind to work with.

In any case, whatever the details of this process are (and I am no psychologist), I suspect that any good explanation should include both an intuitive part, including examples, metaphors and visualizations, and a precise mathematical part where every equation and derivation is properly explained. This then is the challenge I have set to myself. It will be your task to insist on understanding the abstract idea that is being conveyed and build your own personalized visual representations. I will try to assist in this process but it is ultimately you who will have to do the hard work.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 11

Context: Many people may find this somewhat experimental way to introduce students to new topics counter-productive. Undoubtedly for many it will be. If you feel under-challenged and become bored, I recommend moving on to the more advanced textbooks, of which there are many excellent samples on the market (for a list see [books](#)). But I hope that for most beginning students, this intuitive style of writing may help to gain a deeper understanding of the ideas that I will present in the following. Above all, have fun!
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 13

Context: # Chapter 1

## Data and Information

Data is everywhere in abundant amounts. Surveillance cameras continuously capture video; every time you make a phone call your name and location get recorded. Often, your clicking pattern is recorded when surfing the web. Most financial transactions are recorded; satellites and observatories generate terabytes of data every year; the FBI maintains a DNA database of most convicted criminals. Soon, all written text from our libraries is digitized. Need I go on?

But data in itself is useless. Hidden inside the data is valuable information. The objective of machine learning is to pull the relevant information from the data and make it available to the user. What do we mean by "relevant information"? When analyzing data, we typically have a specific question in mind, such as: 

- “How many types of car can be discerned in this video?” 
- “What will be the weather next week?”

The answer can take the form of a single number (there are 5 cars), or a sequence of numbers (or the temperature next week) or a complicated pattern (the cloud configuration next week). If the answer to our query is complex, we like to visualize it using graphs, bar plots, or even little movies. One should keep in mind that the particular analysis depends on the task one has in mind.

Let me spell out a few tasks that are typically considered in machine learning:

### Prediction

Here we ask ourselves whether we can extrapolate the information in the data to new use cases. For instance, if I have a database of attributes of Hummers such as weight, color, number of people it can hold, etc., and another database of attributes for Ferraris, then we can try to predict the type of car (Hummer or Ferrari) from a new set of attributes. Another example is predicting the weather given all the recorded weather patterns in the past: can we predict the weather next week, or the stock prices?
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 16

Context: # CHAPTER 1. DATA AND INFORMATION

## 1.2 Preprocessing the Data

As mentioned in the previous section, algorithms are based on assumptions and can become more effective if we transform the data first. Consider the following example, depicted in Figure 1. The algorithm we consist of estimating the area that the data occupy. It grows a circle starting at the origin and at the point it contains all the data we record the area of a circle. In the figure why this will be a bad estimate: the data-cloud is not centered. If we had first centered it we would have obtained reasonable estimate. Although this example is somewhat simple-minded, there are many, much more interesting algorithms that assume centered data. To center data we will introduce the sample mean of the data, given by,

\[ 
E[X_i] = \frac{1}{N} \sum_{n=1}^{N} X_{in} \tag{1.1} 
\]

Hence, for every attribute \(i\) separately, we simple add all the attribute value across data-cases and divide by the total number of data-cases. To transform the data so that their sample mean is zero, we set,

\[ 
X'_{in} = X_{in} - E[X_i], \quad \forall n 
\tag{1.2} 
\]

It is now easy to check that the sample mean of \(X'\) indeed vanishes. An illustration of the global shift is given in Figure 2. We also see in this figure that the algorithm described above now works much better!

In a similar spirit as centering, we may also wish to scale the data along the coordinate axis in order to make it more “spherical.” Consider Figure 3. In this case the data was first centered, but the elongated shape still prevented us from using the simplistic algorithm to estimate the area covered by the data. The solution is to scale the axes so that the spread is the same in every dimension. To define this operation we first introduce the notion of sample variance,

\[ 
V[X_i] = \frac{1}{N} \sum_{n=1}^{N} X_{in}^2 \tag{1.3} 
\]

where we have assumed that the data was first centered. Note that this is similar to the sample mean, but now we have used the square. It is important that we have removed the sign of the data-cases (by taking the square) because otherwise positive and negative signs might cancel each other out. By first taking the square, all data-cases first get mapped to positive half of the axes (for each dimension or
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 20

Context: # CHAPTER 2. DATA VISUALIZATION

An example of such a scatter plot is given in Figure ??.

Note that we have a total of \( d(d - 1) / 2 \) possible two-dimensional projections which amounts to 4950 projections for 100-dimensional data. This is usually too many to manually inspect. How do we cut down on the number of dimensions? Perhaps random projections may work? Unfortunately, that turns out to be not a great idea in many cases. The reason is that data projected on a random subspace often looks distributed according to what is known as a Gaussian distribution (see Figure ??). The deeper reason behind this phenomenon is the **central limit theorem** which states that the sum of a large number of independent random variables is (under certain conditions) distributed as a Gaussian distribution. Hence, if we denote with \( \mathbf{w} \) a vector in \( \mathbb{R}^d \) and by \( x \) the d-dimensional random variable, then \( y = \mathbf{w}^T \mathbf{x} \) is the value of the projection. This is clearly a weighted sum of the random variables \( x_i, \; i = 1, \ldots, d \). If we assume that \( x_i \) are approximately independent, then we can see that their sum will be governed by its central limit theorem. Analogously, a dataset \( \{X_n\} \) can thus be visualized in one dimension by "histogramming" the values of \( y = \mathbf{w}^T \mathbf{x} \), see Figure ??. In this figure we clearly recognize the characteristic "Bell-shape" of the Gaussian distribution of projected and histogrammed data.

In one sense the central limit theorem is a rather helpful quirk of nature. Many variables follow Gaussian distributions and the Gaussian distribution is one of the few distributions which have very nice analytical properties. Unfortunately, the Gaussian distribution is also the most uninformative distribution. This notion of "uninformative" can actually be made very precise using information theory and states: 

> Given a fixed mean and variance, the Gaussian density represents the least amount of information among all densities with the same mean and variance.

This is rather unfortunate for our purposes because Gaussian projections are the least revealing dimensions to look at. So in general we have to work a bit harder to see interesting structure.

A large number of algorithms has been devised to search for informative projections. The simplest being "principal component analysis" or PCA for short ??. Here, interesting means dimensions of high variance. However, the fact that high variance is not always a good measure of interestingness and one should rather search for dimensions that are non-Gaussian. For instance, "independent components analysis" (ICA) ?? and "projection pursuit" ?? search for dimen-
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 23

Context: # Chapter 3

## Learning

This chapter is without question the most important one of the book. It concerns the core, almost philosophical question of what learning really is (and what it is not). If you want to remember one thing from this book, you will find it here in this chapter.

Ok, let’s start with an example. Alice has a rather strange ailment. She is not able to recognize objects by their visual appearance. At home, she is doing just fine: her mother explained to Alice for every object in her house what it is and how you use it. When she is home, she recognizes these objects (if they have not been moved too much), but when she enters a new environment she is lost. For example, if she enters a new meeting room, she needs a long time to infer what the chairs and the table are in the room. She has been diagnosed with a severe case of "overfitting." 

What is the matter with Alice? Nothing is wrong with her memory because she remembers the objects once she has seen them. In fact, she has a fantastic memory. She remembers every detail of the objects she has seen. And every time she sees a new object, she reasons that the object in front of her is surely not a chair because it doesn’t have all the features she has seen in earlier chairs. The problem is that Alice cannot generalize the information she has observed from one instance of a visual object category to other, yet unobserved members of the same category. The fact that Alice’s disease is so rare is understandable; there must have been a strong selection pressure against this disease. 

Imagine our ancestors walking through the savanna one million years ago. A lion appears on the scene. Ancestral Alice has seen lions before, but not this particular one, and it does not induce a fear response. Of course, she has no time to infer the possibility that this animal may be dangerous logically. Alice’s contemporaries noticed that the animal was yellow-brown, had manes, etc., and immediately un-
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 27

Context: # 3.1 In a Nutshell

Learning is all about generalizing regularities in the training data to new, yet unobserved data. It is not about remembering the training data. Good generalization means that you need to balance prior knowledge with information from data. Depending on the dataset size, you can entertain more or less complex models. The correct size of the model can be determined by playing a compression game. 

Learning = generalization = abstraction = compression.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 28

Context: # CHAPTER 3. LEARNING

## Introduction

Learning involves acquiring knowledge, skills, attitudes, or competencies. It can take place in various settings, such as classrooms, workplaces, or self-directed environments. This chapter discusses the types and processes of learning.

## Types of Learning

1. **Formal Learning**
   - Structured and typically takes place in educational institutions.
   - Includes degrees, diplomas, and certifications.

2. **Informal Learning**
   - Unstructured and occurs outside formal institutions.
   - Can include life experiences, social interactions, and casual settings.

3. **Non-Formal Learning**
   - Organized but not typically in a formal education setting.
   - Often community-based, such as workshops and training sessions.

## Learning Processes

- **Cognitive Learning**: Involves mental processes and understanding. 
- **Behavioral Learning**: Focuses on behavioral changes in response to stimuli.
- **Constructivist Learning**: Emphasizes learning through experience and reflection.

## Table of Learning Theories

| Theory                    | Key Contributor      | Description                                      |
|--------------------------|----------------------|--------------------------------------------------|
| Behaviorism              | B.F. Skinner         | Learning as a change in behavior due to reinforcement. |
| Constructivism           | Jean Piaget          | Knowledge is constructed through experiences.    |
| Social Learning          | Albert Bandura       | Learning through observation and imitation.      |

## Conclusion

Understanding the different types and processes of learning can help educators and learners optimize educational experiences. Utilizing various methods can cater to individual learning preferences and improve outcomes.

## References

- Smith, J. (2020). *Learning Theories: A Comprehensive Approach*. Educational Press.
- Doe, A. (2019). *Techniques for Effective Learning*. Learning Publishers.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 31

Context: fall under the name "reinforcement learning". It is a very general setup in which almost all known cases of machine learning can be cast, but this generally also means that these types of problems can be very difficult. The most general RL problems do not even assume that you know what the world looks like (i.e. the maze for the mouse), so you have to simultaneously learn a model of the world and solve your task in it. 

This dual task induces interesting trade-offs: should you invest time now to learn machine learning and reap the benefit later in terms of a high salary working for Yahoo!, or should you stop investing now and start exploiting what you have learned so far? This is clearly a function of age, or the time horizon that you still have to take advantage of these investments. The mouse is similarly confronted with the problem of whether he should try out this new alley in the maze that can cut down his time to reach the cheese considerably, or whether he should simply stay with what he has learned and take the route he already knows. This clearly depends on how often he thinks he will have to run through the same maze in the future. We call this the exploration versus exploitation trade-off. The reason that RL is a very exciting field of research is because of its biological relevance. Do we not also have to figure out how the world works and survive in it?

Let's go back to the news-articles. Assume we have control over what article we will label next. Which one would be pick? Surely the one that would be most informative in some suitably defined sense. Or the mouse in the maze. Given that it needs to explore, where does he explore? Surely he will try to seek out alleys that look promising, i.e. alleys that he expects to maximize his reward. We call the problem of finding the next best data-case to investigate "active learning".

One may also be faced with learning multiple tasks at the same time. These tasks are related but not identical. For instance, consider the problem if recommending movies to customers of Netflix. Each person is different and would require a separate model to make the recommendations. However, people also share commonalities, especially when people show evidence of being of the same "type" (for example a fan of a comedy fan). We can learn personalized models but share features between them. Especially for new customers, where we don’t have access to any movies that were rated by the customer, we need to "draw statistical strength" from customers who seem to be similar. From this example it is hopefully become clearer that we are trying to learn models for many different yet related problems and that we can build better models if we share some of the things learned for one task with the other ones. The trick is not to share too much nor too little and how much we should share depends on how much data and prior knowledge we have access to for each task. We call this subfield of machine learning: "multi-task learning".
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 32

Context: # Chapter 4: Types of Machine Learning

## 4.1 In a Nutshell

There are many types of learning problems within machine learning. Supervised learning deals with predicting class labels from attributes, unsupervised learning tries to discover interesting structure in data, semi-supervised learning uses both labeled and unlabeled data to improve predictive performance, reinforcement learning can handle simple feedback in the form of delayed reward, active learning optimizes the next sample to include in the learning algorithm, and multi-task learning deals with sharing common model components between related learning tasks.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 33

Context: # Chapter 5

## Nearest Neighbors Classification

Perhaps the simplest algorithm to perform classification is the **k nearest neighbors (kNN classifier)**. As usual, we assume that we have data of the form \(\{X_n, Y_n\}\) where \(X_n\) is the value of attribute \(i\) for data-case \(n\) and \(Y_n\) is the label for data-case \(n\). We also need a measure of similarity between data-cases, which we will denote with \(K(X_n, X_m)\) where larger values of \(K\) denote more similar data-cases.

Given these preliminaries, classification is embarrassingly simple: when you are provided with the attributes \(X_k\) for a new (unseen) test-case, you first find the \(k\) most similar data-cases in the dataset by computing \(K(X_k, X_n)\) for all \(n\). Call this set \(S\). Then, each of these \(k\) most similar neighbors in \(S\) can cast a vote on the label of the test case, where each neighbor predicts that the test case has the same label as itself. Assuming binary labels and an odd number of neighbors, this will always result in a decision.

Although kNN algorithms are often associated with this simple voting scheme, more sophisticated ways of combining the information of these neighbors are allowed. For instance, one could weigh each vote by the similarity to the test-case. This results in the following decision rule:

\[
Y_i = 1 \quad \text{if} \quad \sum_{n \in S} K(X_k, X_n)(2Y_n - 1) > 0 \tag{5.1}
\]

\[
Y_i = 0 \quad \text{if} \quad \sum_{n \in S} K(X_k, X_n)(2Y_n - 1) < 0 \tag{5.2}
\]

\[
\text{(5.3)}
\]

and flipping a coin if it is exactly 0.

Why do we expect this algorithm to work intuitively? The reason is that we expect data-cases with similar labels to cluster together in attribute space.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 40

Context: 28  
# CHAPTER 6. THE NATIVE BAYESIAN CLASSIFIER

For ham emails, we compute exactly the same quantity,

$$ P_{ham}(X_i = j) = \frac{ \# \text{ ham emails for which the word } j \text{ was found} }{ \text{total } \# \text{ of ham emails} } $$ (6.5)

$$ = \frac{\sum_{n} \mathbb{I}(X_{n} = j \land Y_{n} = 0)}{ \sum_{n} \mathbb{I}(Y_{n} = 0)} $$ (6.6)

Both these quantities should be computed for all words or phrases (or more generally attributes).

We have now finished the phase where we estimate the model from the data. We will often refer to this phase as "learning" or training a model. The model helps us understand how data was generated in some approximate setting. The next phase is that of prediction or classification of new email.

## 6.3 Class-Prediction for New Instances

New email does not come with a label ham or spam (if it would we could throw spam in the spam-box right away). What we do see are the attributes $\{X_i\}$. Our task is to guess the label based on the model and the measured attributes. The approach we take is simple: calculate whether the email has a higher probability of being generated from the spam or the ham model. For example, because the word "viagra" has a tiny probability of being generated under the ham model it will end up with a higher probability under the spam model. But clearly, all words have a say in this process. It’s like a large committee of experts, one for each word; each member casts a vote and can say things like: "I am 99% certain its spam", or "It’s almost definitely not spam (0.1% spam)". Each of these opinions will be multiplied together to generate a final score. We then figure out whether ham or spam has the highest score.

There is one little practical caveat with this approach, namely that the product of a large number of probabilities, each of which is necessarily smaller than one, very quickly gets so small that your computer can’t handle it. There is an easy fix though. Instead of multiplying probabilities as scores, we use the logarithms of those probabilities and add the logarithms. This is numerically stable and leads to the same conclusion because if \(a > b\) then we also have \(\log(a) > \log(b)\) and vice versa. In equations we compute the score as follows:

$$ S_{spam} = \sum_{i} \log P_{spam}(X_i = e_i) + \log P(spam) $$ (6.7)
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 43

Context: # 6.5 Remarks

One of the main limitations of the NB classifier is that it assumes independence between attributes (This is presumably the reason why we call it the naive Bayesian classifier). This is reflected in the fact that each classifier has an independent vote in the final score. However, imagine that I measure the words, "home" and "mortgage". Observing "mortgage" certainly raises the probability of observing "home". We say that they are positively correlated. It would therefore be more fair if we attributed a smaller weight to "home" if we already observed mortgage because they convey the same thing: this email is about mortgages for your home. One way to obtain a more fair voting scheme is to model these dependencies explicitly. However, this comes at a computational cost (a longer time before you receive your email in your inbox) which may not always be worth the additional accuracy. One should also note that more parameters do not necessarily improve accuracy because too many parameters may lead to overfitting.

# 6.6 The Idea In a Nutshell

Consider Figure ??. We can classify data by building a model of how the data was generated. For NB we first decide whether we will generate a data-item from class \( Y = 0 \) or class \( Y = 1 \). Given that decision we generate the values for \( D \) attributes independently. Each class has a different model for generating attributes. Classification is achieved by computing which model was more likely to generate the new data-point, biasing the outcome towards the class that is expected to generate more data.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 44

Context: # CHAPTER 6. THE NAIVE BAYESIAN CLASSIFIER

## Introduction

The Naive Bayesian Classifier is a simple yet powerful algorithm based on Bayes' theorem, used extensively in machine learning for classification tasks.

## Key Concepts

1. **Bayes' Theorem**: 
   \[
   P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}
   \]
   
2. **Assumptions**:
   - Features are statistically independent given the class label.
   - The prior distribution of each class.

## Types of Naive Bayes Classifiers

- **Gaussian Naive Bayes**: Assumes continuous data follows a normal distribution.
- **Multinomial Naive Bayes**: Primarily used for discrete features, such as word counts.
- **Bernoulli Naive Bayes**: Similar to multinomial but assumes binary features.

## Classification Process

1. **Calculate Prior Probabilities**:
   \[
   P(C_k) = \frac{N_k}{N}
   \]

2. **Calculate Likelihoods**:
    - For Gaussian:
    \[
    P(x|C_k) = \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{(x - \mu)^2}{2\sigma^2}}
    \]

3. **Apply Bayes' Theorem**:
    \[
    P(C_k|x) = \frac{P(x|C_k) \cdot P(C_k)}{P(x)}
    \]

4. **Choose Class with Maximum Posterior**:
   \[
   \hat{C} = \arg\max_{C_k} P(C_k|x)
   \]

## Advantages

- Easy to implement.
- Requires a small amount of training data.
- Performs well in multi-class problems.

## Disadvantages

- Assumption of feature independence is often violated in real-world applications.
- Poor performance when features are highly correlated.

## Conclusion

The Naive Bayesian Classifier serves as an excellent baseline for classification tasks and is crucial for understanding more advanced techniques in machine learning.

## References

- [Pattern Recognition and Machine Learning](http://www.example.com)
- [Machine Learning: A Probabilistic Perspective](http://www.example.com)
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 54

Context: 42  
# CHAPTER 8. SUPPORT VECTOR MACHINES  

The theory of duality guarantees that for convex problems, the dual problem will be concave, and moreover, that the unique solution of the primal problem corresponds to the unique solution of the dual problem. In fact, we have:  
\( L_P(w^*) = L_D(\alpha^*) \), i.e. the “duality-gap” is zero.  

Next we turn to the conditions that must necessarily hold at the saddle point and thus the solution of the problem. These are called the KKT conditions (which stands for Karush-Kuhn-Tucker). These conditions are necessary in general, and sufficient for convex optimization problems. They can be derived from the primal problem by setting the derivatives w.r.t to zero. Also, the constraints themselves are part of these conditions and we need that for inequality constraints the Lagrange multipliers are non-negative. Finally, an important constraint called “complementary slackness” needs to be satisfied.  

\[
\begin{align*}
\partial_w L_P = 0 & \rightarrow \quad w - \sum_i \alpha_i y_i x_i = 0 \quad (8.12) \\
\partial_{\alpha} L_P = 0 & \rightarrow \sum_i \alpha_i y_i = 0 \quad (8.13) \\
\text{constraint - 1} & \quad y_i (w^T x_i - b) - 1 \geq 0 \quad (8.14) \\
\text{multiplier condition} & \quad \alpha_i \geq 0 \quad (8.15) \\
\text{complementary slackness} & \quad \alpha_i [y_i (w^T x_i - b) - 1] = 0 \quad (8.16) 
\end{align*}
\]

It is the last equation which may be somewhat surprising. It states that either the inequality constraint is satisfied, but not saturated: \( y_i (w^T x_i - b) - 1 > 0 \) in which case \( \alpha_i \) for that data-case must be zero, or the inequality constraint is saturated \( y_i (w^T x_i - b) - 1 = 0 \), in which case \( \alpha_i \) can be any value \( \geq 0 \). Inequality constraints which are saturated are said to be “active”, while unsaturated constraints are inactive. One could imagine the process of searching for a solution as a ball which runs down the primary objective function using gradient descent. At some point, it will hit a wall which is the constraint and although the derivative is still pointing partially towards the wall, the constraints prohibits the ball to go on. This is an active constraint because the ball is glued to that wall. When a final solution is reached, we could remove some constraints, without changing the solution, these are inactive constraints. One could think of the term \( \partial_{\alpha} L_P \) as the force acting on the ball. We see from the first equation above that only the forces with \( \alpha_i \neq 0 \) exert a force on the ball that balances with the force from the curved quadratic surface \( w \).  

The training cases with \( \alpha_i > 0 \), representing active constraints on the position of the support hyperplane are called support vectors. These are the vectors.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 70

Context: # CHAPTER 11: KERNEL K-MEANS AND SPECTRAL CLUSTERING

## 11.1 Introduction

Kernel K-means is an extension of the K-means clustering algorithm. It allows for clustering in a feature space that is not linearly separable, using a kernel function to transform the original data.

## 11.2 Algorithm

The steps of the Kernel K-means algorithm are as follows:

1. **Choose a kernel function** \( K(x_i, x_j) \).
2. **Initialize** \( k \) centroids randomly.
3. **Assign points** to the nearest centroid using the kernel.
4. **Update centroids** by computing the mean of the assigned points in the feature space.
5. **Repeat** steps 3 and 4 until convergence.

## 11.3 Spectral Clustering

Spectral clustering is another clustering technique that uses the eigenvalues of the similarity matrix of the data.

### 11.3.1 Algorithm

1. Compute the **similarity matrix**.
2. Compute the **Laplacian matrix** from the similarity matrix.
3. Compute the eigenvalues and eigenvectors of the Laplacian matrix.
4. Use the top \( k \) eigenvectors to form a new feature space.
5. Apply K-means to cluster the data in the new space.

## 11.4 Conclusion

Kernel K-means and spectral clustering provide powerful methods for clustering in complex spaces, enabling better data segmentation in many applications.

## References

- *Author Name*, *Title of the Source*, Year
- *Author Name*, *Title of the Source*, Year

For further reading, explore the works on kernel methods and spectral analysis in clustering contexts.
##########

"""QUERY: can you give me 5 sentence summary with your context?"""

Consider the chat history for relevant information. Use all information included. Use as much tokens as needed. Important: If you find information separated by a | in the context, it is a table formatted in Markdown.Important: Take a look at the QUERY and only the QUERY. If this is vague or unclear, please ignore everything and ask a follow-up question instead!


Final Files Sources: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 1, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 3, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 5, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 8, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 11, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 13, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 14, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 23, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 26, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 39, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 40, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 50, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 66, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 67, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 68, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 75, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 78, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 81, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 85, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 88, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 93, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 97, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 99, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 106, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 116, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 117, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 123, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 128, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 130, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 138, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 144, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 145, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 147, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 158, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 167, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 171, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 201, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 204, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 206, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 219, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 226, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 227, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 229, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 231, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 233, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 236, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 237, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 257, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 1, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 2, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 4, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 11, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 13, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 16, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 17, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 19, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 20, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 22, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 23, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 50, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 54, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 55, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 58, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 77, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 80, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 82, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 92, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 106, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 113, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 115, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 119, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 125, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 133, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 138, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 169, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 173, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 180, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 207, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 208, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 210, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 1, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 4, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 6, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 7, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 10, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 11, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 13, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 16, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 20, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 23, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 27, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 28, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 31, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 32, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 33, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 40, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 43, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 44, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 54, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 70
==================================================
**Elapsed Time: 2.98 seconds**
==================================================


FINAL ANSWER 
Answer: "A Cool Brisk Walk Through Discrete Mathematics" by Stephen Davies is a comprehensive guide that explores various fundamental topics in discrete mathematics, including sets, functions, relations, probability, and structures like graphs and trees. The book is designed to be accessible and engaging, with a focus on practical applications in fields such as computer science and information technology. It emphasizes the importance of understanding abstract, conceptual objects and their properties, and provides numerous examples and exercises to reinforce learning. The text also discusses the philosophical underpinnings of probability and different approaches to learning, such as frequentist and Bayesian methods. Overall, the book aims to provide a solid foundation in discrete mathematics, essential for modern scientific and technological applications.
==================================================
**Elapsed Time: 1.42 seconds**
==================================================