{
    "query": "Was sind Mengen und wie kann man sie kombinieren?",
    "namespace": "e791d701-717d-4c06-9602-2878acdbb6ad",
    "messages": [
        {
            "role": "assistant",
            "content": "Hallo diverse NF SepTest,\n\nich bin Simon, dein KI-Chatbot, der mit dem gro\u00dfen Sprachmodell gemini-1.5-flash angetrieben wird. Ich bin hier, um all deine Fragen pr\u00e4zise und umfassend zu beantworten. \n\nBasierend auf den von dir bereitgestellten Ausz\u00fcgen aus drei B\u00fcchern \u00fcber Mathematik und maschinelles Lernen, k\u00f6nnen wir uns \u00fcber verschiedene Themen unterhalten. Hier sind ein paar Vorschl\u00e4ge:\n\n* **Diskrete Mathematik:** \n    * Was sind Mengen und wie kann man sie kombinieren? \n    * Wie funktionieren Relationen und welche Eigenschaften haben sie? \n    * Was ist Wahrscheinlichkeit und wie kann man sie berechnen? \n    * Welche Datenstrukturen gibt es und wie funktionieren sie? \n    * Wie kann man kombinatorische Probleme l\u00f6sen? \n    * Was sind Zahlensysteme und wie funktionieren sie? \n    * Was ist Logik und wie kann man mit ihr argumentieren? \n    * Wie funktioniert ein mathematischer Beweis? \ud83e\udd14\n* **Maschinelles Lernen:** \n    * Was ist \u00fcberwachtes Lernen und wie funktioniert es? \n    * Was sind probabilistische Modelle und wie werden sie im maschinellen Lernen eingesetzt? \n    * Was ist Klassifikation und wie kann man Daten in Kategorien einteilen? \n    * Was ist die statistische Lerntheorie und wie kann man die Generalisierungsleistung von Modellen verstehen? \n    * Was ist un\u00fcberwachtes Lernen und wie kann man Muster in unbeschrifteten Daten entdecken? \n    * Was sind probabilistische Graphenmodelle und wie k\u00f6nnen sie komplexe Abh\u00e4ngigkeiten zwischen Variablen darstellen? \n    * Wie funktioniert approximative Inferenz und wie kann man komplexe Modelle lernen? \n    * Was sind die wichtigsten Herausforderungen und Zukunftsperspektiven im Bereich des maschinellen Lernens? \ud83e\udd16\n\nWas m\u00f6chtest du dir aus dieser Liste genauer ansehen? \n"
        },
        {
            "role": "user",
            "content": "Was sind Mengen und wie kann man sie kombinieren?"
        },
        {
            "role": "assistant",
            "content": ""
        }
    ],
    "stream": true,
    "language_level": "",
    "chat_channel": ":4042:1726667082838",
    "language": "German",
    "tone": "neutral",
    "writing_style": "standard",
    "model": "gemini-1.5-flash",
    "knowledgebase": "ki-dev-large",
    "seed": 1055,
    "client_id": 1726667082838,
    "all_context": false,
    "follow_up_for": null,
    "knowledgebase_files_count": 3,
    "override_command": null,
    "disable_clarity_check": false,
    "custom_primer": "",
    "logging": true,
    "query_route": ""
}


INITIALIZATION
Knowledgebase: ki-dev-large
Base Query: Was sind Mengen und wie kann man sie kombinieren?
Model: gemini-1.5-flash
Use Curl?: None
==================================================
**Elapsed Time: 0.00 seconds**
==================================================


ROUTING
Query type: summary
==================================================
**Elapsed Time: 7.71 seconds**
==================================================


RAG PARAMETERS
Max Context To Include: 100
Lowest Score to Consider: 0.1
==================================================
**Elapsed Time: 0.14 seconds**
==================================================


VECTOR SEARCH ALGORITHM TO USE 
Use MMR search?: True
Use Similarity search?: False
==================================================
**Elapsed Time: 0.11 seconds**
==================================================


VECTOR SEARCH DONE 
==================================================
**Elapsed Time: 6.58 seconds**
==================================================


PRIMER 
Primer: 
            You are Simon, a highly intelligent personal assistant in a system called KIOS. You are a chatbot that
            can read knowledgebases through the "CONTEXT" that is included in the user's chat message.

        
            Your role is to act as an expert at summarization and analysis.

            In your responses to enterprise users, prioritize clarity, trustworthiness, and appropriate formality.

            Be honest by admitting when a topic falls outside your scope of knowledge, and suggest
            alternative avenues for obtaining information when necessary.

            Make effective use of chat history to avoid redundancy and enhance response relevance, continuously
            adapting to integrate all necessary details in your interactions.

            Use as much tokens as possible to provide a detailed response.
        
==================================================
**Elapsed Time: 0.18 seconds**
==================================================


FINAL QUERY 
Final Query: CONTEXT: ##########
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 3

Context: # Contents at a glance

## Contents at a glance
- **Contents at a glance** …… i
- **Preface** …… iii
- **Acknowledgements** …… v
  
1. **Meetup at the trailhead** …… 1
2. **Sets** …… 7
3. **Relations** …… 35
4. **Probability** …… 59
5. **Structures** …… 85
6. **Counting** …… 141
7. **Numbers** …… 165
8. **Logic** …… 197
9. **Proof** …… 223

Also be sure to check out the forever-free-and-open-source instructional videos that accompany this series at [www.allthemath.org](http://www.allthemath.org)!
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 8

Context: I'm unable to view images. Please provide the text in Markdown format for me to assist you.

Image Analysis: 

### Image Analysis:

#### 1. **Localization and Attribution:**
- **Image Number**: Image 1

#### 2. **Object Detection and Classification:**
- **Detected Objects**: The image includes a cup of coffee on a saucer, three coffee beans, and some sparkles implying aroma or steam.
- **Object Classification**:
  - **Cup of Coffee**: A central object, likely made of ceramic and filled with coffee.
  - **Coffee Beans**: Close to the cup, classifiable as roasted beans.
  - **Sparkles/Steam**: Above the cup, representing the aroma of the hot coffee.

#### 3. **Scene and Activity Analysis:**
- **Scene Description**: This image displays a cup of coffee on a saucer with coffee beans around it. Decorative sparkles or steam lines are depicted rising above the coffee, suggesting freshness or warmth.
- **Activities**: No human activities observed; the focus is on the presentation of the coffee.

#### 4. **Text Analysis:**
- No text detected in this image.

#### 5. **Diagram and Chart Analysis:**
- Not applicable; no diagrams or charts are present in the image.

#### 6. **Product Analysis:**
- **Product**: A cup of coffee.
- **Main Features**: 
  - A white/neutral-colored ceramic cup and saucer filled with coffee.
  - Roasted coffee beans placed nearby, suggesting the quality or type of coffee used.
- **Materials**: Ceramic cup and saucer, roasted coffee beans.
- **Colors**: Predominantly brown (coffee and beans) and white (cup and saucer).

#### 7. **Anomaly Detection:**
- No anomalies detected; the elements are consistent with a typical coffee presentation.

#### 8. **Color Analysis**:
- **Dominant Colors**: Brown and white.
  - **Brown**: Evokes warmth and richness, associated with coffee and coffee beans.
  - **White**: Clean and neutral, ensures focus on the coffee.

#### 9. **Perspective and Composition**:
- **Perspective**: Bird’s eye view.
- **Composition**: 
  - Centralized with the cup of coffee as the main focus.
  - Balanced by the placement of coffee beans around the cup.
  - Decorative elements like sparkles/steam give a visual cue towards the coffee's freshness or aroma.

#### 10. **Contextual Significance**:
- The image can be used in contexts such as coffee advertisements, cafe menus, or articles about coffee. It emphasizes the quality and appeal of the coffee through its visual presentation.

#### 11. **Metadata Analysis**:
- No metadata available for analysis.

#### 12. **Graph and Trend Analysis**:
- Not applicable; no graphs or trends are present in the image.

#### 13. **Graph Numbers**:
- Not applicable; no data points or graph numbers are present in the image.

### Additional Aspects:

#### **Ablaufprozesse (Process Flows)**:
- Not applicable; no process flows depicted.

#### **Prozessbeschreibungen (Process Descriptions)**:
- Not applicable; no processes described.

#### **Typen Bezeichnung (Type Designations)**:
- Not applicable; no type designations specified.

#### **Trend and Interpretation**:
- Not applicable; no trends depicted.

#### **Tables**:
- Not applicable; no tables included.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 9

Context: # Chapter 1

## Meetup at the trailhead

Before we set out on our “cool, brisk walk,” let’s get oriented. What is discrete mathematics, anyway? Why is it called that? What does it encompass? And what is it good for?

Let’s take the two words of the subject, in reverse order. First, **math**. When most people hear “math,” they think “numbers.” After all, isn’t math the study of quantity? And isn’t that the class where we first learned to count, add, and multiply?

Mathematics certainly has its root in the study of numbers — specifically, the “natural numbers” (the integers from 1 on up) that fascinated the ancient Greeks. Yet math is broader than this, almost to the point where numbers can be considered a special case of something deeper. In this book, when we talk about trees, sets, or formal logic, there might not be a number in sight.

Math is about **abstract, conceptual objects that have properties,** and the implications of those properties. An “object” can be any kind of “thought material” that we can define and reason about precisely. Much of math deals with questions like, “Suppose we defined a certain kind of thing that had certain attributes. What would be the implications of this, if we reasoned it all the way out?” The “thing” may or may not be numerical, whatever it turns out to be. Like a number, however, it will be crisply defined, have certain known aspects to it, and be capable of combining with other things in some way.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 11

Context: # Understanding Integration and Differentiation

People in your family, there will never be 5.3 of them (although there could be 6 someday).

The last couple of entries on this list are worth a brief comment. They are math symbols, some of which you may be familiar with.

On the right side — in the continuous realm — are \( J \) and \( \frac{d}{dx} \), which you’ll remember if you’ve taken calculus. They stand for the two fundamental operations of integration and differentiation. Integration, which can be thought of as finding “the area under a curve,” is basically a way of adding up a whole infinite bunch of numbers over some range. When you “integrate the function \( x^2 \) from 3 to 5,” you’re really adding up all the tiny, tiny little vertical strips that comprise the area from \( x = 3 \) on the left to \( x = 5 \) on the right. Its corresponding entry in the left-column of the table is \( \Sigma \), which is just a short-hand for “sum up a bunch of things.” Integration and summation are equivalent operations; it’s just that when you integrate, you’re adding up all the (infinitely many) slivers across the real-line continuum. When you sum, you’re adding up a fixed sequence of entries, one at a time, like in a loop. \( \Sigma \) is just the discrete “version” of \( J \).

The same sort of relationship holds between ordinary subtraction (\(-\)) and differentiation (\(\frac{d}{dx}\)). If you’ve plotted a bunch of discrete points on \( x \)-\( y \) axes, and you want to find the slope between two of them, you just subtract their \( y \) values and divide by the \( x \) distance between them. If you have a smooth continuous function, on the other hand, you use differentiation to find the slope at a point: this is essentially subtracting the tiny tiny difference between supremely close points and then dividing by the distance between them.

Don’t worry; you don’t need to have fully understood any of the integration or differentiation stuff I just talked about, or even to have taken calculus yet. I’m just trying to give you some feel for what “discrete” means, and how the dichotomy between discrete and continuous really runs through all of math and computer science. In this book, we will mostly be focusing on discrete values and structures, which turn out to be of more use in computer science. That’s partially because, as you probably know, computers
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 13

Context: # 1.1 EXERCISES

1. **What’s the opposite of concrete?**  
   Abstract.

2. **What’s the opposite of discrete?**  
   Continuous.

3. **Consider a quantity of water in a glass. Would you call it abstract, or concrete? Discrete, or continuous?**  
   Concrete, since it’s a real entity you can experience with the senses. Continuous, since it could be any number of ounces (or liters, or tablespoons, or whatever). The amount of water certainly doesn’t have to be an integer. (Food for thought: since all matter is ultimately comprised of atoms, are even substances like water discrete?)

4. **Consider the number 27. Would you call it abstract, or concrete? Discrete, or continuous?**  
   Abstract, since you can’t see or touch or smell “twenty-seven.” Probably discrete, since it’s an integer, and when we think of whole numbers we think “discrete.” (Food for thought: in real life, how would you know whether I meant the integer “27” or the decimal number “27.00”? And does it matter?) Clearly it’s discrete. Abstract vs. concrete, though, is a little tricky. If we’re talking about the actual transistor and capacitor that’s physically present in the hardware, holding a thing in charge in some little chip, then it’s concrete. But if we’re talking about the value “1” that is conceptually part of the computer’s currently executing state, then it’s really abstract just like 27 was. In this book, we’ll always be talking about bits in this second, abstract sense.

5. **Consider a bit in a computer’s memory. Would you call it abstract, or concrete? Discrete, or continuous?**  
   Any kind of abstract object that has properties we can reason about. 

6. **If math isn’t just about numbers, what else is it about?**  
   Any kind of abstract object that has properties we can reason about.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 15

Context: # Chapter 2

## Sets

The place from which we'll start our walk is a body of mathematics called "set theory." Set theory has an amazing property: it's so simple and applicable that almost all the rest of mathematics can be based on it! This is all the more remarkable because set theory itself came along pretty late in the game (as things go) — it was singlehandedly invented by one brilliant man, Georg Cantor, in the 1870s. That may seem like a long time ago, but consider that by the time Cantor was born, mankind had already accumulated an immense wealth of mathematical knowledge: everything from geometry to algebra to calculus to prime numbers. Set theory was so elegant and universal, though, that after it was invented, nearly everything in math was redefined from the ground up to be couched in the language of sets. It turns out that this simple tool is an amazingly powerful way to reason about mathematical concepts of all flavors. Everything else in this book stands on set theory as a foundation.

Cantor, by the way, went insane as he tried to extend set theory to fully encompass the concept of infinity. Don't let that happen to you!
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 17

Context: # 2.1. THE IDEA OF A SET

is the same whether we write it like this:

\( P = \{ \text{Dad}, \text{Mom} \} \)

or this:

\( P = \{ \text{Mom}, \text{Dad} \} \)

Those are just two different ways of writing the same thing.

The set \( E \) that had nobody in it can be written like this, of course:

\( E = \{ \} \)

but we sometimes use this special symbol instead:

\( E = \emptyset \)

However you write it, this kind of set (one that has no elements) is referred to as an **empty set**.

The set \( H \), above, contained all the members of the group under consideration. Sometimes we'll refer to "the group under consideration" as the **domain of discourse**. It too is a set, and we usually use the symbol \( \Omega \) to refer to it. So in this case,

\( \Omega = \{ \text{Mom}, \text{Johnny}, \text{T.J.}, \text{Dad}, \text{Lizzy} \} \)

Another symbol we'll use a lot is \( \in \), which means "is a member of." Since Lizzy is a female, we can write:

\( \text{Lizzy} \in F \)

to show that Lizzy is a member of the set \( F \). Conversely, we write:

\( \text{T.J.} \notin F \)

to show that T.J. is not.

As an aside, I mentioned that every item is either in, or not in, a set: there are no shades of gray. Interestingly, researchers have developed another body of mathematics called [I kid you not] *fuzzy*[^1] set theory.

[^1]: Some authors use the symbol \( U \) for this, and call it the **universal set**.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 19

Context: # 2.2. DEFINING SETS

Note that two sets with different intensions might nevertheless have the same extension. Suppose \( O \) is "the set of all people over 25 years old" and \( R \) is "the set of all people who wear wedding rings." If our \( \Omega \) is the Davies family, then \( O \) and \( R \) have the same extension (namely, Mom and Dad). They have different intensions, though; conceptually speaking, they're describing different things. One could imagine a world in which older people don't all wear wedding rings, or one in which some younger people do. Within the domain of discourse of the Davies family, however, the extensions happen to coincide.

**Fact:** We say two sets are equal if they have the same extension. This might seem unfair to intentionality, but that’s the way it is. So it is totally legit to write:

\[ O = R \]

since by the definition of set equality, they are in fact equal. I thought this was weird at first, but it’s really no weirder than saying "the number of years the Civil War lasted = Brett Favre’s jersey number when he played for the Packers." The things on the left and right side of that equals sign refer conceptually to two very different things, but that doesn’t stop them from both having the value 4, and thus being equal.

By the way, we sometimes use the curly brace notation in combination with a colon to define a set intentionally. Consider this:

\[ M = \{ k : k \text{ is between 1 and 20, and a multiple of } 3 \} \]

When you reach a colon, pronounce it as “such that.” So this says "M is the set of all numbers \( k \) such that \( k \) is between 1 and 20, and a multiple of 3." (There's nothing special about \( k \), here; I could have picked any letter.) This is an intensional definition since we haven’t listed the specific numbers in the set, but rather given a rule for finding them. Another way to specify this set would be to write:

\[ M = \{ 3, 6, 9, 12, 15, 18 \} \]

which is an extensional definition of the same set.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 21

Context: 2.4 Sets are not arrays
=======================

If you’ve done some computer programming, you might see a resemblance between sets and the collections of items often used in a program: arrays, perhaps, or linked lists. To be sure, there are some similarities. But there are also some very important differences, which must not be overlooked:

- **No order.** As previously mentioned, there is no order to the members of a set. “{Mom, Dad}” is the same set as “{Dad, Mom}”. In a computer program, of course, most arrays or lists have first, second, and last elements, and an index number assigned to each.

- **No duplicates.** Suppose \( M \) is the set of all males. What would it possibly mean to say \( M = \{T.J., T.J., Johnny\} \)? Would that mean that “T.J. is twice the man that Johnny is”? This is obviously nonsensical. The set \( M \) is based on a property: maleness. Each element of \( M \) is either male, or it isn't. It can't be “male three times.” Again, in an array or linked list, you could certainly have more than one copy of the same item in different positions.

- **Infinite sets.** 'Nuff said. I’ve never seen an array with infinitely many elements, and neither will you.

- **Untyped.** Most of the time, an array or other collection in a computer program contains elements of only a single type; it’s an array of integers, or a linked list of Customer objects, for example. This is important because the program often needs to treat all elements in the collection the same way. Perhaps it needs to loop over the array to add up all the numbers, or iterate through a linked list and search for customers who have not placed an order in the last six months.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 22

Context: ```
# CHAPTER 2. SETS

A program would run into problems if it tried to add a string of text to its cumulative total, or encountered a Product object in the middle of its list of Customers. Sets, though, can be heterogeneous, meaning they can contain different kinds of things. The Davies family example had all human beings, but nothing stops me from creating a set \( S = \{ \text{Jack Nicholson}, \text{Kim Kardashian}, \text{Universal Studios}, 5756, \star \} \).

I don’t press this point too hard for a couple of reasons. First, most programming languages do allow heterogeneous collections of some sort, even if they’re not the most natural thing to express. In Java, you can define an `ArrayList` as a non-generic one that simply holds items of class “Object.” In C, you can have an array of void pointers to some unspecified type—which allows your array to point to different kinds of things. Unless it’s a loosely-typed language, though (like Perl or JavaScript), it sort of feels like you’re bending over backwards to do this. The other reason I make this distinction lightly is that when we’re dealing with sets, we often find it useful to deal with things of only one type, and so our \( S \) ends up being homogeneous anyway.

## 2.5 Sets are not ordered pairs (or tuples)

You’ll remember from high school algebra the notion of an ordered pair \( (x, y) \). We dealt with those when we wanted to specify a point to plot on a graph: the first coordinate gave the distance from the origin on the x-axis, and the second coordinate on the y-axis. Clearly an ordered pair is not a set, because as the name implies it is 
```
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 23

Context: # 2.6 Sets of Sets

Sets are heterogeneous — a single set can contain four universities, seven integers, and an image — and so it might occur to you that they can contain other sets as well. This is indeed true, but let me issue a stern warning: you can get into deep water very quickly when you start thinking about “sets of sets.” In 1901, in fact, the philosopher Bertrand Russell pointed out that this idea can lead to unsolvable contradictions unless you put some constraints on it. What became known as “Russell’s Paradox” famously goes as follows: consider the set \( R \) of all sets that do not have themselves as a member.

For this reason, we’ll be very careful to use curly braces to denote sets, and parentheses to denote ordered pairs.

By the way, although the word “coordinate” is often used to describe the elements of an ordered pair, that’s really a geometry-centric word that implies a visual plot of some kind. Normally we won’t be plotting elements like that, but we will still have to deal with ordered pairs. I’ll just call the constituent parts “elements” to make it more general.

Three-dimensional points need ordered triples \( (x, y, z) \), and it doesn’t take a rocket scientist to deduce that we could extend this to any number of elements. The question is what to call them, and you do sort of sound like a rocket scientist (or other generic nerd) when you say tuple. Some people rhyme this word with “Drupal,” and others with “couple,” by the way, and there seems to be no consensus. If you have an ordered pair-type thing with 5 elements, therefore, it’s a 5-tuple (or a quintuple). If it has 117 elements, it’s a 117-tuple, and there’s really nothing else to call it. The general term (if we don’t know or want to specify how many elements) is n-tuple. In any case, it’s an ordered sequence of elements that may contain duplicates, so it’s very different than a set.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 24

Context: # CHAPTER 2. SETS

as members³. Now is \( R \) a member of itself, or isn’t it? Either way you answer turns out to be wrong (try it!) which means that this whole setup must be flawed at some level.

The good news is that as long as you don’t deal with this kind of self-referential loop (“containing yourself as a member”) then it’s pretty safe to try at home. Consider this set:

\[
V = \{ 3, 5, \{ 5, 4 \}, 2 \}
\]

This set has four (not five) members. Three of \( V \)’s members are integers: 2, 3, and 5. The other one is a set (with no name given). That other set, by the way, has two members of its own: 4 and 5. If you were asked, “is \( 4 \in V \)?” the answer would be no.

As a corollary to this, there’s a difference between

\[
\emptyset
\]

and 

\[
\{ \emptyset \}
\]

The former is a set with no elements. The latter is a set with one element: and that element just happens to be a set with nothing in it.

## 2.7 Cardinality

When we talk about the number of elements in a set, we use the word cardinality. You’d think we could just call it the “size” of the set, but mathematicians sometimes like words that sound cool. The cardinality of \( M \) (the set of males, where the Davies family is the domain of discourse) is 3, because there are three elements in it. The cardinality of the empty set \( \emptyset \) is 0. The cardinality of the set of all integers is \( \aleph_0 \). Simple as that.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 25

Context: 2.8 Some special sets
=====================

In addition to the empty set, there are symbols for some other common sets, including:

- **Z** — the integers (positive, negative, and zero)
- **N** — the natural numbers (positive integers and zero)
- **Q** — the rational numbers (all numbers that can be expressed as an integer divided by another integer)
- **R** — the real numbers (all numbers that aren't imaginary, even decimal numbers that aren't rational)

The cardinality of all these sets is infinity, although, as I alluded to previously, \(\mathbb{R}\) is in some sense "greater than" \(\mathbb{N}\). For the curious, we say that \(\mathbb{N}\) is a countably infinite set, whereas \(\mathbb{R}\) is uncountably infinite. Speaking very loosely, this can be thought of this way: if we start counting up all the natural numbers, 0, 1, 2, 3, 4, ..., we will never get to the end of them. But at least we can start counting. With the real numbers, we can't even get off the ground. Where do you begin? Starting with 0 is fine, but then what's the "next" real number? Choosing anything for your second number inevitably skips a lot in between. Once you've digested this, I'll spring another shocking truth on you: \(|\mathbb{Q}| \text{ is actually equal to } |\mathbb{R}|, \text{ not greater than it as } |\mathbb{R}|. \text{ Cantor came up with an ingenious numbering scheme whereby all the rational numbers — including 3, -9, \text{ and } \frac{1}{17} — can be listed off regularly, in order, just like the integers can. And so } |\mathbb{Q}| = |\mathbb{N}| \neq |\mathbb{R}|. \text{ This kind of stuff can blow your mind.}
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 26

Context: # 2.9 Combining Sets

Okay, so we have sets. Now what can we do with them? When you first learn about numbers back before kindergarten, the next thing you learn is how to combine numbers using various operations to produce other numbers. These include \( -, \times, +, \div \), exponents, roots, etc. Sets, too, have operations that are useful for combining to make other sets. These include:

- **Union (\(\cup\))**: The union of two sets is a set that includes the elements that either (or both) of them have as members. For instance, if \( A = \{ \text{Dad}, \text{Lizzy} \} \) and \( B = \{ \text{Lizzy}, \text{T.J.}, \text{Johnny} \} \), then \( A \cup B = \{ \text{Dad}, \text{Lizzy}, \text{T.J.}, \text{Johnny} \} \). Note that any element is in the union if it is in \( A \) or \( B \). For this reason, there is a strong relationship between the union operator of sets and the “or” (\( \vee \)) operator of boolean logic that we’ll see later.

- **Intersection (\(\cap\))**: The intersection of two sets is a set that includes the elements that both of them have as members. In the above example, \( A \cap B = \{ \text{Lizzy} \} \). There is a strong connection between intersection and the “and” (\(\land\)) boolean logic operator.

- **(Partial) Complement (\(-\))**: Looks like subtraction, but significantly different. \( A - B \) contains the elements from \( A \) that are also not in \( B \). So you start with \( A \) and then “subtract off” the contents of \( B \) if they occur. In the above example, \( A - B = \{ \text{Dad} \} \). (Note that T.J. and Johnny didn’t really enter into the calculation.) Unlike \( \cup \) and \( \cap \), it is not commutative. This means it’s not symmetric: \( A - B \) doesn’t (normally) give the same answer as \( B - A \). In this example, \( B - A = \{ \text{T.J.}, \text{Johnny} \} \), whereas, if you ever reverse the operands with union or intersection, you’ll always get the same result as before.

- **(Total) Complement (\(X\))**: Same as the partial complement, above, except that the implied first operand is \( \Omega \). In other words, \( A - B \) is “all the things in \( A \) that aren’t in \( B \),” whereas...
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 27

Context: # 2.9. COMBINING SETS

\( B \) is “all the things *period* that aren't in \( B' \).” Of course, “all the things period” means “all the things that we're currently talking about.” The domain of discourse \( \Omega \) is very important here. If we're talking about the Davies family, we would say that \( M = \{ \text{Mom}, \text{Lizzy} \} \), because those are all the Davies who aren’t male. If, on the other hand, \( \Omega \) is “the grand set of absolutely everything,” then not only is Mom a member of \( M \), but so is the number 12, the French Revolution, and my nightmare last Tuesday about a rabid platypus.

- **Cartesian product** (\( \times \)). Looks like multiplication, but very different. When you take the Cartesian product of two sets \( A \) and \( B \), you don’t even get the elements from the sets in the result. Instead, you get ordered pairs of elements. These ordered pairs represent each combination of an element from \( A \) and an element from \( B \). For instance, suppose \( A = \{ \text{Bob}, \text{Dave} \} \) and \( B = \{ \text{Jenny}, \text{Gabrielle}, \text{Tiffany} \} \). Then:

\[
A \times B = \{ (\text{Bob}, \text{Jenny}), (\text{Bob}, \text{Gabrielle}), (\text{Bob}, \text{Tiffany}), (\text{Dave}, \text{Jenny}), (\text{Dave}, \text{Gabrielle}), (\text{Dave}, \text{Tiffany}) \}
\]

Study that list. The first thing to realize is that it consists of neither guys nor girls, but of ordered pairs. (Clearly, for example, \( \text{Jenny} \in A \times B \).) Every guy appears exactly once with every girl, and the guy is always the first element of the ordered pair. Since we have two guys and three girls, there are six elements in the result, which is an easy way to remember the \( x \) sign that represents Cartesian product. (Do not worry, make the common mistake of thinking that \( A \times B \) is \( A \) or \( B \) as a set, not a number. The cardinality of the set, of course, is \( 6 \), so it's appropriate to write \( |A \times B| = 6 \).)

## Laws of Combining Sets

There are a bunch of handy facts that arise when combining sets using the above operators. The important thing is that these are all easily seen just by thinking about them for a moment. Put another way, these aren’t facts to memorize; they’re facts to look at and see.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 28

Context: # CHAPTER 2. SETS

for yourself. They’re just a few natural consequences of the way we’ve defined sets and operations, and there are many others.

- **Union and intersection are commutative.** As noted above, it’s easy to see that \( A \cup B \) will always give the same result as \( B \cup A \). Same goes for \( \cap \), though.

- **Union and intersection are associative.** “Associative” means that if you have an operator repeated several times, left to right, it doesn’t matter which order you evaluate them in. \( (A \cup B) \cup C \) will give the same result as \( A \cup (B \cup C) \). This means we can freely write expressions like \( X \cup Y \cup Z \) and no one can accuse us of being ambiguous. This is also true if you have three (or more) intersections in a row. Be careful, though: associativity does not hold if you have unions and intersections mixed together. If I write \( A \cap B \cap C \) it matters very much whether I do the union first or the intersection first. This is just how it works with numbers: \( 4 + 3 \) gives either \( 10 \) or \( 14 \) depending on the order of operations. In algebra, we learned that \( x \) has precedence over \( + \), and you’ll always do that one first in the absence of parentheses. We could establish a similar order for set operations, but we won’t: we’ll always make it explicit with parentheses.

- **Union and intersection are distributive.** You’ll recall from basic algebra that \( a \cdot (b + c) = ab + ac \). Similarly with sets,

\[ X \cap (Y \cup Z) = (X \cap Y) \cup (X \cap Z) \]

It's important to work this out for yourself rather than just memorize it as a rule. Why does it work? Well, take a concrete example. Suppose \( Y \) is the set of all female students, \( S \) is the set of all computer science majors, and \( Z \) is the set of all math majors. (Some students, of course, double-major in both.) The left-hand side of the equals sign says “first take all the math and computer science majors and put them in a group. Then, intersect that group with the women’s group.” The result is “women who are either computer science majors or math majors (or both).” Now
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 29

Context: # 2.9. COMBINING SETS

Look at the right-hand side. The first pair of parentheses encloses only female computer science majors. The right pair encloses female math majors. Then we take the union of the two, to get a group which contains only females, and specifically only the females who are computer science majors or math majors (or both). Clearly, the two sides of the equals sign have the same extension.

The distributive property in basic algebra doesn’t work if you flip the times and plus signs (normally \( a + b + c \neq (a + b) + (a + c) \)), but remarkably it does here:

\( X \cup (Y \cap Z) = (X \cup Y) \cap (X \cup Z) \).

Using the same definitions of \( X, Y, \) and \( Z \), work out the meaning of this one and convince yourself it’s always true.

## Identity Laws

Simplest thing you’ve learned all day: 

- \( X \cup \emptyset = X \)  
- \( X \cap X = X \)

You don’t change \( X \) by adding nothing to it, or taking nothing away from it.

## Dominance Laws

The flip side of the above is that \( X \cup \Omega = \Omega \) and \( X \cap \emptyset = \emptyset \). If you take \( X \) and add everything and the kitchen sink to it, you get everything and the kitchen sink. And if you restrict \( X \) to having nothing, it of course has nothing.

## Complement Laws

\( X \cup \overline{X} = \Omega \). This is another way of saying “everything (in the domain of discourse) is either in, or not in, a set.” So if I take \( X \), and then I take everything not in \( X \), and smooth the two together, I get everything. In a similar vein, \( X \cap \overline{X} = \emptyset \), because there can’t be any element that’s both in \( X \) and not in \( X \): that would be a contradiction. Interestingly, the first of these two laws has become controversial in modern philosophy. It’s called “the law of the excluded middle,” and is explicitly repudiated in many modern logic systems.

## De Morgan's Laws

De Morgan’s laws. Now these are worth memorizing, if only because (1) they’re incredibly important, and (2) they
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 30

Context: # CHAPTER 2. SETS

Let \( X \) be the set of all known murderers. Let \( Y \) be the set of all known thieves. 

Again, it's best understood with a specific example. Let’s say you’re renting a house and want to make sure you don’t have any shady characters under the roof. 

Now, as a landlord, you don’t want any thieves or murderers renting your property. So who are you willing to rent to? Answer: if \( \Omega \) is the set of all people, you are willing to rent to \( X \cup Y \).

Why that? Because if you take \( X \cup Y \), that gives you all the undesirables: people who are either murderers or thieves (or both). You don’t want to rent to any of them. In fact, you want to rent to the complement of that set; namely, “anybody else.” Putting an overbar on that expression gives you all the non-thieves and non-murderers.

Very well. But now look at the right-hand side of the equation. \( X \) gives you the non-thieves. \( Y \) gives you the non-murderers. Now in order to get acceptable people, you want to rent only to someone who’s in both groups. Put another way, they have to be both a non-thief and a non-murderer in order for you to rent to them. Therefore, they must be in the intersection of the non-thief group and the non-murderer group. Therefore, the two sides of this equation are the same.

The other form of De Morgan’s law is stated by flipping the intersections and unions:

\[
X \cap Y = \overline{X \cup Y}
\]

Work this one out for yourself using a similar example, and convince yourself it’s always true.

Augustus De Morgan, by the way, was a brilliant 19th century mathematician with a wide range of interests. His name
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 31

Context: 2.10 Subsets
===========

We learned that the “∈” symbol is used to indicate set membership: the element on the left is a member of the set on the right. A related but distinct notion is the idea of a subset. When we say \( X \subset Y \) (pronounced “X is a subset of Y”), it means that every member of \( X \) is also a member of \( Y \). The reverse is not necessarily true, of course; otherwise “⊆” would just mean “=”. So if \( A = \{ Dad, Lizzy \} \) and \( K = \{ Dad, Mom, Lizzy \} \), then we can say \( A \subset K \).

Be careful about the distinction between “∈” and “⊆”, which are often confused. With “∈”, the thing on the left is an element, whereas with “⊆”, the thing on the left is a set. This is further complicated by the fact that the element on the left-hand side of “∈” might well be a set.

Let's see some examples. Suppose \( Q \) is the set \( \{ 4, 9, 4 \} \).  \( Q \) has three elements here, one of which is itself a set. Now suppose that we let \( P \) be the set \( \{ 4, 9 \} \). Question: Is \( P \subset Q \)? The answer is yes: the set \( \{ 4, 9 \} \) (which is the same as the set \( \{ 9, 4 \} \), just written in a different way) is in fact an element of the set \( Q \). Next question: Is \( P \subset Q \)? The answer is no, \( P \nsubseteq Q \). If there are two of them (4 and 4) also in \( Q \), then 4 is a member of \( Q \), not 9. Last question: If \( R \) is defined to be \( \{ 2, 4 \} \), is \( R \subset Q \)? The answer is yes, since both 2 and 4 are members of \( Q \).

Notice that the definition here is a subset of itself. Sometimes, though, it’s useful to talk about whether a set is really a subset of another, and you don’t want to “count” if the two sets are actually equal. This is called a proper subset, and the symbol for it is \( \subsetneq \). You can see the rationale for the choice of symbol, because “⊆” is kind of like “<” for numbers, and “⊂” is like “<”.

Every set is a subset (not necessarily a proper one) of itself, because
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 33

Context: ```markdown
## 2.11. POWER SETS

Don't lose sight. There are four elements to the power set of \( A \), each of which is one of the possible subsets. It might seem strange to talk about "all of the possible subsets"—when I first learned this stuff, I remember thinking at first that there would be no limit to the number of subsets you could make from a set. But of course there is. To create a subset, you can either include, or exclude, each one of the original set's members. In \( A \)’s case, you can either (1) include both Dad and Lizzy, or (2) include Dad but not Lizzy, or (3) include Lizzy but not Dad, or (4) exclude both, in which case your subset is empty. Therefore, \( P(A) \) includes all four of those subsets.

Now what's the cardinality of \( P(X) \) for some set \( X \)? That's an interesting question, and one well worth pondering. The answer ripples through the heart of a lot of combinatorics and the binary number system, topics we'll cover later. And the answer is right at our fingertips if we just extrapolate from the previous example. To form a subset of \( X \), we have a choice to either include, or else exclude, each of its elements. So there's two choices for the first element, and then whether we choose to include or exclude that first element, there are two choices for the second, etc. 

So if \( |X| = 2 \) (recall that this notation means \( X \) has two elements) or \( X \) has a cardinality of \( 2^n \), then its power set has \( 2 \times 2 \) members. If \( |X| = 3 \), then its power set has \( 2 \times 2 \times 2 \) members. In general:

\[
|P(X)| = 2^{|X|}.
\]

As a limiting case (and a brain-bender), notice that if \( X \) is the empty set, then \( P(X) \) has one (not zero) members, because there is in fact one subset of the empty set: namely, the empty set itself. So \( |X| = 0 \), and \( |P(X)| = 1 \). And that jives with the above formula.
```
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 34

Context: # 2.12 Partitions

Finally, there’s a special variation on the subset concept called a **partition**. A partition is a group of subsets of another set that together are both **collectively exhaustive** and **mutually exclusive**. This means that every element of the original set is in **one and only one** of the sets in the partition. Formally, a partition of \( X \) is a group of sets \( X_1, X_2, \ldots, X_n \) such that:

\[ 
X_1 \cup X_2 \cup \cdots \cup X_n = X, 
\]

and 

\[ 
X_i \cap X_j = \emptyset \quad \text{for all } i, j. 
\]

So let’s say we’ve got a group of subsets that are supposedly a partition of \( X \). The first line above says that if we combine the contents of all of them, we get everything that’s in \( X \) (and nothing more). This is called being collectively exhaustive. The second line says that no two of the sets have anything in common; they are mutually exclusive.

As usual, an example is worth a thousand words. Suppose the set \( D \) is \(\{ \text{Dad, Mom, Lizzy, T.J., Johnny} \}\). A partition is any way of dividing \( D \) into subsets that meet the above conditions. One such partition is:

- \(\{ \text{Lizzy, T.J.} \}\),
- \(\{ \text{Mom, Dad} \}\), 
- \(\{ \text{Johnny} \}\).

Another one is:

- \(\{ \text{Lizzy} \}\),
- \(\{ \text{T.J.} \}\), 
- \(\{ \text{Mom} \}\),
- \(\{ \text{Johnny} \}\).

Yet another is:

- \(\emptyset\),
- \(\{ \text{Lizzy, T.J., Johnny, Mom, Dad} \}\), and 
- \(\emptyset\).
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 36

Context: # CHAPTER 2: SETS

might be double-majoring in both. Hence, this group of subsets is neither mutually exclusive nor collectively exhaustive. It’s interesting to think about gender and partitions: when I grew up, I was taught that males and females were a partition of the human race. But now I’ve come to realize that there are non-binary persons who do not identify with either of those genders, and so it’s not a partition after all.

## Question:
Is the number of students \( |S| \) equal to the number of off-campus students plus the number of on-campus students? Obviously yes. But why? The answer: because the off-campus and on-campus students form a partition. If we added up the number of freshmen, sophomores, juniors, and seniors, we would also get \( |S| \). But adding up the number of computer science majors and English majors would almost certainly not be equal to \( |S| \), because some students would be double-counted and others counted not at all. This is an example of the kind of beautiful simplicity that partitions provide.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 37

Context: 2.13 Exercises
================

Use an index card or a piece of paper folded lengthwise, and cover up the right-hand column of the exercises below. Read each exercise in the left-hand column, answer it in your mind, then slide the index card down to reveal the answer and see if you're right! For every exercise you missed, figure out why you missed it before moving on.

1. Is the set \( \{ Will, Smith \} \) the same as the set \( \{ Smith, Will \} \)?

   Yes indeed.

2. Is the ordered pair \( (Will, Smith) \) the same as \( (Smith, Will) \)?

   No. Order matters with ordered pairs (hence the name), and with any size tuple for that matter.

3. Is the set \( \{ Luke, Leia \}, Han \) the same as the set \( \{ Luke, Leia, Han \} \)?

   No. For instance, the first set has Han as a member but the second does not. (Instead, it has another set as a member, and that inner set happens to include Han.)

4. What's the first element of the set \( \{ Cowboys, Redskins, Steelers \} \)?

   The question doesn't make sense. There is no "first element" of a set. All three teams are equally members of the set and could be listed in any order.

5. Let \( G = \{ Matthew, Mark, Luke, John \} \) be \( \{ Luke, Obi-wan, Yoda \} \). \( S \) be the set of all Star Wars characters, and \( F \) be the four gospels from the New Testament. Now then. Is \( J \subseteq G \)?

   No.

6. Is \( J \subseteq S \)?

   Yes.

7. Is Yoda \( \in J \)?

   Yes.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 39

Context: # 2.13. EXERCISES

21. What’s \( A \cap B \)?  
   \{ Macbeth, Hamlet \}.

22. What’s \( B \cap T \)?  
   ∅.

23. What’s \( B \cap \overline{T} \)?  
   B. (which is { Scrabble, Monopoly, Othello. })

24. What’s \( A \cup (B \cap T) \)?  
   \{ Hamlet, Othello, Macbeth \}.

25. What’s \( (A \cup B) \cap T \)?  
   \{ Macbeth, Hamlet \}.

26. What’s \( A \setminus B \)?  
   Simply \( T \), since these two sets have nothing in common.

27. What’s \( T' \setminus B \)?  
   \{ (Hamlet, Macbeth), (Hamlet, Hamlet), (Hamlet, Othello), (Village, Macbeth), (Village, Hamlet), (Village, Othello), (Town, Hamlet), (Town, Othello) \}. The order of the ordered pairs within the set is not important; the order of the elements within each ordered pair is important.

28. What’s \( T \cap X \)?  
   0.

29. What’s \( (B \cap B) \times (A \cap T) \)?  
   \{ (Scrabble, Hamlet), (Monopoly, Hamlet), (Othello, Hamlet) \}.

30. What’s \( I \cup B \cup T \)?  
   7.

31. What’s \( I \cap (A \cap T) \)?  
   21. (The first parenthesized expression gives rise to a set with 7 elements, and the second to a set with three elements (B itself). Each element from the first set gets paired with an element from the second, so there are 21 such pairings.)
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 40

Context: # CHAPTER 2. SETS

### 33. Is A an extensional set, or an intensional set?
The question doesn't make sense. Sets aren't "extensional" or "intensional"; rather, a given set can be described extensionally or intensionally. The description given in item 19 is an extensional one; an intensional description of the same set would be "The Shakespeare tragedies Stephen studied in high school."

### 34. Recall that G was defined as { Matthew, Mark, Luke, John }. Is this a partition of G?
- { Luke, Matthew }
- { John }

No, because the sets are not collectively exhaustive (Mark is missing).

### 35. Is this a partition of G?
- { Mark, Luke }
- { Matthew, Luke }

No, because the sets are neither collectively exhaustive (John is missing) nor mutually exclusive (Luke appears in two of them).

### 36. Is this a partition of G?
- { Matthew, Mark, Luke }
- { John }

Yes. (Trivia: this partitions the elements into the synoptic gospels and the non-synoptic gospels).

### 37. Is this a partition of G?
- { Matthew, Luke }
- { John, Mark }

Yes. (This partitions the elements into the gospels which feature a Christmas story and those that don't).
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 43

Context: # Chapter 3

## Relations

Sets are fundamental to discrete math, both for what they represent in themselves and for how they can be combined to produce other sets. In this chapter, we’re going to learn a new way of combining sets, called relations.

### 3.1 The Idea of a Relation

A relation between a set \( X \) and \( Y \) is a subset of the Cartesian product. That one sentence packs in a whole heck of a lot, so spend a moment thinking deeply about it. Recall that \( X \times Y \) yields a set of ordered pairs, one for each combination of an element from \( X \) and an element from \( Y \). If \( X \) has 5 elements and \( Y \) has 4, then \( X \times Y \) is a set of 20 ordered pairs. To make it concrete, if \( X \) is the set \{ Harry, Ron, Hermione \} and \( Y \) is the set \{ Dr. Pepper, Mt. Dew \}, then \( X \times Y \) is:

- \( (Harry, Dr. Pepper) \)
- \( (Harry, Mt. Dew) \)
- \( (Ron, Dr. Pepper) \)
- \( (Ron, Mt. Dew) \)
- \( (Hermione, Dr. Pepper) \)
- \( (Hermione, Mt. Dew) \)

Convince yourself that every possible combination is in there. I listed them out methodically to make sure I didn’t miss any (all the Harry’s first, with each drink in order, then all the Ron’s, etc.), but of course there’s no order to the members of a set, so I could have listed them in any order.

Now if I define a relation between \( X \) and \( Y \), I’m simply specifying that certain of these ordered pairs are in the relation, and certain
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 47

Context: # 3.4 FINITE AND INFINITE RELATIONS

life \( x \) is romantically attracted to \( y \). The extension is probably only \( \{ (Ron, Hermione), (Hermione, Ron) \} \), although who knows what goes through teenagers' minds.

Another example would be the relation "hasMoreCaloriesThan" between \( X \) and \( Y \): this relation's extension is \( \{ (Mt. Dew, Dr. Pepper) \} \). (Fun fact: Dr. Pepper has only 150 calories per can, whereas Mt. Dew has 170.)

Note that just because a relation's two sets are the same, that doesn't necessarily imply that the two elements are the same for any of its ordered pairs. Harry clearly doesn't have a crush on himself, nor does anyone else have a self-crush. And no soda has more calories than itself, either — that's impossible. That being said, though, an ordered pair can have the same two elements. Consider the relation "hasSeen" between \( X \) and \( Y \). Surely all three wizards have looked in a mirror at some point in their lives, so in addition to ordered pairs like \( (Ron, Harry) \), the hasSeen relation also contains ordered pairs like \( (Ron, Ron) \) and \( (Hermione, Hermione) \).

## 3.4 Finite and Infinite Relations

Sets can be infinite, and relations can be too. An **infinite relation** is simply a relation with infinitely many ordered pairs in it. This might seem strange at first, since how could we ever hope to specify all the ordered pairs? But it’s really no different than with sets: we either have to do it intentionally, or else we are systematically computing the extension.

As an example of the first, consider the relation "isGreaterThan" between \( Z \) and \( R \). (Recall that \( Z \) is just a way of writing "the set of integers".) This relation contains ordered pairs like \( (2, 5) \) and \( (17, -13) \), since \( 5 > 2 \) isGreaterThan \( 2 \) and \( 17 > -13 \), but not \( (7, 9) \) or \( (11, 11) \). Clearly it’s an infinite relation. We couldn’t list all the pairs, but we don’t need to, since the name implies the underlying meaning of the relation.

As an example of the second, consider the relation "isLuckerThan" between \( N \) and \( N \). (The \( N \) means "the natural numbers.") We
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 52

Context: # CHAPTER 3. RELATIONS

I contend that in this toy example, “isAtLeastAsToughAs” is a partial order, and \( A \) along with \( isAtLeastAsToughAs \) together form a poset. I reason as follows. It’s reflexive, since we started off by adding every dog with itself. It’s antisymmetric, since when we need add both \( (x, y) \) and \( (y, x) \) to the relation. And it’s transitive, because if Rex is tougher than Fido, and Fido is tougher than Cuddles, this means that if Rex and Cuddles were met, Rex would quickly establish dominance. (I’m no zoologist, and am not sure if the last condition truly applies with real dogs. But let’s pretend it does.)

It’s called a “partial order” because it establishes a partial, but incomplete, hierarchy among dogs. If we ask, “is dog X tougher than dog Y?” the answer is never ambiguous. We’re never going to say, “well, dog X was superior to dog A, who was superior to dog Y . . .” but then again, dog Y was superior to dog B, who was superior to dog X, so there’s no telling which of X and Y is truly toughest.” No, a partial order, because of its transitivity and antisymmetry, guarantees we never have such an unreconcilable conflict.

However, we could have a lack of information. Suppose Rex has never met Killer, and nobody Rex has met has ever met anyone Killer has met. There’s no chain between them. They’re in two separate universes as far as we’re concerned, and we’d have no way of knowing which was toughest. It doesn’t have to be that extreme, though. Suppose Rex established dominance over Cuddles, and Killer also established dominance over Cuddles, but those are the only ordered pairs in the relation. Again, there’s no way to tell whether Rex or Killer is the tougher dog. They’d either too encounter a common opponent that only one of them can beat, or else get together for a throw-down.

So a partial order gives us some semblance of structure — the relation establishes a directionality, and we’re guaranteed not to get wrapped up in contradictions — but it doesn’t completely order all the elements. If it does, it’s called a total order.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 55

Context: # 3.6 FUNCTIONS

Functions aren’t necessarily numeric. Sometimes we do draw “plots” of sorts, though, like this one:

```
Harry          Mt. Dew
          ↘    
Ron ──────── Dr. Pepper
          ↖    
Hermione
```
*Figure 3.1: A function represented graphically.*

This simply shows which elements of the domain map to which elements of the codomain. The left blob is the domain, the right blob is the codomain, and there’s an arrow representing each mapping.

Now as with relations, functions normally have “meaning.” We could define a function called `firstTasted` that associates each wizard with the soft drink he or she first sampled as a child. We could define another called `faveDrink` that maps each wizard to his or her favorite — presuming that every wizard has a favorite drink in the set (Hermione will have to overlook her iced tea and choose among the options provided). A third function called `wouldChooseWithMexicanFood` provides information about which drink each wizard provides with that type of cuisine. Here are Ron’s values for each of the three functions:

- `firstTasted(Ron) = Mt. Dew`
- `faveDrink(Ron) = Mt. Dew`
- `wouldChooseWithMexicanFood(Ron) = Dr. Pepper`

These values indicate that Mt. Dew was the soda pop that Ron first sipped, and it has been his favorite ever since, although at La Estrellita he prefers a Pepper.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 60

Context: # CHAPTER 3. RELATIONS

own wizard, and no soft drinks (or wizards) are left out. How exciting! This is a perfectly bijective function, also called a **bijection**. Again, the only way to get a bijection is for the domain and codomain to be the same size (although that alone does not guarantee a bijection; witness \( f \), above). Also observe that if they are the same size, then injectivity and surjectivity go hand-in-hand. Violate one, and you’re bound to violate the other. Uphold the one, and you’re bound to uphold the other. There’s a nice, pleasing, symmetrical elegance to the whole idea.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 65

Context: # 3.8. EXERCISES 

29. **Is faveSport injective?**  
   No, because Julie and Sam (and Chuck) all map to the same value (basketball). For a function to be injective, there must be no two domain elements that map to the same codomain element.

30. **Is there any way to make it injective?**  
   Not without altering the underlying sets. There are three athletes and two sports, so we can’t help but map multiple athletes to the same sport.

31. **Fine. Is faveSport surjective?**  
   No, because no one maps to volleyball.

32. **Is there any way to make it surjective?**  
   Sure, for instance, change Sam from basketball to volleyball. Now both of the codomain elements are "reachable" by some domain element, so it’s surjective.

33. **Is faveSport now also bijective?**  
   No, because it’s still not injective.

34. **How can we alter things so that it’s bijective?**  
   One way is to add a third sport—say, kickboxing—and move either Julie or Chuck over to kickboxing. If we have Julie map to kickboxing, Sam map to volleyball, and Chuck map to basketball, we have a bijection.

35. **How do we normally write the fact that "Julie maps to kickboxing"?**  
   faveSport(Julie) = kickboxing.

36. **What’s another name for "injective"?**  
   One-to-one.

37. **What’s another name for "surjective"?**  
   Onto.

38. **What’s another name for "range"?**  
   Image.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 67

Context: # Chapter 4

## Probability

Probability is the study of **uncertainty**. This may seem like a hopeless endeavor, sort of like knowing the unknowable, but it's not. The study of probability gives us tools for taming the uncertain world we live and program in, and for reasoning about it in a precise and helpful way.

We may not know exactly how long a particular visitor is willing to wait for our webpage to load in their browser, but we can use probability to estimate how much traffic we’ll lose if this takes longer than a certain average duration. We may not know which specific passwords a hacker will try as he attempts to break our security protocol, but we can use probability to estimate how feasible this approach will be for him. We may not know exactly when a certain program will run out of RAM and have to swap its data out to virtual memory, but we can predict how often this is likely to occur — and how painful it will be for us — given a certain system load and user behavior.

The trick is to use the tools we've already built — sets, relations, functions — to characterize and structure our notions of the relative likelihood of various outcomes. Once those underpinnings are secured, a layer of deductive reasoning will help us make good use of that information to begin to predict the future.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 68

Context: # 4.1 Outcomes and events

Since life is uncertain, we don’t know for sure what is going to happen. But let’s start by assuming we know what things might happen. Something that might happen is called an outcome. You can think of this as the result of an experiment if you want to, although normally we won’t be talking about outcomes that we have explicitly manipulated and measured via scientific means. It’s more like we’re just curious how some particular happening is going to turn out, and we’ve identified the different ways it can turn out and called them outcomes.

Now we’ve been using the symbol \( \Omega \) to refer to “the domain of discourse” or “the universal set” or “all the stuff we’re talking about.” We’re going to give it yet another name now: the sample space. \( \Omega \), the sample space, is simply the *set of all possible outcomes*. Any particular outcome — call it \( O \) — is an element of this set, just like in Chapter 1 every conceivable element was a member of the domain of discourse.

If a woman is about to have a baby, we might define \( \Omega \) as \{ boy, girl \}. Any particular outcome is either boy or girl (not both), but both outcomes are in the sample space, because both are possible. If we roll a die, we’d define \( \Omega \) as \{ 1, 2, 3, 4, 5, 6 \}. If we’re interested in motor vehicle safety, we might define \( \Omega \) for a particular road trip as \{ safe, accident \}. The outcomes don’t have to be equally likely; an important point we’ll return to soon.

In probability, we define an event as a *subset of the sample space*. In other words, an event is a group of related outcomes (though an event might contain just one outcome, or even zero). I always thought this was a funny definition for the word “event”: it’s not the first thing that word brings to mind. But it turns out to be a useful concept, because sometimes we’re not interested in any particular outcome necessarily, but rather in whether the outcome — whatever it is — has a certain property. For instance, suppose at the start of some game, my opponent and I each roll the die, agreeing that the highest roller gets to go first. Suppose he rolls a 2. Now it’s my turn. The \( \Omega \) for my die roll is of course \{ 1, 2, 3 \}.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 71

Context: # 4.2. PROBABILITY MEASURES

Now it turns out that not just any function will do as a probability measure, even if the domain (events) and codomain (real numbers in the range [0,1]) are correct. In order for a function to be a “valid” probability measure, it must satisfy several other rules:

1. \( \text{Pr}(\Omega) = 1 \)
2. \( \text{Pr}(A) \geq 0 \) for all \( A \subseteq \Omega \)
3. \( \text{Pr}(A \cup B) = \text{Pr}(A) + \text{Pr}(B) - \text{Pr}(A \cap B) \)

Rule 1 basically means "something has to happen." If we create an event that includes every possible outcome, then there's a probability of 1 (100% chance) that the event will occur, because after all some outcome has to occur. (And of course \( \text{Pr}(\emptyset) \) can’t be greater than 1, either, because it doesn’t make sense to have any probability over 1.) Rule 2 says there’s no negative probabilities: you can’t define any event, no matter how remote, that has a less than zero chance of happening.

Rule 3 is called the “additivity property,” and is a bit more difficult to get your head around. A diagram works wonders. Consider Figure 4.1, called a "Venn diagram," which visually depicts sets and their contents. Here we have defined three events: \( F \) (as above) is the event that the winner is a woman; \( R \) is the event that the winner is a rock musician (perhaps in addition to other musical genres); and \( U \) is the event that the winner is underage (i.e., becomes a multimillionaire before they can legally drink). Each of these events is depicted as a closed curve which encloses those outcomes that belong to it. There is obviously a great deal of overlap.

Now back to rule 3. Suppose I ask “what’s the probability that the All-time Idol winner is underage or a rock star?” Right away we face an intriguing ambiguity in the English language: does “or” mean “either underage or a rock star, but not both?” Or does it mean “underage and/or rock star?” The former interpretation is called an exclusive or and the latter an inclusive or. In computer science, we will almost always be assuming an inclusive or, unless explicitly noted otherwise.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 75

Context: # 4.3. PHILOSOPHICAL INTERLUDE

The idea of how often heads will occur in the long run. But if we flip it a million times and get 500,372 heads, we can confidently say that the probability of getting a head on a single flip is approximately .500.

This much isn’t controversial: it’s more like common sense. But the frequentist philosophy states that this is really the only way that probability can be defined. It’s what probability is: the frequency with which we can expect certain outcomes to occur, based on our observations of their past behavior. Probabilities only make sense for things that are repeatable, and reflect a known, reliable trend in how often they produce certain results. Historical proponents of this philosophy include John Venn, the inventor of the aforementioned Venn diagram, and Ronald Fisher, one of the greatest biologists and statisticians of all time.

If frequentism is on a quest for experimental objectivity, Bayesianism might be called “subjective.” This isn’t to say it’s arbitrary or sloppy. It simply has a different notion of what probability ultimately means. Bayesians interpret probability as a quantitative personal assessment of the likelihood of something happening. They point out that for many (most) events of interest, trials are neither possible nor sensible. Suppose I’m considering asking a girl out to the prom, and I’m trying to estimate how likely it is she’ll go with me. It’s not like I’m going to ask her a hundred times and count how many times she says yes, then divide by 100 to get a probability. There is in fact no way to perform a trial or use past data to guide me, and at any rate she’s only going to say yes or no once. So based on my background knowledge and my assumptions about her, myself, and the world, I form an opinion which could be quantified as a “percent chance.”

Once I’ve formed this opinion (which of course involves guesswork and subjectivity) I can then reason about it mathematically, using all the tools we’ve been developing. Of special interest to Bayesians is the notion of updating probabilities when new information comes to light, a topic we’ll return to in a moment. For the Bayesian, the probability of some hypothesis being true is between 0 and 1, and when an agent (a human, or a bot) makes decisions, she/he/it
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 79

Context: # 4.5 Total Probability

Many psychologists, by the way, claim that we're constantly doing this sort of thing in our minds: gathering facts, then revising our beliefs about the world in light of those facts. We start by believing that \( \Pr(X) \) is approximately some value. Then we learn \( K_1 \) has occurred, and we update this to \( \Pr(X|K_1) \). Then we learn that \( K_2 \) has also occurred, and so now we have \( \Pr(X|K_1 \cap K_2) \). (Can you see why it’s the intersection?) The more we learn, the more we revise our estimates up or down, presumably getting more accurate as we go. Another way of looking at it is that every time we learn something new is true, we also learn that its opposite is not true, and therefore we can eliminate some parts of the theoretically possible universe that we have now ruled out. The denominator gets smaller and smaller as we eliminate possibilities.

Keep in mind, by the way, that unlike union and intersection, conditional probability is not commutative. In other words, \( \Pr(X|Y) \neq \Pr(Y|X) \) in general. To take just one example, look again at the \( F \) and \( G \) sets from All-time Idol. \( \Pr(F) \), as we already computed, is equal to 1 since if \( U \) has occurred, we automatically know that \( F \) has also occurred (there aren't any underage contestants except females). But the reverse is certainly not true: just because we have a female winner doesn't mean we have an underage winner, since the winner might be Carrie. Working it out, \( \Pr(U|F) = \frac{\Pr(U \cap F)}{\Pr(F)} = \frac{1}{4} \). Higher than \( \Pr(U) \), but not 1.

## 4.5 Total Probability

There's a very useful fact that goes by the grandiose name "The Law of Total Probability." It goes like this: If there's an event whose probability you’d like to know, you can split it into pieces and add up their probabilities, as long as we do it in the right way.

"The right way" is the key, of course. And it has to do with partitions. Recall from section 2.12 that a partition of a set is a mutually exclusive and collectively exhaustive group of subsets. One example is the empty set and its complement together form a partition of \( \Omega \). By the same token, for any sets \( A \) and \( B \), these two sets together form a partition of \( A \):
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 81

Context: # 4.5 TOTAL PROBABILITY

Pr(A) = Pr(A ∩ C₁) + Pr(A ∩ C₂) + ... + Pr(A ∩ C_N)  
= Pr(A|C₁)Pr(C₁) + Pr(A|C₂)Pr(C₂) + ... + Pr(A|C_N)Pr(C_N)  
= ∑ᵏ₌₁ Pr(A|Cₖ)Pr(Cₖ)  

is the formula.¹

Let's take an example of this approach. Suppose that as part of a promotion for Muvico Cinemas movie theatre, we're planning to give a door prize to the 1000th customer this Saturday afternoon. We want to know, though, the probability that this person will be a minor. Figuring out how many patrons overall will be under 18 might be difficult. But suppose we're showing these three films on Saturday: *Spider-Man: No Way Home*, *Here Before*, and *Sonic the Hedgehog 2*. We can estimate the fraction of each movie's viewers that will be minors: .6, .01, and .95, respectively. We can also predict how many tickets will be sold for each film: 2,000 for *Spider-Man*, 500 for *Here Before*, and 1,000 for *Sonic*.

Applying frequentist principles, we can compute the probability that a particular visitor will be seeing each of the movies:

- \(Pr(Spider-Man) = \frac{2000}{2000 + 500 + 1000} = 0.571\)

- \(Pr(Here \ Before) = \frac{500}{2000 + 500 + 1000} = 0.143\)

- \(Pr(Sonic) = \frac{1500}{2000 + 500 + 1000} = 0.286\)

¹ If you're not familiar with the notation in that last line, realize that \(Σ\) (a capital Greek "sigma") just represents a sort of loop with a counter. The \(k\) in the sign means that the counter is \(k\) and starts at 1. The "N" above the line means the counter goes up to \(N\), which is its last value. And what does the loop do? It adds up a cumulative sum. The thing being added to the total each time through the loop is the expression to the right of the sign. The last line with the \(Σ\) is just a more compact way of expressing the preceding line.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 89

Context: # 4.7 INDEPENDENCE

Whoops!

One last point on the topic of independence: please don’t make the mistake of thinking that mutually exclusive events are independent! This is by no means the case, and, in fact, the opposite is true. If two events are mutually exclusive, they are extremely dependent on each other!

Consider the most trivial case: I choose a random person on campus, and define \( I \) as the event that they’re an in-state student, and \( O \) as the event that they’re out-of-state. Clearly these events are mutually exclusive. But are they independent? Of course not! Think about it: if I told you a person was in-state, would that tell you anything about whether they were out-of-state? Duh. In a mutual exclusive case like this, event \( I \) completely rules out \( O \) (and vice versa), which means that although \( \text{Pr}(I) \) might be .635, \( \text{Pr}(I|O) \) is a big fat zero. More generally, \( \text{Pr}(A|B) \) is most certainly not going to be equal to \( \text{Pr}(A) \) if the two events are mutually exclusive, because learning about one event tells you everything about the other.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 91

Context: # 4.8 EXERCISES

9. Okay, so we have the probabilities of our four swimmers Ben, Chad, Grover, and Tim each winning the heat at 1, 2, 4, and 3, respectively.  
   Now suppose Ben, Chad, and Grover are UWM athletes; Tim is from Marymount. Ben and Tim are juniors, and Chad and Grover are sophomores. We'll define:  
   \( U = \{ \text{Ben, Chad, Grover} \} \)  
   \( M = \{ \text{Tim} \} \)  
   \( J = \{ \text{Ben, Tim} \} \)  
   \( S = \{ \text{Chad, Grover} \} \)  

   What's \( P(U) \)?  
   .7  

10. What's \( P(J) \)?  
   .4  

11. What's \( P(U^c) \)?  
   .3 \cdot (1 - P(U), \text{ of course.})  

12. What's \( P(J \cup S) \)?  
   Exactly 1. All of the outcomes are represented in the two sets \( J \) and \( S \). (Put another way, all competitors are juniors or seniors.)  

13. What's \( P(J \cap S) \)?  
   Zero. Sets \( J \) and \( S \) have no elements in common; therefore their intersection is a set with no outcomes, and the probability of a non-existent outcome happening is 0. (Put another way, nobody is both a junior and a senior.)  

14. What's the probability of a UWM junior winning the heat?  
   This is \( P(J \cap U) \), which is the probability that the winner is a junior and a UWM student. Since \( U \cap J = \{ \text{Ben} \} \), the answer is 1.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 93

Context: # Chapter 5

## Structures

Much of computer science deals with representing and manipulating information. To do this, people have devised various **structures** for organizing chunks of data in a way that makes it easy to store, search, and retrieve. There’s a whole course in most computer science curricula called “data structures” which covers how to implement these structures in code. In this book, we won’t be talking about the code, but rather the abstract structures themselves. This chapter has a lot of pictures in it, which depict examples of the various structures in a very general way. The concepts here map directly to code when you need to put them into practice.

There are all kinds of data structures — arrays, linked lists, queues, stacks, hashtables, and heaps, to name a few — but they almost all boil down to one of two fundamental kinds of things: graphs, and trees. There are the two structures we’ll focus on in this chapter. A graph is just about the most general structure you can envision: a bunch of scattered data elements that are related to each other in some way. Almost every data structure imaginable can be recast as a type of graph. Trees are sort of a special case of graphs, but also sort of a topic in their own right, kind of like functions were a special type of relation, but also kind of different. A tree can be seen as a type of graph that imposes extra special conditions which give some navigational benefit.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 102

Context: # CHAPTER 5. STRUCTURES

## Figure 5.7: A graph depicting a symmetric endorelationship.

```
(Harry, Ron)  
(Ron, Harry)  
(Ron, Hermione)  
(Hermione, Ron)  
(Harry, Harry)  
(Neville, Neville)
```

```
    Harry
      |
      |
    Ron
    / \
   /   \
Hermione  Neville
```

Notice how the loops (edges from a node back to itself) in these diagrams represent ordered pairs in which both elements are the same.

Another connection between graphs and sets has to do with partitions. Figure 5.7 was not a connected graph: Neville couldn't be reached from any of the other nodes. Now consider: isn't a graph like this similar in some ways to a partition of \( A \) — namely, this one?

- \( \{ Harry, Ron, Hermione \} \) 
- \( \{ Neville \} \)

We've simply partitioned the elements of \( A \) into the groups that are connected. If you remove the edge between Harry and Ron in that graph, you have:

- \( \{ Harry \} \) 
- \( \{ Ron, Hermione \} \) 
- \( \{ Neville \} \)

Then add one between Hermione and Neville, and you now have: 

- \( \{ Harry \} \)
- \( \{ Ron, Hermione, Neville \} \)
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 136

Context: # CHAPTER 5. STRUCTURES

Searching for “Molly” in a simple unsorted list of names is an O(n) prospect. If there’s a thousand nodes in the list, on average you’ll find Molly after scanning through 500 of them. You might get lucky and find Molly at the beginning, but of course you might get really unlucky and not find her until the end. This averages out to about half the size of the list in the normal case. If there’s a million nodes, however, it’ll take you 500,000 traversals on average before finding Molly. Ten times as many nodes means ten times as long to find Molly, and a thousand times as many means a thousand times as long. Bummer.

Looking up Molly in a BST, however, is an O(log n) process. Recall that “log” means the logarithm (base-2). This means that doubling the number of nodes gives you a minuscule increase in the running time. Suppose there were a thousand nodes in your tree, as above. You wouldn’t have to look through log(1000) = 10 nodes; you’d only have to look through ten. Now increase it to a million nodes. You wouldn’t have to look through 500,000 to find Molly; you’d only have to look through twenty. Absolutely mind-boggling.

## Adding nodes to a BST

Finding things in a BST is lightning fast. Turns out, so is adding things to it. Suppose we acquire a new customer named Jennifer, and we need to add her to our BST so we can retrieve her account information in the future. All we do is follow the same process we would if we were looking for Jennifer, but as soon as we find the spot where she should be, we add her there. In this case, Jennifer comes before Molly (go left), and before Jessica (go left again), and after Ben (go right). Ben has no right child, so we put Jessica in the tree at that right spot. (See Figure 5.26.)

This adding process is also an O(log n) algorithm, since we only need look at a small number of nodes equal to the height of the tree.

Note that a new entry always becomes a leaf when added. In fact,
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 143

Context: # 5.4 Exercises

1. How many vertices are there in the graph below?  
   ![Graph](link_to_graph_image)

2. How many edges are there?  
   7.

3. What’s the degree of vertex B?  
   3.

4. Is this graph directed?  
   No. (No arrowheads on the lines.)

5. Is this graph connected?  
   No – there is no path from A, B, E, or F to either C or D.

6. Is this graph weighted?  
   No. (No numbers annotating the edges.)

7. Is it a tree?  
   No. (A tree must be connected and must also have no cycles, which this graph clearly does: e.g., B–to–A–to–E–to–B.)

8. Is it a DAG?  
   Not remotely; it is neither directed nor acyclic.

9. If this graph represented an undirected relation, how many ordered pairs would it have?  
   14. (If you said 7, remember that since there are no arrowheads on the lines, this is an undirected graph, which corresponds to a symmetric relation, and hence both (A, E) and (E, A) will be present.)
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 149

Context: # Chapter 6

## Counting

If the title of this chapter seems less than inspiring, it’s only because the kind of counting we learned as children was mostly of a straightforward kind. In this chapter, we’re going to learn to answer some more difficult questions like “how many different semester schedules could a college student possibly have?” and “how many different passwords can a customer choose for this e-commerce website?” and “how likely is this network buffer to overflow, given that its packets are addressed to three different destinations?”

The more impressive-sounding name for this topic is **combinatorics**. In combinatorics, we focus on two tasks: counting things (to find out how many there are), and enumerating things (to systematically list them as individuals). Some things turn out to be hard to count but easy to enumerate, and vice versa.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 150

Context: # 6.1 The Fundamental Theorem

We start with a basic rule that goes by the audacious name of **The Fundamental Theorem of Counting**. It goes like this:

If a whole can be divided into \( k \) parts, and there’s \( n_i \) choices for the \( i \)-th part, then there’s \( n_1 \times n_2 \times n_3 \times \cdots \times n_k \) ways of doing the whole thing.

## Example

Jane is ordering a new Lamborghini. She has twelve different paint colors to choose from (including Luscious Red and Sassy Yellow), three different interiors (Premium Leather, Bonded Leather, or Vinyl), and three different stereo systems. She must also choose between automatic and manual transmission, and she can get power locks & windows (or not). How many different configurations does Jane have to choose from? 

To put it another way, how many different kinds of cars could come off the line for her?

The key is that every one of her choices is independent of all the others. Choosing an Envious Green exterior doesn't constrain her choice of transmission, stereo, or anything else. So no matter which of the 12 paint colors she chooses, she can independently choose any of the three interiors, and no matter what these first two choices are, she can freely choose any of the stereos, etc. It’s a mix-and-match. Therefore, the answer is:

\[ 
12 \times 3 \times 3 \times 2 = 432 
\]

Here’s an alternate notation you’ll run into for this, by the way:

**How many other "Fundamental Theorems" of math do you know?** Here are a few: the Fundamental Theorem of Arithmetic says that any natural number can be broken down into its prime factors in only one way. The Fundamental Theorem of Algebra says that the highest power of a polynomial is how many roots (zeros) it has. The Fundamental Theorem of Linear Algebra says that the row space and the column space of a matrix have the same dimension. The Fundamental Theorem of Calculus says that integration and differentiation are the inverses of each other.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 151

Context: # 6.1. THE FUNDAMENTAL THEOREM

which is just a shorter way of writing

\[ n_1 \times n_2 \times n_3 \cdots \times n_k. \]

As mentioned in section 4.5, the \( \Sigma \) notation is essentially a loop with a counter, and it says to add up the expression to the right of it for each value of the counter. The \( \Pi \) notation is exactly the same, only instead of adding the expressions together for each value of the counter, we're multiplying them. (The reason mathematicians chose the symbols \( \Sigma \) (sigma) and \( \Pi \) (pi) for this, by the way, is that “sigma” and “pi” start with the same letter as “sum” and “product,” respectively.)

We can actually get a lot of leverage just with the fundamental theorem. How many different PINs are possible for an ATM card? There are four digits, each of which can be any value from 0 to 9 (ten total values), so the answer is:

\[ 10 \times 10 \times 10 \times 10 = 10,000 \text{ different PINs.} \]

So a thief at an ATM machine frantically entering PINs at random (hoping to break your account before you call and stop your debit card) would have to try about 5,000 of them on average before cracking the code.

What about middle school bullies who are trying to break into your locker? Well, most combination locks are opened by a three-number sequence, each number of which is anything from 0 to 39. So there are:

\[ 40 \times 40 \times 40 = 64,000 \text{ different combinations.} \]

That's probably slightly overstated, since I'll bet consecutive repeat numbers are not allowed (Master probably doesn't manufacture a
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 152

Context: # CHAPTER 6. COUNTING

Every car in the state of Virginia must be issued its own license plate number. That's a lot of cars. How many different license plate combinations are available?

This one requires a bit more thought, since not all license numbers have the same number of characters. In addition to `SBE4756` and `PY9127`, you can also have `DAM6` or `LUV6` or even `U2`. How can we incorporate these?

The trick is to divide up our sets into mutually exclusive subsets, and then add up the cardinalities of the subsets. If only 7 characters fit on a license plate, then clearly every license plate number has either 1, 2, 3, 4, 5, 6, or 7 characters. And no license plate has two of these (i.e., there is no plate that is both 5 characters long and 6 characters long). Therefore they’re mutually exclusive subsets, and safe to add. This last point is often not fully appreciated, leading to errors. Be careful not to cavalierly add the cardinalities of non-mutually-exclusive sets! You’ll end up double-counting.

So we know that the number of possible license plates is equal to:

- The # of 7-character plates
- The # of 6-character plates
- The # of 5-character plates
- ...
- The # of 1-character plates.

Very well. Now we can figure out each one separately. How do we know how many 7-character plates there are? Well, if every character must be either a letter or a digit, then we have \(26 + 10 = 36\) choices for each character. This implies \(36^7\) different possible 7-character license plates. The total number of plates is therefore:

\[
36^7 + 36^6 + 36^5 + 36^4 + 36^3 + 36^2 + 36^1 = 60,603,140,212 \text{ plates}
\]

which is about ten times the population of the earth, so I think we're safe for now.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 156

Context: # CHAPTER 6. COUNTING

This pattern comes up so much that mathematicians have established a special notation for it:

\[ n \times (n - 1) \times (n - 2) \cdots \times 1 = n! \; (\text{``n-factorial''}) \]

We say there are “3-factorial” different brushing orders for the Davies kids. For our purposes, the notion of factorial will only apply for integers, so there’s no such thing as 23.46! or π! (In advanced computer science applications, however, mathematicians sometimes do define factorial for non-integers.) We also define 0! to be 1, which might surprise you.

This comes up a heck of a lot. If I give you a jumbled set of letters to unscramble, like “KRBS” (think of the Jumble® word game in the newspaper), how many different unscramblings are there? The answer is 5! or 120, one of which is BRISK. Let's say I shuffle a deck of cards before playing War. How many different games of War are there? The answer is 52!, since any of the cards in the deck might be shuffled on top, then any but that top card could be second, then any of those two could be third, etc. Ten packets arrive near-simultaneously at a network router. How many ways can they be queued up for transmission? 10! ways, just like a larger Davies family.

The factorial function grows really, really fast, by the way, even faster than exponential functions. A five letter word like “BRISK” has 120 permutations, but “AMBIENTDROUSLY” has 87,178,291,200, ten times the population of the earth. The number of ways to shuffle a deck is 

\[ 52! \approx 8.065 \times 10^{67} \]

so I don’t think my boys will end up playing the same War game twice any time soon, nor my wife and I the same bridge hand.

---

**“War”** is a mindless card game which involves no strategy or decision-making on the part of the players. Once you shuffle the initial deck, the entire outcome of the game is fixed.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 157

Context: # 6.2. PERMUTATIONS 

## Enumerating permutations

We've discovered that there are 120 permutations of BRISK, but how would we go about listing them all? You can play around with the Davies kids and stumble upon all 6 permutations, but for larger numbers it’s harder. We need a systematic way.

Two of the easiest ways to enumerate permutations involve recursion. Here’s one:

### Algorithm #1 for enumerating permutations

1. Begin with a set of \( n\) objects.
   a. If \( n = 1\), there is only one permutation; namely, the object itself.  
   b. Otherwise, remove one of the objects, and find the permutations of the remaining \( n - 1\) objects. Then, insert the removed object at every possible position, creating another permutation each time.

As always with recursion, solving a bigger problem depends on solving smaller problems. Let’s start with RISK. We’ve already discovered from the toothbrushing example that the permutations of ISK are ISK, IKS, SIK, SKI, KIS, and KSI. So to find the permutations of RISK, we insert an R into each possible location for each of these ISK-permutations. This gives us:

- RISK  
- RISK  
- TRSK  
- TSKR  
- ISRK  
- IKS  
- TRKS  
- TRKS  
- IKRS  
- KIS  
- IKS  
- RSK  
- RSTK  
- ...
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 160

Context: # 152
## CHAPTER 6. COUNTING

10. Bubba, then it doesn’t matter whether Jason finished 11<sup>th</sup> or 45<sup>th</sup>.

It’s easy to see that there are 45 possible winners; then for each winner there are 44 possible second-placers, etc., so this total turns out to be:

45×44×43×42×41×40×39×38×37×36 = 11,576,551,623,436,800 finishes.

Each of the finishes is called a *partial permutation*. It’s a permutation of k items chosen from n total, and is denoted P(n, k). The number of such permutations works out to:

$$
n \times (n - 1) \times (n - 2) \times \cdots \times (n - k + 1)
$$

The “n - k + 1” bit can be confusing, so take your time and think it through. For the golf tournament case, our highest term was 45 and our lowest term was 36. This is because n was 45 and k was 10, and so we only wanted to carry out the multiplication to 36 (not 35), and 36 is 45 - 10 + 1.

This can be expressed more compactly in a few different ways. First, we can use factorials to represent it:

$$
n \times (n - 1) \times (n - 2) \times \cdots \times (n - k + 1) = \frac{(n - k)!}{(n - k)!} \times (n - k) \times (n - k - 1) \times (n - k - 2) \times \cdots \times 1 = \frac{n!}{(n - k)!}
$$

Also, we could use our compact product notation:

$$
n \times (n - 1) \times (n - 2) \times \cdots \times (n - k + 1) = \prod_{i=0}^{k-1} (n - i).
$$

Finally, as with (non-partial) permutations, this comes up so much that the professionals have invented a special notation for it. It looks like a power, but has an underline under the exponent:

$$
n \times (n - 1) \times (n - 2) \times \cdots \times (n - k + 1) = n^{\underline{k}}.
$$
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 161

Context: # 6.2. PERMUTATIONS

This is pronounced “n-to-the-k-falling,” and was invented by one of the most brilliant computer scientists in history, Donald Knuth.

To keep straight what \( n^k \) means, think of it as the same as plain exponentiation, except that the product diminishes instead of staying the same. For example, “17-to-the-6th” is

\[
17^6 = 17 \cdot 17 \cdot 17 \cdot 17 \cdot 17 \cdot 17
\]

but “17-to-the-6th-falling” is

\[
17^4 = 17 \cdot 16 \cdot 15 \cdot 14 \cdot 13 \cdot 12.
\]

In both cases, you’re multiplying the same number of terms; it’s just that in the second case, these terms are “falling.”

Anyway, notation aside, partial permutations abound in practice. A late night movie channel might show four classic films back to back every evening. If there are 500 films in the studio’s library, how many nightly TV schedules are possible? Answer: \( 500^4 \) since there are 500 choices of what to show at 7 pm, then 499 choices for 9 pm, 498 for 11 pm, and 497 for the 1 am late show.

The fastest 41 auto racers will qualify for Sunday’s race, and will be placed from Pole Position on down depending on their qualifying time. If 60 cars participate in the qualifying heat, then there are \( 60! \) different possible starting configurations for Sunday.

Middle schoolers entering sixth grade will be assigned a semester schedule that consists of five “blocks” (periods), each of which will have one of thirteen classes (science, math, orchestra, study hall, etc.). How many schedules are possible? You guessed it, \( 13^5 \). Notice that this is the correct answer only because no repeats are allowed: we don’t want to schedule any student for American History more than once. If a student could take the same class more than once in a day, then there would be \( 13^5 \) (not “falling”) different possible schedules.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 162

Context: # 6.3 Combinations

All the stuff with permutations has emphasized order. Somebody gets first place in the golf tournament, and somebody else gets second, and you bet your bottom dollar that it matters which is which. What if it turns out we don’t care about the order, though? Maybe we don’t care who got what place, but just which golfers were in the top ten. Maybe we don’t care which film is showing in which time slot, but only which films are in tonight’s lineup.

This counting scenario involves something called combinations rather than permutations. A combination of k objects out of a possible n is a choice of any set of k of them, without regard to order. For instance, suppose all three Davies kids want to play on the Wii, but only two can play at a time. Who will get to play first after school? One possibility is Lizzy and T.J., another is Lizzy and Johnny, and the last one is T.J. and Johnny. These are the three (and only three) combinations of 2 objects out of 3.

To see how to count these in general, let’s return to the golf tournament example. Suppose that in addition to winning money, the top three finishers of our local tournament will also advance to the regional tournament. This is a great honor and brings with it far greater additional winning potential than the local money did. Question: how many different possible finishes might we send to regional competition?

At first glance, this seems just like the “how many prize money allocations” problem from before, except that we’re taking 3 instead of 10. But there is a twist. In the former problem, it mattered who was first vs. second vs. third. Now the order is irrelevant. If you finish in the top three, you advance, period. You don’t “advance more forcefully” for finishing first locally instead of third.

It’s not as obvious how to count this, but of course there is a trick. The trick is to count the partial permutations, but then realize how much we overcounted, and then compensate for it accordingly.

If we count the partial permutations of 3 out of 45 golfers, we have \( 45P3 \) such permutations. One of those partial permutations is:
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 163

Context: # 6.3. COMBINATIONS

1. Phil  
2. Bubba  
3. Tiger  

Another one is:  

1. Phil  
2. Tiger  
3. Bubba  

and yet another is:  

1. Tiger  
2. Phil  
3. Bubba  

Now the important thing to recognize is that in our present problem — counting the possible number of regional-bound golf trios — all three of these different partial permutations represent the same combination. In all three cases, it's Bubba, Phil, and Tiger who will represent our local golf association in the regional competition. So by counting all three of them as separate partial permutations, we've overcounted the combinations.

Obviously, we want to count Bubba/Phil/Tiger only once. Okay then. How many times did we overcount it when we counted partial permutations? The answer is that we counted this trio once for every way it can be permuted. The three permutations, above, were examples of this, and so are these three:

1. Tiger  
2. Bubba  
3. Phil  
1. Bubba  
2. Tiger  
3. Phil  
1. Bubba  
2. Phil  
3. Tiger  

This makes a total of six times that we (redundantly) counted the same combination when we counted the partial permutations. Why 6? Because that's the value of 3!, of course. There are 3! different ways to arrange Bubba, Phil, and Tiger, since that's just a straight permutation of three elements. And so we find that every threesome we want to account for, we have counted 6 times.

The way to get the correct answer, then, is obviously to correct for this overcounting by dividing by 6:

\[
\frac{45 \times 44 \times 43}{3!} = \frac{45 \times 44 \times 43}{6} = 14,190 \text{ different threesomes.}
\]
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 164

Context: # CHAPTER 6. COUNTING

And in general, that’s all we have to do. To find the number of combinations of \( k \) things taken from a total of \( n \) things, we have:

\[
\binom{n}{k} = \frac{n!}{(n-k)!k!}
\]

This pattern, too, comes up so often that mathematicians have invented (yet) another special notation for it. It looks a bit strange at first, almost like a fraction without a horizontal bar:

\[
\binom{n}{k} = \frac{n!}{(n-k)!k!}
\]

This is pronounced “n-choose-k.”

Again, examples abound. How many different 5-card poker hands are there? Answer: \( \binom{52}{5} \), since it doesn’t matter what order you’re dealt the cards, only which five cards you get. If there are 1024 sectors on our disk, but only 256 cache blocks in memory to hold them, how many different combinations of sectors can be in memory at one time? \( \binom{256}{4} \). If we want to choose 4 or 5 of our top 10 customers to participate in a focus group, how many different combinations of participants could we have? \( \binom{10}{4} + \binom{10}{5} \), since we want the number of ways to pick 4 of them plus the number of ways to pick 5 of them. And for our late night movie channel, of course, there are \( 540 \) possible movie lineups to attract audiences, if we don’t care which film is aired at which time.

## Binomial coefficients

The “n-choose-k” notation \( \binom{n}{k} \) has another name: values of this sort are called **binomial coefficients**. This is because one way to generate them, believe it or not, is to repeatedly multiply a binomial times itself (or, equivalently, take a binomial to a power.)

A binomial, recall, is a polynomial with just two terms:

\[
x + y
\]
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 165

Context: ```
# 6.3. COMBINATIONS

The coefficients for this binomial are of course 1 and 1, since "n" really means "1 . n." Now if we multiply this by itself, we get:

$ (x + y) \cdot (x + y) = 2x^2 + 2xy + y^2 $,

the coefficients of the terms being 1, 2, and 1. We do it again:

$ (x^2 + 2xy + y^2) \cdot (x + y) = x^3 + 3x^2y + 3xy^2 $ 

to get 1, 3, 3, and 1, and do it again:

$ (x^3 + 3x^2y + 3xy^2 + y^3) \cdot (x + y) = x^4 + 4x^3y + 6x^2y^2 + 4xy^3 + y^4 $ 

to get 1, 4, 6, and 4. At this point you might be having flashbacks to Pascal's triangle, which perhaps you learned about in grade school, in which each entry in a row is the sum of the two entries immediately above it (to the left and right), as in Figure 6.1. (If you never learned that, don't worry about it.)

```
1
1 1
1 2 1
1 3 3 1
1 4 6 4 1
1 5 10 10 5 1
```
**Figure 6.1:** The first six rows of Pascal's triangle.

Now you might be wondering where I'm going with this. What do fun algebra tricks have to do with counting combinations of items? The answer is that the values of $ \binom{n}{k} $ are precisely the coefficients of these multiplied polynomials. Let n be 4, which corresponds to the last polynomial we multiplied out. We can then compute all the combinations of items taken from a group of four:

- $ \binom{4}{0} = 1 $ 
- $ \binom{4}{1} = 4 $
- $ \binom{4}{2} = 6 $ 
- $ \binom{4}{3} = 4 $ 
- $ \binom{4}{4} = 1 $ 

In other words, there is exactly **one** way of taking no items out of 4 (you simply don't take any). There are **four** ways of taking one out of 4.
```
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 168

Context: # CHAPTER 6. COUNTING

We each do a quick 30-minute workout session divided into six 5-minute blocks, and we work with one of the machines during each block, taking turns spotting each other. One day my friend asks me, “Hey Stephen, have you ever wondered: how many different workout routines are possible for us?”

I was, of course, wondering exactly that. But the correct answer turns out to hinge very delicately on exactly what “a workout routine” is. If we could select any weight machine for any 5-minute block, then the answer is \( 6^{18} \), since we have 18 choices for our first block, 18 choices for our second, and so on. (This comes to 34,012,224 different routines, if you’re interested.)

However, on further inspection, we might change our mind about this. Does it make sense to choose the same machine more than once in a 30-minute workout? Would we really complete a workout that consisted of “1. Biceps, 2. Abs, 3. Pec, 4. Biceps, 5. Biceps, 6. Biceps?” If not (and most trainers would probably recommend against such monomanical approaches to exercise), then the real answer is only \( 6^{18} \), since we have 18 choices for our first block, and then only 17 for the second, 16 for the third, etc. (This reduces the total to 13,360,080.)

But perhaps the phrase “a workout routine” means something different even than that. If I tell my physical therapist what “my workout routine” consisted of this morning, does he really care whether I did triceps first, last, or in the middle? He probably only cares about which machines (and therefore which muscle groups) I worked out that morning, not what order I did them in. If this is true, then our definition of a workout routine is somewhat different than before. It’s no longer a consecutive sequence of machines chosen, but rather a set of six machine choices. There would only \( \binom{6}{3} \) of those, or a mere 18,564. So as you can see, the answer radically depends on the precise interpretation of the concepts, which means that to successfully do combinatorics, you have to slow down and think very carefully.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 170

Context: # CHAPTER 6. COUNTING

1. Go back to when the child did have to choose something from each category, but now say they can choose any number of accessories (so they could have the wizard's cape, a batman helmet, plus a lightsaber, pipe, and scepter). Now how many costumes are there?

   This is \(4 \times 5 \times 2^n\), or a whopping 10,240 for those of you keeping score. The \(2^n\) changed to a \(2^n\) because now for each accessory, a costume might include it, or exclude it. That's two independent choices for each accessory.

2. Okay, that’s overkill. A kid only has two hands, after all, so handling nine accessories would be a disastrous challenge. Let’s say instead that a child can choose up to three accessories (but must have at least one). Now how many costume choices are there?

   Now it’s \(4 \times 5 \times \left( \binom{9}{1} + \binom{9}{2} + \binom{9}{3} \right)\), which is equal to \(4 \times 5 \times (9 + 36 + 84)\) or 2,580 possible costumes.

3. Ignoring the at-least-one-child-and-adult constraint for the moment, the total number of groups would seem to be \( \binom{7}{3} + \binom{7}{2} = 0 + 2380 + 6188 = 9,238 possible groups. But of course this is an overcount, since it includes groups with no children and groups with no adults. We’ll use the trick from p. 146 to subtract those out. Now how many size-3-to-5 groups with no adults (all kids) are there? \( \binom{3}{1} + \binom{3}{2} + \binom{3}{3} = 41\). Therefore, by the 14 trick, the total number of legal groups is 9248 - 957 - 41 = 8,250. Final answer.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 173

Context: # Chapter 7
## Numbers

Wow, last chapter was about “counting,” and this one is about “numbers.” It sure seems like we’re regressing back to first grade or earlier. And indeed, this chapter will contain a repeat of some elementary school concepts! But this is so we can re-examine the foundations and generalize them somewhat. The mechanical processes you’ve always used with numbers — adding, subtracting, comparing, checking whether something divides evenly, working with place value — are all correct, but they’re all hard-coded for decimal numbers. The word “decimal,” in this chapter, won’t mean “a number with a decimal point,” like 5.62* but rather a number expressed in base 10. 

And what does “expressed in base 10” mean? It means that the digits, from right to left, represent a “one’s place,” a “ten’s place,” a “hundred’s place,” and so on. This is what we all learned in grade school, and perhaps you thought that’s just how numbers “were.” But it turns out that 1, 10, 1000, …, is just one choice of place values, and that we could equally as well choose many other things, like 1, 2, 4, 8, …, or 1, 16, 256, 4096, …, or even 1, 23, 529, 12167, …, as long as those values are of a certain type (successive powers of the base). 

It’s the concept of bases, and specifically bases other than 10, that will cause us to rethink some things. It’ll feel unnatural at first, but soon you’ll discover that there are aspects of how you work with numbers that are unnecessarily specific, and that it’s freeing.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 175

Context: # 7.1. WHAT IS A “NUMBER?”

When you think of a number, I want you to try to erase the sequence of digits from your mind. Think of a number as what is: a **quantity**. Here's what the number seventeen really looks like:

```
   8
 8 8 8
   8
```

It’s just an **amount**. There are more circles in that picture than in some pictures, and less than in others. But in no way is it “two digits,” nor do the particular digits “1” and “7” come into play any more or less than any other digits.

Let’s keep thinking about this. Consider this number, which I’ll label “A”:

(A)

Now let’s add another circle to it, creating a different number I’ll call “B”:

(B)

And finally, we’ll do it one more time to get “C”:

(C)

(Look carefully at those images and convince yourself that I added one circle each time.)

When going from A to B, I added one circle. When going from B to C, I also added one circle. Now I ask you: was going from B to C any more “significant” than going from A to B? Did anything qualitatively different happen?

The answer is obviously no. Adding a circle is adding a circle; there’s nothing more to it than that. But if you had been writing
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 177

Context: # 7.2. BASES

The largest quantity we can hold in one digit, before we’d need to “roll over” to two digits.

In base 10 (decimal), we use ten symbols: 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. Consequently, the number nine is the highest value we can hold in a single digit. Once we add another element to a set of nine, we have no choice but to add another digit to express it. This makes a “ten’s place” because it will represent the number of sets-of-10 (which we couldn't hold in the 1's place) that the value contains.

Now why is the next place over called the “hundred’s place” instead of, say, the “twenty’s place”? Simply because twenty — as well as every other number less than a hundred — comfortably fits in two digits. We can have up to 9 in the one’s place, and also up to 9 in the ten’s place, giving us a total of ninety-nine before we ever have to cave in to using three digits. The number one hundred is exactly the point at which we must roll over to three digits; therefore, the sequence of digits 1-0-0 represents one hundred.

If the chosen base isn’t obvious from context (as it often won’t be in this chapter) then when we write out a sequence of digits we’ll append the base as a subscript to the end of the number. So the number “four hundred and thirty-seven” will be written as \( 437_{10} \).

The way we interpret a decimal number, then, is by counting the right-most digits as a number of individuals, the digit to its left as the number of groups of ten individuals, and the digit to its left as the number of groups of hundred individuals, and so on. Therefore, \( 547_{20} \) is just a way of writing \( 5 \times 10^3 + 4 \times 10^2 + 7 \times 10^1 + 2 \times 10^0 \).

By the way, we will often use the term **least significant digit** to refer to the right-most digit (2, in the above example), and **most significant digit** to refer to the left-most (5). “Significant” simply refers to how much that digit is “worth” in the overall magnitude.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 181

Context: # 7.3. HEXADECIMAL (BASE 16)

In hexadecimal, what's the highest value that can fit in one digit?  
Answer: F (which is fifteen). What's the highest that can fit in two digits?  
FF (which is two hundred fifty-five). What about three digits?  
FFF (which is sixty-five thousand five hundred thirty-five.)  
And so on. If you count in hexadecimal, you do the same thing as in decimal, only you "roll over the odometer" when you get to F, not when you get to 9.

## Converting to and from decimal

So we know how to take a hexadecimal number (like 7E23h) and find its decimal equivalent: we just interpret each place's value as 1, 16, 256, 4096, and so on. What about going the other way? If we had a decimal number, how would we write its value hexadecimally?

First, let's learn two operations (if you don't already know them) that come in handy when working with integers. The first is called the modulo operator (written "mod"), and simply gives the remainder when dividing two numbers. This is a concept you probably learned in elementary school but might not have used since then. As we get older (and use calculators), we tend to think of a division operation like \(13 \div 3\) as being 4.33... But that's when we want a real-valued (instead of integer-valued) answer. If we only want integers, then we say that \(13 \mod 3 = 1\) with a remainder of 1. (The "4" is called the quotient.) This means that if you have 13 objects, you can take four groups of 3 out of them, and then have 1 object left over. The way we write this operation mathematically is \(13 \mod 3\). In this case, it turns out that \(13 \mod 3 = 1\).

Let's think about how the modulo operator yields different values. We know that \(13 \mod 3 = 1\). What about \(14 \mod 3\)? That is equal to 2, since we can (again) take four groups of 3, but then we have two left over. What about \(15 \mod 3\)? That yields 0, since 3 goes into 15 evenly, leaving no remainder at all. If 16 mod 3 again gives us 1, just like 13 did. If you think it through, you'll realize that \(19 \mod 3\) will also be 1, as will \(22 \mod 3\) and \(25 \mod 3\). These numbers that give the same remainder are said to be "congruent mod 3." The numbers 2, 5, 8, 11, 14, etc. are also all congruent to each other mod 3, since they all give remainder 2.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 190

Context: # CHAPTER 7. NUMBERS

## Adding binary numbers

Adding two binary numbers is the same as adding in decimal, hexadecimal, or any other base; you just have to know when to "roll over the odometer," which in this case is almost instantly, since the highest value a bit can hold is 1! Let's give it a shot:

```
  111001₂  
+ 011010₂  
────────────  
  1011011₂  
```

A child could follow the rules: when we add two zeros, we get zero. Adding a one to a zero gives one. Adding two ones gives zero, and a carry to the next significant digit. And adding two ones plus a carry gives a one and a carry. See if you can follow the flow:

```
   11  
  111001₂  
+ 011010₂  
────────────  
  1011011₂  
```

## Capacity

How large can a byte store? There are 8 bits, and each one can independently have either of two values (0 or 1), so by the Fundamental Theorem of Counting, there are \(2^8\) different combinations. This works out to 256, but we can't actually store the number 256 in a byte if we're using the bit pattern `00000000` (or `00` to represent zero). The highest value would be `11111111₂` (or `FF₁₆`), which is 256₁₀.

How do we store a number larger than that? Simply use more than one byte, of course. If we used two bytes of memory, and treated them as concatenated one after the other, that would give us 16 bits, allowing us to store up to the number `0000000000000000₂` (or `0` to represent zero). The highest value in this case would be `1111111111111111₂` (or `FFFF₁₆` = 65,535₁₀). We'd call one of these bytes—the one representing the \(2^0\) place up to the \(2^7\) place—the least significant byte, and the other one containing places \(2^8\) through \(2^{15}\)—the most significant byte. Extending to more than two bytes to accommodate even larger numbers is done in the obvious way.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 197

Context: 7.4. BINARY (BASE 2)
=====================

will contain one more digit than the original numbers did. You’ve probably seen this on a hand calculator when you press “=” and get an “E” (for “error”) in the display. If there are only ten digits on your display, adding two ten-digit numbers will (sometimes) result in an eleven-digit number that your calculator can’t display, and it’s alerting you to that fact so you don’t misinterpret the result. Here, we might add two 8-bit quantities and end up with a 9-bit quantity that can’t fit in one byte. This situation is called overflow, and we need to detect when it occurs.

The rules for detecting overflow are different depending on the scheme. For unsigned numbers, the rule is simple: if a 1 is carried out from the MSB (far left-side), then we have overflow. So if I were to try to add \( 155_{10} \) and \( 108_{10} \):

```
   1111
+ 10011011  ←  155_{10}
+ 01101100  ←  108_{10}
-----------
  100000111
```

then I get a carry out left into the 9th digit. Since we can only hold eight digits in our result, we would get a nonsensical answer (15_{10}), which we can detect as bogus because the carry out indicated overflow.

Sign-magnitude works the same way, except that I have one fewer bit when I’m adding and storing results. (Instead of a byte’s worth of bits representing magnitude, the left-end bit has been reserved for a special purpose: indicating the number’s sign.) Therefore, if I add the remaining 7-bit quantities and get a carry out left into the eighth digit, that would indicate overflow.

Now with two’s-complement, things are (predictably) not that easy. But in this case, they’re almost as easy. There’s still a simple rule to detect overflow; it’s just a different rule. The rule is: if the carry in to the last (left-most) bit is different than the carry out from the last bit, then we have overflow.

Let’s try adding \( 103_{10} \) and \( 95_{10} \) in two’s-complement, two numbers:
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 202

Context: # CHAPTER 7. NUMBERS

12. Are the numbers 617,418 and 617,424 congruent mod 37?  
   Yes. The number 617,418 is exactly 6 less than 617,424. Let's say there are k stones left over after removing groups of three from 617,418. (k must be 0, 1, or 2, of course.) Now if we did the same remove-groups-of-three thing but starting with 617,424 instead, we'll have two more groups of three than we did before, but then also have exactly k stones left over.

13. Are the numbers 617,418 and 617,424 congruent mod 2?  
   Yes. The number 617,418 is exactly 6 less than 617,424. If there are k stones left over after removing pairs of stones from 617,418, we'd get three additional pairs if we had instead started with 617,424, but then also have exactly k stones left over.

14. Are the numbers 617,418 and 617,424 congruent mod 57?  
   No. Five doesn’t go evenly into six.

15. Are the numbers 617,418 and 617,424 congruent mod 67?  
   Yes. The number 617,418 is exactly 6 less than 617,424. If there are k stones left over after removing groups of six from 617,418, we’d get one additional group if we had instead started with 617,424, and then have exactly k stones left over.

16. What’s 11₁₆ + 2₉?  
   3₁₆.

17. What’s 11₁₆ + 9₉?  
   16₁₆.

18. What’s 1₁₆ + 1₈ + E₂₆?  
   A₁₆.

19. What’s 1₁₆ + F₁₆?  
   10₁₆. (This is the first time we've had to "carry".)

20. What’s 11₁₆ + 22₉?  
   33₁₆.

21. What’s 11₁₆ + 99₉?  
   AA₁₆.

22. What’s 11₁₆ + EF₁₆?  
   100₁₆. (As in exercise 19, we must carry.)
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 207

Context: # 8.1. PROPOSITIONAL LOGIC

- \( \land \) — "and"
- \( \neg \) — "not"
- \( \lor \) — "or"
- \( \Rightarrow \) — "implies" (or "if...then...")
- \( \oplus \) — "xor" (exclusive "or")
- \( \equiv \) — "equiv" (equivalent)

Just as the ordinary arithmetic operators (\(+\), \(-\), etc.) can be used to join numbers and produce another number, and just as the set operators can be used to join sets and produce another set, the logical operators can be used to join propositions and produce another proposition. 

The expression \( 34 + 59 \) produces the number 93. The expression \( \{X, Y\} \cup \{Z\} \) produces the set \( \{X, Y, Z\} \). And the expression \( A \land B \) produces the value **false**, since although \( UMW \) is located in Virginia, the King is not female.

Let's run through the six operators, some of which are intuitive and some of which are not:

### \( \land \) ("and")
The proposition \( X \land Y \) is true when both \( X \) and \( Y \) are true propositions. "AAC" represents the proposition "UMW is in Virginia and dogs are carnivores," which has a truth value of **true** since both components are true. This operation is sometimes called a **conjunction**. Notice that the "∧" sign somewhat resembles the "∩" sign for set intersection. This is not an accident. An element is in the intersection of two sets if it is a member of the first and the second set. Hence mathematicians have chosen symbols which reinforce this connection.

### \( \lor \) ("or")
The proposition \( X \lor Y \) is true when either \( X \) or \( Y \) (or both) are true propositions. "BVC" represents the proposition "The King of England is female or dogs are carnivores," which has a truth value of **true** since the second component is true. This operation is sometimes called a **disjunction**. The \( \lor \) looks somewhat like the "U" sign for union, since an element is in the union of two sets if it is an element of the first set or the second set (or both). This operator is sometimes called an **inclusive or** since it is true if both propositions are true.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 212

Context: ```
# CHAPTER 8. LOGIC

| X | Y | X∨Y | X•Y | ¬X | X=Y | X≠Y |
|---|---|-----|-----|----|-----|-----|
| 0 | 0 |  0  |  0  |  1 |  1  |  1  |
| 0 | 1 |  1  |  0  |  1 |  0  |  1  |
| 1 | 0 |  1  |  0  |  0 |  0  |  1  |
| 1 | 1 |  1  |  1  |  0 |  1  |  0  |

Take a moment and look carefully through the entries in that table, and make sure you agree that this correctly represents the outputs for the five operators. (Note that ¬, being a unary operator, only has X as an input, which means that the value of Y is effectively ignored for that column.)

Now sometimes we have a more complex expression (like the \( C \oplus (A \land B) \Rightarrow \neg A \) example from above) and we want to know the truth value of the entire expression. Under what circumstances — i.e., for what truth values of A, B, and C — is that expression true? We can use truth tables to calculate this piece by piece.

Let’s work through that example in its entirety. First, we set up the inputs for our truth table:

| A | B | C |
|---|---|---|
| 0 | 0 | 0 |
| 0 | 0 | 1 |
| 0 | 1 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 1 | 1 |

In this case, there are three inputs to the expression (A, B, and C) and so we have \( 2^3 \), or eight, rows in the truth table.

Now we work our way through the expression inside out, writing down the values of intermediate parts of the expression. We need to know the value of ¬B to figure some other things out, so let’s start with that one:
```
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 216

Context: # CHAPTER 8. LOGIC

| X | Y | X ∧ Y | ¬(X ∧ Y) | ¬X | ¬Y | X ∨ Y | ¬(X ∨ Y) | (X ∨ Y) ↔ ¬(X ∧ Y) |
|---|---|-------|-----------|----|----|-------|-----------|-----------------------|
| 0 | 0 |  0    |     1     |  1 |  1 |  0    |     1     |          1            |
| 0 | 1 |  0    |     1     |  1 |  0 |  1    |     0     |          1            |
| 1 | 0 |  0    |     1     |  0 |  1 |  1    |     0     |          1            |
| 1 | 1 |  1    |     0     |  0 |  0 |  1    |     0     |          1            |

This is one of De Morgan's Laws, which we've seen previously with regards to sets (p. 21). Here is the other:

| X | Y | X ∧ Y | ¬(X ∧ Y) | ¬X | ¬Y | X ∨ Y | ¬(X ∨ Y) | (X ∨ Y) ↔ ¬(X ∧ Y) |
|---|---|-------|-----------|----|----|-------|-----------|-----------------------|
| 0 | 0 |  0    |     1     |  1 |  1 |  0    |     1     |          1            |
| 0 | 1 |  0    |     1     |  1 |  0 |  1    |     0     |          1            |
| 1 | 0 |  0    |     1     |  0 |  1 |  1    |     0     |          1            |
| 1 | 1 |  1    |     0     |  0 |  0 |  1    |     0     |          1            |

The first can be expressed as “the negation of the disjunction is equal to the conjunction of the negations,” and the second as “the negation of the conjunction is equal to the disjunction of the negations.” If that helps at all.

One last identity is this one:

| X | Y | Z | X ∧ Y ∨ Z | X ∧ (Y ∨ Z) | X ∧ Z | (X ∧ Y) ∨ (X ∧ Z) | A* |
|---|---|---|-----------|-------------|-------|--------------------|----|
| 0 | 0 | 0 |     0     |      0      |   0   |         0          |  1 |
| 0 | 0 | 1 |     1     |      0      |   0   |         0          |  1 |
| 0 | 1 | 0 |     0     |      0      |   0   |         0          |  1 |
| 0 | 1 | 1 |     1     |      0      |   0   |         0          |  1 |
| 1 | 0 | 0 |     1     |      0      |   0   |         0          |  1 |
| 1 | 0 | 1 |     1     |      1      |   1   |         1          |  1 |
| 1 | 1 | 0 |     1     |      1      |   0   |         1          |  1 |
| 1 | 1 | 1 |     1     |      1      |   1   |         1          |  1 |

*Here, "A" is X ∧ (Y ∨ Z) ⇔ (X ∧ Y) ∨ (X ∧ Z).

This is none other than the distributive law, which we also saw for set union and intersection (p. 20) and which you should also remember from introductory algebra: \( x \cdot (y + z) = x \cdot y + x \cdot z \).

It's interesting, actually, when you compare the distributive law from algebra to the distributive law for logic:

\[ x \land (y \lor z) \equiv (x \land y) \lor (x \land z) \]

\( X \land (Y \lor Z) \equiv (X \land Y) \lor (X \land Z) \)
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 217

Context: # 8.2 Predicate logic

The `∧` operator is analogous to `+` (times), while `∨` corresponds to `+` (plus). In fact, if you look at the truth tables for these two operators again, you’ll see an uncanny resemblance:

| X | Y | X ∧ Y | X ∨ Y |
|---|---|-------|-------|
| 0 | 0 |   0   |   0   |
| 0 | 1 |   0   |   1   |
| 1 | 0 |   0   |   1   |
| 1 | 1 |   1   |   1   |

Except for the `(1)` that I put in parentheses, this truth table is exactly what you’d get if you mathematically multiplied (`∧`) and added (`∨`) the inputs! At some level, logically “and-ing” is multiplying, while “or-ing” is adding. Fascinating.

## 8.2 Predicate logic

Propositional logic can represent a lot of things, but it turns out to be too limiting to be practically useful. That has to do with the atomic nature of propositions. Every proposition is its own opaque chunk of truthhood or falsity, with no way to break it down into constituent parts. 

Suppose I wanted to claim that every state in the union had a governor. To state this in propositional logic, I’d have to create a brand new proposition for each state:

- Let G1 be the proposition that Alabama has a governor.
- Let G2 be the proposition that Alaska has a governor.
- Let G3 be the proposition that Arizona has a governor.
- ...

and then, finally, I could assert:

G1 ∧ G2 ∧ G3 ∧ ... ∧ G50.

That’s a lot of work just to create a whole bunch of individual propositions that are essentially the same. What we need is some
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 222

Context: 214  
# CHAPTER 8. LOGIC

So these two cases both result in true. But perhaps surprisingly, we also get true for oatmeal:

- **Human**(oatmeal) → **Adult**(oatmeal) ⊕ **Child**(oatmeal)  
  false → false ⊕ false  
  false → false  
  true ✓  

Whoa, how did true pop out of that? Simply because the premise was false, and so all bets were off. We effectively said “if a bowl of oatmeal is human, then it will either be an adult or a child. But it’s not, so never mind!” Put another way, the bowl of oatmeal did not turn out to be a counterexample, and so we’re confident claiming that this expression is true “for all \( x \): \( \forall \)”.

The other kind of quantifier is called the **existential quantifier**. As its name suggests, it asserts the existence of something. We write it \( \exists \) and pronounce it “there exists.” For example,

- \( \exists \, \text{HasGovernor}(x) \) asserts that there is at least one state that has a governor. This doesn’t tell us how many states this is true for, and in fact despite their name, quantifiers really aren’t very good at “quantifying” things for us, at least numerically. As of 2008, the statement

- \( \text{President}(x) \land \text{African-American}(x) \)

is true, and always will be, no matter how many more African-American U.S. presidents we have. Note that in compound expressions like this, a variable (like \( x \)) always stands for a single entity wherever it appears. For hundreds of years there existed African-Americans, and there have existed Presidents, so the expression above would be ridiculously obvious if it meant only “there have been Presidents, and there have been African-Americans.” But the same variable \( x \) being used as inputs to both predicates is what seals the deal and makes it represent the much stronger statement: “there is at least one individual who is personally both African-American and President of the United States at the same time.”
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 224

Context: # CHAPTER 8. LOGIC

## Order matters

When you're facing an intimidating morass of \( \forall \)s and \( \exists \)s and \( \vee \)s and \( \wedge \)s, and God knows what else, it's easy to get lost in the sauce. But you have to be very careful to dissect the expression to find out what it means. Consider this one:

\[
\forall x \in \mathbb{R}, \exists y \in \mathbb{R} \quad y = x + 1.
\] (8.5)

This statement is true. It says that for every single real number (call it \( x \)), it's true that you can find some other number (call it \( y \)) that's one greater than it. If you generate some examples it's easy to see this is true. Suppose we have the real number \( x = 5 \). Is there some other number \( y \) that's equal to \( 2 + x \)? Of course, the number 6. What if \( x = -32.4 \)? Is there a number \( y \) that satisfies this equation? Of course, \( y = -31.4 \). Obviously not matter what number \( x \) we choose, we can find the desired number \( y \) just by adding one. Hence this statement is true for all \( x \), just like it says.

What happens, though, if we innocently switch the order of the quantifiers? Let's try asserting this:

\[
\exists y \in \mathbb{R}, \forall x \in \mathbb{R} \quad y = x + 1.
\] (8.6)

Is this also true? Look carefully. It says “there exists some magic number \( y \) that has the following amazing property: no matter what value of \( x \) you choose, this \( y \) is one greater than \( x \)!” Obviously this is not true. There is no such number \( y \). If I choose \( y = 13 \), that works great as long as I choose \( x = 12 \), but for any other choice of \( x \), it’s dead in the water.

## The value of precision

This fluency with the basic syntax and meaning of predicate logic was our only goal in this chapter. There are all kinds of logical rules that can be applied to predicate logic statements in order to
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 235

Context: ```
9.2 TYPES OF PROOF
==================

to be true, and so it is legal grounds from which to start. A proof can't even get off the ground without axioms. For instance, in step 1 of the above proof, we noted that either A or B must be true, and so if B isn't true, then A must be. But we couldn't have taken this step without knowing that disjunctive syllogism is a valid form of reasoning. It's not important to know all the technical names of the rules that I included in parentheses. But it is important to see that we made use of an axiom of reasoning on every step, and that if any of those axioms were incorrect, it could lead to a faulty conclusion.

When you create a valid proof, the result is a new bit of knowledge called a theorem which can be used in future proofs. Think of a theorem like a subroutine in programming: a separate bit of code that does a job and can be invoked at will in the course of doing other things. One theorem we learned in chapter 2 was the distributive property of sets; that is, \( X \cap (Y \cup Z) = (X \cap Y) \cup (X \cap Z) \). This can be proven through the use of Venn diagrams, but once you've proven it, it's accepted to be true, and can be used as a "given" in future proofs.

## 9.2 Types of Proof

There are a number of accepted "styles" of doing proofs. Here are some important ones:

### Direct Proof

The examples we've used up to now have been direct proofs. This is where you start from what's known and proceed directly by positive steps towards your conclusion.

Direct proofs remind me of a game called "word ladders," invented by Lewis Carroll, that you might have played as a child:

```
WARM
  |
  |
? ? ? ?
  |
  |
COLD
```
```
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 240

Context: 232  
# CHAPTER 9. PROOF  

“recursion,” because this is exactly what it is. Remember that we discussed recursion in the context of rooted trees (see p.116). A tree can be thought of as a node with several children—each of which are, in turn, trees. Each of them is the root node of a tree comprised of yet smaller trees, and so on and so forth. If you flip back to the left-hand side of Figure 5.16 on p.113, you’ll see that A is the root of one tree, and its two children, F and B, are roots of their own smaller trees in turn. If we were to traverse this tree in (say) pre-order, we’d visit the root, then visit the left and right subtrees in turn, treating each of them as their own tree. In this way we’ve broken up a larger problem (traversing the big tree) into smaller problems (traversing the smaller trees F and B). The A node has very little to do: it just visits itself, then defers all the rest of the work onto its children. This idea of pawning off most of the work onto smaller subproblems that you trust will work is key to the idea of inductive proofs.

Mathematical induction is hard to wrap your head around because it feels like cheating. It seems like you never actually prove anything: you defer all the work to someone else, and then declare victory. But the chain of reasoning, though delicate, is strong as iron.

## Casting the problem in the right form  

Let’s examine that chain. The first thing you have to be able to do is express the thing you’re trying to prove as a predicate about natural numbers. In other words, you need to form a predicate that has one input, which is a natural number. You’re setting yourself up to prove that the predicate is true for all natural numbers. (Or at least, all natural numbers of at least a certain size.)

Suppose I want to prove that in the state of Virginia, all legal drinkers can vote. Then I could say “let VOTE(n) be the proposition that a citizen of age n can vote.”

If I want to prove an algebraic identity, like  

\[
\sum_{i=1}^{n} \frac{i}{n+1} = \frac{n(n+1)}{2}, 
\]  

then I have to figure out which variable is the one that needs to vary across the natural numbers. In this case it’s the \( n \) variable in my equation.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 244

Context: # Example 2

A famous story tells of Carl Friedrich Gauss, perhaps the most brilliant mathematician of all time, getting in trouble one day as a schoolboy. As punishment, he was sentenced to tedious work: adding together all the numbers from 1 to 100. To his teacher's astonishment, he came up with the correct answer in a moment, not because he was quick at adding integers, but because he recognized a trick. The first number on the list (1) and the last (100) add up to 101. So do 3 and 98, and so do 4 and 97, etc., all the way up to 50 and 51. So really what you have here is 50 different sums of 101 each, so the answer is \( 50 \times 101 = 5050 \). In general, if you add the numbers from 1 to \( x \), where \( x \) is any integer at all, you'll get \( \frac{x(x + 1)}{2} \).

Now, use mathematical induction to prove that Gauss was right (i.e., that \( \sum_{i=1}^{x} i = \frac{x(x + 1)}{2} \)) for all numbers \( x \).

First, we have to cast our problem as a predicate about natural numbers. This is easy: say “let \( P(n) \) be the proposition that 

\[
\sum_{i=1}^{n} i = \frac{n(n + 1)}{2} 
\]

Then, we satisfy the requirements of induction:

1. **Base Case**: We prove that \( P(1) \) is true simply by plugging it in. Setting \( n = 1 \), we have:

   \[
   \sum_{i=1}^{1} i = \frac{1(1 + 1)}{2}
   \]
   \[
   1 = \frac{1(2)}{2}
   \]
   \[
   1 = 1 \quad \text{✓}
   \]

2. **Inductive Step**: We now must prove that \( P(k) \) implies \( P(k + 1) \). Put another way, we assume \( P(k) \) is true, and then use that assumption to prove that \( P(k + 1) \) is also true.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 245

Context: # 9.3. PROOF BY INDUCTION

Let’s be crystal clear where we’re going with this. Assuming that \( P(k) \) is true means we can count on the fact that 

\[
1 + 2 + 3 + \cdots + k = \frac{k(k + 1)}{2}
\]

What we need to do, then, is prove that \( P(k + 1) \) is true, which amounts to proving that 

\[
1 + 2 + 3 + \cdots + (k + 1) = \frac{(k + 1)(k + 2)}{2}
\]

Very well. First we make the inductive hypothesis, which allows us to assume:

\[
1 + 2 + 3 + \cdots + k = \frac{k(k + 1)}{2}
\]

The rest is just algebra. We add \( k + 1 \) to both sides of the equation, then multiply things out and factor it all together. Watch carefully:

\[
1 + 2 + 3 + \cdots + k + (k + 1) = \frac{k(k + 1)}{2} + (k + 1)
\]

\[
= \frac{k^2}{2} + \frac{1}{2}(k + 1) + (k + 1)
\]

\[
= \frac{k^2}{2} + \frac{3}{2}(k + 1)
\]

\[
= \frac{k^2 + 3k + 2}{2}
\]

\[
= \frac{(k + 1)(k + 2)}{2}
\]

Therefore, \( \forall n \geq 1 \, P(n) \).

## Example 3

Another algebra one. You learned in middle school that \( (ab)^n = a^n b^n \). Prove this by mathematical induction.

**Solution:** Let \( P(n) \) be the proposition that \( (ab)^n = a^n b^n \).
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 253

Context: # Index

- n-choose-k notation, 156
- n-to-the-k-falling operator, 152
- a priori, 68
- modus ponens, 201, 226
- modus tollens, 226
- quad erat demonstrandum (Q.E.D.), 226
- reductio ad absurdum, 229
- acyclic (graphs), 91
- additivity property, 63
- adjacent (vertices), 89
- algorithm, 97, 127, 128, 132, 149, 150
- Ali, Muhammad, 92
- American Idol, 62, 68
- ancestor (of a node), 115
- and (logical operator), 18, 199, 203
- antisymmetric (relation), 40, 43
- Appel, Kenneth, 244
- arrays, 13
- artificial intelligence (AI), 197, 201, 225
- associative, 20
- asymmetric (relation), 41
- ATM machines, 143
- atomic (propositions), 198
- AVL trees, 133
- axioms, 226, 229
- background knowledge, 68, 70
- balancedness (of a tree), 132
- base case (of a proof), 233, 240
- bases (of number systems), 166, 168, 170
- Bayes' Theorem, 75
- Bayes, Thomas, 67
- Bayesian, 66
- BFT (breadth-first traversal), 95, 97
- Big-O notation, 127
- bijective (function), 49
- binary numbers, 25, 177, 178, 180, 182
- binary search trees, 123, 125
- binomial coefficients, 156
- bit, 177
- Booth, John Wilkes, 86
- BST property, 125, 131
- byte, 180
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 8

Context: 2. Notation
==============

- The notation log represents the logarithm in base two, while ln represents the natural logarithm.
- \( \mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma) \) indicates that random vector \(\mathbf{x}\) is distributed according to a multivariate Gaussian distribution with mean vector \(\boldsymbol{\mu}\) and covariance matrix \(\Sigma\). The multivariate Gaussian pdf is denoted as \( \mathcal{N}(\mathbf{x} | \boldsymbol{\mu}, \Sigma) \) as a function of \(\mathbf{x}\).
- \( \mathbf{x} \sim U[a,b] \) indicates that \(\mathbf{x}\) is distributed according to a uniform distribution in the interval [a, b]. The corresponding uniform pdf is denoted as \( U(a,b) \).
- \( \delta(x) \) denotes the Dirac delta function or the Kronecker delta function, as clear from the context.
- \( \|\mathbf{a}\|_2^2 = \sum_{i=1}^N a_i^2 \) is the quadratic, or \( l_2 \) norm of a vector \(\mathbf{a} = [a_1, \ldots, a_N]^T\). We similarly define the \( l_1 \) norm as \( \|\mathbf{a}\|_1 = \sum_{i=1}^N |a_i| \), and the \( l_0 \) pseudo-norm \( \|\mathbf{a}\|_0 \) as the number of non-zero entries of vector \(\mathbf{a}\).
- \( I \) denotes the identity matrix, whose dimensions will be clear from the context. Similarly, I represents a vector of all ones.
- \( \mathbb{R} \) is the set of real numbers; \( \mathbb{R}^+ \) the set of non-negative real numbers; \( \mathbb{R}^- \) the set of non-positive real numbers; and \( \mathbb{R}^N \) is the set of all vectors of \( N \) real numbers.
- \( |S| \) represents the cardinality of a set \( S \).
- \( \mathbf{Z} \) represents a set of random variables indexed by the integers \( k \in S \).
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 11

Context: # Part I
## Basics
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 33

Context: # 2.3. Frequentist Approach

Since the latter accounts for all pairs \((x, y) \sim p(x, y)\), while the former includes only the training points used for learning. More importantly, the key observation here is that increasing \(M\) allows one to better fit—and possibly overfit—the training set, hence reducing \(L_p(M)\). The generalization loss \(L_p(M)\) instead tends to decrease at first, as we move away from the underfitting regime, but it eventually increases for sufficiently large \(M\). The widening of the gap between training and generalization provides evidence that overfitting is occurring. From Fig. 2.3, we can hence conclude that, in this example, model orders larger than \(M = 7\) should be avoided since they lead to overfitting, while model orders less than \(M = 3\) should also not be considered in order to avoid underfitting.

![Figure 2.4](#)  
**Figure 2.4**: Square root of the generalization loss \(L_p(M)\) and of the training loss \(L_p(M)\) as a function of the training set size \(N\). The asymptote of the generalization and training losses is given by the minimum generalization loss \(L_p(w^*)\) (cf. (2.21)) achievable for the given model order (see Fig. 2.5).

## What if we had more data?

Extrapolating from the behavior observed in Fig. 2.2, we can surmise that, as the number \(N\) of data points increases, overfitting is avoided even for larger values of \(M\). In fact, when the training set is big as compared to the number of parameters in \(\theta\),
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 50

Context: ```
44
# A Gentle Introduction through Linear Regression

Produces a description of approximately 
$$ - \sum_{n=1}^{N} \log{p(y_n | x_n, w_{ML}, \beta_{ML})} $$
bits. This description is, however, not sufficient, since the decoder of the description should also be informed about the parameters $(w_{ML}, \beta_{ML})$.

Using quantization, the parameters can be described by means of a number $C(M)$ of bits that is proportional to the number of parameters, here $M + 2$. Concatenating these bits with the description produced by the ML model yields the overall description length

$$ - \sum_{n=1}^{N} \log{p[y_n | x_n, w_{ML}, \beta_{ML}]} + C(M). \quad (2.39) $$

MDL – in the simplified form discussed here – selects the model order $M$ that minimizes the description length (2.39). Accordingly, the term $C(M)$ acts as a regularizer. The optimal value of $M$ for the MDL criterion is hence the result of the trade-off between the minimization of the overhead $C(M)$, which calls for a small value of $M$, and the minimization of the NLL, which decreases with $M$.

Under some technical assumptions, the overhead term can be often evaluated in the form 
$$ \left( \frac{K}{2} \right) N \log N + c, $$ 
where $K$ is the number of parameters in the model and $c$ is a constant. This expression is not quite useful in practice, but it provides intuition about the mechanism used by MDL to combat overfitting.

## 2.6 Information-Theoretic Metrics

We now provide a brief introduction to information-theoretic metrics by leveraging the example studied in this chapter. As we will see in the following chapters, information-theoretic metrics are used extensively in the definition of learning algorithms. Appendix A provides a detailed introduction to information-theoretic measures in terms of inferential tasks. Here we introduce the key metrics of Kullback-Leibler (KL) divergence and entropy by examining the asymptotic behavior of ML in the regime of large $N$. The case with finite $N$ is covered in Chapter 6 (see Sec. 6.4.3).
```
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 56

Context: 50  
# A Gentle Introduction through Linear Regression

When the number of data points goes to infinity; while the latter is parameterized and is subject to optimization. For this reason, divergence metrics between the two distributions play an important role in the development of learning algorithms. We will see in the rest of the monograph that other frequentist methods may involve a single distribution rather than two, as discussed in Sec. 6.6 or an additional, so-called variational distribution, as covered in Sec. 6.3 and Chapter 8.

In contrast, Bayesian methods posit a single coherent distribution over the data and the parameters, and frame the learning problem as one of inference of unobserved variables. As we will discuss in Chapter 8, variational Bayesian methods also introduce an additional variational distribution and are a building block for frequentist learning in the presence of unobserved variables.

The running example in this chapter has been one of linear regression for a Gaussian model. The next chapter provides the necessary tools to construct and learn more general probabilistic models.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 58

Context: # 3.1 Preliminaries

We start with a brief review of some definitions that will be used throughout the chapter and elsewhere in the monograph (see [28] for more details). Readers with a background in convex analysis and calculus may just review the concept of sufficient statistic in the last paragraph.

First, we define a **convex set** as a subset of \(\mathbb{R}^D\), for some \(D\), that contains all segments between any two points in the set. Geometrically, convex sets hence cannot have “indentations.” Function \(f(x)\) is convex if its domain is a convex set and if it satisfies the inequality \(f(\lambda x_1 + (1 - \lambda)x_2) \leq \lambda f(x_1) + (1 - \lambda)f(x_2)\) for all \(x_1\) and \(x_2\) in its domain and for all \(0 \leq \lambda \leq 1\). Geometrically, this condition says that the function is “U”-shaped: the curve defining the function cannot be above the segment obtained by connecting any two points on the curve. A function is strictly convex if the inequality above is strict except for \(\lambda = 0\) or \(\lambda = 1\); a concave, or strictly concave, function is defined by reversing the inequality above – it is hence “n-shaped.”

The minimization of a convex (“U”) function over a convex constraint set or the maximization of a concave (“n”) function over a convex constraint set are known as convex optimization problems. For these problems, there exist powerful analytical and algorithmic tools to obtain globally optimal solutions [28].

We also introduce two useful concepts from calculus. The **gradient** of a differentiable function \(f(x)\) with \(x = [x_1, \ldots, x_D]^T \in \mathbb{R}^D\) is defined as the \(D \times 1\) vector \(\nabla f(x) = [\frac{\partial f(x)}{\partial x_1}, \ldots, \frac{\partial f(x)}{\partial x_D}]^T\) containing all partial derivatives. At any point \(x\) in the domain of the function, the gradient is a vector that points to the direction of locally maximal increase of the function. The Hessian \(\nabla^2 f(x)\) is the \(D \times D\) matrix with \((i,j)\) element given by the second-order derivative \(\frac{\partial^2 f(x)}{\partial x_i \partial x_j}\). It captures the local curvature of the function.

1. A statistic is a function of the data.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 64

Context: ```markdown
setting \( C = 2 \) recovers the Bernoulli distribution. PMFs in this model are given as:

\[
\text{Cat}(x| \mu) = \prod_{k=0}^{C-1} \mu_k^{I(x=k)} \times (1 - \sum_{k=1}^{C-1} \mu_k) \quad (3.16)
\]

where we have defined \( \mu_k = \Pr[x = k] \) for \( k = 1, \ldots, C - 1 \) and \( \mu_0 = 1 - \sum_{k=1}^{C-1} \mu_k = \Pr[x = 0] \). The log-likelihood (LL) function is given as:

\[
\ln(\text{Cat}(x|\mu)) = \sum_{k=1}^{C-1} I(x=k) \ln \mu_k + \ln \mu_0 \quad (3.17)
\]

This demonstrates that the categorical model is in the exponential family, with sufficient statistics vector \( u(x) = [I(x = 0) \ldots I(x = C - 1)]^T \) and measure function \( m(x) = 1 \). Furthermore, the mean parameters \( \mu = [\mu_1, \ldots, \mu_{C-1}]^T \) are related to the natural parameter vector \( \eta = [\eta_1, \ldots, \eta_{C-1}]^T \) by the mapping:

\[
\eta_k = \ln \left( \frac{\mu_k}{1 - \sum_{j=1}^{C-1} \mu_j} \right) \quad (3.18)
\]

which again takes the form of an LLR. The inverse mapping is given by:

\[
\mu = \left( \frac{e^{\eta_k}}{1 + \sum_{j=1}^{C-1} e^{\eta_j}} \right) \quad (3.19)
\]

The parameterization given here is minimal, since the sufficient statistics \( u(x) \) are linearly independent. An overcomplete representation would instead include in the vector of sufficient statistics also the function \( I(x=0) \). In this case, the resulting vector of sufficient statistics is:

\[
u(x) = 
\begin{cases} 
1 & \text{if } x = 0 \\ 
1 & \text{if } x = C - 1 
\end{cases} \quad (3.20)
\]

which is known as one-hot encoding of the categorical variable, since only one entry equals 1 while the others are zero. Furthermore, with this...
```
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 76

Context: To elaborate, let us denote as `exponential(η)` a probabilistic model in the exponential family, that is, a model of the form (3.1) with natural parameters η. We also write `exponential(μ|θ)` for a probabilistic model in the exponential family with mean parameters μ.

Using the notation adopted in the previous chapter, in its most common form, a GLM defines the probability of a target variable as:

\[ 
p(y|x, W) = \text{exponential}(η|Wz) \tag{3.53} 
\]

where we recall that \( z \) is the vector of explanatory variables, and \( W \) here denotes a matrix of learnable weights of suitable dimensions. According to (3.53), GLMs posit that the response variable \( y \) has a conditional distribution from the exponential family, with natural parameter vector \( η = Wz \) given by a linear function of the given explanatory variables \( z \) with weights \( W \). More generally, we may have the parametrization:

\[ 
p(y|x, W) = \text{exponential}(η|Wϕ(x)) \tag{3.54} 
\]

for some feature vector \( ϕ(x) \) obtained as a function of the input variables \( x \) (see next chapter).

While being the most common, the definition (3.54) is still not the most general for GLMs. More broadly, GLMs can be interpreted as a generalization of the linear model considered in the previous chapter, whereby the mean parameters are defined as a linear function of a feature vector. This viewpoint, described next, may also provide a more intuitive understanding of the modeling assumptions made by GLMs.

Recall that, in the recurring example of Chapter 2, the target variable was modeled as Gaussian distributed with mean given by a linear function of the covariates \( z \). Extending the example, GLMs posit the model:

\[ 
p(y|x, W) = \text{exponential}(η|g(Wϕ(x))) \tag{3.55} 
\]

where the mean parameter vector is parametrized as a function of the feature vector \( ϕ(x) \) through a generally non-linear vector function \( g(·) \) of suitable dimensions. In words, GLMs assume that the target variable is a “noisy” measure of the mean \( μ = g(Wϕ(x)) \).

When the function \( g(·) \) is selected as the gradient of the partition function of the selected model, e.g., \( g(·) = \nabla A(·) \), then, by (3.10), we...
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 100

Context: # 4.4.3 Multi-Class Classification

Given their larger capacity, mixture models may be prone to overfitting. A way to control overfitting will be discussed in Sec. 4.7.

In the case of \( K \) classes, the relevant exponential family distribution is Categorical with natural parameters depending linearly on the feature vector. This yields the following discriminative model as a generalization of logistic regression:

\[
t | x, W \sim \text{Cat}(t | \mathbf{W}(x)).
\]

where the label vector \( t \) is defined using one-hot encoding (Chapter 3) and \( W \) is a matrix of weights. We can also equivalently write the vector of probabilities for the \( K \) classes as

\[
y = \text{softmax}(\mathbf{W}(x)) = \left( \frac{e^{\mathbf{W}_1(x)}}{\sum_{j=1}^K e^{\mathbf{W}_j(x)}}, \ldots, \frac{e^{\mathbf{W}_K(x)}}{\sum_{j=1}^K e^{\mathbf{W}_j(x)}} \right).
\]

where \( y = [y_1, \ldots, y_K]^T \) with \( y_h = p(c_h | x) \); and \( \mathbf{W}_k \) being the \( k \)-th row of the weight matrix \( W \).

Learning follows as for logistic regression. To briefly elaborate on this point, the NLL can be written as the cross-entropy function:

\[
-\ln p(t | x; W) = -\sum_{n=1}^{N} \ln p(t_n | x, W)
= -\sum_{n=1}^{N} t_n^T \ln(y_n).
\]

where the logarithm is applied element by element, and we have \( y_n = \text{softmax}(\mathbf{W}(x_n)) \). Note that each term in (4.32) can be expressed as the cross-entropy \( -\mathbf{t}_n \ln(y_n) = H(t_n || y_n) \). The ML problem is again convex and hence efficiently solvable. The gradient of the NLL can be again found using the general formula (3.28) for exponential models and the chain rule for derivatives. We can write

\[
\nabla W \ln p(t | x; W) = \nabla \ln p(t | \mathbf{W}(x)) \propto \nabla \mathbf{W}(\mathbf{W}(x)).
\]
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 112

Context: where \( t \) is encoded using one-hot encoding, so that the label of each example is given by the vector \( t_n = [t_{n,1}, \ldots, t_{n,K}]^T \). Following the discussion above, moment matching yields the ML estimates

\[
\pi_k, ML = \frac{N |[k]}{N} \sum_{n=1}^N t_{n,k} \quad (4.45a)
\]

\[
\mu_{k,ML} = \frac{1}{N |[k]|} \sum_{n=1}^N t_{n,k} x_n \quad (4.45b)
\]

\[
\Sigma_{k,ML} = \frac{1}{N |[k]|} \sum_{n=1}^N t_{n,k} (x_n - \mu_{k}) (x_n - \mu_{k})^T \quad (4.45c)
\]

### 4.7 Boosting*

In this section, we return to the mixture models of the form (4.29) and discuss a popular training approach to reduce overfitting. We focus on deterministic discriminative models with activations \( \alpha_k(x, \tilde{\theta}) \), \( k = 1, \ldots, K \), in which the mixture predictor is given as 

\[
a(x, \tilde{w}) = \sum_{k=1}^K \pi_k \alpha_k(x, \tilde{w}) \quad (4.46)
\]

with learnable parameters \( \{\pi_k\} \) and \( \{\tilde{w}\} \). The technique, known as boosting, trains one model \( \alpha_k(x, \tilde{w}) \) at a time in a sequential fashion, from \( h = 1 \) to \( h = K \), hence adding one predictor at each training step \( k \). As a result, boosting increases the capacity of the model in a sequential manner, controlling the sum in (4.46) over a growing number of predictors. In this way, one starts by training a model with a large bias, or approximation error; and progressively decreases the bias at the cost of a potentially larger estimation error (see Chapter 2 and the next chapter for further discussion on bias and estimation error). As we will discuss below, each model is trained by solving an ERM problem in which the contribution of a training example is weighted by the rate of the previously trained models.

To elaborate, boosting—more specifically the AdaBoost scheme—can be described as solving an ERM problem with the exponential loss function \( l(t, a(x, \tilde{w})) = \exp(-t \cdot a(x, \tilde{w})) \), as plotted in Fig. 4.4.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 130

Context: # 5.4 VC Dimension and Fundamental Theorem of PAC Learning

We have seen that finite classes are PAC learnable with sample complexity proportional to the model capacity in \(|H|\) by using ERM. In this section, we address the following questions: Is \(N_{\text{SRM}}(N,\delta)\) the smallest sample complexity? How can we define the capacity of infinite hypothesis classes? We have discussed at the end of Section 5.3 that the answer to the latter question cannot be found by extrapolating from results obtained when considering finite hypothesis classes. In contrast, we will see here that the answers to both of these questions rely on the concept of VC dimension, which serves as a more fundamental definition of capacity of a model. The VC dimension is defined next.

## Definition 5.4

A hypothesis class \(\mathcal{H}\) is said to shatter a set of domain points \(X = \{x_1,\ldots,x_n\}\) if, no matter how the corresponding labels \(\{y_i \in \{0,1\}\}\) are selected, there exists a hypothesis \(h \in \mathcal{H}\) that ensures \(\hat{y}_{n} = y_n\) for all \(n = 1,\ldots,V\).

## Definition 5.5

The VC dimension \(\text{VCdim}(\mathcal{H})\) of the model \(\mathcal{H}\) is the size of the largest set \(X\) that is shattered by \(\mathcal{H}\).

Based on the definitions above, to prove that a model has VCdim(\(\mathcal{H}\)) = V, we need to carry out the following two steps:

1. **Step 1:** Demonstrate the existence of a set \(X\) with \(|X| = V\) that is shattered by \(\mathcal{H}\);
2. **Step 2:** Prove that no set \(X'\) of dimensions \(V + 1\) exists that is shattered by \(\mathcal{H}\).

The second step is typically seen to be more difficult, as illustrated by the following examples.

## Example 5.3

The threshold function model (5.8) has VCdim(\(\mathcal{H}) = 1\), since there is clearly a set \(X\) of one sample (\(V = 1\)) that can be shattered.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 133

Context: # 5.5. Summary

To elaborate, consider a probabilistic model \( \mathcal{H} \) defined as the set of all pmfs \( p(x|\theta) \) parameterized by \( \theta \) in a given set. With some abuse of notation, we take \( \mathcal{H} \) to be also the domain of parameter \( \theta \). To fix the ideas, assume that \( \theta \) takes values over a finite alphabet. We know from Section 2.5 that a distribution \( q(x) \) is associated with a lossless compression scheme that requires around \( -\log q(x) \) bits to describe a value. Furthermore, if we were informed about the true parameter \( \theta \), the minimum average coding length would be the entropy \( H(p(\cdot|\theta)) \), which requires setting \( q(x) = p(x|\theta) \) (see Appendix A).

Assume now that we only know that the parameter \( \theta \) lies in set \( \mathcal{H} \), and hence the true distribution \( p(x|\theta) \) is not known. In this case, we cannot select the true parameter distribution, and we need instead to choose a generally different distribution \( q(x) \) to define a compression scheme. With a given distribution \( q(x) \), the average coding length is given by 

\[
\bar{L}(q(x), \theta) = -\sum_{x} p(x|\theta) \log q(x). 
\]
Therefore, the choice of a generally different distribution \( q(x) \) entails a redundancy of 

\[
\Delta R(q(x), \theta) = -\sum_{x} p(x|\theta) \log \frac{q(x)}{p(x|\theta)} \geq 0 \tag{5.22}
\]
bits.

The redundancy \( \Delta R(q(x), \theta) \) in (5.22) depends on the true value of \( \theta \). Since the latter is not known, this quantity cannot be computed. We can instead obtain a computable metric by maximizing over all values of \( \theta \in \mathcal{H} \), which yields the worst-case redundancy

\[
\Delta R(q(x), \mathcal{H}) = \max_{\theta \in \mathcal{H}} \Delta R(q(x), \theta). \tag{5.23}
\]

This quantity can be minimized over \( q(x) \) yielding the so-called min-max redundancy:

\[
\Delta R(q(x)) = \min_{q} \Delta R(q(x), \mathcal{H}) \tag{5.24}
\]
\[
= \min_{q} \max_{\theta} \sum_{x} p(x|\theta) \log \frac{p(x|\theta)}{q(x)} \tag{5.25}
\]
\[
= \min_{q} \max_{\theta} \sum_{x} p(x|\theta) \log \frac{p(x|\theta)}{q(x)}. \tag{5.26}
\]

The minimum redundancy can be taken as a measure of the capacity of model \( \mathcal{H} \), since a richer model tends to yield a larger \( \Delta R(\mathcal{H}) \). In fact,
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 156

Context: # Unsupervised Learning

![Figure 6.5: True distribution (dashed line) and mixture of Gaussians model learned via EM (solid line).](path/to/image)

## Example 6.3

Consider data generated from the multi-modal distribution shown in Fig. 6.5 as a dashed line, which was obtained as a mixture of two Gaussians and an exponential distribution. When \( M = 1 \), the mixture of Gaussian distribution learned via EM corresponds to the conventional ML estimate, and is hence obtained via moment matching (see Chapter 3). This is plotted in the figure for a given data realization with \( N = 100 \). Running EM with the larger values \( M = 2 \) and \( M = 3 \) also returns the same distribution. This distribution is seen to be inclusive of the entire support of the true distribution and to smooth out the edges of the original distribution. As we will further discuss in Sec. 6.4.3, this is a consequence of the fact that ML minimizes the KL divergence over the second argument.

A similar model that applies to binary data, rather than continuous data, such as black-and-white images, is the mixture of Bernoulli models. The EM algorithm can be derived by following the same steps detailed here [23].
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 163

Context: # 6.5. Undirected Generative Models

![Figure 6.6: An undirected graph (MRF) describing the joint distribution prescribed by the RBM model (6.43)-(6.44).](link-to-figure)

From the model, it is easy to compute the distribution of each hidden or observed variable when conditioning on all the observed or hidden variables, respectively. In particular, we have that the variables in \( z \) are mutually independent when conditioned on \( x \), and, similarly, the variables in \( x \) are independent given \( z \). Furthermore, the conditional distributions are given as:

\[
p(z_j | z, \theta) = \sigma(a_j^T x + a_j) \tag{6.45a}
\]
\[
p(x_1 | z, \theta) = \sigma(b_1^T z + b) \tag{6.45b}
\]

where \( w_j \) is the \( j \)-th column of matrix \( W \) arranged into a column vector via transposition. These relationships reveal the significance of the binary hidden variables \( z \) in terms of features.

In fact, as for PPCA, we can consider each column of matrix \( W \) as a feature vector that contributes in explaining the observations. To see this, note that (6.45a) suggests that the \( j \)-th hidden variable \( z_j \) equals 1 when the feature \( w_j \) is well correlated with the data point \( x \). Probability (6.45b) also confirms this interpretation, as each variable \( z_j \) indicates a portion of the feature vector \( w_j^T z + b_k \) (see Sec. 3.2). Furthermore, the distributed representation of the data in terms of the binary hidden vector \( z \) is more efficient than that provided by models with categorical variables such as the mixture of Gaussian model. This is in the following sense. While categorical models require to learn a number of parameters that increases linearly with the number of classes, the distributed representation with a binary vector \( z \) can distinguish \( 2^D \) combinations of features with a number of parameters that increases only linearly with \( D \) [20].
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 192

Context: # 8
## Approximate Inference and Learning

In Chapters 6 and 7, we have seen that learning and inference tasks are often made difficult by the need to compute the posterior distribution \( p(z|x) \) of an unobserved variable \( z \) given an observed variable \( x \). This task requires the computation of the normalizing marginal:

\[
p(x) = \sum_{z} p(x, z), \tag{8.1}
\]

where the sum is replaced by an integral for continuous variables.¹ This computation is intractable when the alphabet of the hidden variable \( z \) is large enough. Chapter 7 has shown that the complexity of computing (8.1) can be alleviated in the special case in which the factorized joint distribution \( p(x, z) \) is defined by specific classes of probabilistic graphical models.

What to do when the complexity of computing (8.1) is excessive? In this chapter, we provide a brief introduction to two popular approximate inference approaches, namely MC methods and Variational Inference (VI). We also discuss their application to learning. As for the  

¹ Note that this task assumes (7.23) after appropriate redefinitions of the variables.
####################
File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf

Page: 198

Context: # Approximate Inference and Learning

is that the M-projection is obtained by moment matching. Specifically, one needs to find a value of parameter ϕ such that the expectations of the sufficient statistics of the model \( q(z|ϕ) \) under distribution \( q(z|ϕ) \) match with the same expectations under the true distribution \( p(z|x) \).

In mathematical terms, the M-projection in the exponential family model \( q(z|ϕ) \propto \exp(ϕ^T u(z)) \), with sufficient statistics \( u(z) \), yields natural parameters \( ϕ^* \) that satisfy the moment matching condition:

\[ 
E_{p(z|x)}[u(z)] = E_{q(z|ϕ)}[u(z)]. 
\]

This derivation is detailed in the Appendix of this chapter. Amortized inference can be defined in a similar way as for I-projection.

![Figure 8.1: Example of I- and M-projections of a mixture of Gaussians distribution (dashed line).](image_link)

## Example 8.2

This simple example is meant to provide an intuitive comparison between the approximations produced by I- and M-projections. To this end, consider a mixture of Gaussians distribution:

\[
p(x|z) = 0.3N(x|μ_1 = -1, σ_1^2 = 0.3) + 0.7N(x|μ_2 = 1, σ_2^2 = 0.3). 
\]
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 9

Context: # Learning and Intuition

We have all experienced the situation that the solution to a problem presents itself while riding your bike, walking home, “relaxing” in the washroom, waking up in the morning, taking your shower, etc. Importantly, it did not appear while banging your head against the problem in a conscious effort to solve it, staring at the equations on a piece of paper. In fact, I would claim that all my bits and pieces of progress have occurred while taking a break and “relaxing out of the problem.” Greek philosophers walked in circles when thinking about a problem; most of us stare at a computer screen all day. The purpose of this chapter is to make you more aware of where your creative mind is located and to interact with it in a fruitful manner.

My general thesis is that, contrary to popular belief, creative thinking is not performed by conscious thinking. It is rather an interplay between your conscious mind, which prepares the seeds to be planted into the unconscious part of your mind. The unconscious mind will munch on the problem “out of sight” and return promising roads to solutions to the consciousness. This process iterates until the conscious mind decides the problem is sufficiently solved, intractable, or plain dull and moves on to the next. It may be a little unsettling to learn that at least part of your thinking goes on in a part of your mind that seems inaccessible and has a very limited interface with what you think of as yourself. But it is undeniable that it is there, and it is also undeniable that it plays a role in the creative thought process.

To become a creative thinker, one should learn to play this game more effectively. To do so, we should think about the language in which to represent knowledge that is most effective in terms of communication with the unconscious. In other words, what type of “interface” between conscious and unconscious mind should we use? It is probably not a good idea to memorize all the details of a complicated equation or problem. Instead, we should extract the abstract idea and capture the essence of it in a picture. This could be a movie with colors and other elements.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 11

Context: Many people may find this somewhat experimental way to introduce students to new topics counter-productive. Undoubtedly for many it will be. If you feel under-challenged and become bored, I recommend moving on to the more advanced textbooks, of which there are many excellent samples on the market (for a list see [books](#)). But I hope that for most beginning students, this intuitive style of writing may help to gain a deeper understanding of the ideas that I will present in the following. Above all, have fun!
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 14

Context: # CHAPTER 1. DATA AND INFORMATION

## Interpretation
Here we seek to answer questions about the data. For instance, what property of this drug was responsible for its high success rate? Does a security officer at the airport apply racial profiling in deciding whose luggage to check? How many natural groups are there in the data?

## Compression
Here we are interested in compressing the original data, a.k.a. the number of bits needed to represent it. For instance, files in your computer can be "zipped" to a much smaller size by removing much of the redundancy in those files. Also, JPEG and GIF (among others) are compressed representations of the original pixel map.

All of the above objectives depend on the fact that there is structure in the data. If data is completely random, there is nothing to predict, nothing to interpret, and nothing to compress. Hence, all tasks are somehow related to discovering or leveraging this structure. One could say that data is highly redundant and that this redundancy is exactly what makes it interesting. Take the example of natural images. If you are required to predict the color of the pixels neighboring some random pixel in an image, you would be able to do a pretty good job (for instance, 20% may be blue sky and predicting the neighbors of a blue sky pixel is easy). Also, if we would generate images at random, they would not look like natural scenes at all. For one, it wouldn’t contain objects. Only a tiny fraction of all possible images looks "natural," and so the space of natural images is highly structured.

Thus, all of these concepts are intimately related: structure, redundancy, predictability, regularity, interpretability, compressibility. They refer to the "food" for machine learning. Without structure, there is nothing to learn. The same thing is true for human learning. From the day we are born, we start noticing that there is structure in this world. Our survival depends on discovering and recording this structure. If I walk into this brown cylinder with a green canopy, I suddenly stop; it won’t give way. In fact, it damages my body. Perhaps this holds for all these objects. When I cry, my mother suddenly appears. Our game is to predict the future accurately, and we predict it by learning its structure.

## 1.1 Data Representation
What does "data" look like? In other words, what do we download into our computer? Data comes in many shapes and forms; for instance, it could be words from a document or pixels from an image. But it will be useful to convert data into a
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 17

Context: # 1.2. PREPROCESSING THE DATA

Attribute separately) and then added and divided by \( N \). You have perhaps noticed that variance does not have the same units as \( X \) itself. If \( X \) is measured in grams, then variance is measured in grams squared. To scale the data to have the same scale in every dimension, we divide by the square-root of the variance, which is usually called the sample standard deviation:

\[
X_{n}^{m} = \frac{X_{n}}{\sqrt{V[|X|_{i}]}} \quad \forall n
\]

Note again that sphering requires centering, implying that we always have to perform these operations in this order, first center, then sphere. Figure ??a,b,c illustrates this process.

You may now be asking, “well what if the data where elongated in a diagonal direction?”. Indeed, we can also deal with such a case by first centering, then rotating such that the elongated direction points in the direction of one of the axes, and then scaling. This requires quite a bit more math, and will postpone this issue until chapter ?? on “principal components analysis”. However, the question is in fact a very deep one, because one could argue that one could keep changing the data using more and more sophisticated transformations until all the structure was removed from the data and there would be nothing left to analyze! It is indeed true that the pre-processing steps can be viewed as part of the modeling process in that it identifies structure (and then removes it). By remembering the sequence of transformations you performed, you have implicitly built a model. Reversely, many algorithms can be easily adapted to model the mean and scale of the data. Now, the preprocessing is no longer necessary and becomes integrated into the model.

Just as preprocessing can be viewed as building a model, we can use a model to transform structured data into (more) unstructured data. The details of this process will be left for later chapters but a good example is provided by compression algorithms. Compression algorithms are based on models for the redundancy in data (e.g. text, images). The compression consists in removing this redundancy and transforming the original data into a less redundant (and hence more succinct) code. Models and structure reducing data transformations are in essence each others reverse: we often associate with a model an understanding of how the data was generated, starting from random noise. Reversely, pre-processing starts with the data and understands how we can get back to the unstructured random state of the data.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 18

Context: # CHAPTER 1. DATA AND INFORMATION

The origin. If data happens to be just positive, it doesn’t fit this assumption very well. Taking the following logarithm can help in that case:

$$
X'_{nm} = \log(\alpha + X_{nm}) \tag{1.5}
$$
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 27

Context: # 3.1 In a Nutshell

Learning is all about generalizing regularities in the training data to new, yet unobserved data. It is not about remembering the training data. Good generalization means that you need to balance prior knowledge with information from data. Depending on the dataset size, you can entertain more or less complex models. The correct size of the model can be determined by playing a compression game. 

Learning = generalization = abstraction = compression.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 31

Context: fall under the name "reinforcement learning". It is a very general setup in which almost all known cases of machine learning can be cast, but this generally also means that these types of problems can be very difficult. The most general RL problems do not even assume that you know what the world looks like (i.e. the maze for the mouse), so you have to simultaneously learn a model of the world and solve your task in it. 

This dual task induces interesting trade-offs: should you invest time now to learn machine learning and reap the benefit later in terms of a high salary working for Yahoo!, or should you stop investing now and start exploiting what you have learned so far? This is clearly a function of age, or the time horizon that you still have to take advantage of these investments. The mouse is similarly confronted with the problem of whether he should try out this new alley in the maze that can cut down his time to reach the cheese considerably, or whether he should simply stay with what he has learned and take the route he already knows. This clearly depends on how often he thinks he will have to run through the same maze in the future. We call this the exploration versus exploitation trade-off. The reason that RL is a very exciting field of research is because of its biological relevance. Do we not also have to figure out how the world works and survive in it?

Let's go back to the news-articles. Assume we have control over what article we will label next. Which one would be pick? Surely the one that would be most informative in some suitably defined sense. Or the mouse in the maze. Given that it needs to explore, where does he explore? Surely he will try to seek out alleys that look promising, i.e. alleys that he expects to maximize his reward. We call the problem of finding the next best data-case to investigate "active learning".

One may also be faced with learning multiple tasks at the same time. These tasks are related but not identical. For instance, consider the problem if recommending movies to customers of Netflix. Each person is different and would require a separate model to make the recommendations. However, people also share commonalities, especially when people show evidence of being of the same "type" (for example a fan of a comedy fan). We can learn personalized models but share features between them. Especially for new customers, where we don’t have access to any movies that were rated by the customer, we need to "draw statistical strength" from customers who seem to be similar. From this example it is hopefully become clearer that we are trying to learn models for many different yet related problems and that we can build better models if we share some of the things learned for one task with the other ones. The trick is not to share too much nor too little and how much we should share depends on how much data and prior knowledge we have access to for each task. We call this subfield of machine learning: "multi-task learning".
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 46

Context: # CHAPTER 7. THE PERCEPTRON

I like to warn the reader at this point that more features is not necessarily a good thing if the new features are uninformative for the classification task at hand. The problem is that they introduce noise in the input that can mask the actual signal (i.e. the good, discriminative features). In fact, there is a whole subfield of ML that deals with selecting relevant features from a set that is too large. The problem of too many dimensions is sometimes called "the curse of high dimensionality." Another way of seeing this is that more dimensions often lead to more parameters in the model (as in the case for the perceptron) and can hence lead to overfitting. To combat that in turn we can add regularizers as we will see in the following.

With the introduction of regularizers, we can sometimes play magic and use an infinite number of features. How we play this magic will be explained when we will discuss kernel methods in the next sections. But let us first start simple with the perceptron.

## 7.1 The Perceptron Model

Our assumption is that a line can separate the two classes of interest. To make our life a little easier we will switch to the \( Y = \{+1, -1\} \) representation. With this, we can express the condition mathematically expressed as:

\[
y_n \approx \text{sign}\left(\sum_k w_k x_{n} - \alpha\right) \tag{7.1}
\]

where "sign" is the sign-function (+1 for nonnegative reals and -1 for negative reals). We have introduced \( K + 1 \) parameters \( \{w_1, \ldots, w_k, \alpha\} \) which define the line for us. The vector \( w \) represents the direction orthogonal to the decision boundary depicted in Figure ??. For example, a line through the origin is represented by \( w^T x = 0 \), i.e., all vectors \( x \) with a vanishing inner product with \( w \). The scalar quantity \( \alpha \) represents the offset of the line \( w^T x = 0 \) from the origin, i.e., the shortest distance from the origin to the line. This can be seen by writing the points on the line as \( x = y + v \) where \( v \) is a fixed vector pointing to an arbitrary point on the line and \( y \) is the vector on the line starting at \( y \) (see Figure ??). Here, \( w^T (y + v) - \alpha = 0 \). Since by definition \( w^T v = 0 \), we find \( w^T y = \alpha \) which means that \( \alpha \) is the projection of \( y \) onto \( w \) which is the shortest distance from the origin to the line.

1. Note that we can replace \( x_n \rightarrow \phi_k(X) \) but for the sake of simplicity we will refrain from doing so at this point.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 90

Context: # APPENDIX B. KERNEL DESIGN

## B.1 Term

\[
\left(x^Ty\right)^* = (x_1y_1+x_2y_2+\ldots+x_ny_n)^* = \sum_{i,j=1}^{n} \frac{s}{i_1! \ldots i_s!} (x_1y_1)^{i_1}(x_2y_2)^{i_2} \ldots (x_ny_n)^{i_n} (B.4)
\]

Taken together with eqn. B.3 we see that the features correspond to:

\[
\phi_t(x) = \frac{d!}{(d - s)!} \frac{1}{i_1! i_2! \ldots i_n!} R^{d-s} x_1^{i_1} x_2^{i_2} \ldots x_n^{i_n} \quad \text{with } i_1 + i_2 + \ldots + i_n = s < d \quad (B.5)
\]

The point is really that in order to efficiently compute the total sum of \(\binom{m+d-1}{d}\) terms we have inserted very special coefficients. The only true freedom we have left is in choosing \(R\): for larger \(R\) we down-weight higher order polynomials more.

The question we want to answer is: how much freedom do we have in choosing different coefficients and still being able to compute the inner product efficiently?

## B.2 All Subsets Kernel

We define the feature again as the product of powers of input attributes. However, in this case, the choice of power is restricted to \([0,1]\), i.e. the feature is present or absent. For \(n\) input dimensions (number of attributes) we have \(2^n\) possible combinations.

Let’s compute the kernel function:

\[
K(x,y) = \sum_{I} \phi_I(x) \phi_I(y) = \sum_{I} \prod_{j=1}^{n} x_j y_j = \prod_{i=1}^{n}(1+x_i) (B.6) 
\]

where the last identity follows from the fact that,

\[
\prod_{i=1}^{n} (1 + z_i) = 1 + \sum_{i} z_i + \sum_{i < j} z_iz_j + \ldots + z_1 z_2 \ldots z_n \quad (B.7)
\]

i.e. a sum over all possible combinations. Note that in this case again, it is much efficient to compute the kernel directly than to sum over the features. Also note that in this case there is no decaying factor multiplying the monomials.
##########

"""QUERY: Was sind Mengen und wie kann man sie kombinieren?"""

Consider the chat history for relevant information. Use all information included. Use as much tokens as needed. Important: If you find information separated by a | in the context, it is a table formatted in Markdown.Important: Take a look at the QUERY and only the QUERY. If this is vague or unclear, please ignore everything and ask a follow-up question instead!


Final Files Sources: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 3, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 8, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 9, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 11, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 13, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 15, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 17, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 19, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 21, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 22, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 23, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 24, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 25, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 26, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 27, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 28, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 29, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 30, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 31, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 33, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 34, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 36, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 37, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 39, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 40, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 43, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 47, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 52, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 55, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 60, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 65, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 67, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 68, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 71, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 75, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 79, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 81, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 89, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 91, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 93, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 102, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 136, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 143, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 149, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 150, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 151, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 152, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 156, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 157, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 160, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 161, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 162, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 163, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 164, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 165, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 168, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 170, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 173, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 175, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 177, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 181, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 190, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 197, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 202, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 207, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 212, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 216, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 217, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 222, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 224, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 235, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 240, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 244, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 245, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 253, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 8, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 11, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 33, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 50, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 56, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 58, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 64, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 76, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 100, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 112, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 130, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 133, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 156, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 163, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 192, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 198, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 9, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 11, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 14, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 17, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 18, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 27, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 31, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 46, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 90
==================================================
**Elapsed Time: 0.00 seconds**
==================================================


FINAL ANSWER 
Answer: Mengen sind Sammlungen von Objekten, die als Elemente bezeichnet werden. Sie können alles enthalten, von Zahlen und Buchstaben bis hin zu Personen und Orten. Mengen werden oft mit geschweiften Klammern dargestellt, z. B. {1, 2, 3} oder {rot, grün, blau}.

Es gibt verschiedene Möglichkeiten, Mengen zu kombinieren:

* **Vereinigung:** Die Vereinigung zweier Mengen ist eine neue Menge, die alle Elemente beider ursprünglichen Mengen enthält. Sie wird mit dem Symbol ∪ dargestellt. Beispielsweise ist die Vereinigung der Mengen {1, 2, 3} und {3, 4, 5} die Menge {1, 2, 3, 4, 5}.
* **Schnittmenge:** Die Schnittmenge zweier Mengen ist eine neue Menge, die nur die Elemente enthält, die in beiden ursprünglichen Mengen vorkommen. Sie wird mit dem Symbol ∩ dargestellt. Beispielsweise ist die Schnittmenge der Mengen {1, 2, 3} und {3, 4, 5} die Menge {3}.
* **Differenz:** Die Differenz zweier Mengen ist eine neue Menge, die nur die Elemente enthält, die in der ersten Menge, aber nicht in der zweiten Menge vorkommen. Sie wird mit dem Symbol \ dargestellt. Beispielsweise ist die Differenz der Mengen {1, 2, 3} und {3, 4, 5} die Menge {1, 2}.
* **Kartesisches Produkt:** Das kartesische Produkt zweier Mengen ist eine neue Menge, die alle möglichen Paare von Elementen aus den beiden ursprünglichen Mengen enthält. Sie wird mit dem Symbol × dargestellt. Beispielsweise ist das kartesische Produkt der Mengen {1, 2} und {a, b} die Menge {(1, a), (1, b), (2, a), (2, b)}.

Diese Operationen sind nützlich, um neue Mengen aus bestehenden Mengen zu erstellen und um Beziehungen zwischen Mengen zu untersuchen. Sie spielen eine wichtige Rolle in vielen Bereichen der Mathematik, Informatik und anderen Wissenschaften.

==================================================
**Elapsed Time: 6.90 seconds**
==================================================