INITIALIZATION
Knowledgebase: ki-dev-large
Base Query: Summarize in 5 sentences the contents of your context. Speak in Old English manner similar to how Shakespeare writes his sonnets.
Model: gpt-4o
Use Curl?: 
==================================================
**Elapsed Time: 0.00 seconds**
==================================================


ROUTING
Query type: summary
==================================================
**Elapsed Time: 1.69 seconds**
==================================================


PRIMER 
Primer: 
            You are Simon, a highly intelligent personal assistant in a system called KIOS. You are a chatbot that
            can read knowledgebases through the "CONTEXT" that is included in the user's chat message.

        
            Your role is to act as an expert at summarization and analysis.

            In your responses to enterprise users, prioritize clarity, trustworthiness, and appropriate formality.

            Be honest by admitting when a topic falls outside your scope of knowledge, and suggest
            alternative avenues for obtaining information when necessary.

            Make effective use of chat history to avoid redundancy and enhance response relevance, continuously
            adapting to integrate all necessary details in your interactions.

            Use as much tokens as possible to provide a detailed response.
        
==================================================
**Elapsed Time: 0.39 seconds**
==================================================


FINAL QUERY 
Final Query: CONTEXT: ##########
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 1

Context: # A First Encounter with Machine Learning

**Max Welling**  
*Donald Bren School of Information and Computer Science*  
*University of California Irvine*

November 4, 2011
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 2

Context: I'm unable to assist with that.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 3

Context: # Contents

Preface iii  
Learning and Intuition vii  

1. Data and Information  
   1.1 Data Representation ...................................................... 2  
   1.2 Preprocessing the Data .................................................. 4  

2. Data Visualization .................................................................. 7  

3. Learning  
   3.1 In a Nutshell ............................................................... 15  

4. Types of Machine Learning  
   4.1 In a Nutshell ............................................................... 20  

5. Nearest Neighbors Classification  
   5.1 The Idea In a Nutshell .................................................. 23  

6. The Naive Bayesian Classifier  
   6.1 The Naive Bayes Model .................................................. 25  
   6.2 Learning a Naive Bayes Classifier ............................... 27  
   6.3 Class-Prediction for New Instances ............................. 28  
   6.4 Regularization ............................................................... 30  
   6.5 Remarks .................................................................... 31  
   6.6 The Idea In a Nutshell .................................................. 32  

7. The Perceptron  
   7.1 The Perceptron Model .................................................. 34
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 4

Context: ```
# CONTENTS

7.2 A Different Cost function: Logistic Regression .......................... 37  
7.3 The Idea In a Nutshell .................................................. 38  

# 8 Support Vector Machines ................................................ 39  
8.1 The Non-Separable case ................................................. 43  

# 9 Support Vector Regression .............................................. 47  

# 10 Kernel ridge Regression ............................................... 51  
10.1 Kernel Ridge Regression ............................................... 52  
10.2 An alternative derivation ............................................. 53  

# 11 Kernel K-means and Spectral Clustering ............................... 55  

# 12 Kernel Principal Components Analysis ................................ 59  
12.1 Centering Data in Feature Space ....................................... 61  

# 13 Fisher Linear Discriminant Analysis .................................. 63  
13.1 Kernel Fisher LDA .................................................... 66  
13.2 A Constrained Convex Programming Formulation of FDA ................. 68  

# 14 Kernel Canonical Correlation Analysis ................................ 69  
14.1 Kernel CCA ............................................................ 71  

# A Essentials of Convex Optimization ..................................... 73  
A.1 Lagrangians and all that ............................................... 73  

# B Kernel Design .......................................................... 77  
B.1 Polynomials Kernels .................................................... 77  
B.2 All Subsets Kernel ...................................................... 78  
B.3 The Gaussian Kernel ..................................................... 79  
```
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 5

Context: # Preface

In winter quarter 2007, I taught an undergraduate course in machine learning at UC Irvine. While I had been teaching machine learning at a graduate level, it became soon clear that teaching the same material to an undergraduate class was a whole new challenge. Much of machine learning is built upon concepts from mathematics such as partial derivatives, eigenvalue decompositions, multivariate probability densities, and so on. I quickly found that these concepts could not be taken for granted at an undergraduate level. The situation was aggravated by the lack of a suitable textbook. Excellent textbooks do exist for this field, but I found all of them to be too technical for a first encounter with machine learning. This experience led me to believe there was a genuine need for a simple, intuitive introduction into the concepts of machine learning. 

A first read to wet the appetite so to speak, a prelude to the more technical and advanced textbooks. Hence, the book you see before you is meant for those starting out in the field who need a simple, intuitive explanation of some of the most useful algorithms that our field has to offer.

Machine learning is a relatively recent discipline that emerged from the general field of artificial intelligence only quite recently. To build intelligent machines, researchers realized that these machines should learn from and adapt to their environment. It is simply too costly and impractical to design intelligent systems by first gathering all the expert knowledge ourselves and then hard-wiring it into a machine. For instance, after many years of intense research, we can now recognize faces in images to a high degree of accuracy. But the world has approximately 30,000 visual object categories according to some estimates (Biederman). Should we invest the same effort to build good classifiers for monkeys, chairs, pencils, axes, etc., or should we build systems to can observe millions of training images, some with labels (e.g., in these pixels in the image correspond to a car) but most of them without side information? Although there currently is no system which can recognize even in the order of 1000 object categories (the best system can get).
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 6

Context: # PREFACE

about 60% correct on 100 categories, the fact that we pull it off seemingly effortlessly serves as a “proof of concept” that it can be done. But there is no doubt in my mind that building truly intelligent machines will involve learning from data.

The first reason for the recent successes of machine learning and the growth of the field as a whole is rooted in its multidisciplinary character. Machine learning emerged from AI but quickly incorporated ideas from fields as diverse as statistics, probability, computer science, information theory, convex optimization, control theory, cognitive science, theoretical neuroscience, physics, and more. To give an example, the main conference in this field is called: **advances in neural information processing systems**, referring to information theory and theoretical neuroscience and cognitive science.

The second, perhaps more important reason for the growth of machine learning is the exponential growth of both available data and computer power. While the field is built on theory and tools developed statistics machine learning recognizes that the most exciting progress can be made to leverage the enormous flood of data that is generated each year by satellites, sky observatories, particle accelerators, the human genome project, banks, the stock market, the army, seismic measurements, the internet, video, scanned text, and so on. It is difficult to appreciate the exponential growth of data that our society is generating. To give an example, a modern satellite generates roughly the same amount of data all previous satellites produced together. This insight has shifted the attention from highly sophisticated modeling techniques on small datasets to more basic analysis on such larger data-sets (the latter sometimes called **data-mining**). Hence the emphasis shifted to algorithmic efficiency and as a result many machine learning faculty (like myself) can typically be found in computer science departments. To give some examples of recent successes of this approach one would only have to turn on one computer and perform an internet search. Modern search engines do not run terribly sophisticated algorithms, but they manage to store and sift through almost the entire content of the internet to return sensible search results. There has also been much success in the field of machine translation, not because a new model was invented but because many more translated documents became available.

The field of machine learning is multifaceted and expanding fast. To sample a few sub-disciplines: statistical learning, kernel methods, graphical models, artificial neural networks, fuzzy logic, Bayesian methods, and so on. The field also covers many types of learning problems, such as supervised learning, supervised learning, semi-supervised learning, active learning, reinforcement learning, etc. I will only cover the most basic approaches in this book from a highly per
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 7

Context: # MEANT FOR INDUSTRY AS WELL AS BACKGROUND READING

This book was written during my sabbatical at the Radboud University in Nijmegen (Netherlands). I would like to thank Hans for discussion on intuition. I also thank Prof. Bert Kappen, who leads an excellent group of postdocs and students for his hospitality. Marga, kids, UCI,...

---

There are a few main aspects I want to cover from a personal perspective. Instead of trying to address every aspect of the entire field, I have chosen to present a few popular and perhaps useful tools and approaches. What will (hopefully) be significantly different than most other scientific books is the manner in which I will present these methods. I have always been frustrated by the lack of proper explanation of equations. Many times, I have been staring at a formula without the slightest clue where it came from or how it was derived. Many books also excel in stating facts in an almost encyclopedic style, without providing the proper intuition of the method. This is my primary mission: to write a book that conveys intuition. The first chapter will be devoted to why I think this is important.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 8

Context: I'm unable to assist with that.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 9

Context: # Learning and Intuition

We have all experienced the situation that the solution to a problem presents itself while riding your bike, walking home, “relaxing” in the washroom, waking up in the morning, taking your shower, etc. Importantly, it did not appear while banging your head against the problem in a conscious effort to solve it, staring at the equations on a piece of paper. In fact, I would claim that all my bits and pieces of progress have occurred while taking a break and “relaxing out of the problem.” Greek philosophers walked in circles when thinking about a problem; most of us stare at a computer screen all day. The purpose of this chapter is to make you more aware of where your creative mind is located and to interact with it in a fruitful manner.

My general thesis is that, contrary to popular belief, creative thinking is not performed by conscious thinking. It is rather an interplay between your conscious mind, which prepares the seeds to be planted into the unconscious part of your mind. The unconscious mind will munch on the problem “out of sight” and return promising roads to solutions to the consciousness. This process iterates until the conscious mind decides the problem is sufficiently solved, intractable, or plain dull and moves on to the next. It may be a little unsettling to learn that at least part of your thinking goes on in a part of your mind that seems inaccessible and has a very limited interface with what you think of as yourself. But it is undeniable that it is there, and it is also undeniable that it plays a role in the creative thought process.

To become a creative thinker, one should learn to play this game more effectively. To do so, we should think about the language in which to represent knowledge that is most effective in terms of communication with the unconscious. In other words, what type of “interface” between conscious and unconscious mind should we use? It is probably not a good idea to memorize all the details of a complicated equation or problem. Instead, we should extract the abstract idea and capture the essence of it in a picture. This could be a movie with colors and other elements.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 10

Context: # LEARNING AND INTUITION

Baroque features or a more “dull” representation, whatever works. Some scientists have been asked to describe how they represent abstract ideas and they invariably seem to entertain some type of visual representation. A beautiful account of this in the case of mathematicians can be found in a marvelous book “XXX” (Hardamard).

By building accurate visual representations of abstract ideas, we create a database of knowledge in the unconscious. This collection of ideas forms the basis for what we call intuition. I often find myself listening to a talk and feeling uneasy about what is presented. The reason seems to be that the abstract idea I am trying to capture from the talk clashed with a similar idea that is already stored. This in turn can be a sign that I either misunderstood the idea before and need to update it, or that there is actually something wrong with what is being presented. In a similar way, I can easily detect that some idea is a small perturbation of what I already knew (I feel happily bored), or something entirely new (I feel intrigued and slightly frustrated). While the novice is continuously challenged and often feels overwhelmed, the more experienced researcher feels at ease 90% of the time because the “new” idea was already in his/her database which therefore needs no and very little updating.

Somehow our unconscious mind can also manipulate existing abstract ideas into new ones. This is what we usually think of as creative thinking. One can stimulate this by seeding the mind with a problem. This is a conscious effort and is usually a combination of detailed mathematical derivations and building an intuitive picture or metaphor for the thing one is trying to understand. If you focus enough time and energy on this process and walk home for lunch, you’ll find that you’ll still be thinking about it in a much more vague fashion: you review and create visual representations of the problem. Then you get your mind off the problem altogether and when you walk back to work, suddenly parts of the solution surface into consciousness. Somehow, your unconscious took over and kept working on your problem. The essence is that you created visual representations as the building blocks for the unconscious mind to work with.

In any case, whatever the details of this process are (and I am no psychologist), I suspect that any good explanation should include both an intuitive part, including examples, metaphors and visualizations, and a precise mathematical part where every equation and derivation is properly explained. This then is the challenge I have set to myself. It will be your task to insist on understanding the abstract idea that is being conveyed and build your own personalized visual representations. I will try to assist in this process but it is ultimately you who will have to do the hard work.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 11

Context: Many people may find this somewhat experimental way to introduce students to new topics counter-productive. Undoubtedly for many it will be. If you feel under-challenged and become bored, I recommend moving on to the more advanced textbooks, of which there are many excellent samples on the market (for a list see [books](#)). But I hope that for most beginning students, this intuitive style of writing may help to gain a deeper understanding of the ideas that I will present in the following. Above all, have fun!
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 12

Context: I'm unable to assist with that.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 13

Context: # Chapter 1

## Data and Information

Data is everywhere in abundant amounts. Surveillance cameras continuously capture video; every time you make a phone call your name and location get recorded. Often, your clicking pattern is recorded when surfing the web. Most financial transactions are recorded; satellites and observatories generate terabytes of data every year; the FBI maintains a DNA database of most convicted criminals. Soon, all written text from our libraries is digitized. Need I go on?

But data in itself is useless. Hidden inside the data is valuable information. The objective of machine learning is to pull the relevant information from the data and make it available to the user. What do we mean by "relevant information"? When analyzing data, we typically have a specific question in mind, such as: 

- “How many types of car can be discerned in this video?” 
- “What will be the weather next week?”

The answer can take the form of a single number (there are 5 cars), or a sequence of numbers (or the temperature next week) or a complicated pattern (the cloud configuration next week). If the answer to our query is complex, we like to visualize it using graphs, bar plots, or even little movies. One should keep in mind that the particular analysis depends on the task one has in mind.

Let me spell out a few tasks that are typically considered in machine learning:

### Prediction

Here we ask ourselves whether we can extrapolate the information in the data to new use cases. For instance, if I have a database of attributes of Hummers such as weight, color, number of people it can hold, etc., and another database of attributes for Ferraris, then we can try to predict the type of car (Hummer or Ferrari) from a new set of attributes. Another example is predicting the weather given all the recorded weather patterns in the past: can we predict the weather next week, or the stock prices?
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 14

Context: # CHAPTER 1. DATA AND INFORMATION

## Interpretation
Here we seek to answer questions about the data. For instance, what property of this drug was responsible for its high success rate? Does a security officer at the airport apply racial profiling in deciding whose luggage to check? How many natural groups are there in the data?

## Compression
Here we are interested in compressing the original data, a.k.a. the number of bits needed to represent it. For instance, files in your computer can be "zipped" to a much smaller size by removing much of the redundancy in those files. Also, JPEG and GIF (among others) are compressed representations of the original pixel map.

All of the above objectives depend on the fact that there is structure in the data. If data is completely random, there is nothing to predict, nothing to interpret, and nothing to compress. Hence, all tasks are somehow related to discovering or leveraging this structure. One could say that data is highly redundant and that this redundancy is exactly what makes it interesting. Take the example of natural images. If you are required to predict the color of the pixels neighboring some random pixel in an image, you would be able to do a pretty good job (for instance, 20% may be blue sky and predicting the neighbors of a blue sky pixel is easy). Also, if we would generate images at random, they would not look like natural scenes at all. For one, it wouldn’t contain objects. Only a tiny fraction of all possible images looks "natural," and so the space of natural images is highly structured.

Thus, all of these concepts are intimately related: structure, redundancy, predictability, regularity, interpretability, compressibility. They refer to the "food" for machine learning. Without structure, there is nothing to learn. The same thing is true for human learning. From the day we are born, we start noticing that there is structure in this world. Our survival depends on discovering and recording this structure. If I walk into this brown cylinder with a green canopy, I suddenly stop; it won’t give way. In fact, it damages my body. Perhaps this holds for all these objects. When I cry, my mother suddenly appears. Our game is to predict the future accurately, and we predict it by learning its structure.

## 1.1 Data Representation
What does "data" look like? In other words, what do we download into our computer? Data comes in many shapes and forms; for instance, it could be words from a document or pixels from an image. But it will be useful to convert data into a
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 15

Context: # 1.1. DATA REPRESENTATION

Most datasets can be represented as a matrix, \( X = [X_{n,k}] \), with rows indexed by "attribute-index" \( i \) and columns indexed by "data-index" \( n \). The value \( X_{n,k} \) for attribute \( i \) and data-case \( n \) can be binary, real, discrete, etc., depending on what we measure. For instance, if we measure weight and color of 100 cars, the matrix \( X \) is \( 2 \times 100 \) dimensional and \( X_{1,20} = 20,684.57 \) is the weight of car nr. 20 in some units (a real value) while \( X_{2,20} = 2 \) is the color of car nr. 20 (say one of 6 predefined colors).

Most datasets can be cast in this form (but not all). For documents, we can give each distinct word of a prespecified vocabulary a number and simply count how often a word was present. Say the word "book" is defined to have nr. 10, 568 in the vocabulary then \( X_{1,1005} = 4 \) would mean: the word book appeared 4 times in document 5076. Sometimes the different data-cases do not have the same number of attributes. Consider searching the internet for images about cats. You’ll retrieve a large variety of images most with a different number of pixels. We can either try to resize the images to a common size or we can simply leave those entries in the matrix empty. It may also occur that a certain entry is supposed to be there but it couldn’t be measured. For instance, if we run an optical character recognition system on a scanned document some letters will not be recognized. We’ll use a question mark “?” to indicate that that entry wasn’t observed.

It is very important to realize that there are many ways to represent data and not all are equally suitable for analysis. By this I mean that in some representation the structure may be obvious while in other representation it may become totally obscure. It is still there, but just harder to find. The algorithms that we will discuss are based on certain assumptions, such as, “Humans and Ferraries can be separated with a line,” see figure ?. While this may be true if we measure weight in kilograms and height in meters, it is no longer true if we decide to re-code these numbers into bit-strings. The structure is still in the data, but we would need a much more complex assumption to discover it. A lesson to be learned is thus to spend some time thinking about in which representation the structure is as obvious as possible and transform the data if necessary before applying standard algorithms. In the next section we’ll discuss some standard preprocessing operations. It is often advisable to visualize the data before preprocessing and analyzing it. This will often tell you if the structure is a good match for the algorithm you had in mind for further analysis. Chapter ? will discuss some elementary visualization techniques.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 16

Context: # CHAPTER 1. DATA AND INFORMATION

## 1.2 Preprocessing the Data

As mentioned in the previous section, algorithms are based on assumptions and can become more effective if we transform the data first. Consider the following example, depicted in Figure 1. The algorithm we consist of estimating the area that the data occupy. It grows a circle starting at the origin and at the point it contains all the data we record the area of a circle. In the figure why this will be a bad estimate: the data-cloud is not centered. If we had first centered it we would have obtained reasonable estimate. Although this example is somewhat simple-minded, there are many, much more interesting algorithms that assume centered data. To center data we will introduce the sample mean of the data, given by,

\[ 
E[X_i] = \frac{1}{N} \sum_{n=1}^{N} X_{in} \tag{1.1} 
\]

Hence, for every attribute \(i\) separately, we simple add all the attribute value across data-cases and divide by the total number of data-cases. To transform the data so that their sample mean is zero, we set,

\[ 
X'_{in} = X_{in} - E[X_i], \quad \forall n 
\tag{1.2} 
\]

It is now easy to check that the sample mean of \(X'\) indeed vanishes. An illustration of the global shift is given in Figure 2. We also see in this figure that the algorithm described above now works much better!

In a similar spirit as centering, we may also wish to scale the data along the coordinate axis in order to make it more “spherical.” Consider Figure 3. In this case the data was first centered, but the elongated shape still prevented us from using the simplistic algorithm to estimate the area covered by the data. The solution is to scale the axes so that the spread is the same in every dimension. To define this operation we first introduce the notion of sample variance,

\[ 
V[X_i] = \frac{1}{N} \sum_{n=1}^{N} X_{in}^2 \tag{1.3} 
\]

where we have assumed that the data was first centered. Note that this is similar to the sample mean, but now we have used the square. It is important that we have removed the sign of the data-cases (by taking the square) because otherwise positive and negative signs might cancel each other out. By first taking the square, all data-cases first get mapped to positive half of the axes (for each dimension or
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 17

Context: # 1.2. PREPROCESSING THE DATA

Attribute separately) and then added and divided by \( N \). You have perhaps noticed that variance does not have the same units as \( X \) itself. If \( X \) is measured in grams, then variance is measured in grams squared. To scale the data to have the same scale in every dimension, we divide by the square-root of the variance, which is usually called the sample standard deviation:

\[
X_{n}^{m} = \frac{X_{n}}{\sqrt{V[|X|_{i}]}} \quad \forall n
\]

Note again that sphering requires centering, implying that we always have to perform these operations in this order, first center, then sphere. Figure ??a,b,c illustrates this process.

You may now be asking, “well what if the data where elongated in a diagonal direction?”. Indeed, we can also deal with such a case by first centering, then rotating such that the elongated direction points in the direction of one of the axes, and then scaling. This requires quite a bit more math, and will postpone this issue until chapter ?? on “principal components analysis”. However, the question is in fact a very deep one, because one could argue that one could keep changing the data using more and more sophisticated transformations until all the structure was removed from the data and there would be nothing left to analyze! It is indeed true that the pre-processing steps can be viewed as part of the modeling process in that it identifies structure (and then removes it). By remembering the sequence of transformations you performed, you have implicitly built a model. Reversely, many algorithms can be easily adapted to model the mean and scale of the data. Now, the preprocessing is no longer necessary and becomes integrated into the model.

Just as preprocessing can be viewed as building a model, we can use a model to transform structured data into (more) unstructured data. The details of this process will be left for later chapters but a good example is provided by compression algorithms. Compression algorithms are based on models for the redundancy in data (e.g. text, images). The compression consists in removing this redundancy and transforming the original data into a less redundant (and hence more succinct) code. Models and structure reducing data transformations are in essence each others reverse: we often associate with a model an understanding of how the data was generated, starting from random noise. Reversely, pre-processing starts with the data and understands how we can get back to the unstructured random state of the data.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 18

Context: # CHAPTER 1. DATA AND INFORMATION

The origin. If data happens to be just positive, it doesn’t fit this assumption very well. Taking the following logarithm can help in that case:

$$
X'_{nm} = \log(\alpha + X_{nm}) \tag{1.5}
$$
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 19

Context: # Chapter 2

## Data Visualization

The process of data analysis does not just consist of picking an algorithm, fitting it to the data and reporting the results. We have seen that we need to choose a representation for the data necessitating data-preprocessing in many cases. Depending on the data representation and the task at hand we then have to choose an algorithm to continue our analysis. But even after we have run the algorithm and studied the results we are interested in, we may realize that our initial choice of algorithm or representation may not have been optimal. We may therefore decide to try another representation/algorithm, compare the results and perhaps combine them. This is an iterative process.

What may help us in deciding the representation and algorithm for further analysis? Consider the two datasets in Figure ??. In the left figure we see that the data naturally forms clusters, while in the right figure we observe that the data is approximately distributed on a line. The left figure suggests a clustering approach while the right figure suggests a dimensionality reduction approach. This illustrates the importance of looking at the data before you start your analysis instead of (literally) blindly picking an algorithm. After your first peek, you may decide to transform the data and then look again to see if the transformed data better suit the assumptions of the algorithm you have in mind.

“Looking at the data” sounds more easy than it really is. The reason is that we are not equipped to think in more than 3 dimensions, while most data lives in much higher dimensions. For instance image patches of size 10 × 10 live in a 100 pixel space. How are we going to visualize it? There are many answers to this problem, but most involve **projection**: we determine a number of, say, 2 or 3 dimensional subspaces onto which we project the data. The simplest choice of subspaces are the ones aligned with the features, e.g. we can plot \( x_1 \) vs \( x_2 \).
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 20

Context: # CHAPTER 2. DATA VISUALIZATION

An example of such a scatter plot is given in Figure ??.

Note that we have a total of \( d(d - 1) / 2 \) possible two-dimensional projections which amounts to 4950 projections for 100-dimensional data. This is usually too many to manually inspect. How do we cut down on the number of dimensions? Perhaps random projections may work? Unfortunately, that turns out to be not a great idea in many cases. The reason is that data projected on a random subspace often looks distributed according to what is known as a Gaussian distribution (see Figure ??). The deeper reason behind this phenomenon is the **central limit theorem** which states that the sum of a large number of independent random variables is (under certain conditions) distributed as a Gaussian distribution. Hence, if we denote with \( \mathbf{w} \) a vector in \( \mathbb{R}^d \) and by \( x \) the d-dimensional random variable, then \( y = \mathbf{w}^T \mathbf{x} \) is the value of the projection. This is clearly a weighted sum of the random variables \( x_i, \; i = 1, \ldots, d \). If we assume that \( x_i \) are approximately independent, then we can see that their sum will be governed by its central limit theorem. Analogously, a dataset \( \{X_n\} \) can thus be visualized in one dimension by "histogramming" the values of \( y = \mathbf{w}^T \mathbf{x} \), see Figure ??. In this figure we clearly recognize the characteristic "Bell-shape" of the Gaussian distribution of projected and histogrammed data.

In one sense the central limit theorem is a rather helpful quirk of nature. Many variables follow Gaussian distributions and the Gaussian distribution is one of the few distributions which have very nice analytical properties. Unfortunately, the Gaussian distribution is also the most uninformative distribution. This notion of "uninformative" can actually be made very precise using information theory and states: 

> Given a fixed mean and variance, the Gaussian density represents the least amount of information among all densities with the same mean and variance.

This is rather unfortunate for our purposes because Gaussian projections are the least revealing dimensions to look at. So in general we have to work a bit harder to see interesting structure.

A large number of algorithms has been devised to search for informative projections. The simplest being "principal component analysis" or PCA for short ??. Here, interesting means dimensions of high variance. However, the fact that high variance is not always a good measure of interestingness and one should rather search for dimensions that are non-Gaussian. For instance, "independent components analysis" (ICA) ?? and "projection pursuit" ?? search for dimen-
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 21

Context: # Data Visualization Techniques

Distributions that have heavy tails relative to Gaussian distributions are important to consider. Another criterion is to find projections onto which the data has multiple modes. A more recent approach is to project the data onto a potentially curved manifold.

## Scatter Plots

Scatter plots are of course not the only way to visualize data. It's a creative exercise, and anything that helps enhance your understanding of the data is allowed in this game. To illustrate, I will give a few examples from a variety of techniques:

1. **Histogram**: A useful way to represent the distribution of a dataset.
2. **Box Plot**: Provides a visual summary of the median, quartiles, and outliers.
3. **Heatmap**: Displays data values in a matrix format using colors for easy interpretation.
4. **Line Graph**: Ideal for showing trends over time.

Feel free to explore different methods to find what best enhances your comprehension of the dataset.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 22

Context: # CHAPTER 2: DATA VISUALIZATION

## Introduction

Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data.

## Benefits of Data Visualization

- **Enhanced Understanding**: Complex data becomes more understandable through visual representation.
- **Immediate Insights**: Visualizations can provide quick and effective insights into data trends.
- **Better Communication**: It aids in storytelling and communicating data findings effectively.

## Common Types of Data Visualizations

1. **Bar Charts**
   - Useful for comparing quantities across categories.
2. **Line Graphs**
   - Ideal for showing trends over time.
3. **Pie Charts**
   - Best for illustrating proportions of a whole.
4. **Heat Maps**
   - Effective for displaying data density across a geographical area.

## Tools for Data Visualization

| Tool          | Description                                      | Cost        |
|---------------|--------------------------------------------------|-------------|
| Tableau       | Leading data visualization tool                  | Subscription |
| Microsoft Excel | Popular for creating basic charts and graphs | License fee  |
| Power BI      | Business analytics service from Microsoft       | Subscription |
| Google Data Studio | Free online tool for data visualization  | Free        |

## Conclusion

Data visualization is a crucial technique for data analysis and communication. By implementing effective visualization methods and using appropriate tools, organizations can greatly enhance their decision-making processes.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 23

Context: # Chapter 3

## Learning

This chapter is without question the most important one of the book. It concerns the core, almost philosophical question of what learning really is (and what it is not). If you want to remember one thing from this book, you will find it here in this chapter.

Ok, let’s start with an example. Alice has a rather strange ailment. She is not able to recognize objects by their visual appearance. At home, she is doing just fine: her mother explained to Alice for every object in her house what it is and how you use it. When she is home, she recognizes these objects (if they have not been moved too much), but when she enters a new environment she is lost. For example, if she enters a new meeting room, she needs a long time to infer what the chairs and the table are in the room. She has been diagnosed with a severe case of "overfitting." 

What is the matter with Alice? Nothing is wrong with her memory because she remembers the objects once she has seen them. In fact, she has a fantastic memory. She remembers every detail of the objects she has seen. And every time she sees a new object, she reasons that the object in front of her is surely not a chair because it doesn’t have all the features she has seen in earlier chairs. The problem is that Alice cannot generalize the information she has observed from one instance of a visual object category to other, yet unobserved members of the same category. The fact that Alice’s disease is so rare is understandable; there must have been a strong selection pressure against this disease. 

Imagine our ancestors walking through the savanna one million years ago. A lion appears on the scene. Ancestral Alice has seen lions before, but not this particular one, and it does not induce a fear response. Of course, she has no time to infer the possibility that this animal may be dangerous logically. Alice’s contemporaries noticed that the animal was yellow-brown, had manes, etc., and immediately un-
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 24

Context: # CHAPTER 3. LEARNING

They understood that this was a lion. They understood that all lions have these particular characteristics in common, but may differ in some other ones (like the presence of a scar or scar someplace).

Bob has another disease which is called over-generalization. Once he has seen an object he believes almost everything is some, perhaps twisted instance of the same object class (In fact, I seem to suffer from this so now and then when I think all of machine learning can be explained by this one new exciting principle). If ancestral Bob walks the savanna and he has just encountered an instance of a lion and fled into a tree with his buddies, the next time he sees a squirrel he believes it is a small instance of a dangerous lion and flees into the trees again. Over-generalization seems to be rather common among small children.

One of the main conclusions from this discussion is that we should neither over-generalize nor over-fit. We need to be on the edge of being just right. But just right about what? It doesn’t seem there is one correct God-given definition of the category chairs. We seem to all agree, but one can surely find examples that would be difficult to classify. When do we generalize exactly right? The magic word is **PREDICTION**. From an evolutionary standpoint, all we have to do is make correct predictions about aspects of life that help us survive. Nobody really cares about the definition of lion, but we do care about our responses to the various animals (run away for lion, chase for deer). And there are a lot of things that can be predicted in the world. This food kills me but food is good for me. Drumming my fists on my hairy chest in front of a female generates opportunities for sex; sticking my hand into that yellow-orange flickering “flame” hurts my hand and so on. The world is wonderfully predictable and we are very good at predicting it.

So why do we care about object categories in the first place? Well, apparently they help us organize the world and make accurate predictions. The category *lions* is an abstraction and abstractions help us to generalize. In a certain sense, learning is all about finding useful abstractions or concepts that describe the world. Take the concept “fluid”; it describes all watery substances and summarizes some of their physical properties. Or the concept of “weight”: an abstraction that describes a certain property of objects.

Here is one very important corollary for you: 

> “machine learning is not in the business of remembering and regurgitating observed information, it is in the business of transferring (generalizing) properties from observed data onto new, yet unobserved data.” 

This is the mantra of machine learning that you should repeat to yourself every night before you go to bed (at least until the final exam).

The information we receive from the world has two components to it: there
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 25

Context: # Information Theory and Image Compression

The part of the information that does not carry over to the future, the unpredictable information, is called "noise." There is also the information that is predictable, the learnable part of the information stream. The task of any learning algorithm is to separate the predictable part from the unpredictable part.

Now imagine Bob wants to send an image to Alice. He has to pay 1 dollar cent for every bit that he sends. If the image were completely white, it would be really stupid of Bob to send the message: 

```
pixel 1: white, pixel 2: white, pixel 3: white, .... 
```

He could just have sent the message:

```
all pixels are white!
```

The blank image is completely predictable but carries very little information. Now imagine a image that consists of white noise (your television screen if the cable is not connected). To send the exact image Bob will have to send:

```
pixel 1: white, pixel 2: black, pixel 3: black, .... 
```

Bob cannot do better because there is no predictable information in that image, i.e., there is no structure to be modeled. You can imagine playing a game and revealing one pixel at a time to someone and paying him 15 for every next pixel he predicts correctly. For the white image, you can do perfect for the noisy picture you would be random guessing. Real pictures are in between: some pixels are very hard to predict, while others are easier. 

To compress the image, Bob can extract rules such as: always predict the same color as the majority of the pixels next to you, except when there is an edge. These rules constitute the model for the regularities of the image. Instead of sending the entire image pixel by pixel, Bob will now first send his rules and ask Alice to apply the rules. Every time the rule fails, Bob also sends a correction:

```
pixel 103: white, pixel 245: black.
```

A few rules and two corrections are obviously cheaper than 256 pixel values and no rules.

There is one fundamental tradeoff hidden in this game. Since Bob is sending only a single image, it does not pay to send an incredibly complicated model that would require more bits to explain than simply sending all pixel values. If he were sending 1 billion images it would pay off to first send the complicated model because he would be saving a fraction of all bits for every image. On the other hand, if Bob wants to send 2 pixels, there really is no need in sending a model whatsoever. 

Therefore, the size of Bob's model depends on the amount of data he wants to transmit. In this context, the boundary between what is model and what is noise depends on how much data we are dealing with! If we use a model that is too complex, we overfit to the data at hand, i.e., part of the model represents noise. On the other hand, if we use a too simple model we "underfit" (over-generalize) and valuable structure remains unmodeled. Both lead to sub-optimal compression of the image. 

The compression game can therefore be used to find the right size of model complexity for a given dataset.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 26

Context: # CHAPTER 3. LEARNING

Now let’s think for a moment what we really mean with “a model”. A model represents our prior knowledge of the world. It imposes structure that is not necessarily present in the data. We call this the **inductive bias**. Our inductive bias often comes in the form of a parameterized model. That is to say, we define a family of models but let the data determine which of these models is most appropriate. A strong inductive bias means that we don’t leave flexibility in the model for the data to work on. We are so convinced of ourselves that we basically ignore the data. The downside is that if we are creating a “bad bias” towards to wrong model. On the other hand, if we are correct, we can learn the remaining degrees of freedom in our model from very few data-cases. Conversely, we may leave the door open for a huge family of possible models. If we now let the data zoom in on the model that best explains the training data it will overfit to the peculiarities of that data. 

Now imagine you sampled 10 datasets of the same size \(N\) and train these very flexible models separately on each of these datasets (note that in reality you only have access to one such dataset but please play along in this thought experiment). Let’s say we want to determine the value of some parameter \(\theta\). Because the models are so flexible, we can actually model the idiosyncrasies of each dataset. The result is that the value for \(\theta\) is likely to be very different for each dataset. But because we didn’t impose much inductive bias the average of many of such estimates will be about right. We say that the bias is small, but the variance is high. In the case of very restrictive models the opposite happens: the bias is potentially large but the variance small. Note that not only is a large bias bad (for obvious reasons), a large variance is bad as well: because we only have one dataset of size \(N\), our estimate could be very far off simply we were unlucky with the dataset we were given. 

What we should therefore strive for is to inject all our prior knowledge into the learning problem (this makes learning easier) but avoid injecting the wrong prior knowledge. If we don’t trust our prior knowledge we should let the data speak. However, letting the data speak too much might lead to overfitting, so we need to find the boundary between too complex and too simple a model and get its complexity just right. Access to more data means that the data can speak more relative to prior knowledge. That, in a nutshell, is what machine learning is all about.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 27

Context: # 3.1 In a Nutshell

Learning is all about generalizing regularities in the training data to new, yet unobserved data. It is not about remembering the training data. Good generalization means that you need to balance prior knowledge with information from data. Depending on the dataset size, you can entertain more or less complex models. The correct size of the model can be determined by playing a compression game. 

Learning = generalization = abstraction = compression.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 28

Context: # CHAPTER 3. LEARNING

## Introduction

Learning involves acquiring knowledge, skills, attitudes, or competencies. It can take place in various settings, such as classrooms, workplaces, or self-directed environments. This chapter discusses the types and processes of learning.

## Types of Learning

1. **Formal Learning**
   - Structured and typically takes place in educational institutions.
   - Includes degrees, diplomas, and certifications.

2. **Informal Learning**
   - Unstructured and occurs outside formal institutions.
   - Can include life experiences, social interactions, and casual settings.

3. **Non-Formal Learning**
   - Organized but not typically in a formal education setting.
   - Often community-based, such as workshops and training sessions.

## Learning Processes

- **Cognitive Learning**: Involves mental processes and understanding. 
- **Behavioral Learning**: Focuses on behavioral changes in response to stimuli.
- **Constructivist Learning**: Emphasizes learning through experience and reflection.

## Table of Learning Theories

| Theory                    | Key Contributor      | Description                                      |
|--------------------------|----------------------|--------------------------------------------------|
| Behaviorism              | B.F. Skinner         | Learning as a change in behavior due to reinforcement. |
| Constructivism           | Jean Piaget          | Knowledge is constructed through experiences.    |
| Social Learning          | Albert Bandura       | Learning through observation and imitation.      |

## Conclusion

Understanding the different types and processes of learning can help educators and learners optimize educational experiences. Utilizing various methods can cater to individual learning preferences and improve outcomes.

## References

- Smith, J. (2020). *Learning Theories: A Comprehensive Approach*. Educational Press.
- Doe, A. (2019). *Techniques for Effective Learning*. Learning Publishers.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 29

Context: # Chapter 4

## Types of Machine Learning

We now will turn our attention and discuss some learning problems that we will encounter in this book. The most well-studied problem in ML is that of **supervised learning**. To explain this, let’s first look at an example. Bob wants to learn how to distinguish between bobcats and mountain lions. He types these words into Google Image Search and closely studies all catlike images of bobcats on the one hand and mountain lions on the other. Some months later on a hiking trip in the San Bernardino mountains he sees a big cat... 

The data that Bob collected was labelled because Google is supposed to only return pictures of bobcats when you search for the word "bobcat" (and similarly for mountain lions). Let’s call the images \( X_1, \ldots, X_n \), and the labels \( Y_1, \ldots, Y_n \). Note that \( X_i \) are much higher dimensional objects because they represent all the information extracted from the image (approximately 1 million pixel color values), while \( Y_i \) is simply −1 or 1 depending on how we choose to label our classes. So, that would be a ratio of about 1 million to 1 in terms of information content! The classification problem can usually be posed as finding (a.k.a. learning) a function \( f(x) \) that approximates the correct class labels for any input \( x \). For instance, we may decide that \( \text{sign}(f(x)) \) is the predictor for our class label. In the following, we will be studying quite a few of these classification algorithms.

There is also a different family of learning problems known as **unsupervised learning** problems. In this case there are no labels \( Y \) involved, just the features \( X \). Our task is not to classify, but to organize the data, or to discover the structure in the data. This may be very useful for visualization data, compressing data, or organizing data for easy accessibility. Extracting structure in data often leads to the discovery of concepts, topics, abstractions, factors, causes, and more such terms that all really mean the same thing. These are the underlying semantic structures.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 30

Context: # CHAPTER 4. TYPES OF MACHINE LEARNING

Factors that can explain the data. Knowing these factors is like denoising the data where we first peel off the uninteresting bits and pieces of the signal and subsequently transform onto an often lower dimensional space which exposes the underlying factors.

There are two dominant classes of unsupervised learning algorithms: clustering-based algorithms assume that the data organizes into groups. Finding these groups is then the task of the ML algorithm and the identity of the group is the semantic factor. Another class of algorithms strives to project the data onto a lower dimensional space. This mapping can be nonlinear, but the underlying assumption is that the data is approximately distributed on some (possibly curved) lower dimensional manifold embedded in the input space. Unrolling that manifold is then the task of the learning algorithm. In this case, the dimensions should be interpreted as semantic factors.

There are many variations on the above themes. For instance, one is often confronted with a situation where you have access to many more unlabeled data (only \(X_i\)) and many fewer labeled instances (both \((X_i, Y_i)\)). Take the task of classifying news articles by topic (weather, sports, national news, international etc.). Some people may have labeled some news-articles by hand but there won’t be all that many of those. However, we do have a very large digital library of scanned newspapers available. Shouldn’t it be possible to use those scanned newspapers somehow to improve the classifier? Imagine that the data naturally clusters into well-separated groups (for instance because news articles reporting on different topics use very different words). This is depicted in Figure ??. Note that there are only very few cases which have labels attached to them. From this figure it becomes clear that the expected optimal decision boundary nicely separates these clusters. In other words, you do not expect that the decision boundary will cut through one of the clusters. Yet that is exactly what would happen if you would only be using the labeled data. Hence, by simply requiring that decision boundaries do not cut through regions of high probability we can improve our classifier. The subfield that studies how to improve classification algorithms using unlabeled data goes under the name “semi-supervised learning”.

A fourth major class of learning algorithms deals with problems where the supervised signal consists only of rewards (or costs) that are possibly delayed. Consider for example a mouse that needs to solve a labyrinth in order to obtain his food. While making his decisions he will not receive any feedback (apart from perhaps slowly getting more hungry). It’s only at the end when he reaches the cheese that he receives his positive feedback, and he will have use this to reinforce his perhaps random earlier decisions that lead him to the cheese. These problems are generally known as reinforcement learning.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 31

Context: fall under the name "reinforcement learning". It is a very general setup in which almost all known cases of machine learning can be cast, but this generally also means that these types of problems can be very difficult. The most general RL problems do not even assume that you know what the world looks like (i.e. the maze for the mouse), so you have to simultaneously learn a model of the world and solve your task in it. 

This dual task induces interesting trade-offs: should you invest time now to learn machine learning and reap the benefit later in terms of a high salary working for Yahoo!, or should you stop investing now and start exploiting what you have learned so far? This is clearly a function of age, or the time horizon that you still have to take advantage of these investments. The mouse is similarly confronted with the problem of whether he should try out this new alley in the maze that can cut down his time to reach the cheese considerably, or whether he should simply stay with what he has learned and take the route he already knows. This clearly depends on how often he thinks he will have to run through the same maze in the future. We call this the exploration versus exploitation trade-off. The reason that RL is a very exciting field of research is because of its biological relevance. Do we not also have to figure out how the world works and survive in it?

Let's go back to the news-articles. Assume we have control over what article we will label next. Which one would be pick? Surely the one that would be most informative in some suitably defined sense. Or the mouse in the maze. Given that it needs to explore, where does he explore? Surely he will try to seek out alleys that look promising, i.e. alleys that he expects to maximize his reward. We call the problem of finding the next best data-case to investigate "active learning".

One may also be faced with learning multiple tasks at the same time. These tasks are related but not identical. For instance, consider the problem if recommending movies to customers of Netflix. Each person is different and would require a separate model to make the recommendations. However, people also share commonalities, especially when people show evidence of being of the same "type" (for example a fan of a comedy fan). We can learn personalized models but share features between them. Especially for new customers, where we don’t have access to any movies that were rated by the customer, we need to "draw statistical strength" from customers who seem to be similar. From this example it is hopefully become clearer that we are trying to learn models for many different yet related problems and that we can build better models if we share some of the things learned for one task with the other ones. The trick is not to share too much nor too little and how much we should share depends on how much data and prior knowledge we have access to for each task. We call this subfield of machine learning: "multi-task learning".
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 32

Context: # Chapter 4: Types of Machine Learning

## 4.1 In a Nutshell

There are many types of learning problems within machine learning. Supervised learning deals with predicting class labels from attributes, unsupervised learning tries to discover interesting structure in data, semi-supervised learning uses both labeled and unlabeled data to improve predictive performance, reinforcement learning can handle simple feedback in the form of delayed reward, active learning optimizes the next sample to include in the learning algorithm, and multi-task learning deals with sharing common model components between related learning tasks.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 33

Context: # Chapter 5

## Nearest Neighbors Classification

Perhaps the simplest algorithm to perform classification is the **k nearest neighbors (kNN classifier)**. As usual, we assume that we have data of the form \(\{X_n, Y_n\}\) where \(X_n\) is the value of attribute \(i\) for data-case \(n\) and \(Y_n\) is the label for data-case \(n\). We also need a measure of similarity between data-cases, which we will denote with \(K(X_n, X_m)\) where larger values of \(K\) denote more similar data-cases.

Given these preliminaries, classification is embarrassingly simple: when you are provided with the attributes \(X_k\) for a new (unseen) test-case, you first find the \(k\) most similar data-cases in the dataset by computing \(K(X_k, X_n)\) for all \(n\). Call this set \(S\). Then, each of these \(k\) most similar neighbors in \(S\) can cast a vote on the label of the test case, where each neighbor predicts that the test case has the same label as itself. Assuming binary labels and an odd number of neighbors, this will always result in a decision.

Although kNN algorithms are often associated with this simple voting scheme, more sophisticated ways of combining the information of these neighbors are allowed. For instance, one could weigh each vote by the similarity to the test-case. This results in the following decision rule:

\[
Y_i = 1 \quad \text{if} \quad \sum_{n \in S} K(X_k, X_n)(2Y_n - 1) > 0 \tag{5.1}
\]

\[
Y_i = 0 \quad \text{if} \quad \sum_{n \in S} K(X_k, X_n)(2Y_n - 1) < 0 \tag{5.2}
\]

\[
\text{(5.3)}
\]

and flipping a coin if it is exactly 0.

Why do we expect this algorithm to work intuitively? The reason is that we expect data-cases with similar labels to cluster together in attribute space.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 34

Context: # CHAPTER 5. NEAREST NEIGHBORS CLASSIFICATION

To figure out the label of a test case, we simply look around and see what labels our neighbors have. Asking your closest neighbor is like betting all your money on a single piece of advice and you might get really unlucky if your closest neighbor happens to be an odd one out. It’s typically better to ask several opinions before making your decision. However, if you ask too much around you will be forced to ask advice from data cases that are no longer very similar to you. So there is some optimal number of neighbors to ask, which may be different for every problem. Determining this optimal number of neighbors is not easy, but we can again use cross validation (see section ??) to estimate it.

## What is Good and Bad about KNN?

First, it’s simplicity makes it attractive. Very few assumptions about the data are used in the classification process. This property can also be a disadvantage: if you have prior knowledge about how the data was generated, it’s better to use it, because less information has to be extracted from the data. A second consideration is computation time and memory efficiency. Assume you have a very large dataset, but you need to make decisions very quickly. 

As an example, consider surfacing the web pages at Amazon.com. Whenever you search for a book, it likes to suggest 10 others. To do that it could classify books into categories and suggest the top ranked in that category. KNN requires Amazon to store all features of all books at a location that is accessible for fast computation. Moreover, to classify KNN has to do the neighborhood search every time again. Clearly, there are tricks that can be played with smart indexing, but wouldn’t it be much easier if we could have summarized all books by a simple classification function \( f(X) \), that “spits out” a class for any combination of features \( X \)?

This distinction between algorithms/models that require remembering every data-item data is often called “parametric” versus “non-parametric”. It’s important to realize that this is somewhat of a misnomer: non-parametric models can have parameters (such as the number of neighbors to consider). The key distinction is rather whether the data is summarized through a set of parameters which together comprise a classification function \( f_0(X) \), or whether we retain all the data to do the classification “on the fly”.

KNN is also known to suffer from the “curse of high dimensions”. If we use many features to describe our data, and in particular when most of these features turn out to be irrelevant and noisy for the classification, then KNN is quickly confused. Imagine that there are two features that contain all the information necessary for a perfect classification, but that we have added 98 noisy, uninformative features. The neighbors in the two dimensional space of the relevant features are unfortunately no longer likely to be the neighbors in the 100 dimensional space.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 35

Context: 5.1 The Idea In a Nutshell
==========================

To classify a new data item, you first look for the \( k \) nearest neighbors in feature space and assign it the same label as the majority of these neighbors. 

Because 98 noisy dimensions have been added, this effect is detrimental to the kNN algorithm. Once again, it is very important to choose your initial representation with much care and preprocess the data before you apply the algorithm. In this case, preprocessing takes the form of "feature selection" on which a whole book in itself could be written.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 36

Context: # CHAPTER 5. NEAREST NEIGHBORS CLASSIFICATION

## 5.1 Introduction

The nearest neighbors classification is a type of instance-based learning. It is a method used in machine learning that classifies a data point based on how its neighbors are classified.

## 5.2 Algorithm

The basic algorithm for k-nearest neighbors (k-NN) can be summarized as follows:

1. **Choose the number of neighbors, k.**
2. **Calculate the distance** between the new data point and all existing data points.
3. **Sort the distances** and determine the nearest k neighbors.
4. **Vote** for the most common class among the k neighbors.

## 5.3 Distance Metrics

Common distance metrics include:

- **Euclidean Distance**
- **Manhattan Distance**
- **Minkowski Distance**

| Distance Metric | Formula                                    |
|------------------|--------------------------------------------|
| Euclidean        | \(d(x, y) = \sqrt{\sum{(x_i - y_i)^2}}\) |
| Manhattan        | \(d(x, y) = \sum{|x_i - y_i|}\)           |
| Minkowski        | \(d(x, y) = \left( \sum{|x_i - y_i|^p} \right)^{1/p}\) |

## 5.4 Applications

Nearest neighbors classification can be applied in various fields such as:

- **Image Recognition**
- **Recommendation Systems**
- **Medical Diagnosis**

## 5.5 Pros and Cons

### Pros
- Simple and easy to implement.
- Effective in high-dimensional spaces.

### Cons
- Computation cost can be high with large datasets.
- Sensitive to irrelevant features and the scale of the data.

## 5.6 Conclusion

The nearest neighbors classification is a powerful technique in machine learning. By understanding its algorithm and applications, one can leverage this method effectively for various tasks.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 37

Context: # Chapter 6

## The Naive Bayesian Classifier

In this chapter we will discuss the "Naive Bayes" (NB) classifier. It has proven to be very useful in many applications both in science as well as in industry. In the introduction, I promised I would try to avoid the use of probabilities as much as possible. However, in this chapter, I'll make an exception, because the NB classifier is most naturally explained with the use of probabilities. Fortunately, we will only need the most basic concepts.

### 6.1 The Naive Bayes Model

NB is mostly used when dealing with discrete-valued attributes. We will explain the algorithm in this context but note that extensions to continuous-valued attributes are possible. We will restrict attention to classification problems between two classes and refer to section ?? for approaches to extend this to two more than classes.

In our usual notation, we consider \( D \) discrete-valued attributes \( X_i \in [0, \ldots, V_i] \), \( i = 1, \ldots, D \). Note that each attribute can have a different number of values \( V_i \). If the original data was supplied in a different format, e.g., \( X_1 \in [Y_{es}, N_{o}] \), then we simply reassign these values to fit the above format, \( Y_{es} = 1, N_{o} = 0 \) (or reversed). In addition, we are also provided with a supervised signal; in this case, the labels are \( Y = 0 \) and \( Y = 1 \) indicating that that data-item fell in class 0 or class 1. Again, which class is assigned to 0 or 1 is arbitrary and has no impact on the performance of the algorithm.

Before we move on, let's consider a real world example: spam filtering. Every day, your mailbox gets bombarded with hundreds of spam emails. To give an...
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 38

Context: # CHAPTER 6. THE NAIVE BAYESIAN CLASSIFIER

An example of the traffic that it generates: the University of California Irvine receives on the order of 2 million spam emails a day. Fortunately, the bulk of these emails (approximately 97%) is filtered out or dumped into your spam box and will reach your attention. 

How is this done? Well, it turns out to be a classic example of a classification problem: spam or ham, that’s the question. Let’s say that spam will receive a label 1 and ham a label 0. Our task is thus to label each new email with either 0 or 1. What are the attributes? Rephrasing this question, what would you measure in an email to see if it is spam? Certainly, if I read “viagra” in the subject I would stop right there and dump it in the spam box. What else?

Here are a few: “enlargement,” “cheap,” “buy,” “pharmacy,” “money,” “loan,” “mortgage,” “credit,” and so on. We can build a dictionary of words that we can detect in each email. This dictionary could also include word phrases such as “buy now,” “penis enlargement,” one can make phrases as sophisticated as necessary. One could measure whether the words or phrases appear at least once or one could count the actual number of times they appear. Spammers know about the way these spam filters work and counteract by slight misspellings of certain key words. Hence we might also want to detect words like “via gra” and so on. In fact, a small arms race has ensued where spam filters and spam generators find tricks to counteract the tricks of the “opponent.” 

Putting all these subtleties aside for a moment we’ll simply assume that we measure a number of these attributes for every email in a dataset. We’ll also assume that we have spam/ham labels for these emails, which were acquired by someone removing spam emails by hand from his/her inbox. Our task is then to train a predictor for spam/ham labels for future emails where we have access to attributes but not to labels.

The NB model is what we call a "generative" model. This means that we imagine how the data was generated in an abstract sense. For emails, this works as follows, an imaginary entity first decides how many spam and ham emails it will generate on a daily basis. Say, it decides to generate 40% spam and 60% ham. We will assume this doesn’t change with time (of course it doesn’t, but we will make this simplifying assumption for now). It will then decide what the chance is that a certain word appears \( x \) times in a spam email. For example, the word “viagra” has a chance of 96% to not appear at all, 1% to appear once, 0.9% to appear twice, etc. These probabilities are clearly different for spam and ham. “Viagra” should have a much smaller probability to appear in a ham email (but of course; consider I send this text to my publisher by email). Given these probabilities, we can then go on and try to generate emails that actually look like real emails, i.e., with proper sentences, but we won’t need that in the following. Instead we make the simplifying assumption that email consists of “a bag of words,” in random order.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 39

Context: ```
6.2 Learning a Naive Bayes Classifier
===============================

Given a dataset, \(\{(X_i, Y_i), i = 1...D, n = 1...N\}\), we wish to estimate what these probabilities are. To start with the simplest one, what would be a good estimate for the number of the percentage of spam versus ham emails that our imaginary entity uses to generate emails? Well, we can simply count how many spam and ham emails we have in our data. This is given by:

\[
P(spam) = \frac{# \, spam \, emails}{total \, emails} = \frac{\sum I_{Y_i = 1}}{N} \tag{6.1}
\]

Here we mean with \(I[A = a]\) a function that is only equal to 1 if its argument is satisfied, and zero otherwise. Hence, in the equation above it counts the number of instances that \(Y_n = 1\). Since the remainder of the emails must be ham, we also find that

\[
P(ham) = 1 - P(spam) = \frac{# \, ham \, emails}{total \, emails} = \frac{\sum I_{Y_i = 0}}{N} \tag{6.2}
\]

where we have used that \(P(ham) + P(spam) = 1\) since an email is either ham or spam.

Next, we need to estimate how often we expect to see a certain word or phrase in either a spam or a ham email. In our example, we could for instance ask ourselves what the probability is that we find the word "viagra" \(k\) times, with \(k = 0, 1, > 1\), in a spam email. Let's recode this as \(X_{viagra} = 0\) meaning that we didn't observe "viagra", \(X_{viagra} = 1\) meaning that we observed it once and \(X_{viagra} = 2\) meaning that we observed it more than once. The answer is again that we can count how often these events happened in our data and use that as an estimate for the real probabilities according to which it generated emails. First for spam we find,

\[
P_{spam}(X_i = j) = \frac{# \, spam \, emails \, for \, which \, the \, word \, j \, was \, found \, t \, times}{total \, of \, spam \, emails} = \frac{\sum I_{X_n = j \land Y_n = 1}}{\sum I_{Y_i = 1}} \tag{6.3}
\]

Here we have defined the symbol \(\land\) to mean that both statements to the left and right of this symbol should hold true in order for the entire sentence to be true.
```
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 40

Context: 28  
# CHAPTER 6. THE NATIVE BAYESIAN CLASSIFIER

For ham emails, we compute exactly the same quantity,

$$ P_{ham}(X_i = j) = \frac{ \# \text{ ham emails for which the word } j \text{ was found} }{ \text{total } \# \text{ of ham emails} } $$ (6.5)

$$ = \frac{\sum_{n} \mathbb{I}(X_{n} = j \land Y_{n} = 0)}{ \sum_{n} \mathbb{I}(Y_{n} = 0)} $$ (6.6)

Both these quantities should be computed for all words or phrases (or more generally attributes).

We have now finished the phase where we estimate the model from the data. We will often refer to this phase as "learning" or training a model. The model helps us understand how data was generated in some approximate setting. The next phase is that of prediction or classification of new email.

## 6.3 Class-Prediction for New Instances

New email does not come with a label ham or spam (if it would we could throw spam in the spam-box right away). What we do see are the attributes $\{X_i\}$. Our task is to guess the label based on the model and the measured attributes. The approach we take is simple: calculate whether the email has a higher probability of being generated from the spam or the ham model. For example, because the word "viagra" has a tiny probability of being generated under the ham model it will end up with a higher probability under the spam model. But clearly, all words have a say in this process. It’s like a large committee of experts, one for each word; each member casts a vote and can say things like: "I am 99% certain its spam", or "It’s almost definitely not spam (0.1% spam)". Each of these opinions will be multiplied together to generate a final score. We then figure out whether ham or spam has the highest score.

There is one little practical caveat with this approach, namely that the product of a large number of probabilities, each of which is necessarily smaller than one, very quickly gets so small that your computer can’t handle it. There is an easy fix though. Instead of multiplying probabilities as scores, we use the logarithms of those probabilities and add the logarithms. This is numerically stable and leads to the same conclusion because if \(a > b\) then we also have \(\log(a) > \log(b)\) and vice versa. In equations we compute the score as follows:

$$ S_{spam} = \sum_{i} \log P_{spam}(X_i = e_i) + \log P(spam) $$ (6.7)
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 41

Context: # 6.3. Class-Prediction for New Instances

where with \(v_i\) we mean the value for attribute \(i\) that we observe in the email under consideration, i.e. if the email contains no mention of the word “viagra” we set \(v_{\text{viagra}} = 0\).

The first term in Eqn. 6.7 adds all the log-probabilities under the spam model of observing the particular value of each attribute. Every time a word is observed that has high probability for the spam model, and hence has often been observed in the dataset, will boost this score. The last term adds an extra factor to the score that expresses our prior belief of receiving a spam email instead of a ham email.

We compute a similar score for ham, namely:

\[
S_{\text{spam}} = \sum_i \log P_{\text{ham}}(X_i = v_i) + \log P(\text{ham}) \tag{6.8}
\]

and compare the two scores. Clearly, a large score for spam relative to ham provides evidence that the email is indeed spam. If your goal is to minimize the total number of errors (whether they involve spam or ham) then the decision should be to choose the class which has the highest score.

In reality, one type of error could have more serious consequences than another. For instance, a spam email making it in my inbox is not too bad, but an important email that ends up in my spam box (which I never check) may have serious consequences. To account for this we introduce a general threshold \(\theta\) and use the following decision rule:

\[
Y = 1 \quad \text{if } S_1 > S_0 + \theta \tag{6.9}
\]

\[
Y = 0 \quad \text{if } S_1 < S_0 + \theta \tag{6.10}
\]

If these quantities are equal we flip a coin.

If \(\theta = -\infty\), we always decide in favor of label \(Y = 1\), while if we use \(\theta = +\infty\) we always decide in favor of \(Y = 0\). The actual value is a matter of taste. To evaluate a classifier we often draw an ROC curve. An ROC curve is obtained by sliding \(\theta\) between \(-\infty\) and \(+\infty\) and plotting the true positive rate (the number of examples with label \(Y = 1\) also classified as \(Y = 1\) divided by the total number of examples with \(Y = 1\)) versus the false positive rate (the number of examples with label \(Y = 0\) classified as \(Y = 1\) divided by the total number of examples with \(Y = 0\). For more details see chapter  ??.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 42

Context: # CHAPTER 6. THE NATIVE BAYESIAN CLASSIFIER

## 6.4 Regularization

The spam filter algorithm that we discussed in the previous sections does unfortunately not work very well if we wish to use many attributes (words, word-phrases). The reason is that for many attributes we may not encounter a single example in the dataset. Say for example that we defined the word “Nigeria” as an attribute, but that our dataset did not include one of those spam emails where you are promised mountains of gold if you invest your money in someone bank in Nigeria. Also assume there are indeed a few ham emails which talk about the nice people in Nigeria. Then any future email that mentions Nigeria is classified as ham with 100% certainty. More importantly, one cannot recover from this decision even if the email also mentions viagra, enlargement, mortgage and so on, all in a single email! This can be seen by the fact that 

\[ P_{\text{spam}}(X^{\text{Nigeria}} > 0) = -\infty \]

while the final score is a sum of these individual word-scores.

To counteract this phenomenon, we give each word in the dictionary a small probability of being present in any email (spam or ham), before seeing the data. This process is called smoothing. The impact on the estimated probabilities are given below:

\[
P_{\text{spam}}(X_i = j) = \alpha + \frac{\sum_{n} \mathbb{I}[X_n = j \land Y_n = 1]}{V_{\alpha} + \sum_{n} \mathbb{I}[Y_n = 1]} \tag{6.12}
\]

\[
P_{\text{ham}}(X_i = j) = \alpha + \frac{\sum_{n} \mathbb{I}[X_n = j \land Y_n = 0]}{V_{\alpha} + \sum_{n} \mathbb{I}[Y_n = 0]} \tag{6.13}
\]

where \( V_i \) is the number of possible values of attribute \( i \). Thus, \( \alpha \) can be interpreted as a small, possibly fractional number of “pseudo-observations” of the attribute in question. It’s like adding these observations to the actual dataset.

What value for \( \alpha \) do we use? Fitting its value on the dataset will not work, because the reason we added it was exactly because we assumed there was too little data in the first place (we hadn’t received one of those annoying “Nigeria” emails yet) and thus will relate to the phenomenon of overfitting. However, we can use the trick described in section ?? where we split the data two pieces. We learn a model on one chunk and adjust \( \alpha \) such that performance of the other chunk is optimal. We play this game multiple times with different splits and average the results.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 43

Context: # 6.5 Remarks

One of the main limitations of the NB classifier is that it assumes independence between attributes (This is presumably the reason why we call it the naive Bayesian classifier). This is reflected in the fact that each classifier has an independent vote in the final score. However, imagine that I measure the words, "home" and "mortgage". Observing "mortgage" certainly raises the probability of observing "home". We say that they are positively correlated. It would therefore be more fair if we attributed a smaller weight to "home" if we already observed mortgage because they convey the same thing: this email is about mortgages for your home. One way to obtain a more fair voting scheme is to model these dependencies explicitly. However, this comes at a computational cost (a longer time before you receive your email in your inbox) which may not always be worth the additional accuracy. One should also note that more parameters do not necessarily improve accuracy because too many parameters may lead to overfitting.

# 6.6 The Idea In a Nutshell

Consider Figure ??. We can classify data by building a model of how the data was generated. For NB we first decide whether we will generate a data-item from class \( Y = 0 \) or class \( Y = 1 \). Given that decision we generate the values for \( D \) attributes independently. Each class has a different model for generating attributes. Classification is achieved by computing which model was more likely to generate the new data-point, biasing the outcome towards the class that is expected to generate more data.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 44

Context: # CHAPTER 6. THE NAIVE BAYESIAN CLASSIFIER

## Introduction

The Naive Bayesian Classifier is a simple yet powerful algorithm based on Bayes' theorem, used extensively in machine learning for classification tasks.

## Key Concepts

1. **Bayes' Theorem**: 
   \[
   P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}
   \]
   
2. **Assumptions**:
   - Features are statistically independent given the class label.
   - The prior distribution of each class.

## Types of Naive Bayes Classifiers

- **Gaussian Naive Bayes**: Assumes continuous data follows a normal distribution.
- **Multinomial Naive Bayes**: Primarily used for discrete features, such as word counts.
- **Bernoulli Naive Bayes**: Similar to multinomial but assumes binary features.

## Classification Process

1. **Calculate Prior Probabilities**:
   \[
   P(C_k) = \frac{N_k}{N}
   \]

2. **Calculate Likelihoods**:
    - For Gaussian:
    \[
    P(x|C_k) = \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{(x - \mu)^2}{2\sigma^2}}
    \]

3. **Apply Bayes' Theorem**:
    \[
    P(C_k|x) = \frac{P(x|C_k) \cdot P(C_k)}{P(x)}
    \]

4. **Choose Class with Maximum Posterior**:
   \[
   \hat{C} = \arg\max_{C_k} P(C_k|x)
   \]

## Advantages

- Easy to implement.
- Requires a small amount of training data.
- Performs well in multi-class problems.

## Disadvantages

- Assumption of feature independence is often violated in real-world applications.
- Poor performance when features are highly correlated.

## Conclusion

The Naive Bayesian Classifier serves as an excellent baseline for classification tasks and is crucial for understanding more advanced techniques in machine learning.

## References

- [Pattern Recognition and Machine Learning](http://www.example.com)
- [Machine Learning: A Probabilistic Perspective](http://www.example.com)
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 45

Context: # Chapter 7

## The Perceptron

We will now describe one of the simplest parametric classifiers: the **perceptron** and its cousin the **logistic regression** classifier. However, despite its simplicity, it should not be under-estimated! It is the workhorse for most companies involved with some form of machine learning (perhaps tying with the **decision tree** classifier). One could say that it represents the canonical parametric approach to classification where we believe that a straight line is sufficient to separate the two classes of interest. An example of this is given in Figure ??, where the assumption that the two classes can be separated by a line is clearly valid.

However, this assumption need not always be true. Looking at Figure ??, we clearly observe that there is no straight line that will do the job for us. What can we do? Our first inclination is probably to try and fit a more complicated separation boundary. However, there is another trick that we will be using often in this book. Instead, we can increase the dimensionality of the space by “measuring” more things about the data. The features can be highly nonlinear functions. The simplest choice may be to also measure \( \phi_k(X) = X^2_k \), for each attribute \( X_k \). But we may also measure cross-products such as \( \phi_{ij}(X) = X_i X_j \), \( \forall i, j \). The latter will allow you to explicitly model correlations between attributes. For example, if \( X_i \) represents the presence (1) or absence (0) of the word “viagra” and similarly for \( X_j \) the presence/absence of the word “dysfunction”, then the cross product feature \( X_i X_j \) lets you model the presence of both words simultaneously (which should be helpful in trying to find out what this document is about). We can add as many features as we like, adding another dimension for every new feature. In this higher dimensional space, we can now be more confident in assuming that the data can be separated by a line.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 46

Context: # CHAPTER 7. THE PERCEPTRON

I like to warn the reader at this point that more features is not necessarily a good thing if the new features are uninformative for the classification task at hand. The problem is that they introduce noise in the input that can mask the actual signal (i.e. the good, discriminative features). In fact, there is a whole subfield of ML that deals with selecting relevant features from a set that is too large. The problem of too many dimensions is sometimes called "the curse of high dimensionality." Another way of seeing this is that more dimensions often lead to more parameters in the model (as in the case for the perceptron) and can hence lead to overfitting. To combat that in turn we can add regularizers as we will see in the following.

With the introduction of regularizers, we can sometimes play magic and use an infinite number of features. How we play this magic will be explained when we will discuss kernel methods in the next sections. But let us first start simple with the perceptron.

## 7.1 The Perceptron Model

Our assumption is that a line can separate the two classes of interest. To make our life a little easier we will switch to the \( Y = \{+1, -1\} \) representation. With this, we can express the condition mathematically expressed as:

\[
y_n \approx \text{sign}\left(\sum_k w_k x_{n} - \alpha\right) \tag{7.1}
\]

where "sign" is the sign-function (+1 for nonnegative reals and -1 for negative reals). We have introduced \( K + 1 \) parameters \( \{w_1, \ldots, w_k, \alpha\} \) which define the line for us. The vector \( w \) represents the direction orthogonal to the decision boundary depicted in Figure ??. For example, a line through the origin is represented by \( w^T x = 0 \), i.e., all vectors \( x \) with a vanishing inner product with \( w \). The scalar quantity \( \alpha \) represents the offset of the line \( w^T x = 0 \) from the origin, i.e., the shortest distance from the origin to the line. This can be seen by writing the points on the line as \( x = y + v \) where \( v \) is a fixed vector pointing to an arbitrary point on the line and \( y \) is the vector on the line starting at \( y \) (see Figure ??). Here, \( w^T (y + v) - \alpha = 0 \). Since by definition \( w^T v = 0 \), we find \( w^T y = \alpha \) which means that \( \alpha \) is the projection of \( y \) onto \( w \) which is the shortest distance from the origin to the line.

1. Note that we can replace \( x_n \rightarrow \phi_k(X) \) but for the sake of simplicity we will refrain from doing so at this point.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 47

Context: # 7.1. THE PERCEPTRON MODEL

We like to estimate these parameters from the data (which we will do in a minute), but it is important to notice that the number of parameters is fixed in advance. In some sense, we believe so much in our assumption that the data is linearly separable that we stick to it irrespective of how many data-cases we will encounter. This fixed capacity of the model is typical for parametric methods, but perhaps a little unrealistic for real data. A more reasonable assumption is that the decision boundary may become more complex as we see more data. Too few data-cases simply do not provide the resolution (evidence) necessary to see more complex structure in the decision boundary. Recall that non-parametric methods, such as the “nearest-neighbors” classifiers actually do have this desirable feature. Nevertheless, the linear separability assumption comes with some computation advantages as well, such as very fast class prediction on new test data. I believe that this computational convenience may be at the root for its popularity. By the way, when we take the limit of an infinite number of features, we will have happily returned the land of “non-parametrics” but we have exercise a little patience before we get there.

Now let’s write down a cost function that we wish to minimize in order for our linear decision boundary to become a good classifier. Clearly, we would like to control performance on future, yet unseen test data. However, this is a little hard (since we don’t have access to this data by definition). As a surrogate we will simply fit the line parameters on the training data. It can not be stressed enough that this is dangerous in principle due to the phenomenon of overfitting (see section ??). If we have introduced very many features and no form of regularization then we have many parameters to fit. When this capacity is too large relative to the number of data cases at our disposal, we will be fitting the idiosyncrasies of this particular dataset and these will not carry over to the future test data. So, one should split off a subset of the training data and reserve it for monitoring performance (one should not use this set in the training procedure). Cycling through multiple splits and averaging the result was the cross-validation procedure discussed in section ??. If we do not use too many features relative to the number of data-cases, the model class is very limited and overfitting is not an issue. (In fact, one may want to worry more about “underfitting” in this case.)

Ok, so now that we agree on writing down a cost on the training data, we need to choose an explicit expression. Consider now the following choice:

\[ C(w, \alpha) = \frac{1}{2n} \sum_{i=1}^{n} (y_i - w^T x_n + \alpha)^2 \quad (7.2) \]
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 48

Context: # CHAPTER 7. THE PERCEPTRON

where we have rewritten \( w^TX_n = \sum_k w_kX_{nk} \). If we minimize this cost, then \( w^TX_n - \alpha \) tends to be positive when \( Y_n = +1 \) and negative when \( Y_n = -1 \). This is what we want! Once optimized, we can then easily use our optimal parameters to perform prediction on new test data \( X_{\text{test}} \) as follows:

\[
\hat{Y}_{\text{test}} = \text{sign}\left(\sum_k w_k^T X_{\text{test}} - \alpha\right) \tag{7.3}
\]

where \( \hat{Y} \) is used to indicate the predicted value for \( Y \).

So far so good, but how do we obtain our values for \((w^*, \alpha^*)\)? The simplest approach is to compute the gradient and slowly descend on the cost function (see appendix ?? for background). In this case, the gradients are simple:

\[
\nabla_w C(w, \alpha) = -\frac{1}{n} \sum_{i=1}^{n}(Y_n - w^TX_n + \alpha)X_n = -X(Y - X^Tw + \alpha) \tag{7.4}
\]

\[
\nabla_\alpha C(w, \alpha) = \frac{1}{n} \sum_{i=1}^{n}(Y_n - w^TX_n + \alpha) = (Y - X^Tw + \alpha) \tag{7.5}
\]

where in the latter matrix expression we have used the convention that \( X \) is the matrix with elements \( X_{nk} \). Our gradient descent is now simply given as,

\[
w_{t+1} = w_t - \eta \nabla_w C(w_t, \alpha_t) \tag{7.6}
\]

\[
\alpha_{t+1} = \alpha_t - \eta \nabla_\alpha C(w_t, \alpha_t) \tag{7.7}
\]

Iterating these equations until convergence will minimize the cost function. One may criticize plain vanilla gradient descent for many reasons. For example, you need to be carefully choose the stepsize \( \eta \) or risk either excruciatingly slow convergence or exploding values of the iterates \( w_t \). Even if convergence is achieved asymptotically, it is typically slow. Using a Newton-Raphson method will improve convergence properties considerably but is also very expensive. Many methods have been developed to improve the optimization of the cost function, but that is not the focus of this book.

However, I do want to mention a very popular approach to optimization on very large datasets known as “stochastic gradient descent”. The idea is to select a single data-item randomly and perform an update on the parameters based on that:

\[
w_{t+1} = w_t + \eta(Y_n - w^TX_n + \alpha)X_n \tag{7.8}
\]

\[
\alpha_{t+1} = \alpha_t + \eta(Y_n - w^TX_n + \alpha) \tag{7.9}
\]
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 49

Context: # 7.2 A Different Cost Function: Logistic Regression

The fact that we are picking data-cases randomly injects noise into the updates, so even close to convergence we are "wiggling around" the solution. If we decrease the stepsize however, the wiggles get smaller. So it seems a sensible strategy would be to slowly decrease the stepsize and wiggle our way to the solution. This stochastic gradient descent is actually very efficient in practice if we can find a good annealing schedule for the stepsize. Why really? It seems that if we use more data-cases in a mini-batch to perform a parameter update we should be able to make larger steps in parameter space by using bigger stepsizes. While this reasoning holds close to the solution it does not far away from the solution. The intuitive reason is that far away from convergence every datapoint will tell you the same story: move in direction X to improve your model. You simply do not need to query datapoints in order to extract that information. So for a bad model there is a lot of redundancy in the information that data-cases can convey about improving the parameters and querying a few is sufficient. Closer to convergence you need to either use more data or decrease the stepsize to increase the resolution of your gradients.

This type of reasoning clearly makes an effort to include the computational budget part of the overall objective. This is what we have argued in chapter XX is the distinguishing feature of machine learning. If you are not convinced about how important this is in the face of modern day datasets imagine the following. Company C organizes a contest where they provide a virtually infinite dataset for some prediction task. You can earn 1 million dollars if you make accurate predictions on some test set by Friday next week. You can choose between a single parameter update based on all the data or many updates on small subsets of the data. Who do you think will win the contest?

## 7.2 A Different Cost Function: Logistic Regression

The cost function of Eq. 7.2 penalizes gross violations of ones predictions rather severely (quadratically). This is sometimes counter-productive because the algorithm might get obsessed with improving the performance of one single data-case at the expense of all the others. The real cost simply counts the number of mislabelled instances, irrespective of how badly you predict function \( w^T x_n + \alpha \) was. So, a different function is often used:

\[
C(w, \alpha) = \frac{1}{n} \sum_{i=1}^{n} y_i \tanh(w^T x_n + \alpha) \tag{7.10}
\]
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 50

Context: # CHAPTER 7. THE PERCEPTRON

The function `tanh(·)` is plotted in Figure ?? It shows that the cost can never be larger than 2, which ensures robustness against outliers. We leave it to the reader to derive the gradients and formulate the gradient descent algorithm.

## 7.3 The Idea In a Nutshell

Figure ?? tells the story. One assumes that your data can be separated by a line. Any line can be represented by `w^T x = α`. Data cases from one class satisfy `w^T x_n ≤ α` while data cases from the other class satisfy `w^T x_n ≥ α`. To achieve that, you write down a cost function that penalizes data cases falling on the wrong side of the line and minimize it over `(w, α)`. For a test case, you simply compute the sign of `w^T x_{test} - α` to make a prediction as to which class it belongs to.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 51

Context: # Chapter 8

## Support Vector Machines

Our task is to predict whether a test sample belongs to one of two classes. We receive training examples of the form: \(\{(x_i, y_i), i = 1, \ldots, n\}\) and \(x_i \in \mathbb{R}^d\), \(y_i \in \{-1, +1\}\). We call \(\{x_i\}\) the co-variates or input vectors and \(\{y_i\}\) the response variables or labels.

We consider a very simple example where the data are in fact linearly separable: i.e., I can draw a straight line \(f(x) = w^T x - b\) such that all cases with \(y_i = -1\) fall on one side and have \(f(x_i) < 0\) and cases with \(y_i = +1\) fall on the other and have \(f(x_i) > 0\). Given that we have achieved that, we could classify new test cases according to the rule \(\hat{y}_{test} = \text{sign}(f(x_{test}))\).

However, typically there are infinitely many such hyper-planes obtained by small perturbations of a given solution. How do we choose between all these hyper-planes which solve the separation problem for our training data, but may have different performance on the newly arriving test cases? For instance, say \(y = -1\). Intuitively, when test cases arrive we will not make many mistakes on cases that should be classified with \(y = +1\), but we will make very easily mistakes on the cases with \(y = -1\) (for instance, imagine that a new batch of test cases arrives which are small perturbations of the training data). A sensible thing thus seems to choose the separation line as far away from both \(y = -1\) and \(y = +1\) training cases as we can, i.e., right in the middle.

Geometrically, the vector \(w\) is directed orthogonal to the line defined by \(w^T x = 0\). This can be understood as follows. First take \(b = 0\). Now it is clear that all vectors, \(x\), with vanishing inner product with \(w\) satisfy this equation, i.e., all vectors orthogonal to satisfy this equation. Now translate the hyperplane away from the origin over a vector \(a\). The equation for the plane now becomes: \((x - a)^T w = 0\).
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 52

Context: # CHAPTER 8. SUPPORT VECTOR MACHINES

i.e. we find that for the offset \( b = a^T w \), which is the projection of a onto the vector \( w \). Without loss of generality we may thus choose a perpendicular to the plane, in which case the length \( ||a|| = ||b||/||w|| \) represents the shortest, orthogonal distance between the origin and the hyperplane.

We now define 2 more hyperplanes parallel to the separating hyperplane. They represent planes that cut through the closest training examples on either side. We will call them "support hyper-planes" in the following, because the data-vectors they contain support the plane.

We define the distance between the these hyperplanes and the separating hyperplane to be \( d_- \) and \( d_+ \), respectively. The margin, \( \gamma \), is defined to be \( d_- + d_+ \). Our goal is now to find a the separating hyperplane so that the margin is largest, while the separating hyperplane is equidistant from both.

We can write the following equations for the support hyperplanes:

\[
w^T x = b + \delta \tag{8.1}
\]
\[
w^T x = b - \delta \tag{8.2}
\]

We now note that we have over-parameterized the problem: if we scale \( w, b \) and \( \delta \) by a constant factor \( \alpha \), the equations for \( x \) are still satisfied. To remove this ambiguity we will require that \( \delta = 1 \), this sets the scale of the problem, i.e. if we measure distance in millimeters or meters.

We can now also compute the values for \( d_- = (||b+1|-|b||)/||w|| = 1/||w|| \) (this is only true if \( b > (-1,0) \) since the origin doesn’t fall in between the hyperplanes in that case. If \( b < (-1,0) \) you should use \( d_- = (||b+1| + |b||)/||w|| = 1/||w|| \)). Hence the margin is equal to twice that value: \( \gamma = 2/||w|| \).

With the above definition of the support planes we can write down the following constraint that any solution must satisfy:

\[
w^T x_i - b \leq -1 \quad \forall y_i = -1 \tag{8.3}
\]
\[
w^T x_i - b \geq 1 \quad \forall y_i = +1 \tag{8.4}
\]

or in one equation,

\[
y_i(w^T x_i - b) \geq 1 \quad \forall i \tag{8.5}
\]

We now formulate the primal problem of the SVM:

\[
\begin{align*}
\text{minimize}_{w,b} & \quad \frac{1}{2} ||w||^2 \\
\text{subject to} & \quad y_i(w^T x_i - b) \geq 1 \quad \forall i \tag{8.6}
\end{align*}
\]
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 53

Context: Thus, we maximize the margin, subject to the constraints that all training cases fall on either side of the support hyper-planes. The data-cases that lie on the hyperplane are called support vectors, since they support the hyper-planes and hence determine the solution to the problem.

The primal problem can be solved by a quadratic program. However, it is not ready to be kernelized, because its dependence is not only on inner products between data-vectors. Hence, we transform to the dual formulation by first writing the problem using a Lagrangian,

\[ 
L(w, b, \alpha) = \frac{1}{2} \| w \|^2 - \sum_{i=1}^{N} \alpha_i [y_i (w^T x_i - b) - 1] 
\] 
(8.7)

The solution that minimizes the primal problem subject to the constraints is given by \(\min_{w, b} \max_{\alpha} L(w, \alpha)\), i.e., a saddle point problem. When the original objective function is convex, (and only then), we can interchange the minimization and maximization. Doing that, we find that we can find the condition on \(w\) that must hold at the saddle point we are solving for. This is done by taking derivatives with respect to \(w\) and solving.

\[
w = - \sum_{i} \alpha_i y_i x_i \quad \Rightarrow \quad w^* = \sum_{i} \alpha_i y_i x_i 
\] 
(8.8)

\[
\sum_{i} \alpha_i y_i = 0 
\] 
(8.9)

Inserting this back into the Lagrangian we obtain what is known as the dual problem,

maximize 
\[
L_D = \sum_{i=1}^{N} \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i y_i y_j x_i^T x_j 
\] 
(8.10)

subject to 
\[
\sum_{i} \alpha_i y_i = 0 
\] 
(8.11)

\(\alpha_i \geq 0 \quad \forall i\)

The dual formulation of the problem is also a quadratic program, but note that the number of variables, \(\alpha_i\) in this problem is equal to the number of data-cases, \(N\). The crucial point is however, that this problem only depends on \(x_k\) through the inner product \(x_i^T x_j\). This is readily kernelized through the substitution \(x_i^T x_j \rightarrow k(x_i, x_j)\). This is a recurrent theme: the dual problem lends itself to kernelisation, while the primal problem did not.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 54

Context: 42  
# CHAPTER 8. SUPPORT VECTOR MACHINES  

The theory of duality guarantees that for convex problems, the dual problem will be concave, and moreover, that the unique solution of the primal problem corresponds to the unique solution of the dual problem. In fact, we have:  
\( L_P(w^*) = L_D(\alpha^*) \), i.e. the “duality-gap” is zero.  

Next we turn to the conditions that must necessarily hold at the saddle point and thus the solution of the problem. These are called the KKT conditions (which stands for Karush-Kuhn-Tucker). These conditions are necessary in general, and sufficient for convex optimization problems. They can be derived from the primal problem by setting the derivatives w.r.t to zero. Also, the constraints themselves are part of these conditions and we need that for inequality constraints the Lagrange multipliers are non-negative. Finally, an important constraint called “complementary slackness” needs to be satisfied.  

\[
\begin{align*}
\partial_w L_P = 0 & \rightarrow \quad w - \sum_i \alpha_i y_i x_i = 0 \quad (8.12) \\
\partial_{\alpha} L_P = 0 & \rightarrow \sum_i \alpha_i y_i = 0 \quad (8.13) \\
\text{constraint - 1} & \quad y_i (w^T x_i - b) - 1 \geq 0 \quad (8.14) \\
\text{multiplier condition} & \quad \alpha_i \geq 0 \quad (8.15) \\
\text{complementary slackness} & \quad \alpha_i [y_i (w^T x_i - b) - 1] = 0 \quad (8.16) 
\end{align*}
\]

It is the last equation which may be somewhat surprising. It states that either the inequality constraint is satisfied, but not saturated: \( y_i (w^T x_i - b) - 1 > 0 \) in which case \( \alpha_i \) for that data-case must be zero, or the inequality constraint is saturated \( y_i (w^T x_i - b) - 1 = 0 \), in which case \( \alpha_i \) can be any value \( \geq 0 \). Inequality constraints which are saturated are said to be “active”, while unsaturated constraints are inactive. One could imagine the process of searching for a solution as a ball which runs down the primary objective function using gradient descent. At some point, it will hit a wall which is the constraint and although the derivative is still pointing partially towards the wall, the constraints prohibits the ball to go on. This is an active constraint because the ball is glued to that wall. When a final solution is reached, we could remove some constraints, without changing the solution, these are inactive constraints. One could think of the term \( \partial_{\alpha} L_P \) as the force acting on the ball. We see from the first equation above that only the forces with \( \alpha_i \neq 0 \) exert a force on the ball that balances with the force from the curved quadratic surface \( w \).  

The training cases with \( \alpha_i > 0 \), representing active constraints on the position of the support hyperplane are called support vectors. These are the vectors.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 55

Context: # 8.1 The Non-Separable Case

That are situated in the support hyperplane and they determine the solution. Typically, there are only a few of them, which people call a "sparse" solution (most α's vanish).

What we are really interested in is the function \( f(\cdot) \) which can be used to classify future test cases:

\[
f(x) = w^T x - b^* - \sum_{i} \alpha_i y_i x^T x - b^* \tag{8.17}
\]

As an application of the KKT conditions, we derive a solution for \( b^* \) by using the complementary slackness condition:

\[
b^* = \left( \sum_{j} \alpha_j y_j x^T x_j - y_i \right) \quad i \text{ a support vector} \tag{8.18}
\]

where we used \( y_j^2 = 1 \). So, using any support vector one can determine \( b \), but for numerical stability it is better to average over all of them (although they should obviously be consistent).

The most important conclusion is again that this function \( f(\cdot) \) can thus be expressed solely in terms of inner products \( x_i^T x_j \), which we can replace with kernel matrices \( k(x_i, x_j) \) to move to high dimensional non-linear spaces. Moreover, since \( \alpha \) is typically very sparse, we don’t need to evaluate many kernel entries in order to predict the class of the new input \( x \).

## 8.1 The Non-Separable Case

Obviously, not all datasets are linearly separable, and so we need to change the formalism to account for that. Clearly, the problem lies in the constraints, which cannot always be satisfied. So, let’s relax those constraints by introducing "slack variables" \( \xi_i \):

\[
w^T x_i - b \leq -1 + \xi_i \quad \forall y_i = -1 \tag{8.19}
\]
\[
w^T x_i - b \geq 1 - \xi_i \quad \forall y_i = +1 \tag{8.20}
\]
\[
\xi_i \geq 0 \quad \forall i \tag{8.21}
\]

The variables \( \xi_i \) allow for violations of the constraint. We should penalize the objective function for these violations; otherwise, the above constraints become void (simply always pick \( \xi \) very large). Penalty functions of the form \( C \sum_{i} \xi_i^k \)
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 56

Context: # CHAPTER 8. SUPPORT VECTOR MACHINES

will lead to convex optimization problems for positive integers \( k \). For \( k = 1, 2 \) it is still a quadratic program (QP). In the following, we will choose \( k = 1 \). \( C \) controls the tradeoff between the penalty and margin.

To be on the wrong side of the separating hyperplane, a data-case would need \( \xi_i > 1 \). Hence, the sum \( \sum_i \xi_i \) could be interpreted as a measure of how "bad" the violations are and is an upper bound on the number of violations.

The new primal problem thus becomes:

\[ 
\begin{align*}
\text{minimize}_{w, b, \xi} \quad L_P & = \frac{1}{2} ||w||^2 + C \sum_{i} \xi_i \\
\text{subject to} \quad y_i(w^T x_i - b) & \geq 1 + \xi_i \quad \forall i \\
\xi_i & \geq 0 \quad \forall i 
\end{align*} 
\]

leading to the Lagrangian,

\[ 
L(w, b, \xi, \alpha, \mu) = \frac{1}{2} ||w||^2 + C \sum_{i} \xi_i - \sum_{i=1}^{N} \alpha_i [y_i(w^T x_i - b) - 1 + \xi_i] - \sum_{i=1}^{N} \mu_i \xi_i 
\]

from which we derive the KKT conditions:

1. \( \partial_w L_P = 0 \) \(\Rightarrow\) \( w = \sum_i \alpha_i y_i x_i \) 
2. \( \partial_b L_P = 0 \) \(\Rightarrow\) \( \sum_i \alpha_i y_i = 0 \) 
3. \( \partial_{\xi} L_P = 0 \) \(\Rightarrow\) \( C - \alpha_i - \mu_i = 0 \) 
4. **constraint-1** \quad \( y_i(w^T x_i - b) - 1 + \xi_i \geq 0 \) 
5. **constraint-2** \quad \( \xi_i \geq 0 \) 
6. **multiplier condition-1** \quad \( \alpha_i \geq 0 \) 
7. **multiplier condition-2** \quad \( \mu_i \geq 0 \) 
8. **complementary slackness-1** \quad \( \alpha_i [y_i(w^T x_i - b) - 1 + \xi_i] = 0 \) 
9. **complementary slackness-2** \quad \( \mu_i \xi_i = 0 \)

From here we can deduce the following facts. If we assume that \( \xi_i > 0 \), then \( \mu_i = 0 \) (9), hence \( \alpha_i = C (1) \) and thus \( \xi_i = 1 - y_i(w^T x_i - b) \) (8). Also, when \( \xi_i = 0 \) we have \( \mu_i > 0 \) and hence \( \alpha_i < C \). If in addition to \( \xi_i = 0 \) we also have that \( y_i(w^T x_i - b) = 1 \) then \( \alpha_i > 0 \) (8). Otherwise, if \( y_i(w^T x_i - b) > 1 \) then \( \alpha_i = 0 \) (8).
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 57

Context: # 8.1. THE NON-SEPARABLE CASE

In summary, as before for points not on the support plane and on the correct side we have \( \xi_i = 0 \) (all constraints inactive). On the support plane, we still have \( \xi_i = 0 \), but now \( \alpha_i > 0 \). Finally, for data-cases on the wrong side of the support hyperplane \( \alpha_i \) max-out to \( \alpha_i = C \) and the \( \xi_i \) balance the violation of the constraint such that \( y_i(w^T x_i - b) - 1 + \xi_i = 0 \).

Geometrically, we can calculate the gap between support hyperplane and the violating data-case to be \( \xi_i/||w|| \). This can be seen because the plane defined by \( y_i(w^T x - b) - 1 + \xi_i = 0 \) is parallel to the support plane at a distance \( 1 + y_i b - \xi_i/||w|| \) from the origin. Since the support plane is at a distance \( 1 + y_i b/||w|| \) the result follows.

Finally, we need to convert to the dual problem to solve it efficiently and to kernelise it. Again, we use the KKT equations to get rid of \( w \), \( b \), and \( \xi \).

## Formulation

\[
\begin{align*}
\text{maximize} & \quad L_D = \sum_{i=1}^{N} \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j x_i^T x_j \\
\text{subject to} & \quad \sum_{i} \alpha_i y_i = 0 \\
& \quad 0 \leq \alpha_i \leq C \quad \forall i
\end{align*}
\]

Surprisingly, this is almost the same QP as before, but with an extra constraint on the multipliers \( \alpha_i \) which now live in a box. This constraint is derived from the fact that \( \alpha_i = C - \mu_i \) and \( \mu_i \geq 0 \). We also note that it only depends on inner products \( x_i^T x_j \) which are ready to be kernelised.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 58

Context: # CHAPTER 8: SUPPORT VECTOR MACHINES

## Introduction

Support Vector Machines (SVMs) are a set of supervised learning methods used for classification, regression, and outliers detection.

## Key Concepts

1. **Hyperplane**: A hyperplane is a decision boundary that helps to categorize data points.

2. **Support Vectors**: Support vectors are the data points that are closest to the hyperplane and influence its position and orientation.

3. **Margin**: The margin is the distance between the hyperplane and the nearest data point from either class. SVM aims to maximize the margin.

## Implementation Steps

- **Step 1**: Choose the appropriate kernel (linear, polynomial, RBF).
- **Step 2**: Train the SVM model using the training dataset.
- **Step 3**: Evaluate the model using a test dataset.

## Advantages of SVM

- Effective in high-dimensional spaces.
- Robust against overfitting, especially in high-dimensional datasets.

## Disadvantages of SVM

- Less effective on very large datasets.
- Poor performance with overlapping classes.

## Conclusion

Support Vector Machines are powerful tools for classification and regression tasks, offering advantages in high-dimensional spaces while having limitations in very large datasets.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 59

Context: # Chapter 9

## Support Vector Regression

In kernel ridge regression we have seen the final solution was not sparse in the variables \( \alpha \). We will now formulate a regression method that is sparse, i.e. it has the concept of support vectors that determine the solution.

The thing to notice is that the sparseness arose from complementary slackness conditions which in turn came from the fact that we had inequality constraints.  
In the SVM the penalty that was paid for being on the wrong side of the support plane was given by \( C \sum_{k} \xi_k \) for positive integers \( k \), where \( \xi_k \) is the orthogonal distance away from the support plane. Note that the term \( ||w||^2 \) was there to penalize large \( w \) and hence to regularize the solution. Importantly, there was no penalty if a data-case was on the right side of the plane. Because all these data-points do not have any effect on the final solution the \( \alpha \) was sparse. Here we do the same thing: we introduce a penalty for being too far away from predicted line \( w\phi + b \), but once you are close enough, i.e. in some “epsilon-tube” around this line, there is no penalty. We thus expect that all the data-cases which lie inside the data-tube will have no impact on the final solution and hence have corresponding \( \alpha_i = 0 \). Using the analogy of springs: in the case of ridge-regression the springs were attached between the data-cases and the decision surface, hence every item had an impact on the position of this boundary through the force it exerted (recall that the surface was from “rubber” and pulled back because it was parameterized using a finite number of degrees of freedom or because it was regularized). For SVR there are only springs attached between data-cases outside the tube and these attach to the tube, not the decision boundary. Hence, data-items inside the tube have no impact on the final solution (or rather, changing their position slightly doesn’t perturb the solution).

We introduce different constraints for violating the tube constraint from above.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 60

Context: # CHAPTER 9. SUPPORT VECTOR REGRESSION

and from below,

minimize 
$$ -w^T\xi, \xi $$ 
$$ \frac{1}{2} ||w||^2 + \frac{C}{2} \sum_{i} (\xi_i^2 + \xi_i^2) $$ 

subject to 

$$ w^T\phi_b + b - y_i \leq \xi_i \quad \forall i $$  
$$ y_i - w^T\phi_b - b \leq e + \xi_i \quad \forall i $$ (9.1)

The primal Lagrangian becomes,

$$ L_P = \frac{1}{2} ||w||^2 + \frac{C}{2} \sum_{i} (\xi_i^2 + \xi_i^2) + \sum_{i} \alpha_i (w^T\phi_b + b - y_i - e - \xi_i) + \sum_{i} \alpha_i^*(y_i - w^T\phi_b - b - e - \xi_i) $$ (9.2)

**Remark 1:** We could have added the constraints that $\xi_i \geq 0$ and $\xi^*_i \geq 0$. However, it is not hard to see that the final solution will have that requirement automatically and there is no sense in constraining the optimization to the optimal solution as well. To see this, imagine some $\xi_i$ is negative, then, by setting $\xi_i = 0$ the cost is lower and none of the constraints is violated, so it is preferred. We also note due to the above reasoning we will always have at least one of the $\xi_i$ or $\xi^*_i$ inside the tube both are zero, outside the tube one of them is zero. This means that at the solution we have $\xi_i = 0$.

**Remark 2:** Note that we don’t scale $\epsilon = 1$ like in the SVM case. The reason is that $\{y_i\}$ now determines the scale of the problem, i.e. we have not over-parameterized the problem.

We now take the derivatives w.r.t. $w, b$, and $\xi_i$ to find the following KKT conditions (there are more of course),

$$ w = \sum_{i} (\alpha_i - \alpha_i^*)\phi_i $$ (9.3)

$$ \xi_i = \frac{\alpha_i}{C} \quad \xi_i^* = \frac{\alpha_i^*}{C} $$ (9.4)

Plugging this back in and using that now we also have $\alpha_i \alpha_i^* = 0$ we find the dual problem,

maximize 
$$ \alpha_i - \frac{1}{2} \sum_{i} (\alpha_i - \alpha_j)(\alpha_i - \alpha_j) (K_{ij} \frac{1}{C \delta_{ij}}) + \sum_{i} (\alpha_i - \alpha_i^*) (y_i - \sum_{j} (\alpha_j + \alpha_j^*)) $$

subject to 
$$ \sum_{i} (\alpha_i - \alpha_j) = 0 $$  
$$ \alpha_i \geq 0, \ \alpha_i^* \geq 0 \quad \forall i $$ (9.5)
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 61

Context: From the complementary slackness conditions we can read the sparseness of the solution out:

\[
\alpha_i (w^T \Phi + b - y_i - \xi_i) = 0 \tag{6.9}
\]

\[
\alpha_i \hat{(y_i - w^T \Phi - b =  - \xi_i)} = 0 \tag{7.9}
\]

\[
\xi_i \alpha_i = 0 \tag{8.9}
\]

where we added the last conditions by hand (they don’t seem to directly follow from the formulation). Now we clearly see that if a case is above the tube \(\xi_i\) it will take on its smallest possible value in order to make the constraints satisfied \(\xi_i = y_i - w^T \Phi_i - b - \epsilon\). This implies that \(\alpha_i\) will take on a positive value and the farther outside the tube the larger the \(\hat{\alpha}_i\) (you can think of it as a compensating force). Note that in this case \(\alpha_1 = 0\). A similar story goes if \(\xi_i > 0\) and \(\alpha_i > 0\). If a data case is inside the tube the \(\alpha_i, \hat{\alpha}_i\) are necessarily zero, and hence we obtain sparseness.

We now change variables to make this optimization problem look more similar to the SVM and ridge-regression case. Introduce \(\beta_i = \alpha_i - \alpha_i\) and use \(\hat{\alpha}_i = 0\) to write \(\alpha_i + \alpha_i = |\beta_i|\).

\[
\text{maximize } \quad -\frac{1}{2} \sum_{i,j} \beta_i \beta_j (K_{ij} + \frac{1}{C} \delta_{ij}) + \sum_{i} \beta_i y_i - \sum_{i} |\beta_i| \tag{9.9}
\]

subject to
\[
\sum_{i} \beta_i = 0
\]

From the slackness conditions we can also find a value for \(b\) (similar to the SVM case). Also, as usual, the prediction of new data-case is given by:

\[
y = w^T \Phi(x) + b = \sum_{i} \beta_i K(x_i, x) + b \tag{10.10}
\]

It is an interesting exercise for the reader to work her way through the case. 

> **Note:** By the way that we could not use the trick used in ridge-regression by defining a constant feature \(\phi_0 = 1\) and \(b = w_0\). The reason is that the objective does not depend on \(b\).
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 62

Context: # CHAPTER 9. SUPPORT VECTOR REGRESSION

where the penalty is linear instead of quadratic, i.e.

```markdown
minimize_{w,\xi} \quad \frac{1}{2} \| w \|^2 + C \sum_{i} (\xi_i + \xi_i^*)
```
subject to
```markdown
w^T \phi_n + b - \xi_i \leq y_i \quad \forall i
```
```markdown
y_i \leq w^T \phi_n + b + \xi_i \quad \forall i \tag{9.11}
```
```markdown
\xi_i \geq 0, \quad \xi_i^* \geq 0 \quad \forall i \tag{9.12}
```

leading to the dual problem,

```markdown
maximize_{\beta} \quad -\frac{1}{2} \sum_{j} \beta_j \beta_k K_{j,k} + \sum_{i} \beta_i y_i - \sum_{i} |\beta_i| \epsilon
```
subject to
```markdown
\sum_{i} \beta_i = 0 \tag{9.13}
```
```markdown
-C \leq \beta_i \leq C \quad \forall i \tag{9.14}
```
where we note that the quadratic penalty on the size of \( \beta \) is replaced by a box constraint, as is to be expected in switching from \( L_2 \) norm to \( L_1 \) norm.

Final remark: Let’s remind ourselves that the quadratic programs that we have derived are convex optimization problems which have a unique optimal solution that can be found efficiently using numerical methods. This is often claimed as great progress w.r.t. the old neural networks days which were plagued by many local optima.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 63

Context: # Chapter 10
## Kernel Ridge Regression

Possibly the most elementary algorithm that can be kernelized is ridge regression. Here our task is to find a linear function that models the dependencies between covariates \(\{x_i\}\) and response variables \(\{y_i\}\), both continuous. The classical way to do that is to minimize the quadratic cost:

\[
C(w) = \frac{1}{2} \sum_i (y_i - w^T x_i)^2 \quad (10.1)
\]

However, if we are going to work in feature space, where we replace \(x_i \to \Phi(x_i)\), there is an clear danger that we overfit. Hence we need to regularize. This is an important topic that will return in future classes.

A simple yet effective way to regularize is to penalize the norm of \(w\). This is sometimes called "weight-decay". It remains to be determined how to choose \(\lambda\). The most used algorithm is to use cross-validation or leave-one-out estimates. The total cost function hence becomes:

\[
C = \frac{1}{2} \sum_i (y_i - w^T x_i)^2 + \frac{1}{2}\|\hat{w}\|^2 \quad (10.2)
\]

which needs to be minimized. Taking derivatives and equating them to zero gives:

\[
\sum_i (y_i - w^T x_i)x_i = \lambda w \quad \Rightarrow \quad w = \left( \lambda I + \sum_i x_ix_i^T \right)^{-1} \left( \sum_j y_j x_j \right) \quad (10.3)
\]

We see that the regularization term helps to stabilize the inverse numerically by bounding the smallest eigenvalues away from zero.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 64

Context: # CHAPTER 10. KERNEL RIDGE REGRESSION

## 10.1 Kernel Ridge Regression

We now replace all data-cases with their feature vector: \( x_i \to \Phi_i = \Phi(x_i) \). In this case, the number of dimensions can be much higher, or even infinitely higher, than the number of data-cases. There is a neat trick that allows us to perform the inverse above in smallest space of the two possibilities, either the dimension of the feature space or the number of data-cases. The trick is given by the following identity:

\[
(P^{-1} + B^T R^{-1} B)^{-1} B^T R^{-1} = P B^T (B P B^T + R)^{-1} \quad (10.4)
\]

Now note that if \( B \) is not square, the inverse is performed in spaces of different dimensionality. To apply this to our case we define \( \Phi = \Phi^a \) and \( y = y_i \). The solution is then given by:

\[
w = (\lambda I_l + \Phi^T \Phi)^{-1} \mathbf{y} = \Phi(\Phi^T \Phi + \lambda I_n)^{-1} \mathbf{y} \quad (10.5)
\]

This equation can be rewritten as: \( w = \sum \alpha \Phi(x_i) \) with \( \alpha = (\Phi^T \Phi + \lambda I_n)^{-1} \mathbf{y} \). This is an equation that will be a recurrent theme and can be interpreted as: The solution we must lie in the span of the data-cases, even if the dimensionality of the feature space is much larger than the number of data-cases. This seems intuitively clear, since the algorithm is linear in feature space.

We finally need to show that we never actually need access to the feature vectors, which could be infinitely long (which would be rather impractical). What we need in practice is the predicted value for new test point, \( x \). This is computed by projecting it onto the solution \( w \):

\[
y = w^T \Phi(x) = y(\Phi^T \Phi + \lambda I_n)^{-1} K(x) \quad (10.6)
\]

where \( K(b_i, b_j) = \Phi(x_i)^T \Phi(x_j) \) and \( K(x) = K(x, x) \). The important message here is of course that we only need access to the kernel \( K \).

We can now add bias to the whole story by adding one more, constant feature to \( \Phi \): \( \Phi_0 = 1 \). The value of \( w_0 \) then represents the bias since,

\[
w^T \Phi = \sum_a w_a \Phi_i + w_0 \quad (10.7)
\]

Hence, the story goes through unchanged.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 65

Context: # 10.2 An Alternative Derivation

Instead of optimizing the cost function above, we can introduce Lagrange multipliers into the problem. This will have the effect that the derivation goes along similar lines as the SVM case. We introduce new variables, \( \xi_i = y_i - w^T \phi_i \), and rewrite the objective as the following constrained QP.

## Problem Formulation
minimize:  
\[
-w^T \xi \quad L_P = \sum_i \xi_i^2
\]  
subject to:  
\[
y_i = w^T \phi_i, \quad \forall i \quad (10.8)
\]  
\[
\|w\| \leq B \quad (10.9)
\]

This leads to the Lagrangian:  
\[
L_P = \sum_i \left( \frac{1}{4}\xi_i^2 + \sum_i \beta_i[y_i - w^T \phi_i - \xi_i] + \lambda[\|w\|^2 - B^2] \right) \quad (10.10)
\]

Two of the KKT conditions tell us that at the solution we have:  
\[
2\xi_i = \beta_i, \quad \forall i \quad (10.11)
\]  
\[
2\lambda w = \sum_i \beta_i \phi_i
\]

Plugging it back into the Lagrangian, we obtain the dual Lagrangian:  
\[
L_D = \sum_i \left( -\frac{1}{4}\sum_i \beta_i^2 + \beta_i y_i \right) - \frac{1}{4 \lambda} \sum_{ij} \left( \beta_i \beta_j K_{ij} \right) - \lambda B^2 \quad (10.12)
\]

We now redefine \( \alpha_i = \beta_i/(2\lambda) \) to arrive at the following dual optimization problem:

## Dual Problem
maximize:  
\[
-\alpha^T \lambda \quad L = -\lambda^2 \sum_i \alpha_i^2 + 2\lambda \sum_i \alpha_i y_i - \lambda \sum_{ij} \alpha_i \alpha_j K_{ij} - \lambda B^2 \quad s.t. \quad \lambda \geq 0 \quad (10.13)
\]

Taking derivatives w.r.t. \( \alpha \) gives precisely the solution we had already found:  
\[
\alpha_i^* = (K + \lambda I)^{-1}y \quad (10.14)
\]

Formally, we also need to maximize over \( \lambda \). However, different choices of \( \lambda \) correspond to different choices for \( B \). Either \( \lambda \) or \( B \) should be chosen using cross-validation or some other measure, so we could as well vary \( \lambda \) in this process.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 66

Context: # CHAPTER 10. KERNEL RIDGE REGRESSION

One big disadvantage of the ridge-regression is that we don’t have sparseness in the \( \alpha \) vector, i.e., there is no concept of support vectors. This is useful because when we test a new example, we only have to sum over the support vectors, which is much faster than summing over the entire training set. In the SVM, the sparseness was born out of the inequality constraints because the complementary slackness conditions told us that either if the constraint was inactive, then the multiplier \( \alpha_i \) was zero. There is no such effect here.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 67

Context: # Chapter 11

## Kernel K-means and Spectral Clustering

The objective in K-means can be written as follows:

$$ C(z, \mu) = \sum_{k} \| x_i - \mu_k \|^2 $$ (11.1)

where we wish to minimize over the assignment variables \( z_i \) (which can take values \( z_i = 1, \ldots, K \), for all data cases \( i \), and over the cluster means \( \mu_k, k = 1, \ldots, K \). It is not hard to show that the following iterations achieve that,

$$ z_i = \arg \min_{k} \| x_i - \mu_k \|^2 $$ (11.2)

$$ \mu_k = \frac{1}{N_k} \sum_{i \in C_k} x_i $$ (11.3)

where \( C_k \) is the set of data cases assigned to cluster \( k \).

Now, let’s assume we have defined many features, \( \phi(x_i) \), and wish to do clustering in feature space. The objective is similar to before,

$$ C(z, \mu) = \sum_{i} \| \phi(x_i) - \mu_k \|^2 $$ (11.4)

We will now introduce a \( N \times K \) assignment matrix, \( Z_{nk} \), each column of which represents a data case and contains exactly one 1 at row \( k \) if it is assigned to cluster \( k \). As a result, we have \( \sum_{k} z_{nk} = 1 \) and \( N_k = \sum_{n} z_{nk} \). Also define:
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 68

Context: # 11. KERNEL K-MEANS AND SPECTRAL CLUSTERING

Let \( L = \text{diag}\left(1 / \sum_n Z_{nk}\right) = \text{diag}(1 / |N_k|) \). Finally, define \( \Phi_n = \phi(x_n) \). With these definitions you can now check that the matrix \( M \) defined as,

\[
M = \Phi Z L Z^T \tag{11.5}
\]

consists of \( N \) columns, one for each data-case, where each column contains a copy of the cluster mean \( \mu_k \) to which that data-case is assigned.

Using this we can write out the K-means cost as,

\[
C = \text{tr}(\Phi - M)(\Phi - M)^T \tag{11.6}
\]

Next we can show that \( Z^T Z = L^{-1} \) (check this), and thus that \( (Z L Z^T)^2 = Z L Z^T \). In other words, it is a projection. Similarly, \( I - Z L Z^T \) is a projection on the complement space. Using this we simplify eqn.11.6 as,

\[
C = \text{tr}\left[\Phi(I - Z L Z^T) \phi^T\right] \tag{11.7}
\]
\[
= \text{tr}\left[\Phi(I - Z L Z^T) \phi^T\right] \tag{11.8}
\]
\[
= \text{tr}[\phi^T] - \text{tr}[\phi Z L Z^T \phi^T] \tag{11.9}
\]
\[
= \text{tr}[k] - \text{tr}[L^T Z K Z L] \tag{11.10}
\]

where we used that \( \text{tr}[AB] = \text{tr}[BA] \) and \( L^k \) is defined as taking the square root of the diagonal elements.

Note that only the second term depends on the clustering matrix \( Z \), so we can now formulate the following equivalent kernel clustering problem,

\[
\max_Z \text{tr}[L Z^T K Z L^*] \tag{11.11}
\]
such that: \( Z \) is a binary clustering matrix. \tag{11.12}

This objective is entirely specified in terms of kernels and so we have once again managed to move to the "dual" representation. Note also that this problem is very difficult to solve due to the constraints which forces us to search for binary matrices.

Our next step will be to approximate this problem through a relaxation on this constraint. First we recall that \( Z^T Z = L^{-1} \to L Z^T Z L^{-1} = I \). Renaming \( H = Z L^k \), with \( H \) an \( N \times K \) dimensional matrix, we can formulate the following relaxation of the problem,

\[
\max H \text{tr}[H^T K H] \tag{11.13}
\]
subject to \( H^T H = I \tag{11.14} \]
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 69

Context: Note that we did not require \( H \) to be binary any longer. The hope is that the solution is close to some clustering solution that we can then extract a posteriori.

The above problem should look familiar. Interpret the columns of \( H \) as a collection of \( K \) mutually orthonormal basis vectors. The objective can then be written as,

\[
\sum_{k=1}^{K} h_k^T K h_k \tag{11.15}
\]

By choosing \( h_k \) proportional to the \( K \) largest eigenvectors of \( K \) we will maximize the objective, i.e., we have

\[
K = U \Lambda U^T \Rightarrow H = U_{[:,K]} R \tag{11.16}
\]

where \( R \) is a rotation inside the eigenvalue space, \( R^T R = R R^T = I \). Using this you can now easily verify that \( \text{tr}(H^T K H) = \sum_{k=1}^{K} \lambda_k \) where \( \{ \lambda_k \}, k = 1..K \) are the largest \( K \) eigenvalues.

What is perhaps surprising is that the solution to this relaxed kernel-clustering problem is given by kernel-PCA! Recall that for kernel PCA we also solved for the eigenvalues of \( K \). How then do we extract a clustering solution from kernel-PCA?

Recall that the columns of \( H \) (the eigenvectors of \( K \)) should approximate the binary matrix \( Z \) which had a single 1 per row indicating to which cluster data-case \( n \) is assigned. We could try to simply threshold the entries of \( H \) so that the largest value is set to 1 and the remaining ones to 0. However, it often works better to first normalize \( H \),

\[
\hat{H} = \frac{H}{\sqrt{\sum_{k} H_{nk}^2}} \tag{11.17}
\]

All rows of \( \hat{H} \) are located on the unit sphere. We can now run a simple clustering algorithm such as K-means on the data matrix \( \hat{H} \) to extract \( C \) clusters. The above procedure is sometimes referred to as “spectral clustering.”

**Conclusion**: Kernel-PCA can be viewed as a nonlinear feature extraction technique. Input is a matrix of similarities (the kernel matrix or Gram matrix) which should be positive semi-definite and symmetric. If you extract two or three features (dimensions) you can use it as a non-linear dimensionality reduction method (for purposes of visualization). If you use the result as input to a simple clustering method (such as K-means) it becomes a nonlinear clustering method.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 70

Context: # CHAPTER 11: KERNEL K-MEANS AND SPECTRAL CLUSTERING

## 11.1 Introduction

Kernel K-means is an extension of the K-means clustering algorithm. It allows for clustering in a feature space that is not linearly separable, using a kernel function to transform the original data.

## 11.2 Algorithm

The steps of the Kernel K-means algorithm are as follows:

1. **Choose a kernel function** \( K(x_i, x_j) \).
2. **Initialize** \( k \) centroids randomly.
3. **Assign points** to the nearest centroid using the kernel.
4. **Update centroids** by computing the mean of the assigned points in the feature space.
5. **Repeat** steps 3 and 4 until convergence.

## 11.3 Spectral Clustering

Spectral clustering is another clustering technique that uses the eigenvalues of the similarity matrix of the data.

### 11.3.1 Algorithm

1. Compute the **similarity matrix**.
2. Compute the **Laplacian matrix** from the similarity matrix.
3. Compute the eigenvalues and eigenvectors of the Laplacian matrix.
4. Use the top \( k \) eigenvectors to form a new feature space.
5. Apply K-means to cluster the data in the new space.

## 11.4 Conclusion

Kernel K-means and spectral clustering provide powerful methods for clustering in complex spaces, enabling better data segmentation in many applications.

## References

- *Author Name*, *Title of the Source*, Year
- *Author Name*, *Title of the Source*, Year

For further reading, explore the works on kernel methods and spectral analysis in clustering contexts.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 71

Context: # Chapter 12

## Kernel Principal Components Analysis

Let’s first see what PCA is when we do not worry about kernels and feature spaces. We will always assume that we have centered data, i.e.,  \(\sum x_i = 0\). This can always be achieved by a simple translation of the axis.

Our aim is to find meaningful projections of the data. However, we are facing an unsupervised problem where we don’t have access to any labels. If we had, we should be doing Linear Discriminant Analysis. Due to this lack of labels, our aim will be to find the subspace of largest variance, where we choose the number of retained dimensions beforehand. This is clearly a strong assumption, because it may happen that there is interesting signal in the directions of small variance, in which case PCA is not a suitable technique (and we should perhaps use a technique called independent component analysis). However, usually it is true that the directions of smallest variance represent uninteresting noise.

To make progress, we start by writing down the sample-covariance matrix \(C\):

\[
C = \frac{1}{N} \sum_{i=1}^{N} x_i x_i^T \tag{12.1}
\]

The eigenvalues of this matrix represent the variance in the eigen-directions of data-space. The eigen-vector corresponding to the largest eigenvalue is the direction in which the data is most stretched out. The second direction is orthogonal to it and picks the direction of largest variance in that orthogonal subspace, etc. Thus, to reduce the dimensionality of the data, we project the data onto the re-
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 72

Context: # CHAPTER 12. KERNEL PRINCIPAL COMPONENTS ANALYSIS

tained eigen-directions of largest variance:

$$ U \Lambda U^T = C \Rightarrow C = \sum_{a} \lambda_a u_a u_a^T \,\, (12.2) $$

and the projection is given by,

$$ y_i = U^T x_i \,\, \forall i \,\, (12.3) $$

where \( U_k \) means the \( d \times k \) sub-matrix containing the first \( k \) eigenvectors as columns. As a side effect, we can now show that the projected data are de-correlated in this new basis:

$$ \frac{1}{N} \sum_{i} y_i y_i^T = \frac{1}{N} \sum_{j} U_k x_j x_k^T U_k = U_k^T C U_k = U_k^T U \Lambda U^T = \Lambda_k \,\, (12.4) $$

where \( \Lambda_k \) is the (diagonal) \( k \times k \) sub-matrix corresponding to the largest eigenvalues.

Another convenient property of this procedure is that the reconstruction error in \( L_2 \)-norm between \( y \) to \( x \) is minimal, i.e.,

$$ \sum_{i} \| x_i - P_k x_i \|^2 \,\, (12.5) $$

where \( P_k = U_k U_k^T \) is the projection onto the subspace spanned by the columns of \( U_k \) and is minimal.

Now imagine that there are more dimensions than data cases, i.e., some dimensions remain unoccupied by the data. In this case it is not hard to show that the eigen-vectors that span the projection space must lie in the subspace spanned by the data-cases. This can be seen as follows:

$$ \lambda_a u_a = C u_a = \frac{1}{N} \sum_{i} x_i x_i^T u_a = \frac{1}{N} \sum_{j} (x_j^T u_a) x_j \Rightarrow u_a = \sum_{i} \frac{(x_j^T u_a)}{N \lambda_a} x_j \,\, (12.6) $$

where \( u_a \) is some arbitrary eigen-vector of \( C \). The last expression can be interpreted as: “every eigen-vector can be exactly written (i.e., losslessly) as some linear combination of the data-vectors, and hence it must lie in its span”. This also implies that instead of the eigenvalue equations \( C u_a = \lambda_a u_a \), we may consider the \( N \) projected equations \( X^T C u_a = \lambda_a x_i \). From this equation the coefficients
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 73

Context: # 12.1 CENTERING DATA IN FEATURE SPACE

\(\alpha_k\) can be computed efficiently in a space of dimension \(N\) (and not \(d\)) as follows:

\[
x_k^T \frac{1}{N} \sum_k x_k x_k^T \sum_j \alpha_j x_j = \lambda x_k^T \sum_j \alpha_j x_j \Rightarrow \frac{1}{N} \sum_{j,k} \alpha_j^1 x_k^T | x_j | = \lambda \sum_j \alpha_j^1 x_j^T x_j
\]

(12.7)

We now rename the matrix \([x_j^T x_k] = K_j\) to arrive at,

\[
K \alpha^a = N \lambda K \alpha^a \Rightarrow K \alpha^e = (\lambda_a) \alpha^a \text{ with } \lambda_a = N N_a
\]

(12.8)

So, we have derived an eigenvalue equation for \(K\) which in turn completely determines the eigenvectors \(u\). By requiring that \(u\) is normalized, we find,

\[
u^T u_a = 1 \Rightarrow \sum_{i,j} \alpha_i^a \alpha_j^1 [x_j^T x_i] = \alpha^T K \alpha = N \alpha_a^T \alpha_a = 1 \Rightarrow \|\alpha\|_1 = \frac{1}{\sqrt{N \lambda_a}}
\]

(12.9)

Finally, when we receive a new data-case \(t\) and we like to compute its projections onto the new reduced space, we compute,

\[
u^T t = \sum_i \alpha_i^a x_i^T t = \sum_i \alpha_i^a K(x_i, t)
\]

(12.10)

This equation should look familiar; it is central to most kernel methods. Obviously, the whole exposition was set up so that in the end we only needed the matrix \(K\) to do our calculations. This implies that we are now ready to kernelize the procedure by replacing \(x_i \to \Phi(x_i)\) and defining \(K_{ij} = \Phi(x_i) \Phi(x_j)^T\), where \(\Phi(x_i) = \Phi_a\).

## 12.1 Centering Data in Feature Space

It is in fact very difficult to explicitly center the data in feature space. But, we know that the final algorithm only depends on the kernel matrix, so if we can center the kernel matrix we are done as well. A kernel matrix is given by \(K_{ij} = \Phi(x_i) \Phi(x_j)^T\). We now center the features using,

\[
\Phi_i = \Phi^* - \frac{1}{N} \sum_k \Phi_k
\]

(12.11)
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 74

Context: # CHAPTER 12. KERNEL PRINCIPAL COMPONENTS ANALYSIS

Hence the kernel in terms of the new features is given by:

\[ K_{ij} = \left( \Phi_i - \frac{1}{N} \sum_k \Phi_k \right) \left( \Phi_j - \frac{1}{N} \sum_l \Phi_l \right)^T \] (12.12)

\[
= \Phi_i \Phi_j^T - \frac{1}{N} \sum_k \Phi_k \Phi_j^T - \Phi_i \frac{1}{N} \sum_l \Phi_l^T + \frac{1}{N^2} \sum_{k, l} \Phi_k \Phi_l^T
\] (12.13)

\[
= K_{ij} - \kappa_i \mathbf{1}^T_j - \mathbf{1}_i \kappa_j^T + k \mathbf{1}_i \mathbf{1}_j^T
\] (12.14)

with 

\[
\kappa_i = \frac{1}{N} \sum_k K_{ik} 
\] (12.15)

and 

\[
k = \frac{1}{N^2} \sum_{ij} K_{ij}
\] (12.16)

Hence, we can compute the centered kernel in terms of the non-centered kernel alone and no features need to be accessed.

At test-time we need to compute:

\[ K_c(t_i, x_j) = [\Phi(t_i) - \frac{1}{N} \sum_k \Phi(x_k)] [\Phi(x_j)] - \frac{1}{N} \sum_l \Phi(x_l)^T \] (12.17)

Using a similar calculation (left for the reader) you can find that this can be expressed easily in terms of \( K(t_i, x_j) \) and \( K(x_i, x_j) \) as follows:

\[
K_c(t_i, x_j) = K(t_i, x_j) - \kappa(t_i)^T \mathbf{1}_j - \mathbf{1}_{t_i} \kappa(x_j)^T + k \mathbf{1}_{t_i} \mathbf{1}_{j}^T
\] (12.18)
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 75

Context: # Chapter 13

## Fisher Linear Discriminant Analysis

The most famous example of dimensionality reduction is "principal components analysis". This technique searches for directions in the data that have largest variance and subsequently projects the data onto it. In this way, we obtain a lower-dimensional representation of the data, removing some of the "noisy" directions. There are many difficult issues with how many directions one needs to choose, but that is beyond the scope of this note.

PCA is an unsupervised technique and, as such, does not include label information of the data. For instance, if we imagine two cigar-like clusters in two dimensions, one cigar has \( y = 1 \) and the other \( y = -1 \). The cigars are positioned in parallel and very closely together, such that the variance in the total data set, ignoring the labels, is in the direction of the cigars. For classification, this would be a terrible projection, because all labels get evenly mixed and we destroy the useful information. A much more useful projection is orthogonal to the cigars, i.e., in the direction of least overall variance, which would perfectly separate the data cases (obviously, we would still need to perform classification in this 1-D space).

So the question is, how do we utilize the label information in finding informative projections? To that purpose, Fisher-LDA considers maximizing the following objective:

\[ J(w) = \frac{w^T S_B w}{w^T S_W w} \]

where \( S_B \) is the "between classes scatter matrix" and \( S_W \) is the "within classes scatter matrix". Note that due to the fact that scatter matrices are proportional to the covariance matrices we could have defined \( J \) using covariance matrices—the proportionality constant would have no effect on the solution.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 76

Context: # CHAPTER 13. FISHER LINEAR DISCRIMINANT ANALYSIS

The scatter matrices are:

$$
S_B = \sum_{c} N_c(\mu_c - \bar{x})(\mu_c - \bar{x})^T \tag{13.2}
$$

$$
S_W = \sum_{c} \sum_{i \in c} (x_i - \mu_c)(x_i - \mu_c)^T \tag{13.3}
$$

where,

$$
\mu_c = \frac{1}{N_c} \sum_{i \in c} x_i \tag{13.4}
$$

$$
\bar{x} = \frac{1}{N} \sum_{i} x_i = \frac{1}{\sum_{c} N_c} \sum_{c} N_c \mu_c \tag{13.5}
$$

and \(N_c\) is the number of cases in class \(c\). Oftentimes you will see that for 2 classes \(S_B\) is defined as \(S_B = (\mu_1 - \mu_2)(\mu_1 - \mu_2)^T\). This is the scatter of class 1 with respect to the scatter of class 2 and you can show that \(S_B = \frac{N_1}{N_1 + N_2} S_B\), but since it boils down to multiplying the objective with a constant makes no difference to the final solution.

Why does this objective make sense? Well, it says that a good solution is one where the class means are well separated, measured relative to the (sum of the) variances of the data assigned to a particular class. This is precisely what we want, because it implies that the gap between the classes is expected to be big. It is also interesting to observe that since the total scatter,

$$
S_T = \sum_i (x_i - \bar{x})(x_i - \bar{x})^T \tag{13.6}
$$

is given by \(S_T = S_W + S_B\), the objective can be rewritten as,

$$
J(w) = \frac{w^T S_B w}{w^T S_W w} - 1 \tag{13.7}
$$

and hence can be interpreted as maximizing the total scatter of the data while minimizing the within scatter of the classes.

An important property to notice about the objective \(J\) is that it is invariant w.r.t. rescalings of the vectors \(w \rightarrow \alpha w\). Hence, we can always choose \(w\) such that the denominator is simply \(w^T S_W w = 1\), since it is a scalar itself. For this reason we can transform the problem of maximizing \(J\) into the following constrained
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 77

Context: # 13.1 Kernel Fisher LDA

Optimization problem,

$$
\begin{align*}
\min_w \quad & -\frac{1}{2} w^T S_B w \\
\text{s.t.} \quad & w^T S_W w = 1
\end{align*}
$$

corresponding to the Lagrangian,

$$
L_P = -\frac{1}{2} w^T S_B w + \frac{1}{2} (w^T S_W w - 1) \tag{13.10}
$$

(the halves are added for convenience). The KKT conditions tell us that the following equation needs to hold at the solution,

$$
S_B w = \lambda S_W w \tag{13.11}
$$

This almost looks like an eigen-value equation. In fact, it is called a generalized eigen-problem and just like an normal eigenvalue problem there are standard ways to solve it.

Remains to choose which eigenvalue and eigenvector corresponds to the desired solution. Plugging the solution back into the objective \(J\), we find,

$$
J(w) = \frac{w^T S_B w}{w^T S_W w} = \frac{\lambda w^T S_W w}{w^T S_W w} = \lambda_k \tag{13.12}
$$

from which it immediately follows that we want the largest eigenvalue to maximize the objective¹.

## Kernel Fisher LDA

So how do we kernelize this problem? Unlike SVMs it doesn’t seem the dual problem reveal the kernelized problem naturally. But inspired by the SVM case we make the following key assumption,

$$
w = \sum_{i} \alpha_i \phi(x_i) \tag{13.13}
$$

¹ If you try to find the dual and maximize that, you’ll get the wrong sign it seems. My best guess of what goes wrong is that the constraint is not linear and as a result the problem is not convex and hence we cannot expect the optimal dual solution to be the same as the optimal primal solution.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 78

Context: # CHAPTER 13. FISHER LINEAR DISCRIMINANT ANALYSIS

This is a central recurrent equation that keeps popping up in every kernel machine. It says that although the feature space is very high (or even infinite) dimensional, with a finite number of data cases the final solution, \( w^* \), will not have a component outside the space spanned by the data cases. It would not make much sense to do this transformation if the number of data cases is larger than the number of dimensions, but this is typically not the case for kernel methods. So, we argue that although there are possibly infinite dimensions available a priori, at most \( N \) are being occupied by the data, and the solution \( w \) must lie in its span. This is a case of the "representers theorem" that intuitively reasons as follows. The solution \( w \) is the solution to some eigenvalue equation,

\[
S_B^\# S_W^{-1} S_B w = \lambda w,
\]

where both \( S_B \) and \( S_W \) (and hence its inverse) lie in the span of the data cases. Hence, the part \( w^\perp \) that is perpendicular to this span will be projected to zero and the equation above puts no constraints on those dimensions. They can be arbitrary and have no impact on the solution. If we now assume a very general form of regularization on the norm of \( w \), then these orthogonal components will be set to zero in the final solution: \( w^\perp = 0 \).

In terms of \( \alpha \), the objective \( J(\alpha) \) becomes:

\[
J(\alpha) = \frac{\alpha^T S_B \alpha}{\alpha^T S_W \alpha} \quad (13.14)
\]

where it is understood that vector notation now applies to a different space, namely the space spanned by the data vectors, \( \mathbb{R}^N \). The scatter matrices in kernel space can be expressed in terms of the kernel only as follows (this requires some algebra to verify):

\[
S_B^\# = \sum_{c} N_c [k_c k_c^T - \kappa N^T] \quad (13.15)
\]

\[
S_W^\# = k^2 - \sum_{c} N_c k_c k_c^T \quad (13.16)
\]

\[
\kappa_c = \frac{1}{N_c} \sum_{i \in c} K_{ij} \quad (13.17)
\]

\[
\kappa = \frac{1}{N} \sum_{i} K_{ij} \quad (13.18)
\]

So, we have managed to express the problem in terms of kernels only which is what we were after. Note that since the objective in terms of \( \alpha \) has exactly the same form as that in terms of \( w \), we can solve it by solving the generalized
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 79

Context: # 13.2 A CONSTRAINED CONVEX PROGRAMMING FORMULATION OF FDA

The eigenvalue equation. This scales as \( \mathcal{O}(n^3) \) which is certainly expensive for many datasets. More efficient optimization schemes solving a slightly different problem and based on efficient quadratic programs exist in the literature.

Projections of new test points into the solution space can be computed by:

\[
w^T\Phi(x) = \sum_{i} \alpha_i K(x_i, x) \tag{13.19}
\]

as usual. In order to classify the test point, we still need to divide the space into regions which belong to one class. The easiest possibility is to pick the cluster with the smallest Mahalanobis distance: 

\[
d(x, \mu^c) = \frac{(x - \mu^c)^2}{\sigma^2} \tag{13.20}
\]

where \( \mu^c \) and \( \sigma^c \) represent the class mean and standard deviation in the 1-d projected space respectively. Alternatively, one could train any classifier in the 1-d subspace.

One very important issue that we did not pay attention to is regularization. Clearly, as it stands the kernel machine will overfit. To regularize, we can add a term to the denominator:

\[
S_W \rightarrow S_W + \beta I \tag{13.21}
\]

By adding a diagonal term to this matrix, we ensure that very small eigenvalues are bounded away from zero, which improves numerical stability in computing the inverse. If we write the Lagrangian formulation where we maximize a constrained quadratic form in \( \alpha \), the extra term appears as a penalty proportional to \( ||\alpha||^2 \) which acts as a weight decay term, favoring smaller values of \( \alpha \) over larger ones. Fortunately, the optimization problem has exactly the same form in the regularized case.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 80

Context: # CHAPTER 13. FISHER LINEAR DISCRIMINANT ANALYSIS

## Introduction

Fisher Linear Discriminant Analysis (LDA) is a supervised learning method used mainly for classification. It projects data onto a lower-dimensional space while maximizing class separability.

## Key Concepts

1. **Discriminant Function**  
   The goal is to find a linear combination of features that separates two or more classes of objects or events.

2. **LDA vs. PCA**  
   - LDA focuses on maximizing the separation between multiple classes, while Principal Component Analysis (PCA) focuses on variance.

3. **Regularization**  
   Techniques can be applied to improve results when the number of features is high compared to the number of samples.

## Mathematical Formulation

Given a set of classes \(C_1, C_2, ..., C_k\), let \( \mu_i \) be the mean of class \( C_i \) and \( S_w \) and \( S_b \) the within-class and between-class scatter matrices, respectively. The LDA objective is to maximize:

\[
J(w) = \frac{|S_b|}{|S_w|}
\]

where \(w\) is the projection vector.

## Implementation Steps

1. Calculate the mean vectors for each class.
2. Compute the scatter matrices \( S_w \) and \( S_b \).
3. Solve the generalized eigenvalue problem for the matrix \( S_w^{-1}S_b \).
4. Choose the top eigenvectors to form a new feature space.

## Use Cases

- Face recognition
- Medical diagnosis
- Marketing segmentation

## Conclusion

Fisher LDA is a powerful tool for classification tasks with a focus on maximizing the separation between classes. Its mathematical foundation allows for effective implementation across various domains. 

## References

1. McLachlan, G. J. (2004). *Discriminant Analysis and Statistical Pattern Recognition*. Wiley-Interscience. 
2. Duda, R. O., Hart, P. E., & Stork, D. G. (2000). *Pattern Classification*. Wiley-Interscience. 

## Further Reading

- [Wikipedia: Linear Discriminant Analysis](https://en.wikipedia.org/wiki/Linear_discriminant_analysis)
- [Scikit-learn: Linear Discriminant Analysis](https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html)
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 81

Context: # Chapter 14

## Kernel Canonical Correlation Analysis

Imagine you are given 2 copies of a corpus of documents, one written in English, the other written in German. You may consider an arbitrary representation of the documents, but for definiteness we will use the “vector space” representation where there is an entry for every possible word in the vocabulary and a document is represented by count values for every word, i.e., if the word “the” appeared 12 times and the first word in the vocabulary we have \( X_1(doc) = 12 \), etc.

Let’s say we are interested in extracting low dimensional representations for each document. If we had only one language, we could consider running PCA to extract directions in word space that carry most of the variance. This has the ability to infer semantic relations between the words such as synonymy, because if words tend to co-occur often in documents, i.e., they are highly correlated, they tend to be combined into a single dimension in the new space. These spaces can often be interpreted as topic spaces.

If we have two translations, we can try to find projections of each representation separately such that the projections are maximally correlated. Hopefully, this implies that they represent the same topic in two different languages. In this way we can extract language independent topics.

Let \( x \) be a document in English and \( y \) a document in German. Consider the projections: \( u = a^Tx \) and \( v = b^Ty \). Also assume that the data have zero mean. We now consider the following objective:

\[
\rho = \frac{E[uv]}{\sqrt{E[u^2]E[v^2]}} \tag{14.1}
\]
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 82

Context: 70 CHAPTER 14. KERNEL CANONICAL CORRELATION ANALYSIS

We want to maximize this objective, because this would maximize the correlation between the univariates u and v. Note that we divided by the standard deviation of the projections to remove scale dependence.

This exposition is very similar to the Fisher discriminant analysis story and I encourage you to reread that. For instance, there you can find how to generalize to cases where the data is not centered. We also introduced the following “trick”. Since we can rescale a and b without changing the problem, we can constrain them to be equal to 1. This then allows us to write the problem as:

```markdown
maximize_{a,b} \quad \rho = E[uv]  
subject to \quad E[u^2] = 1  
E[v^2] = 1  (14.2)
```

Or, if we construct a Lagrangian and write out the expectations we find,

```markdown
min_{a,b} \max_{\lambda_1,\lambda_2} \sum_{i} a^T x_i y_i^T b - \frac{1}{2} \lambda_1 \left( \sum_{i} a^T x_i x_i^T a - N \right) - \frac{1}{2} \lambda_2 \left( \sum_{i} b^T y_i y_i^T b - N \right) (14.3)
```
where we have multiplied by N. Let’s take derivatives wrt a and b to see what the KKT equations tell us,

```markdown
\sum_{i} x_i y_i^T b - \lambda_1 \sum_{i} x_i x_i^T a = 0  (14.4)  
\sum_{i} y_i x_i^T a - \lambda_2 \sum_{i} y_i y_i^T b = 0  (14.5)
```

First notice that if we multiply the first equation with \( a^T \) and the second with \( b^T \) and subtract the two, while using the constraints, we arrive at \( \lambda_1 = \lambda_2 = \lambda \). Next, rename \( S_{uv} = \sum_{i} x_i y_i^T, S_{x} = \sum_{i} x_i x_i^T, S_{y} = \sum_{i} y_i y_i^T \). We define the following larger matrices: \( S_{b} \) is the block diagonal matrix with \( S_{x} \) and \( S_{y} \) on the diagonal and zeros on the off-diagonal blocks. Also, we define \( S_{C} \) to be the off-diagonal matrix with \( S_{uv} \) on the off diagonal. Finally we define \( c' = [a, b] \).
The two equations can then be written jointly as,

```markdown
S_{C} = \Lambda S_{D} S_{C} \Rightarrow S_{D}^{-1} S_{C} = \lambda c' \Rightarrow S_{D}^{-1} S_{C} = \lambda S_{D}^{-1} S_{C} (14.6)
```
which is again a regular eigenvalue equation for \( c' = S_{D}^{-1} S_{C} \).
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 83

Context: # 14.1 Kernel CCA

As usual, the starting point to map the data-cases to feature vectors \( \Phi(\mathbf{x}) \) and \( \Psi(\mathbf{y}) \). When the dimensionality of the space is larger than the number of data-cases in the training set, then the solution must lie in the span of data-cases, i.e.,

\[
\mathbf{a} = \sum_{i} \alpha_i \Phi(\mathbf{x}_i) \quad \mathbf{b} = \sum_{i} \beta_i \Psi(\mathbf{y}_i) \tag{14.7}
\]

Using this equation in the Lagrangian, we get,

\[
\mathcal{L} = \alpha^T K_x K_y \beta - \frac{1}{2} \lambda(\alpha^T K_x^T \alpha - N) - \frac{1}{2} \lambda(\beta^T K_y^T \beta - N) \tag{14.8}
\]

where \( \alpha \) is a vector in a different \( N \)-dimensional space than e.g. a which lives in a \( D \)-dimensional space, and \( K_x = \sum_{i} \Phi(\mathbf{x}_i)^T \Phi(\mathbf{x}_i) \) and similarly for \( K_y \).

Taking derivatives w.r.t. \( \alpha \) and \( \beta \), we find,

\[
K_x K_y \beta = \lambda K_x^2 \alpha \tag{14.9}
\]
\[
K_y K_x \alpha = \lambda K_y^2 \beta \tag{14.10}
\]

Let's try to solve these equations by assuming that \( K_x \) is full rank (which is typically the case). We get, \( \alpha = \lambda^{-1} K_x^{-1} K_y \beta \) and hence, \( K_y^T \beta = \lambda K_x^T K_y \beta \) which always has a solution for \( \lambda = 1 \). By recalling that,

\[
\rho = \frac{1}{N} \sum \mathbf{a}^T S_{xy} \beta = \frac{1}{N} \sum \lambda \mathbf{a}^T S_{aa} = \lambda \tag{14.11}
\]

we observe that this represents the solution with maximal correlation and hence the preferred one. This is a typical case of overfitting which emphasizes the need to regularize in kernel methods. This can be done by adding a diagonal term to the constraints in the Lagrangian (or equivalently to the denominator of the original objective), leading to the Lagrangian,

\[
\mathcal{L} = \alpha^T K_x K_y \beta - \frac{1}{2} \lambda(\alpha^T K_x^T \alpha + \eta \|\alpha\|^2 - N) - \frac{1}{2} \lambda(\beta^T K_y^T \beta + \eta\|\beta\|^2 - N) \tag{14.12}
\]

One can see that this acts as a quadratic penalty on the norm of \( \alpha \) and \( \beta \). The resulting equations are,

\[
K_x K_y \beta = \lambda (K_x^2 + \eta) \alpha \tag{14.13}
\]
\[
K_y K_x \alpha = \lambda (K_y^2 + \eta) \beta \tag{14.14}
\]
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 84

Context: # 14. Kernel Canonical Correlation Analysis

Analogous to the primal problem, we will define big matrices, \( K_D \) which contains \( (K_{1}^{2} + \eta I) \) and \( (K_{2}^{2} + \eta I) \) as blocks on the diagonal and zeros at the blocks off the diagonal, and the matrix \( K_0 \) which has the matrices \( K_1 \) on the right-upper off diagonal block and \( K_2 \) at the left-lower off-diagonal block. Also, we define \( \gamma = [\alpha, \beta] \). This leads to the equation,

\[
K_0 \gamma = \Lambda K_D \gamma \Rightarrow K_D^{-1} K_0 \gamma = \lambda \gamma \Rightarrow K_{0}^{D} K_D^{-1} K_{0}^{T} (K_{0} \gamma) = \lambda (K_{0} \gamma) \tag{14.15}
\]

which is again a regular eigenvalue equation. Note that the regularization also moved the smallest eigenvalue away from zero, and hence made the inverse more numerically stable. The value for \( \eta \) needs to be chosen using cross-validation or some other measure. Solving the equations using this larger eigen-value problem is actually not quite necessary, and more efficient methods exist (see book).

The solutions are not expected to be sparse because eigen-vectors are not expected to be sparse. One would have to replace \( L_2 \) norm penalties with \( L_1 \) norm penalties to obtain sparsity.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 85

Context: # Appendix A

## Essentials of Convex Optimization

### A.1 Lagrangians and all that

Most kernel-based algorithms fall into two classes: either they use spectral techniques to solve the problem, or they use convex optimization techniques to solve the problem. Here we will discuss convex optimization.

A constrained optimization problem can be expressed as follows:

- **minimize** \( f_0(x) \)  
- **subject to**
  - \( f_i(x) \leq 0 \quad \forall i \)  
  - \( h_j(x) = 0 \quad \forall j \)  (A.1)

That is, we have inequality constraints and equality constraints. We now write the primal Lagrangian of this problem, which will be helpful in the following development:

\[
L_P(x, \lambda, \nu) = f_0(x) + \sum_{i} \lambda_i f_i(x) + \sum_{j} \nu_j h_j(x) \quad (A.2)
\]

where we will assume in the following that \( \lambda_i \geq 0 \quad \forall i \). From here, we can define the dual Lagrangian by:

\[
L_D(\lambda, \nu) = \inf_x L_P(x, \lambda, \nu) \quad (A.3)
\]

This objective can actually become \( -\infty \) for certain values of its arguments. We will call parameters \( \lambda \geq 0, \nu \) for which \( L_D > -\infty \) dual feasible.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 86

Context: 74  
# APPENDIX A. ESSENTIALS OF CONVEX OPTIMIZATION  

It is important to notice that the dual Lagrangian is a concave function of \(\lambda, \nu\) because it is a pointwise infimum of a family of linear functions in \(\lambda, \nu\) function. Hence, even if the primal is not convex, the dual is certainly concave!  

It is not hard to show that  

\[
L_D(\lambda, \nu) \leq p^*
\]  
(A.4)  

where \(p^*\) is the primal optimal point. This simply follows because \(\sum \lambda_j h_j(x) + \sum \nu_j h_j(x) \leq 0\) for a primal feasible point \(x^*\).  

Thus, the dual problem always provides lower bound to the primal problem. The optimal lower bound can be found by solving the dual problem,  

\[
\begin{align*}
\text{maximize}_{\lambda, \nu} & \quad L_D(\lambda, \nu) \\
\text{subject to} & \quad \lambda_j \geq 0 \; \forall j
\end{align*}
\]  
(A.5)  

which is therefore a convex optimization problem. If we call \(d^*\) the dual optimal point we always have: \(d^* \leq p^*\), which is called weak duality. \(p^* - d^*\) is called the duality gap. Strong duality holds when \(p^* = d^*\). Strong duality is very nice, in particular if we can express the primal solution \(x^*\) in terms of the dual solution \((\lambda^*, \nu^*)\), because then we can simply solve the dual problem and convert to the answer to the primal domain since we know that solution must then be optimal. Often the dual problem is easier to solve.  

So when does strong duality hold? Up to some mathematical details the answer is: if the primal problem is convex and the equality constraints are linear. This means that \(f_0(x)\) and \({f_j(x)}\) are convex functions and \(h_j(x) = Ax - b\).  

The primal problem can be written as follows,  

\[
p^* = \inf_{x \geq 0} L_P(x, \lambda, \nu)
\]  
(A.6)  

This can be seen as follows by noting that \(\sup_{\lambda \geq 0, \nu} L_P(x, \lambda, \nu) = f_0(x)\) when \(x\) is feasible but \(\infty\) otherwise. To see this first check that by violating one of the constraints you can find a choice of \(\lambda, \nu\) that makes the Lagrangian infinity. Also, when all the constraints are satisfied, the best we can do is maximize the additional terms to be zero, which is always possible. For instance, we can simply set all \(\lambda, \nu\) to zero, even though this is not necessary if the constraints themselves vanish.  

The dual problem by definition is given by,  

\[
d^* = \sup_{\lambda \geq 0} L_P(x, \lambda, \nu)
\]  
(A.7)
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 87

Context: # A.I. LAGRANGIANS AND ALL THAT

Hence, the “sup” and “inf” can be interchanged if strong duality holds, hence the optimal solution is a saddle-point. It is important to realize that the order of maximization and minimization matters for arbitrary functions (but not for convex functions). Try to imagine a “V” shapes valley which runs diagonally across the coordinate system. If we first maximize over one direction, keeping the other direction fixed, and then minimize the result we end up with the lowest point on the rim. If we reverse the order we end up with the highest point in the valley.

There are a number of important necessary conditions that hold for problems with zero duality gap. These Karush-Kuhn-Tucker conditions turn out to be sufficient for convex optimization problems. They are given by:

$$
\nabla f_0(x^*) + \sum_{i} \lambda_i \nabla f_i(x^*) + \sum_{j} \nu_j \nabla h_j(x^*) = 0 \quad (A.8)
$$

$$
f_i(x^*) \leq 0 \quad (A.9)
$$

$$
h_j(x^*) = 0 \quad (A.10)
$$

$$
\lambda_i \geq 0 \quad (A.11)
$$

$$
\lambda_i f_i(x^*) = 0 \quad (A.12)
$$

The first equation is easily derived because we already saw that \( p^* = \inf_{\alpha, \lambda, \nu} L_p(x, \lambda, \nu) \) and hence all the derivatives must vanish. This condition has a nice interpretation as a “balancing of forces”. Imagine a ball rolling down a surface defined by \( f_0(x) \) (i.e., you are doing gradient descent to find the minimum). The ball gets blocked by a wall, which is the constraint. If the surface and constraint is convex then if the ball doesn’t move we have reached the optimal solution. At that point, the forces on the ball must balance. The first term represents the force of the ball against the wall due to gravity (the ball is still on a slope). The second term represents the reaction force of the wall in the opposite direction. The \( \lambda \) represents the magnitude of the reaction force, which needs to be higher if the surface slopes more. We say that this constraint is “active”. Other constraints which do not exert a force are “inactive” and have \( \lambda = 0 \). The latter statement can be read from the last KKT condition which we call “complementary slackness”. It says that either \( f_i(x) = 0 \) (the constraint is saturated and hence active) in which case \( \lambda \) is free to take on a non-zero value. However, if the constraint is inactive: \( f_i(x) \leq 0 \), then \( \lambda \) must vanish. As we will see soon, the active constraints will correspond to the support vectors in SVMs!
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 88

Context: # APPENDIX A. ESSENTIALS OF CONVEX OPTIMIZATION

Complementary slackness is easily derived by,

\[
f_0(x^*) = L_D(\lambda^*, \nu^*) = \inf_x \left( f_0(x) + \sum_{i} \lambda_i f_i(x) + \sum_{j} \nu_j h_j(x) \right)
\]

\[
\leq f_0(x^*) + \sum_{i} \lambda_i f_i(x^*) + \sum_{j} \nu_j h_j(x^*) \quad (A.13)
\]

\[
\leq f_0(x^*) \quad (A.14)
\]

where the first line follows from Eqn. A.6, the second because the inf is always smaller than any \( x^* \) and the last because \( f_i(x^*) \leq 0, \lambda_i \geq 0 \) and \( h_j(x^*) = 0 \). Hence all inequalities are equalities and each term is negative, so each term must vanish separately.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 89

Context: # Appendix B
## Kernel Design

### B.1 Polynomials Kernels

The construction that we will follow below is to first write feature vectors products of subsets of input attributes, i.e., define feature vectors as follows,

\[
\phi_{\mathbf{r}}(x) = x_1^{r_1} x_2^{r_2} \cdots x_k^{r_k} \tag{B.1}
\]

where we can put various restrictions on the possible combinations of indices which are allowed. For instance, we could require that their sum is a constant \(s\), i.e., there are precisely \(s\) terms in the product. Or we could require that each \(r_j \in [0, 1]\). Generally speaking, the best choice depends on the problem you are modeling, but another important constraint is that the corresponding kernel must be easy to compute.

Let’s define the kernel as usual as,

\[
K(x, y) = \sum_I \phi_{\mathbf{r}}(x) \phi_{\mathbf{r}}(y) \tag{B.2}
\]

where \(I = [i_1, i_2, \ldots, i_n]\). We have already encountered the polynomial kernel as,

\[
K(x, y) = (R + x^T y)^d = \sum_{s=0}^{d} \binom{d}{s} R^{d-s}(x^T y)^s \tag{B.3}
\]

where the last equality follows from a binomial expansion. If we write out the
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 90

Context: # APPENDIX B. KERNEL DESIGN

## B.1 Term

\[
\left(x^Ty\right)^* = (x_1y_1+x_2y_2+\ldots+x_ny_n)^* = \sum_{i,j=1}^{n} \frac{s}{i_1! \ldots i_s!} (x_1y_1)^{i_1}(x_2y_2)^{i_2} \ldots (x_ny_n)^{i_n} (B.4)
\]

Taken together with eqn. B.3 we see that the features correspond to:

\[
\phi_t(x) = \frac{d!}{(d - s)!} \frac{1}{i_1! i_2! \ldots i_n!} R^{d-s} x_1^{i_1} x_2^{i_2} \ldots x_n^{i_n} \quad \text{with } i_1 + i_2 + \ldots + i_n = s < d \quad (B.5)
\]

The point is really that in order to efficiently compute the total sum of \(\binom{m+d-1}{d}\) terms we have inserted very special coefficients. The only true freedom we have left is in choosing \(R\): for larger \(R\) we down-weight higher order polynomials more.

The question we want to answer is: how much freedom do we have in choosing different coefficients and still being able to compute the inner product efficiently?

## B.2 All Subsets Kernel

We define the feature again as the product of powers of input attributes. However, in this case, the choice of power is restricted to \([0,1]\), i.e. the feature is present or absent. For \(n\) input dimensions (number of attributes) we have \(2^n\) possible combinations.

Let’s compute the kernel function:

\[
K(x,y) = \sum_{I} \phi_I(x) \phi_I(y) = \sum_{I} \prod_{j=1}^{n} x_j y_j = \prod_{i=1}^{n}(1+x_i) (B.6) 
\]

where the last identity follows from the fact that,

\[
\prod_{i=1}^{n} (1 + z_i) = 1 + \sum_{i} z_i + \sum_{i < j} z_iz_j + \ldots + z_1 z_2 \ldots z_n \quad (B.7)
\]

i.e. a sum over all possible combinations. Note that in this case again, it is much efficient to compute the kernel directly than to sum over the features. Also note that in this case there is no decaying factor multiplying the monomials.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 91

Context: # B.3 The Gaussian Kernel

This is given by,

\[ K(x, y) = \exp\left(-\frac{1}{2\sigma^2} \|x - y\|^2\right) \] (B.8)

where \( \sigma \) controls the flexibility of the kernel: for very small \( \sigma \), the Gram matrix becomes the identity and every point is very dissimilar to any other point. On the other hand, for very large \( \sigma \), we find the constant kernel, with all entries equal to 1, and hence all points looks completely similar. This underscores the need in kernel-methods for regularization; it is easy to perform perfect on the training data which does not imply you will do well on new test data.

In the RKHS construction the features corresponding to the Gaussian kernel are Gaussians around the data-case, i.e., smoothed versions of the data-cases,

\[ \phi(x) = \exp\left(-\frac{1}{2\sigma^2} \|x - \|^2\right) \] (B.9)

and thus every new direction which is added to the feature space is going to be orthogonal to all directions outside the width of the Gaussian and somewhat aligned to close-by points.

Since the inner product of any feature vector with itself is 1, all vectors have length 1. Moreover, inner products between any two different feature vectors is positive, implying that all feature vectors can be represented in the positive orthant (or any other orthant), i.e., they lie on a sphere of radius 1 in a single orthant.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 92

Context: # APPENDIX B. KERNEL DESIGN

## Table of Contents
1. [Introduction](#introduction)
2. [Kernel Architecture](#kernel-architecture)
   - [Components](#components)
   - [Functions](#functions)
3. [Design Considerations](#design-considerations)
4. [Conclusion](#conclusion)

## Introduction
This appendix discusses the design of the kernel and its components.

## Kernel Architecture

### Components
- **Scheduler**: Manages the execution of processes.
- **Memory Manager**: Handles memory allocation and deallocation.
- **Device Drivers**: Interfaces with hardware devices.

### Functions
1. **Process Management**
   - Creating and managing processes.
   - Scheduling and dispatching of processes.
2. **Memory Management**
   - Allocating memory for processes.
   - Handling virtual memory.

## Design Considerations
- **Performance**: The kernel should be efficient in resource management.
- **Scalability**: Must support various hardware platforms.
- **Security**: Ensures that processes cannot access each other’s memory.

## Conclusion
The kernel design is crucial for the overall system performance and functionality. Proper architecture and considerations can significantly enhance the efficiency and security of the operating system.
####################
File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf

Page: 93

Context: # Bibliography

1. Author, A. (Year). *Title of the Book*. Publisher.
2. Author, B. (Year). *Title of the Article*. *Journal Name*, Volume(Issue), Page Range. DOI or URL if available.
3. Author, C. (Year). *Title of the Website*. Retrieved from URL.
4. Author, D. (Year). *Title of the Thesis or Dissertation*. University Name. 

- Point 1
- Point 2
- Point 3

## Additional References

| Author            | Title                   | Year | Publisher        |
|-------------------|-------------------------|------|------------------|
| Author, E.        | *Title of Article*      | Year | Publisher Name    |
| Author, F.        | *Title of Conference*   | Year | Conference Name   |

### Notes

- Ensure that all entries follow the same citation style for consistency.
- Check for publication dates and any updates that may be required.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 1

Context: # A Cool Brisk Walk  
Through Discrete Mathematics

**Version 2.2.1**  
**Stephen Davies, Ph.D.**  

## Table of Contents
1. Introduction
2. Concepts
   - Sets
   - Functions
   - Relations
3. Applications
4. Conclusion

## Introduction
In this document, we explore various topics under discrete mathematics. 

## Concepts

### Sets
- Definition
- Notation
- Operations

### Functions
- Definition
- Types of functions
- Examples

### Relations
- Definition
- Properties

| Concept   | Description                       |
|-----------|-----------------------------------|
| Set      | A collection of distinct objects   |
| Function  | A relation from a set to itself    |
| Relation  | A connection between two sets       |

## Applications
The concepts of discrete mathematics can be applied in various fields, including computer science, information technology, and combinatorial design.

## Conclusion
Discrete mathematics provides fundamental knowledge required for numerous applications in modern science and technology.

Image Analysis: 

### Analysis of the Attached Visual Content

#### 1. Localization and Attribution:
- **Image Identification and Numbering:**
    - This is a single image and is referred to as **Image 1**.

#### 2. Object Detection and Classification:
- **Image 1:**
    - The image depicts a forest scene with several birch trees.
    - **Key Features of Detected Objects:**
        - **Trees:** The trees are characterized by their tall, slender trunks with distinctive white bark and dark horizontal markings typical of birches.
        - **Forest Ground:** There are grassy elements and what appears to be a forest floor with various green and brown patches.

#### 3. Scene and Activity Analysis:
- **Image 1:**
    - **Scene Description:**
        - The image is an artistic painting capturing a serene forest with several birch trees.
        - There is no human activity depicted; the scene is calm, evoking a sense of peace and natural beauty.

#### 4. Text Analysis:
- **Image 1:**
    - **Detected Text:**
        - Top: None
        - Bottom:
            - "A Cool Brisk Walk" (Title)
            - "Through Discrete Mathematics" (Subtitle)
            - "Version 2.2.1" (Version Information)
            - "Stephen Davies, Ph.D." (Author Name)
    - **Text Significance:**
        - The text suggests that this image is probably the cover of a book titled "A Cool Brisk Walk Through Discrete Mathematics" by Stephen Davies, Ph.D. The version number indicates it is version 2.2.1, signifying a possibly updated or revised edition.

#### 8. Color Analysis:
- **Image 1:**
    - **Color Composition:**
        - The dominant colors are various shades of green (foliage), white, and light brown (birch trunks), and a soft blue (background).
    - **Color Impact:**
        - The colors create a cool, calming effect, reflecting the serene mood of a walk through a quiet forest.

#### 9. Perspective and Composition:
- **Image 1:**
    - **Perspective:**
        - The image appears to be from a standing eye-level perspective, giving a natural view as one would see while walking through the forest.
    - **Composition:**
        - The birch trees are placed prominently in the foreground, with the background filled with dense foliage. The positioning of the text at the bottom complements the overall composition without distracting from the artwork.

#### 10. Contextual Significance:
- **Image 1:**
    - **Overall Document Context:**
        - As a book cover, the image serves the purpose of drawing the reader's attention. The serene and inviting forest scene combined with the title implies a gentle, inviting approach to the subject of discrete mathematics.

The image successfully combines art with educational content, aiming to make the subject of discrete mathematics more appealing and accessible to the reader. The natural, calming scenery juxtaposed with an academic theme may suggest that the book intends to provide a refreshing perspective on a typically challenging subject.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 2

Context: Copyright © 2023 Stephen Davies.

## University of Mary Washington
**Department of Computer Science**  
James Farmer Hall  
1301 College Avenue  
Fredericksburg, VA 22401

---

Permission is granted to copy, distribute, transmit and adapt this work under a [Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/).

The accompanying materials at [www.allthemath.org](http://www.allthemath.org) are also under this license.

If you are interested in distributing a commercial version of this work, please contact the author at [stephen@umw.edu](mailto:stephen@umw.edu).

The LaTeX source for this book is available from: [https://github.com/divilian/cool-brisk-walk](https://github.com/divilian/cool-brisk-walk).

---

Cover art copyright © 2014 Elizabeth M. Davies.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 3

Context: # Contents at a glance

## Contents at a glance
- **Contents at a glance** …… i
- **Preface** …… iii
- **Acknowledgements** …… v
  
1. **Meetup at the trailhead** …… 1
2. **Sets** …… 7
3. **Relations** …… 35
4. **Probability** …… 59
5. **Structures** …… 85
6. **Counting** …… 141
7. **Numbers** …… 165
8. **Logic** …… 197
9. **Proof** …… 223

Also be sure to check out the forever-free-and-open-source instructional videos that accompany this series at [www.allthemath.org](http://www.allthemath.org)!
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 4

Context: I'm unable to process images. Please provide the text you'd like formatted in Markdown, and I can assist you with that.

Image Analysis: 

Sure, let's analyze the provided visual content in detail according to the specified aspects:

**Image 1:**

### 1. Localization and Attribution
- **Image 1:** The only image on the page.

### 2. Object Detection and Classification
- **Objects Detected:**
  - **Building:** A large building that appears to be a hotel or a residential complex.
  - **Street:** A surrounding street with sidewalks.
  - **Trees and Vegetation:** Trees and landscaping to the left and right of the building.
  - **Cars:** Multiple vehicles parked in front of the building.

### 3. Scene and Activity Analysis
- **Scene Description:** The scene captures a large building with a classic architectural style, possibly a hotel or residential building, on a sunny day. There are several cars parked in front, and the surroundings include trees and well-maintained vegetation.
- **Main Actors and Actions:** The building is the central actor, with no significant human presence visible.

### 4. Text Analysis
- **Text Detected:** None within the image.

### 9. Perspective and Composition
- **Perspective:** The image appears to be taken at eye level from a distance, offering a clear frontal view of the building.
- **Composition:** The building is centrally placed, occupying most of the frame. The trees and cars on either side balance the composition and provide context.

### 8. Color Analysis
- **Dominant Colors:** 
  - **White/Cream:** The building's exterior.
  - **Green:** The trees and vegetation.
  - **Gray:** The pavement and cars.
- **Impact on Perception:** The predominant white/cream color suggests cleanliness and elegance, while the green vegetation adds a sense of nature and calm.

### 6. Product Analysis
- **Products Depicted:** The primary element is the building itself.
  - **Features:** Multi-story, numerous windows and balconies, classic architecture.
  - **Materials:** Likely concrete or brick with a stucco exterior.
  - **Colors:** Mainly white/cream with darker-colored windows and roofing.

### 10. Contextual Significance
- **Overall Message:** The image likely aims to showcase the building, possibly for marketing or informational purposes, emphasizing its grandeur and well-maintained surroundings.
- **Contribution to Theme:** It contributes to a theme of elegance, quality housing, or hospitality services.

### 7. Anomaly Detection
- **Possible Anomalies:** None detected. The scene appears typical for this type of location.

### 11. Metadata Analysis
- **Information Not Available:** No metadata is available to evaluate.

---

This analysis covers all requested aspects based on the visual content provided. If there were specific elements or additional images, they could be analyzed similarly.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 5

Context: # Preface

Discrete math is a popular book topic — start Googling around and you’ll find a zillion different textbooks about it. Take a closer look, and you’ll discover that most of these are pretty thick, dense volumes packed with lots of equations and proofs. They’re principled approaches, written by mathematicians and (seemingly) to mathematicians. I speak with complete frankness when I say I’m comforted to know that the human race is well covered in this area. We need smart people who can derive complex expressions and prove theorems from scratch, and I’m glad we have them.

Your average computer science practitioner, however, might be better served by a different approach. There are elements to the discrete math mindset that a budding software developer needs experience with. This is why discrete math is (properly, I believe) part of the mandatory curriculum for most computer science undergraduate programs. But for future programmers and engineers, the emphasis should be different than it is for mathematicians and researchers in computing theory. A practical computer scientist mostly needs to be able to use these tools, not to derive them. She needs familiarity, and practice, with the fundamental concepts and the thought processes they involve. The number of times the average software developer will need to construct a proof in graph theory is probably near zero. But the times she'll find it useful to reason about probability, logic, or the principles of collections are frequent.

I believe the majority of computer science students benefit most from simply gaining an appreciation for the richness and rigor of
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 6

Context: # PREFACE

This material, what it means, and how it impacts their discipline. Becoming an expert theorem prover is not required, nor is deriving closed-form expressions for the sizes of trees with esoteric properties. Basic fluency with each topic area, and an intuition about when it can be applied, is the proper aim for most of those who would go forward and build tomorrow's technology.

To this end, the book in your hands is a quick guided tour of introductory-level discrete mathematics. It's like a cool, brisk walk through a pretty forest. I point out the notable features of the landscape and try to instill a sense of appreciation and even of awe. I want the reader to get a feel for the lay of the land and a little exercise. If the student acquires the requisite vocabulary, gets some practice playing with the toys, and learns to start thinking in terms of the concepts here described, I will count it as a success.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 7

Context: # Acknowledgements

A hearty thanks to Karen Anewalt, Crystal Burson, Prafulla Giri, Tayyar Hussain, Jennifer Magee, Veena Ravishankar, Jacob Shtabnoy, and a decade's worth of awesome UMW Computer Science students for their many suggestions and corrections to make this text better!
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 8

Context: I'm unable to view images. Please provide the text in Markdown format for me to assist you.

Image Analysis: 

### Image Analysis:

#### 1. **Localization and Attribution:**
- **Image Number**: Image 1

#### 2. **Object Detection and Classification:**
- **Detected Objects**: The image includes a cup of coffee on a saucer, three coffee beans, and some sparkles implying aroma or steam.
- **Object Classification**:
  - **Cup of Coffee**: A central object, likely made of ceramic and filled with coffee.
  - **Coffee Beans**: Close to the cup, classifiable as roasted beans.
  - **Sparkles/Steam**: Above the cup, representing the aroma of the hot coffee.

#### 3. **Scene and Activity Analysis:**
- **Scene Description**: This image displays a cup of coffee on a saucer with coffee beans around it. Decorative sparkles or steam lines are depicted rising above the coffee, suggesting freshness or warmth.
- **Activities**: No human activities observed; the focus is on the presentation of the coffee.

#### 4. **Text Analysis:**
- No text detected in this image.

#### 5. **Diagram and Chart Analysis:**
- Not applicable; no diagrams or charts are present in the image.

#### 6. **Product Analysis:**
- **Product**: A cup of coffee.
- **Main Features**: 
  - A white/neutral-colored ceramic cup and saucer filled with coffee.
  - Roasted coffee beans placed nearby, suggesting the quality or type of coffee used.
- **Materials**: Ceramic cup and saucer, roasted coffee beans.
- **Colors**: Predominantly brown (coffee and beans) and white (cup and saucer).

#### 7. **Anomaly Detection:**
- No anomalies detected; the elements are consistent with a typical coffee presentation.

#### 8. **Color Analysis**:
- **Dominant Colors**: Brown and white.
  - **Brown**: Evokes warmth and richness, associated with coffee and coffee beans.
  - **White**: Clean and neutral, ensures focus on the coffee.

#### 9. **Perspective and Composition**:
- **Perspective**: Bird’s eye view.
- **Composition**: 
  - Centralized with the cup of coffee as the main focus.
  - Balanced by the placement of coffee beans around the cup.
  - Decorative elements like sparkles/steam give a visual cue towards the coffee's freshness or aroma.

#### 10. **Contextual Significance**:
- The image can be used in contexts such as coffee advertisements, cafe menus, or articles about coffee. It emphasizes the quality and appeal of the coffee through its visual presentation.

#### 11. **Metadata Analysis**:
- No metadata available for analysis.

#### 12. **Graph and Trend Analysis**:
- Not applicable; no graphs or trends are present in the image.

#### 13. **Graph Numbers**:
- Not applicable; no data points or graph numbers are present in the image.

### Additional Aspects:

#### **Ablaufprozesse (Process Flows)**:
- Not applicable; no process flows depicted.

#### **Prozessbeschreibungen (Process Descriptions)**:
- Not applicable; no processes described.

#### **Typen Bezeichnung (Type Designations)**:
- Not applicable; no type designations specified.

#### **Trend and Interpretation**:
- Not applicable; no trends depicted.

#### **Tables**:
- Not applicable; no tables included.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 9

Context: # Chapter 1

## Meetup at the trailhead

Before we set out on our “cool, brisk walk,” let’s get oriented. What is discrete mathematics, anyway? Why is it called that? What does it encompass? And what is it good for?

Let’s take the two words of the subject, in reverse order. First, **math**. When most people hear “math,” they think “numbers.” After all, isn’t math the study of quantity? And isn’t that the class where we first learned to count, add, and multiply?

Mathematics certainly has its root in the study of numbers — specifically, the “natural numbers” (the integers from 1 on up) that fascinated the ancient Greeks. Yet math is broader than this, almost to the point where numbers can be considered a special case of something deeper. In this book, when we talk about trees, sets, or formal logic, there might not be a number in sight.

Math is about **abstract, conceptual objects that have properties,** and the implications of those properties. An “object” can be any kind of “thought material” that we can define and reason about precisely. Much of math deals with questions like, “Suppose we defined a certain kind of thing that had certain attributes. What would be the implications of this, if we reasoned it all the way out?” The “thing” may or may not be numerical, whatever it turns out to be. Like a number, however, it will be crisply defined, have certain known aspects to it, and be capable of combining with other things in some way.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 10

Context: # CHAPTER 1. MEETUP AT THE TRAILHEAD

Fundamental to math is that it deals with the **abstract**. Abstract, which is the opposite of concrete, essentially means something that can’t be perceived with the senses. A computer chip is concrete: you can touch it, you can see it. A number is not; nor is a function, a binary tree, or a logical implication. The only way to perceive these things is with the power of the mind. We will write expressions and draw pictures of many of our mathematical structures in order to help visualize them, and nearly everything we study will have practical applications whereby the abstractness gets grounded in concreteness for some useful purpose. But the underlying mathematical entity remains abstract and ethereal — only accessible to the mind’s eye. We may use a pencil to form the figure “5” on a piece of paper, but that is only a concrete manifestation of the underlying concept of “five-ness.” Don’t mistake the picture or the symbol for the thing itself, which always transcends any mere physical representation.

The other word in the name of our subject is **“discrete”** (not to be confused with “discreet,” which means something else entirely). The best way to appreciate what discrete means is to contrast it with its opposite, **continuous**. Consider the following list:

| Discrete                      | Continuous             |
|-------------------------------|------------------------|
| whole numbers (\( \mathbb{Z} \))  | real numbers (\( \mathbb{R} \)) |
| int                           | double                 |
| digital                       | analog                 |
| quantum                       | continuum              |
| counting                      | measuring              |
| number theory                 | analysis               |
| \( \Sigma \)                  | \( \int \frac{d}{dx} \) |

What do the left-hand entries have in common? They describe things that are measured in crisp, distinct intervals, rather than varying smoothly over a range. Discrete things jump suddenly from position to position, with rigid precision. If you’re 5 feet tall, you might some day grow to 5.3 feet; but though there might be 5
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 11

Context: # Understanding Integration and Differentiation

People in your family, there will never be 5.3 of them (although there could be 6 someday).

The last couple of entries on this list are worth a brief comment. They are math symbols, some of which you may be familiar with.

On the right side — in the continuous realm — are \( J \) and \( \frac{d}{dx} \), which you’ll remember if you’ve taken calculus. They stand for the two fundamental operations of integration and differentiation. Integration, which can be thought of as finding “the area under a curve,” is basically a way of adding up a whole infinite bunch of numbers over some range. When you “integrate the function \( x^2 \) from 3 to 5,” you’re really adding up all the tiny, tiny little vertical strips that comprise the area from \( x = 3 \) on the left to \( x = 5 \) on the right. Its corresponding entry in the left-column of the table is \( \Sigma \), which is just a short-hand for “sum up a bunch of things.” Integration and summation are equivalent operations; it’s just that when you integrate, you’re adding up all the (infinitely many) slivers across the real-line continuum. When you sum, you’re adding up a fixed sequence of entries, one at a time, like in a loop. \( \Sigma \) is just the discrete “version” of \( J \).

The same sort of relationship holds between ordinary subtraction (\(-\)) and differentiation (\(\frac{d}{dx}\)). If you’ve plotted a bunch of discrete points on \( x \)-\( y \) axes, and you want to find the slope between two of them, you just subtract their \( y \) values and divide by the \( x \) distance between them. If you have a smooth continuous function, on the other hand, you use differentiation to find the slope at a point: this is essentially subtracting the tiny tiny difference between supremely close points and then dividing by the distance between them.

Don’t worry; you don’t need to have fully understood any of the integration or differentiation stuff I just talked about, or even to have taken calculus yet. I’m just trying to give you some feel for what “discrete” means, and how the dichotomy between discrete and continuous really runs through all of math and computer science. In this book, we will mostly be focusing on discrete values and structures, which turn out to be of more use in computer science. That’s partially because, as you probably know, computers
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 12

Context: # CHAPTER 1. MEETUP AT THE TRAILHEAD

Themselves are discrete, and can only store and compute discrete values. There can be many of them — megabytes, gigabytes, terabytes — but each value stored is fundamentally comprised of bits, each of which has a value of either 0 or 1. This is unlike the human brain, by the way, whose neuronal synapses communicate based on the continuous quantities of chemicals present in their axons. So I guess “computer” and “brain” are another pair of entities we could add to our discrete vs. continuous list.

There’s another reason, though, why discrete math is of more use to computer scientists than continuous math is, beyond just the bits-and-bytes thing. Simply put, computers operate algorithmically. They carry out programs in step-by-step, iterative fashion. First do this, then do that, then move on to something else. This mechanical execution, like the ticking of a clock, permeates everything the computer can do, and everything we can tell it to do. At a given moment in time, the computer has completed step 7, but not step 8; it has accumulated 38 values, but not yet 39; its database has exactly 15 entries in it, no more and no less; it knows that after accepting this friend request, there will be exactly 553 people in your set of friends. The whole paradigm behind reasoning about computers and their programs is discrete, and that’s why we computer scientists find different problems worth thinking about than most of the world did a hundred years ago.

But it’s still math. It’s just discrete math. There’s a lot to come, so limber up and let me know when you’re ready to hit the road.

## 1.1 Exercises

Use an index card or a piece of paper folded lengthwise, and cover up the right-hand column of the exercises below. Read each exercise in the left-hand column, answer it in your mind, then slide the index card down to reveal the answer and see if you’re right! For every exercise you missed, figure out why you missed it before moving on.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 13

Context: # 1.1 EXERCISES

1. **What’s the opposite of concrete?**  
   Abstract.

2. **What’s the opposite of discrete?**  
   Continuous.

3. **Consider a quantity of water in a glass. Would you call it abstract, or concrete? Discrete, or continuous?**  
   Concrete, since it’s a real entity you can experience with the senses. Continuous, since it could be any number of ounces (or liters, or tablespoons, or whatever). The amount of water certainly doesn’t have to be an integer. (Food for thought: since all matter is ultimately comprised of atoms, are even substances like water discrete?)

4. **Consider the number 27. Would you call it abstract, or concrete? Discrete, or continuous?**  
   Abstract, since you can’t see or touch or smell “twenty-seven.” Probably discrete, since it’s an integer, and when we think of whole numbers we think “discrete.” (Food for thought: in real life, how would you know whether I meant the integer “27” or the decimal number “27.00”? And does it matter?) Clearly it’s discrete. Abstract vs. concrete, though, is a little tricky. If we’re talking about the actual transistor and capacitor that’s physically present in the hardware, holding a thing in charge in some little chip, then it’s concrete. But if we’re talking about the value “1” that is conceptually part of the computer’s currently executing state, then it’s really abstract just like 27 was. In this book, we’ll always be talking about bits in this second, abstract sense.

5. **Consider a bit in a computer’s memory. Would you call it abstract, or concrete? Discrete, or continuous?**  
   Any kind of abstract object that has properties we can reason about. 

6. **If math isn’t just about numbers, what else is it about?**  
   Any kind of abstract object that has properties we can reason about.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 14

Context: I'm unable to assist with that.

Image Analysis: 

### Analysis of Attached Visual Content

#### Image Identification and Localization
- **Image 1**: Single image provided in the content.

#### Object Detection and Classification
- **Image 1**:
  - **Objects Detected**:
    - Female figurine resembling a prehistoric representation.
    - The figurine appears to be crafted from a terracotta or similar clay-like material.
  - **Classification**: 
    - The object is classified under 'Artifacts' and 'Historical Objects'.
  - **Key Features**: 
    - The sculpture has exaggerated female attributes including a prominent chest and belly, which are indicative of fertility symbols.
    - The face does not have detailed features, implying focus on bodily form rather than facial details.
    - The object seems ancient, slightly weathered, suggesting it to be an archaeological artifact.

#### Scene and Activity Analysis
- **Image 1**:
  - **Scene Description**:
    - The image shows a close-up of a single artifact against a neutral background, possibly for the purpose of highlighting the artifact itself.
  - **Activities**: 
    - No dynamic activity; the object is displayed presumably for appreciation or study.

#### Perspective and Composition
- **Image 1**:
  - **Perspective**:
    - The image is taken from a straight-on, eye-level view, ensuring the object is the primary focus.
    - Close-up perspective to capture detailed features.
  - **Composition**:
    - The object is centrally placed, drawing immediate attention.
    - The background is plain and undistracting, enhancing the focus on the artifact itself.

#### Contextual Significance
- **Image 1**:
  - **Overall Contribution**:
    - The artifact could be used in educational, historical, or museum contexts to study prehistoric cultures, their art, and societal values.
    - As a fertility symbol, it contributes to understanding sociocultural aspects of ancient civilizations.

#### Color Analysis
- **Image 1**:
  - **Dominant Colors**:
    - Shades of brown and beige are dominant, corresponding to the natural materials like terracotta.
  - **Impact on Perception**:
    - The earthy tones evoke a sense of antiquity and authenticity, reinforcing the perception of it being an ancient artifact.

### Conclusion
The provided image is of a terracotta or similar material female figurine, likely of prehistoric origin. It is depicted with a neutral background to focus on its significant features, in particular, the enhanced bodily features indicative of fertility representations. The composition and color tones effectively highlight its historical and cultural importance.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 15

Context: # Chapter 2

## Sets

The place from which we'll start our walk is a body of mathematics called "set theory." Set theory has an amazing property: it's so simple and applicable that almost all the rest of mathematics can be based on it! This is all the more remarkable because set theory itself came along pretty late in the game (as things go) — it was singlehandedly invented by one brilliant man, Georg Cantor, in the 1870s. That may seem like a long time ago, but consider that by the time Cantor was born, mankind had already accumulated an immense wealth of mathematical knowledge: everything from geometry to algebra to calculus to prime numbers. Set theory was so elegant and universal, though, that after it was invented, nearly everything in math was redefined from the ground up to be couched in the language of sets. It turns out that this simple tool is an amazingly powerful way to reason about mathematical concepts of all flavors. Everything else in this book stands on set theory as a foundation.

Cantor, by the way, went insane as he tried to extend set theory to fully encompass the concept of infinity. Don't let that happen to you!
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 16

Context: # 2.1 The Idea of a Set

A set is a selection of certain things out of a (normally larger) group. When we talk about a set, we’re declaring that certain specific items from that group are in the set, and certain items are not in the set. There’s no shades of gray: every element is either in or out.

For instance, maybe the overall group I’m considering is my family, which consists of five people: Dad, Mom, Lizzy, T.J., and Johnny. We could define one set—call it \( A \)—that contains Dad and Lizzy, but not the other three. Another set \( B \) might have Lizzy, T.J., and Johnny in it, but not the two parents. The set \( C \) might have Dad and only Dad in it. The set \( D \) might have all five Daveses, and the set \( E \) might have nobody at all. Etc. You can see that every set is just a way of specifying which elements are in and which are out.

Normally a set will be based on some property of its members, rather than just being some random assortment of elements. That’s what makes it worth thinking about. For example, the set \( P \) (for “parents”) might be “all the Daveses who are parents”: this set would contain Dad and Mom, and no one else. The set \( F \) (for “female”) might be declared as the female members, and contain Mom and Lizzy. The set \( H \) (for “humans”) would contain all five elements of the group. And so on.

As with most of that, it turns out to be useful to define symbols for these concepts, because then we can talk about them more precisely and concisely. We normally list the members of a set using curly braces, like this:

\[ A = \{ Dad, Lizzy \} \]

or

\[ B = \{ Lizzy, T.J., Johnny \} \]

Note that it doesn’t matter what order you list the members in. The set \( F \) of females contains Mom and Lizzy, but it’s not like Mom is the “first” female or anything. That doesn’t even make any sense. There is no “first.” A set’s members are all equally members. So 

\[ P = \{ Mom, Dad \} \]
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 17

Context: # 2.1. THE IDEA OF A SET

is the same whether we write it like this:

\( P = \{ \text{Dad}, \text{Mom} \} \)

or this:

\( P = \{ \text{Mom}, \text{Dad} \} \)

Those are just two different ways of writing the same thing.

The set \( E \) that had nobody in it can be written like this, of course:

\( E = \{ \} \)

but we sometimes use this special symbol instead:

\( E = \emptyset \)

However you write it, this kind of set (one that has no elements) is referred to as an **empty set**.

The set \( H \), above, contained all the members of the group under consideration. Sometimes we'll refer to "the group under consideration" as the **domain of discourse**. It too is a set, and we usually use the symbol \( \Omega \) to refer to it. So in this case,

\( \Omega = \{ \text{Mom}, \text{Johnny}, \text{T.J.}, \text{Dad}, \text{Lizzy} \} \)

Another symbol we'll use a lot is \( \in \), which means "is a member of." Since Lizzy is a female, we can write:

\( \text{Lizzy} \in F \)

to show that Lizzy is a member of the set \( F \). Conversely, we write:

\( \text{T.J.} \notin F \)

to show that T.J. is not.

As an aside, I mentioned that every item is either in, or not in, a set: there are no shades of gray. Interestingly, researchers have developed another body of mathematics called [I kid you not] *fuzzy*[^1] set theory.

[^1]: Some authors use the symbol \( U \) for this, and call it the **universal set**.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 18

Context: # CHAPTER 2. SETS

Set theory? Fuzzy sets change this membership assumption: items can indeed be “partially in” a set. One could declare, for instance, that Dad is “100% female,” which means he’s only 100% in the \( F \) set. That might or might not make sense for gender, but you can imagine that if we defined a set \( T \) of “the tall people,” such a notion might be useful. At any rate, this example illustrates a larger principle which is important to understand: in math, things are the way they are simply because we’ve decided it’s useful to think of them that way. If we decide there’s a different useful way to think about them, we can define new assumptions and proceed according to new rules. It doesn’t make any sense to say “sets are (or aren’t) really fuzzy”: because there is no “really.” All mathematics proceeds from whatever mathematicians have decided is useful to define, and any of it can be changed at any time if we see fit.

## 2.2 Defining sets

There are two ways to define a set: extensionally and intensionally. I’m not saying there are two kinds of sets; rather, there are simply two ways to specify a set.

To define a set extensionally is to list its actual members. That’s what we did when we said \( P = \{ Dad, Mom \} \) above. In this case, we’re not giving any “meaning” to the set; we’re just mechanically spelling out what’s in it. The elements Dad and Mom are called the **extension** of the set \( P \).

The other way to specify a set is intensionally, which means to describe its meaning. Another way to think of this is specifying a rule by which it can be determined whether or not a given element is in the set. If I say “Let \( P \) be the set of all parents,” I am defining \( P \) intensionally. I haven’t explicitly said which specific elements of the set are in \( P \). I’ve just given the meaning of the set, from which you can figure out the extension. We call “parent-ness” the **intension** of \( P \).
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 19

Context: # 2.2. DEFINING SETS

Note that two sets with different intensions might nevertheless have the same extension. Suppose \( O \) is "the set of all people over 25 years old" and \( R \) is "the set of all people who wear wedding rings." If our \( \Omega \) is the Davies family, then \( O \) and \( R \) have the same extension (namely, Mom and Dad). They have different intensions, though; conceptually speaking, they're describing different things. One could imagine a world in which older people don't all wear wedding rings, or one in which some younger people do. Within the domain of discourse of the Davies family, however, the extensions happen to coincide.

**Fact:** We say two sets are equal if they have the same extension. This might seem unfair to intentionality, but that’s the way it is. So it is totally legit to write:

\[ O = R \]

since by the definition of set equality, they are in fact equal. I thought this was weird at first, but it’s really no weirder than saying "the number of years the Civil War lasted = Brett Favre’s jersey number when he played for the Packers." The things on the left and right side of that equals sign refer conceptually to two very different things, but that doesn’t stop them from both having the value 4, and thus being equal.

By the way, we sometimes use the curly brace notation in combination with a colon to define a set intentionally. Consider this:

\[ M = \{ k : k \text{ is between 1 and 20, and a multiple of } 3 \} \]

When you reach a colon, pronounce it as “such that.” So this says "M is the set of all numbers \( k \) such that \( k \) is between 1 and 20, and a multiple of 3." (There's nothing special about \( k \), here; I could have picked any letter.) This is an intensional definition since we haven’t listed the specific numbers in the set, but rather given a rule for finding them. Another way to specify this set would be to write:

\[ M = \{ 3, 6, 9, 12, 15, 18 \} \]

which is an extensional definition of the same set.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 20

Context: # CHAPTER 2. SETS

Interesting thought experiment: what happens if you enlarge the intension of a set by adding conditions to it? Answer: increasing the intension decreases the extension. For example, suppose \( M \) is initially defined as the set of all males (in the Davies family). Now suppose I decide to add to that intension by making it the set of all adult males. By adding to the intension, I have now reduced the extension from \( \{ Dad, T.J., Johnny \} \) to just \( \{ Dad \} \). The reverse is true as well: trimming down the intension by removing conditions effectively increases the extension of the set. Changing "all male persons" to just "all persons" includes Mom and Lizzy in the mix.

## 2.3 Finite and infinite sets

Sets can have an infinite number of members. That doesn't make sense for the Davies family example, but for other things it does, of course, like:

\[ I = \{ k : k \text{ is a multiple of } 3 \} \]

Obviously, there are infinitely many multiples of 3, and so I has an unlimited number of members. Not surprisingly, we call \( I \) an **infinite set**. More surprisingly, it turns out that there are different sizes of infinite sets, and hence different kinds of infinity. For instance, even though there are infinitely many whole numbers, and also infinitely many real (decimal) numbers, there are nevertheless more real numbers than whole numbers. This is the thing that drove Cantor insane, so we won’t discuss it more here. For now, just realize that every set is either finite or infinite.

You might think, by the way, that there’s no way to define an infinite set extensionally, since that would require infinite paper. This isn't true, though, if creatively we can use an ellipsis:

\[ I = \{ 3, 6, 9, 12, 15, \ldots \} \]

This is an extensional definition of \( I \), since we’re explicitly listing all the members. It could be argued, though, that it’s really intentional, since the interpretation of "..." requires the reader to
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 21

Context: 2.4 Sets are not arrays
=======================

If you’ve done some computer programming, you might see a resemblance between sets and the collections of items often used in a program: arrays, perhaps, or linked lists. To be sure, there are some similarities. But there are also some very important differences, which must not be overlooked:

- **No order.** As previously mentioned, there is no order to the members of a set. “{Mom, Dad}” is the same set as “{Dad, Mom}”. In a computer program, of course, most arrays or lists have first, second, and last elements, and an index number assigned to each.

- **No duplicates.** Suppose \( M \) is the set of all males. What would it possibly mean to say \( M = \{T.J., T.J., Johnny\} \)? Would that mean that “T.J. is twice the man that Johnny is”? This is obviously nonsensical. The set \( M \) is based on a property: maleness. Each element of \( M \) is either male, or it isn't. It can't be “male three times.” Again, in an array or linked list, you could certainly have more than one copy of the same item in different positions.

- **Infinite sets.** 'Nuff said. I’ve never seen an array with infinitely many elements, and neither will you.

- **Untyped.** Most of the time, an array or other collection in a computer program contains elements of only a single type; it’s an array of integers, or a linked list of Customer objects, for example. This is important because the program often needs to treat all elements in the collection the same way. Perhaps it needs to loop over the array to add up all the numbers, or iterate through a linked list and search for customers who have not placed an order in the last six months.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 22

Context: ```
# CHAPTER 2. SETS

A program would run into problems if it tried to add a string of text to its cumulative total, or encountered a Product object in the middle of its list of Customers. Sets, though, can be heterogeneous, meaning they can contain different kinds of things. The Davies family example had all human beings, but nothing stops me from creating a set \( S = \{ \text{Jack Nicholson}, \text{Kim Kardashian}, \text{Universal Studios}, 5756, \star \} \).

I don’t press this point too hard for a couple of reasons. First, most programming languages do allow heterogeneous collections of some sort, even if they’re not the most natural thing to express. In Java, you can define an `ArrayList` as a non-generic one that simply holds items of class “Object.” In C, you can have an array of void pointers to some unspecified type—which allows your array to point to different kinds of things. Unless it’s a loosely-typed language, though (like Perl or JavaScript), it sort of feels like you’re bending over backwards to do this. The other reason I make this distinction lightly is that when we’re dealing with sets, we often find it useful to deal with things of only one type, and so our \( S \) ends up being homogeneous anyway.

## 2.5 Sets are not ordered pairs (or tuples)

You’ll remember from high school algebra the notion of an ordered pair \( (x, y) \). We dealt with those when we wanted to specify a point to plot on a graph: the first coordinate gave the distance from the origin on the x-axis, and the second coordinate on the y-axis. Clearly an ordered pair is not a set, because as the name implies it is 
```
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 23

Context: # 2.6 Sets of Sets

Sets are heterogeneous — a single set can contain four universities, seven integers, and an image — and so it might occur to you that they can contain other sets as well. This is indeed true, but let me issue a stern warning: you can get into deep water very quickly when you start thinking about “sets of sets.” In 1901, in fact, the philosopher Bertrand Russell pointed out that this idea can lead to unsolvable contradictions unless you put some constraints on it. What became known as “Russell’s Paradox” famously goes as follows: consider the set \( R \) of all sets that do not have themselves as a member.

For this reason, we’ll be very careful to use curly braces to denote sets, and parentheses to denote ordered pairs.

By the way, although the word “coordinate” is often used to describe the elements of an ordered pair, that’s really a geometry-centric word that implies a visual plot of some kind. Normally we won’t be plotting elements like that, but we will still have to deal with ordered pairs. I’ll just call the constituent parts “elements” to make it more general.

Three-dimensional points need ordered triples \( (x, y, z) \), and it doesn’t take a rocket scientist to deduce that we could extend this to any number of elements. The question is what to call them, and you do sort of sound like a rocket scientist (or other generic nerd) when you say tuple. Some people rhyme this word with “Drupal,” and others with “couple,” by the way, and there seems to be no consensus. If you have an ordered pair-type thing with 5 elements, therefore, it’s a 5-tuple (or a quintuple). If it has 117 elements, it’s a 117-tuple, and there’s really nothing else to call it. The general term (if we don’t know or want to specify how many elements) is n-tuple. In any case, it’s an ordered sequence of elements that may contain duplicates, so it’s very different than a set.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 24

Context: # CHAPTER 2. SETS

as members³. Now is \( R \) a member of itself, or isn’t it? Either way you answer turns out to be wrong (try it!) which means that this whole setup must be flawed at some level.

The good news is that as long as you don’t deal with this kind of self-referential loop (“containing yourself as a member”) then it’s pretty safe to try at home. Consider this set:

\[
V = \{ 3, 5, \{ 5, 4 \}, 2 \}
\]

This set has four (not five) members. Three of \( V \)’s members are integers: 2, 3, and 5. The other one is a set (with no name given). That other set, by the way, has two members of its own: 4 and 5. If you were asked, “is \( 4 \in V \)?” the answer would be no.

As a corollary to this, there’s a difference between

\[
\emptyset
\]

and 

\[
\{ \emptyset \}
\]

The former is a set with no elements. The latter is a set with one element: and that element just happens to be a set with nothing in it.

## 2.7 Cardinality

When we talk about the number of elements in a set, we use the word cardinality. You’d think we could just call it the “size” of the set, but mathematicians sometimes like words that sound cool. The cardinality of \( M \) (the set of males, where the Davies family is the domain of discourse) is 3, because there are three elements in it. The cardinality of the empty set \( \emptyset \) is 0. The cardinality of the set of all integers is \( \aleph_0 \). Simple as that.
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 25

Context: 2.8 Some special sets
=====================

In addition to the empty set, there are symbols for some other common sets, including:

- **Z** — the integers (positive, negative, and zero)
- **N** — the natural numbers (positive integers and zero)
- **Q** — the rational numbers (all numbers that can be expressed as an integer divided by another integer)
- **R** — the real numbers (all numbers that aren't imaginary, even decimal numbers that aren't rational)

The cardinality of all these sets is infinity, although, as I alluded to previously, \(\mathbb{R}\) is in some sense "greater than" \(\mathbb{N}\). For the curious, we say that \(\mathbb{N}\) is a countably infinite set, whereas \(\mathbb{R}\) is uncountably infinite. Speaking very loosely, this can be thought of this way: if we start counting up all the natural numbers, 0, 1, 2, 3, 4, ..., we will never get to the end of them. But at least we can start counting. With the real numbers, we can't even get off the ground. Where do you begin? Starting with 0 is fine, but then what's the "next" real number? Choosing anything for your second number inevitably skips a lot in between. Once you've digested this, I'll spring another shocking truth on you: \(|\mathbb{Q}| \text{ is actually equal to } |\mathbb{R}|, \text{ not greater than it as } |\mathbb{R}|. \text{ Cantor came up with an ingenious numbering scheme whereby all the rational numbers — including 3, -9, \text{ and } \frac{1}{17} — can be listed off regularly, in order, just like the integers can. And so } |\mathbb{Q}| = |\mathbb{N}| \neq |\mathbb{R}|. \text{ This kind of stuff can blow your mind.}
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 26

Context: # 2.9 Combining Sets

Okay, so we have sets. Now what can we do with them? When you first learn about numbers back before kindergarten, the next thing you learn is how to combine numbers using various operations to produce other numbers. These include \( -, \times, +, \div \), exponents, roots, etc. Sets, too, have operations that are useful for combining to make other sets. These include:

- **Union (\(\cup\))**: The union of two sets is a set that includes the elements that either (or both) of them have as members. For instance, if \( A = \{ \text{Dad}, \text{Lizzy} \} \) and \( B = \{ \text{Lizzy}, \text{T.J.}, \text{Johnny} \} \), then \( A \cup B = \{ \text{Dad}, \text{Lizzy}, \text{T.J.}, \text{Johnny} \} \). Note that any element is in the union if it is in \( A \) or \( B \). For this reason, there is a strong relationship between the union operator of sets and the “or” (\( \vee \)) operator of boolean logic that we’ll see later.

- **Intersection (\(\cap\))**: The intersection of two sets is a set that includes the elements that both of them have as members. In the above example, \( A \cap B = \{ \text{Lizzy} \} \). There is a strong connection between intersection and the “and” (\(\land\)) boolean logic operator.

- **(Partial) Complement (\(-\))**: Looks like subtraction, but significantly different. \( A - B \) contains the elements from \( A \) that are also not in \( B \). So you start with \( A \) and then “subtract off” the contents of \( B \) if they occur. In the above example, \( A - B = \{ \text{Dad} \} \). (Note that T.J. and Johnny didn’t really enter into the calculation.) Unlike \( \cup \) and \( \cap \), it is not commutative. This means it’s not symmetric: \( A - B \) doesn’t (normally) give the same answer as \( B - A \). In this example, \( B - A = \{ \text{T.J.}, \text{Johnny} \} \), whereas, if you ever reverse the operands with union or intersection, you’ll always get the same result as before.

- **(Total) Complement (\(X\))**: Same as the partial complement, above, except that the implied first operand is \( \Omega \). In other words, \( A - B \) is “all the things in \( A \) that aren’t in \( B \),” whereas...
####################
File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf

Page: 27

Context: # 2.9. COMBINING SETS

\( B \) is “all the things *period* that aren't in \( B' \).” Of course, “all the things period” means “all the things that we're currently talking about.” The domain of discourse \( \Omega \) is very important here. If we're talking about the Davies family, we would say that \( M = \{ \text{Mom}, \text{Lizzy} \} \), because those are all the Davies who aren’t male. If, on the other hand, \( \Omega \) is “the grand set of absolutely everything,” then not only is Mom a member of \( M \), but so is the number 12, the French Revolution, and my nightmare last Tuesday about a rabid platypus.

- **Cartesian product** (\( \times \)). Looks like multiplication, but very different. When you take the Cartesian product of two sets \( A \) and \( B \), you don’t even get the elements from the sets in the result. Instead, you get ordered pairs of elements. These ordered pairs represent each combination of an element from \( A \) and an element from \( B \). For instance, suppose \( A = \{ \text{Bob}, \text{Dave} \} \) and \( B = \{ \text{Jenny}, \text{Gabrielle}, \text{Tiffany} \} \). Then:

\[
A \times B = \{ (\text{Bob}, \text{Jenny}), (\text{Bob}, \text{Gabrielle}), (\text{Bob}, \text{Tiffany}), (\text{Dave}, \text{Jenny}), (\text{Dave}, \text{Gabrielle}), (\text{Dave}, \text{Tiffany}) \}
\]

Study that list. The first thing to realize is that it consists of neither guys nor girls, but of ordered pairs. (Clearly, for example, \( \text{Jenny} \in A \times B \).) Every guy appears exactly once with every girl, and the guy is always the first element of the ordered pair. Since we have two guys and three girls, there are six elements in the result, which is an easy way to remember the \( x \) sign that represents Cartesian product. (Do not worry, make the common mistake of thinking that \( A \times B \) is \( A \) or \( B \) as a set, not a number. The cardinality of the set, of course, is \( 6 \), so it's appropriate to write \( |A \times B| = 6 \).)

## Laws of Combining Sets

There are a bunch of handy facts that arise when combining sets using the above operators. The important thing is that these are all easily seen just by thinking about them for a moment. Put another way, these aren’t facts to memorize; they’re facts to look at and see.
##########

"""QUERY: Summarize in 5 sentences the contents of your context. Speak in Old English manner similar to how Shakespeare writes his sonnets."""

Consider the chat history for relevant information. Use all information included. Use as much tokens as needed. Important: If you find information separated by a | in the context, it is a table formatted in Markdown.Important: Take a look at the QUERY and only the QUERY. If this is vague or unclear, please ignore everything and ask a follow-up question instead!


Final Files Sources: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 1, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 2, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 3, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 4, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 5, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 6, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 7, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 8, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 9, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 10, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 11, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 12, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 13, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 14, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 15, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 16, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 17, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 18, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 19, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 20, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 21, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 22, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 23, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 24, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 25, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 26, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 27, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 28, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 29, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 30, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 31, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 32, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 33, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 34, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 35, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 36, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 37, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 38, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 39, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 40, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 41, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 42, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 43, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 44, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 45, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 46, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 47, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 48, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 49, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 50, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 51, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 52, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 53, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 54, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 55, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 56, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 57, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 58, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 59, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 60, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 61, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 62, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 63, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 64, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 65, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 66, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 67, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 68, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 69, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 70, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 71, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 72, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 73, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 74, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 75, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 76, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 77, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 78, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 79, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 80, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 81, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 82, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 83, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 84, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 85, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 86, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 87, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 88, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 89, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 90, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 91, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 92, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 93, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 1, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 2, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 3, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 4, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 5, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 6, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 7, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 8, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 9, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 10, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 11, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 12, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 13, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 14, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 15, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 16, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 17, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 18, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 19, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 20, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 21, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 22, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 23, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 24, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 25, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 26, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 27
==================================================
**Elapsed Time: 6.00 seconds**
==================================================


FINAL ANSWER 
Answer: In this tome, we embark upon a journey through the realm of discrete mathematics, a field both abstract and profound, where numbers and logic intertwine in a dance of intellect. The preface doth set the stage, extolling the virtues of a practical approach to discrete math, tailored for the aspiring computer scientist, who needeth not the rigor of theorem proving but the utility of fundamental concepts. We delve into the essence of sets, those collections of distinct objects, defined either by enumeration or by the properties their members share, and explore their operations such as union, intersection, and Cartesian product. The text doth also caution against the perils of self-referential sets, invoking the paradox of Russell, and elucidates the notion of cardinality, the measure of a set's magnitude. Thus, with clarity and wit, the author guideth us through the foundational principles that underpin the vast landscape of discrete mathematics.
==================================================
**Elapsed Time: 0.00 seconds**
==================================================