{ "query": "You are a super intelligent assistant. Please answer all my questions precisely and comprehensively. \n You have a large Knowledge Base with all the information of the user. This is the initial message to start the chat. Based on the following summary/context you should formulate an initial message greeting the user with the following user name diverse NF SepTest tell them that you are the AI Chatbot Simon using the Large Language Model gemini-1.5-flash to answer all questions.\n Formulate the initial message in the Usersettings Language German\n Please use the following context to suggest some questions or topics to chat about this knowledge base. List at least 3-10 possible topics or suggestions up and use emojis. The chat should be professional and in business terms. At the end ask an open question what the user would like to check on the list \n\n You provided excerpts from three different books on mathematics and machine learning. Let's summarize each of them:\n\n**Book 1: \"A Cool Brisk Walk Through Discrete Mathematics\" by Stephen Davies**\n\nThis book provides a gentle introduction to various concepts in discrete mathematics, a branch of mathematics that deals with distinct, separate values. \n\n* **Chapter 2: Sets** discusses the fundamental concept of sets, collections of distinct objects, and how to define them extensionally (listing members) and intensionally (describing meaning). It also explains the concept of subsets and proper subsets.\n* **Chapter 3: Relations** delves into relations, which describe connections between elements of sets. It covers properties of relations like reflexivity, symmetry, antisymmetry, and transitivity. It also introduces partial orders and posets, which establish partial hierarchies within sets.\n* **Chapter 4: Probability** introduces probability measures, which quantify the likelihood of events. It discusses the frequentist and Bayesian interpretations of probability and explains concepts like sample spaces, events, conditional probability, and independence.\n* **Chapter 5: Structures** explores different data structures like graphs and trees. It explains how graphs can represent relations and covers different types of trees, including binary trees, binary search trees (BSTs), and their traversal methods.\n* **Chapter 6: Counting** focuses on combinatorial principles, including the Fundamental Theorem of Counting, permutations, and combinations. It explains factorials and binomial coefficients.\n* **Chapter 7: Numbers** revisits the concept of numbers and number systems, focusing on different bases (like base 7 and binary). It explains place value and two's complement representation for negative numbers in binary.\n* **Chapter 8: Logic** introduces propositional logic and predicate logic, which provide formal systems for reasoning about truth and falsehood. It covers truth tables, logical connectives, and quantifiers.\n* **Chapter 9: Proof** discusses the concept of mathematical proof, explaining direct and indirect proof methods. It emphasizes the importance of rigorous proof in establishing mathematical truth.\n\n**Book 2: \"A Brief Introduction to Machine Learning for Engineers\" by Osvaldo Simeone**\n\nThis book provides a concise introduction to machine learning concepts, focusing on probabilistic models and their applications in supervised and unsupervised learning.\n\n* **Chapter 1: Introduction** defines machine learning, discusses when it's appropriate to use, and outlines the book's goals.\n* **Chapter 2: A Gentle Introduction through Linear Regression** introduces supervised learning, inference, and the frequentist and Bayesian approaches to learning. It uses linear regression as a running example.\n* **Chapter 3: Probabilistic Models for Learning** covers the exponential family of probabilistic models, including Gaussian, Bernoulli, and Categorical distributions. It discusses frequentist learning, conjugate priors, and energy-based models.\n* **Chapter 4: Classification** focuses on classification problems, where the goal is to assign data points to discrete categories. It covers discriminative deterministic models (like SVM), discriminative probabilistic models (like logistic regression), and generative probabilistic models.\n* **Chapter 5: Statistical Learning Theory** provides a theoretical framework for understanding the relationship between data size and generalization performance. It covers PAC learnability, sample complexity, and structural risk minimization.\n* **Chapter 6: Unsupervised Learning** explores unsupervised learning, where the goal is to discover patterns in unlabeled data. It covers clustering (like K-means), dimensionality reduction (like PCA), and generative models (like GANs and RBMs).\n* **Chapter 7: Probabilistic Graphical Models** introduces Bayesian Networks (BNs) and Markov Random Fields (MRFs), which provide graphical representations of probabilistic models. It covers conditional independence, factorization, and inference in graphical models.\n* **Chapter 8: Approximate Inference and Learning** discusses approximate inference methods like Monte Carlo (MC) methods and Variational Inference (VI). It covers their application to learning and introduces techniques like REINFORCE and amortized VI.\n* **Chapter 9: Concluding Remarks** summarizes the book's key concepts and discusses additional topics like privacy, robustness, domain adaptation, communication-efficient learning, and reinforcement learning.\n\n**Book 3: \"A First Encounter with Machine Learning\" by Max Welling**\n\nThis book offers a beginner-friendly introduction to machine learning, covering various algorithms and concepts.\n\n* **Chapter 5: Nearest Neighbors Classification** introduces the k-nearest neighbors (kNN) algorithm, a simple classification method based on finding the most similar data points. It discusses the advantages and disadvantages of kNN.\n* **Chapter 6: Bayes Classifiers** covers Bayes classifiers, which use probability theory to make predictions. It explains how to compute class probabilities and make decisions based on them.\n* **Chapter 7: The Perceptron** introduces the perceptron, a linear classifier that learns a decision boundary by minimizing a cost function. It explains gradient descent and stochastic gradient descent for training the perceptron.\n* **Chapter 8: Support Vector Machines** discusses Support Vector Machines (SVMs), a powerful classification method that maximizes the margin between classes. It covers key concepts like hyperplanes, support vectors, and margins.\n* **Chapter 10: Kernel Ridge Regression** introduces kernel ridge regression, a regression method that uses kernel functions to model non-linear relationships. It discusses the lack of sparseness in kernel ridge regression compared to SVMs.\n* **Chapter 14: Kernel Canonical Correlation Analysis** covers kernel canonical correlation analysis (KCCA), a method for finding correlations between two sets of variables using kernel functions. It explains how to formulate and solve the KCCA problem.\n\nOverall, these excerpts provide a glimpse into the fascinating world of mathematics and machine learning, covering fundamental concepts, algorithms, and theoretical frameworks. They highlight the importance of probabilistic models, optimization techniques, and the trade-off between model complexity and generalization performance.", "namespace": "e791d701-717d-4c06-9602-2878acdbb6ad", "messages": [ { "role": "assistant", "content": "" } ], "stream": true, "language_level": "", "chat_channel": ":4041:1726667082838", "language": "German", "tone": "neutral", "writing_style": "standard", "model": "gemini-1.5-flash", "knowledgebase": "ki-dev-large", "seed": 1055, "client_id": 1726667082838, "all_context": false, "follow_up_for": null, "knowledgebase_files_count": 3, "override_command": null, "disable_clarity_check": false, "custom_primer": "", "logging": true, "query_route": "" } INITIALIZATION Knowledgebase: ki-dev-large Base Query: You are a super intelligent assistant. Please answer all my questions precisely and comprehensively. You have a large Knowledge Base with all the information of the user. This is the initial message to start the chat. Based on the following summary/context you should formulate an initial message greeting the user with the following user name diverse NF SepTest tell them that you are the AI Chatbot Simon using the Large Language Model gemini-1.5-flash to answer all questions. Formulate the initial message in the Usersettings Language German Please use the following context to suggest some questions or topics to chat about this knowledge base. List at least 3-10 possible topics or suggestions up and use emojis. The chat should be professional and in business terms. At the end ask an open question what the user would like to check on the list You provided excerpts from three different books on mathematics and machine learning. Let's summarize each of them: **Book 1: "A Cool Brisk Walk Through Discrete Mathematics" by Stephen Davies** This book provides a gentle introduction to various concepts in discrete mathematics, a branch of mathematics that deals with distinct, separate values. * **Chapter 2: Sets** discusses the fundamental concept of sets, collections of distinct objects, and how to define them extensionally (listing members) and intensionally (describing meaning). It also explains the concept of subsets and proper subsets. * **Chapter 3: Relations** delves into relations, which describe connections between elements of sets. It covers properties of relations like reflexivity, symmetry, antisymmetry, and transitivity. It also introduces partial orders and posets, which establish partial hierarchies within sets. * **Chapter 4: Probability** introduces probability measures, which quantify the likelihood of events. It discusses the frequentist and Bayesian interpretations of probability and explains concepts like sample spaces, events, conditional probability, and independence. * **Chapter 5: Structures** explores different data structures like graphs and trees. It explains how graphs can represent relations and covers different types of trees, including binary trees, binary search trees (BSTs), and their traversal methods. * **Chapter 6: Counting** focuses on combinatorial principles, including the Fundamental Theorem of Counting, permutations, and combinations. It explains factorials and binomial coefficients. * **Chapter 7: Numbers** revisits the concept of numbers and number systems, focusing on different bases (like base 7 and binary). It explains place value and two's complement representation for negative numbers in binary. * **Chapter 8: Logic** introduces propositional logic and predicate logic, which provide formal systems for reasoning about truth and falsehood. It covers truth tables, logical connectives, and quantifiers. * **Chapter 9: Proof** discusses the concept of mathematical proof, explaining direct and indirect proof methods. It emphasizes the importance of rigorous proof in establishing mathematical truth. **Book 2: "A Brief Introduction to Machine Learning for Engineers" by Osvaldo Simeone** This book provides a concise introduction to machine learning concepts, focusing on probabilistic models and their applications in supervised and unsupervised learning. * **Chapter 1: Introduction** defines machine learning, discusses when it's appropriate to use, and outlines the book's goals. * **Chapter 2: A Gentle Introduction through Linear Regression** introduces supervised learning, inference, and the frequentist and Bayesian approaches to learning. It uses linear regression as a running example. * **Chapter 3: Probabilistic Models for Learning** covers the exponential family of probabilistic models, including Gaussian, Bernoulli, and Categorical distributions. It discusses frequentist learning, conjugate priors, and energy-based models. * **Chapter 4: Classification** focuses on classification problems, where the goal is to assign data points to discrete categories. It covers discriminative deterministic models (like SVM), discriminative probabilistic models (like logistic regression), and generative probabilistic models. * **Chapter 5: Statistical Learning Theory** provides a theoretical framework for understanding the relationship between data size and generalization performance. It covers PAC learnability, sample complexity, and structural risk minimization. * **Chapter 6: Unsupervised Learning** explores unsupervised learning, where the goal is to discover patterns in unlabeled data. It covers clustering (like K-means), dimensionality reduction (like PCA), and generative models (like GANs and RBMs). * **Chapter 7: Probabilistic Graphical Models** introduces Bayesian Networks (BNs) and Markov Random Fields (MRFs), which provide graphical representations of probabilistic models. It covers conditional independence, factorization, and inference in graphical models. * **Chapter 8: Approximate Inference and Learning** discusses approximate inference methods like Monte Carlo (MC) methods and Variational Inference (VI). It covers their application to learning and introduces techniques like REINFORCE and amortized VI. * **Chapter 9: Concluding Remarks** summarizes the book's key concepts and discusses additional topics like privacy, robustness, domain adaptation, communication-efficient learning, and reinforcement learning. **Book 3: "A First Encounter with Machine Learning" by Max Welling** This book offers a beginner-friendly introduction to machine learning, covering various algorithms and concepts. * **Chapter 5: Nearest Neighbors Classification** introduces the k-nearest neighbors (kNN) algorithm, a simple classification method based on finding the most similar data points. It discusses the advantages and disadvantages of kNN. * **Chapter 6: Bayes Classifiers** covers Bayes classifiers, which use probability theory to make predictions. It explains how to compute class probabilities and make decisions based on them. * **Chapter 7: The Perceptron** introduces the perceptron, a linear classifier that learns a decision boundary by minimizing a cost function. It explains gradient descent and stochastic gradient descent for training the perceptron. * **Chapter 8: Support Vector Machines** discusses Support Vector Machines (SVMs), a powerful classification method that maximizes the margin between classes. It covers key concepts like hyperplanes, support vectors, and margins. * **Chapter 10: Kernel Ridge Regression** introduces kernel ridge regression, a regression method that uses kernel functions to model non-linear relationships. It discusses the lack of sparseness in kernel ridge regression compared to SVMs. * **Chapter 14: Kernel Canonical Correlation Analysis** covers kernel canonical correlation analysis (KCCA), a method for finding correlations between two sets of variables using kernel functions. It explains how to formulate and solve the KCCA problem. Overall, these excerpts provide a glimpse into the fascinating world of mathematics and machine learning, covering fundamental concepts, algorithms, and theoretical frameworks. They highlight the importance of probabilistic models, optimization techniques, and the trade-off between model complexity and generalization performance. Model: gemini-1.5-flash Use Curl?: None ================================================== **Elapsed Time: 0.00 seconds** ================================================== ROUTING Query type: summary ================================================== **Elapsed Time: 1.66 seconds** ================================================== RAG PARAMETERS Max Context To Include: 100 Lowest Score to Consider: 0.1 ================================================== **Elapsed Time: 0.00 seconds** ================================================== VECTOR SEARCH ALGORITHM TO USE Use MMR search?: True Use Similarity search?: False ================================================== **Elapsed Time: 0.10 seconds** ================================================== VECTOR SEARCH DONE ================================================== **Elapsed Time: 6.41 seconds** ================================================== PRIMER Primer: You are Simon, a highly intelligent personal assistant in a system called KIOS. You are a chatbot that can read knowledgebases through the "CONTEXT" that is included in the user's chat message. Your role is to act as an expert at summarization and analysis. In your responses to enterprise users, prioritize clarity, trustworthiness, and appropriate formality. Be honest by admitting when a topic falls outside your scope of knowledge, and suggest alternative avenues for obtaining information when necessary. Make effective use of chat history to avoid redundancy and enhance response relevance, continuously adapting to integrate all necessary details in your interactions. Use as much tokens as possible to provide a detailed response. ================================================== **Elapsed Time: 0.18 seconds** ================================================== FINAL QUERY Final Query: CONTEXT: ########## File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf Page: 1 Context: # A First Encounter with Machine Learning **Max Welling** *Donald Bren School of Information and Computer Science* *University of California Irvine* November 4, 2011 #################### File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf Page: 4 Context: ``` # CONTENTS 7.2 A Different Cost function: Logistic Regression .......................... 37 7.3 The Idea In a Nutshell .................................................. 38 # 8 Support Vector Machines ................................................ 39 8.1 The Non-Separable case ................................................. 43 # 9 Support Vector Regression .............................................. 47 # 10 Kernel ridge Regression ............................................... 51 10.1 Kernel Ridge Regression ............................................... 52 10.2 An alternative derivation ............................................. 53 # 11 Kernel K-means and Spectral Clustering ............................... 55 # 12 Kernel Principal Components Analysis ................................ 59 12.1 Centering Data in Feature Space ....................................... 61 # 13 Fisher Linear Discriminant Analysis .................................. 63 13.1 Kernel Fisher LDA .................................................... 66 13.2 A Constrained Convex Programming Formulation of FDA ................. 68 # 14 Kernel Canonical Correlation Analysis ................................ 69 14.1 Kernel CCA ............................................................ 71 # A Essentials of Convex Optimization ..................................... 73 A.1 Lagrangians and all that ............................................... 73 # B Kernel Design .......................................................... 77 B.1 Polynomials Kernels .................................................... 77 B.2 All Subsets Kernel ...................................................... 78 B.3 The Gaussian Kernel ..................................................... 79 ``` #################### File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf Page: 5 Context: # Preface In winter quarter 2007, I taught an undergraduate course in machine learning at UC Irvine. While I had been teaching machine learning at a graduate level, it became soon clear that teaching the same material to an undergraduate class was a whole new challenge. Much of machine learning is built upon concepts from mathematics such as partial derivatives, eigenvalue decompositions, multivariate probability densities, and so on. I quickly found that these concepts could not be taken for granted at an undergraduate level. The situation was aggravated by the lack of a suitable textbook. Excellent textbooks do exist for this field, but I found all of them to be too technical for a first encounter with machine learning. This experience led me to believe there was a genuine need for a simple, intuitive introduction into the concepts of machine learning. A first read to wet the appetite so to speak, a prelude to the more technical and advanced textbooks. Hence, the book you see before you is meant for those starting out in the field who need a simple, intuitive explanation of some of the most useful algorithms that our field has to offer. Machine learning is a relatively recent discipline that emerged from the general field of artificial intelligence only quite recently. To build intelligent machines, researchers realized that these machines should learn from and adapt to their environment. It is simply too costly and impractical to design intelligent systems by first gathering all the expert knowledge ourselves and then hard-wiring it into a machine. For instance, after many years of intense research, we can now recognize faces in images to a high degree of accuracy. But the world has approximately 30,000 visual object categories according to some estimates (Biederman). Should we invest the same effort to build good classifiers for monkeys, chairs, pencils, axes, etc., or should we build systems to can observe millions of training images, some with labels (e.g., in these pixels in the image correspond to a car) but most of them without side information? Although there currently is no system which can recognize even in the order of 1000 object categories (the best system can get). #################### File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf Page: 6 Context: # PREFACE about 60% correct on 100 categories, the fact that we pull it off seemingly effortlessly serves as a “proof of concept” that it can be done. But there is no doubt in my mind that building truly intelligent machines will involve learning from data. The first reason for the recent successes of machine learning and the growth of the field as a whole is rooted in its multidisciplinary character. Machine learning emerged from AI but quickly incorporated ideas from fields as diverse as statistics, probability, computer science, information theory, convex optimization, control theory, cognitive science, theoretical neuroscience, physics, and more. To give an example, the main conference in this field is called: **advances in neural information processing systems**, referring to information theory and theoretical neuroscience and cognitive science. The second, perhaps more important reason for the growth of machine learning is the exponential growth of both available data and computer power. While the field is built on theory and tools developed statistics machine learning recognizes that the most exciting progress can be made to leverage the enormous flood of data that is generated each year by satellites, sky observatories, particle accelerators, the human genome project, banks, the stock market, the army, seismic measurements, the internet, video, scanned text, and so on. It is difficult to appreciate the exponential growth of data that our society is generating. To give an example, a modern satellite generates roughly the same amount of data all previous satellites produced together. This insight has shifted the attention from highly sophisticated modeling techniques on small datasets to more basic analysis on such larger data-sets (the latter sometimes called **data-mining**). Hence the emphasis shifted to algorithmic efficiency and as a result many machine learning faculty (like myself) can typically be found in computer science departments. To give some examples of recent successes of this approach one would only have to turn on one computer and perform an internet search. Modern search engines do not run terribly sophisticated algorithms, but they manage to store and sift through almost the entire content of the internet to return sensible search results. There has also been much success in the field of machine translation, not because a new model was invented but because many more translated documents became available. The field of machine learning is multifaceted and expanding fast. To sample a few sub-disciplines: statistical learning, kernel methods, graphical models, artificial neural networks, fuzzy logic, Bayesian methods, and so on. The field also covers many types of learning problems, such as supervised learning, supervised learning, semi-supervised learning, active learning, reinforcement learning, etc. I will only cover the most basic approaches in this book from a highly per #################### File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf Page: 7 Context: # MEANT FOR INDUSTRY AS WELL AS BACKGROUND READING This book was written during my sabbatical at the Radboud University in Nijmegen (Netherlands). I would like to thank Hans for discussion on intuition. I also thank Prof. Bert Kappen, who leads an excellent group of postdocs and students for his hospitality. Marga, kids, UCI,... --- There are a few main aspects I want to cover from a personal perspective. Instead of trying to address every aspect of the entire field, I have chosen to present a few popular and perhaps useful tools and approaches. What will (hopefully) be significantly different than most other scientific books is the manner in which I will present these methods. I have always been frustrated by the lack of proper explanation of equations. Many times, I have been staring at a formula without the slightest clue where it came from or how it was derived. Many books also excel in stating facts in an almost encyclopedic style, without providing the proper intuition of the method. This is my primary mission: to write a book that conveys intuition. The first chapter will be devoted to why I think this is important. #################### File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf Page: 9 Context: # Learning and Intuition We have all experienced the situation that the solution to a problem presents itself while riding your bike, walking home, “relaxing” in the washroom, waking up in the morning, taking your shower, etc. Importantly, it did not appear while banging your head against the problem in a conscious effort to solve it, staring at the equations on a piece of paper. In fact, I would claim that all my bits and pieces of progress have occurred while taking a break and “relaxing out of the problem.” Greek philosophers walked in circles when thinking about a problem; most of us stare at a computer screen all day. The purpose of this chapter is to make you more aware of where your creative mind is located and to interact with it in a fruitful manner. My general thesis is that, contrary to popular belief, creative thinking is not performed by conscious thinking. It is rather an interplay between your conscious mind, which prepares the seeds to be planted into the unconscious part of your mind. The unconscious mind will munch on the problem “out of sight” and return promising roads to solutions to the consciousness. This process iterates until the conscious mind decides the problem is sufficiently solved, intractable, or plain dull and moves on to the next. It may be a little unsettling to learn that at least part of your thinking goes on in a part of your mind that seems inaccessible and has a very limited interface with what you think of as yourself. But it is undeniable that it is there, and it is also undeniable that it plays a role in the creative thought process. To become a creative thinker, one should learn to play this game more effectively. To do so, we should think about the language in which to represent knowledge that is most effective in terms of communication with the unconscious. In other words, what type of “interface” between conscious and unconscious mind should we use? It is probably not a good idea to memorize all the details of a complicated equation or problem. Instead, we should extract the abstract idea and capture the essence of it in a picture. This could be a movie with colors and other elements. #################### File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf Page: 11 Context: Many people may find this somewhat experimental way to introduce students to new topics counter-productive. Undoubtedly for many it will be. If you feel under-challenged and become bored, I recommend moving on to the more advanced textbooks, of which there are many excellent samples on the market (for a list see [books](#)). But I hope that for most beginning students, this intuitive style of writing may help to gain a deeper understanding of the ideas that I will present in the following. Above all, have fun! #################### File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf Page: 13 Context: # Chapter 1 ## Data and Information Data is everywhere in abundant amounts. Surveillance cameras continuously capture video; every time you make a phone call your name and location get recorded. Often, your clicking pattern is recorded when surfing the web. Most financial transactions are recorded; satellites and observatories generate terabytes of data every year; the FBI maintains a DNA database of most convicted criminals. Soon, all written text from our libraries is digitized. Need I go on? But data in itself is useless. Hidden inside the data is valuable information. The objective of machine learning is to pull the relevant information from the data and make it available to the user. What do we mean by "relevant information"? When analyzing data, we typically have a specific question in mind, such as: - “How many types of car can be discerned in this video?” - “What will be the weather next week?” The answer can take the form of a single number (there are 5 cars), or a sequence of numbers (or the temperature next week) or a complicated pattern (the cloud configuration next week). If the answer to our query is complex, we like to visualize it using graphs, bar plots, or even little movies. One should keep in mind that the particular analysis depends on the task one has in mind. Let me spell out a few tasks that are typically considered in machine learning: ### Prediction Here we ask ourselves whether we can extrapolate the information in the data to new use cases. For instance, if I have a database of attributes of Hummers such as weight, color, number of people it can hold, etc., and another database of attributes for Ferraris, then we can try to predict the type of car (Hummer or Ferrari) from a new set of attributes. Another example is predicting the weather given all the recorded weather patterns in the past: can we predict the weather next week, or the stock prices? #################### File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf Page: 14 Context: # CHAPTER 1. DATA AND INFORMATION ## Interpretation Here we seek to answer questions about the data. For instance, what property of this drug was responsible for its high success rate? Does a security officer at the airport apply racial profiling in deciding whose luggage to check? How many natural groups are there in the data? ## Compression Here we are interested in compressing the original data, a.k.a. the number of bits needed to represent it. For instance, files in your computer can be "zipped" to a much smaller size by removing much of the redundancy in those files. Also, JPEG and GIF (among others) are compressed representations of the original pixel map. All of the above objectives depend on the fact that there is structure in the data. If data is completely random, there is nothing to predict, nothing to interpret, and nothing to compress. Hence, all tasks are somehow related to discovering or leveraging this structure. One could say that data is highly redundant and that this redundancy is exactly what makes it interesting. Take the example of natural images. If you are required to predict the color of the pixels neighboring some random pixel in an image, you would be able to do a pretty good job (for instance, 20% may be blue sky and predicting the neighbors of a blue sky pixel is easy). Also, if we would generate images at random, they would not look like natural scenes at all. For one, it wouldn’t contain objects. Only a tiny fraction of all possible images looks "natural," and so the space of natural images is highly structured. Thus, all of these concepts are intimately related: structure, redundancy, predictability, regularity, interpretability, compressibility. They refer to the "food" for machine learning. Without structure, there is nothing to learn. The same thing is true for human learning. From the day we are born, we start noticing that there is structure in this world. Our survival depends on discovering and recording this structure. If I walk into this brown cylinder with a green canopy, I suddenly stop; it won’t give way. In fact, it damages my body. Perhaps this holds for all these objects. When I cry, my mother suddenly appears. Our game is to predict the future accurately, and we predict it by learning its structure. ## 1.1 Data Representation What does "data" look like? In other words, what do we download into our computer? Data comes in many shapes and forms; for instance, it could be words from a document or pixels from an image. But it will be useful to convert data into a #################### File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf Page: 16 Context: # CHAPTER 1. DATA AND INFORMATION ## 1.2 Preprocessing the Data As mentioned in the previous section, algorithms are based on assumptions and can become more effective if we transform the data first. Consider the following example, depicted in Figure 1. The algorithm we consist of estimating the area that the data occupy. It grows a circle starting at the origin and at the point it contains all the data we record the area of a circle. In the figure why this will be a bad estimate: the data-cloud is not centered. If we had first centered it we would have obtained reasonable estimate. Although this example is somewhat simple-minded, there are many, much more interesting algorithms that assume centered data. To center data we will introduce the sample mean of the data, given by, \[ E[X_i] = \frac{1}{N} \sum_{n=1}^{N} X_{in} \tag{1.1} \] Hence, for every attribute $i$ separately, we simple add all the attribute value across data-cases and divide by the total number of data-cases. To transform the data so that their sample mean is zero, we set, \[ X'_{in} = X_{in} - E[X_i], \quad \forall n \tag{1.2} \] It is now easy to check that the sample mean of $X'$ indeed vanishes. An illustration of the global shift is given in Figure 2. We also see in this figure that the algorithm described above now works much better! In a similar spirit as centering, we may also wish to scale the data along the coordinate axis in order to make it more “spherical.” Consider Figure 3. In this case the data was first centered, but the elongated shape still prevented us from using the simplistic algorithm to estimate the area covered by the data. The solution is to scale the axes so that the spread is the same in every dimension. To define this operation we first introduce the notion of sample variance, \[ V[X_i] = \frac{1}{N} \sum_{n=1}^{N} X_{in}^2 \tag{1.3} \] where we have assumed that the data was first centered. Note that this is similar to the sample mean, but now we have used the square. It is important that we have removed the sign of the data-cases (by taking the square) because otherwise positive and negative signs might cancel each other out. By first taking the square, all data-cases first get mapped to positive half of the axes (for each dimension or #################### File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf Page: 20 Context: # CHAPTER 2. DATA VISUALIZATION An example of such a scatter plot is given in Figure ??. Note that we have a total of $ d(d - 1) / 2 $ possible two-dimensional projections which amounts to 4950 projections for 100-dimensional data. This is usually too many to manually inspect. How do we cut down on the number of dimensions? Perhaps random projections may work? Unfortunately, that turns out to be not a great idea in many cases. The reason is that data projected on a random subspace often looks distributed according to what is known as a Gaussian distribution (see Figure ??). The deeper reason behind this phenomenon is the **central limit theorem** which states that the sum of a large number of independent random variables is (under certain conditions) distributed as a Gaussian distribution. Hence, if we denote with $ \mathbf{w} $ a vector in $ \mathbb{R}^d $ and by $ x $ the d-dimensional random variable, then $ y = \mathbf{w}^T \mathbf{x} $ is the value of the projection. This is clearly a weighted sum of the random variables $ x_i, \; i = 1, \ldots, d $. If we assume that $ x_i $ are approximately independent, then we can see that their sum will be governed by its central limit theorem. Analogously, a dataset $ \{X_n\} $ can thus be visualized in one dimension by "histogramming" the values of $ y = \mathbf{w}^T \mathbf{x} $, see Figure ??. In this figure we clearly recognize the characteristic "Bell-shape" of the Gaussian distribution of projected and histogrammed data. In one sense the central limit theorem is a rather helpful quirk of nature. Many variables follow Gaussian distributions and the Gaussian distribution is one of the few distributions which have very nice analytical properties. Unfortunately, the Gaussian distribution is also the most uninformative distribution. This notion of "uninformative" can actually be made very precise using information theory and states: > Given a fixed mean and variance, the Gaussian density represents the least amount of information among all densities with the same mean and variance. This is rather unfortunate for our purposes because Gaussian projections are the least revealing dimensions to look at. So in general we have to work a bit harder to see interesting structure. A large number of algorithms has been devised to search for informative projections. The simplest being "principal component analysis" or PCA for short ??. Here, interesting means dimensions of high variance. However, the fact that high variance is not always a good measure of interestingness and one should rather search for dimensions that are non-Gaussian. For instance, "independent components analysis" (ICA) ?? and "projection pursuit" ?? search for dimen- #################### File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf Page: 23 Context: # Chapter 3 ## Learning This chapter is without question the most important one of the book. It concerns the core, almost philosophical question of what learning really is (and what it is not). If you want to remember one thing from this book, you will find it here in this chapter. Ok, let’s start with an example. Alice has a rather strange ailment. She is not able to recognize objects by their visual appearance. At home, she is doing just fine: her mother explained to Alice for every object in her house what it is and how you use it. When she is home, she recognizes these objects (if they have not been moved too much), but when she enters a new environment she is lost. For example, if she enters a new meeting room, she needs a long time to infer what the chairs and the table are in the room. She has been diagnosed with a severe case of "overfitting." What is the matter with Alice? Nothing is wrong with her memory because she remembers the objects once she has seen them. In fact, she has a fantastic memory. She remembers every detail of the objects she has seen. And every time she sees a new object, she reasons that the object in front of her is surely not a chair because it doesn’t have all the features she has seen in earlier chairs. The problem is that Alice cannot generalize the information she has observed from one instance of a visual object category to other, yet unobserved members of the same category. The fact that Alice’s disease is so rare is understandable; there must have been a strong selection pressure against this disease. Imagine our ancestors walking through the savanna one million years ago. A lion appears on the scene. Ancestral Alice has seen lions before, but not this particular one, and it does not induce a fear response. Of course, she has no time to infer the possibility that this animal may be dangerous logically. Alice’s contemporaries noticed that the animal was yellow-brown, had manes, etc., and immediately un- #################### File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf Page: 24 Context: # CHAPTER 3. LEARNING They understood that this was a lion. They understood that all lions have these particular characteristics in common, but may differ in some other ones (like the presence of a scar or scar someplace). Bob has another disease which is called over-generalization. Once he has seen an object he believes almost everything is some, perhaps twisted instance of the same object class (In fact, I seem to suffer from this so now and then when I think all of machine learning can be explained by this one new exciting principle). If ancestral Bob walks the savanna and he has just encountered an instance of a lion and fled into a tree with his buddies, the next time he sees a squirrel he believes it is a small instance of a dangerous lion and flees into the trees again. Over-generalization seems to be rather common among small children. One of the main conclusions from this discussion is that we should neither over-generalize nor over-fit. We need to be on the edge of being just right. But just right about what? It doesn’t seem there is one correct God-given definition of the category chairs. We seem to all agree, but one can surely find examples that would be difficult to classify. When do we generalize exactly right? The magic word is **PREDICTION**. From an evolutionary standpoint, all we have to do is make correct predictions about aspects of life that help us survive. Nobody really cares about the definition of lion, but we do care about our responses to the various animals (run away for lion, chase for deer). And there are a lot of things that can be predicted in the world. This food kills me but food is good for me. Drumming my fists on my hairy chest in front of a female generates opportunities for sex; sticking my hand into that yellow-orange flickering “flame” hurts my hand and so on. The world is wonderfully predictable and we are very good at predicting it. So why do we care about object categories in the first place? Well, apparently they help us organize the world and make accurate predictions. The category *lions* is an abstraction and abstractions help us to generalize. In a certain sense, learning is all about finding useful abstractions or concepts that describe the world. Take the concept “fluid”; it describes all watery substances and summarizes some of their physical properties. Or the concept of “weight”: an abstraction that describes a certain property of objects. Here is one very important corollary for you: > “machine learning is not in the business of remembering and regurgitating observed information, it is in the business of transferring (generalizing) properties from observed data onto new, yet unobserved data.” This is the mantra of machine learning that you should repeat to yourself every night before you go to bed (at least until the final exam). The information we receive from the world has two components to it: there #################### File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf Page: 25 Context: # Information Theory and Image Compression The part of the information that does not carry over to the future, the unpredictable information, is called "noise." There is also the information that is predictable, the learnable part of the information stream. The task of any learning algorithm is to separate the predictable part from the unpredictable part. Now imagine Bob wants to send an image to Alice. He has to pay 1 dollar cent for every bit that he sends. If the image were completely white, it would be really stupid of Bob to send the message: ``` pixel 1: white, pixel 2: white, pixel 3: white, .... ``` He could just have sent the message: ``` all pixels are white! ``` The blank image is completely predictable but carries very little information. Now imagine a image that consists of white noise (your television screen if the cable is not connected). To send the exact image Bob will have to send: ``` pixel 1: white, pixel 2: black, pixel 3: black, .... ``` Bob cannot do better because there is no predictable information in that image, i.e., there is no structure to be modeled. You can imagine playing a game and revealing one pixel at a time to someone and paying him 15 for every next pixel he predicts correctly. For the white image, you can do perfect for the noisy picture you would be random guessing. Real pictures are in between: some pixels are very hard to predict, while others are easier. To compress the image, Bob can extract rules such as: always predict the same color as the majority of the pixels next to you, except when there is an edge. These rules constitute the model for the regularities of the image. Instead of sending the entire image pixel by pixel, Bob will now first send his rules and ask Alice to apply the rules. Every time the rule fails, Bob also sends a correction: ``` pixel 103: white, pixel 245: black. ``` A few rules and two corrections are obviously cheaper than 256 pixel values and no rules. There is one fundamental tradeoff hidden in this game. Since Bob is sending only a single image, it does not pay to send an incredibly complicated model that would require more bits to explain than simply sending all pixel values. If he were sending 1 billion images it would pay off to first send the complicated model because he would be saving a fraction of all bits for every image. On the other hand, if Bob wants to send 2 pixels, there really is no need in sending a model whatsoever. Therefore, the size of Bob's model depends on the amount of data he wants to transmit. In this context, the boundary between what is model and what is noise depends on how much data we are dealing with! If we use a model that is too complex, we overfit to the data at hand, i.e., part of the model represents noise. On the other hand, if we use a too simple model we "underfit" (over-generalize) and valuable structure remains unmodeled. Both lead to sub-optimal compression of the image. The compression game can therefore be used to find the right size of model complexity for a given dataset. #################### File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf Page: 26 Context: # CHAPTER 3. LEARNING Now let’s think for a moment what we really mean with “a model”. A model represents our prior knowledge of the world. It imposes structure that is not necessarily present in the data. We call this the **inductive bias**. Our inductive bias often comes in the form of a parameterized model. That is to say, we define a family of models but let the data determine which of these models is most appropriate. A strong inductive bias means that we don’t leave flexibility in the model for the data to work on. We are so convinced of ourselves that we basically ignore the data. The downside is that if we are creating a “bad bias” towards to wrong model. On the other hand, if we are correct, we can learn the remaining degrees of freedom in our model from very few data-cases. Conversely, we may leave the door open for a huge family of possible models. If we now let the data zoom in on the model that best explains the training data it will overfit to the peculiarities of that data. Now imagine you sampled 10 datasets of the same size $N$ and train these very flexible models separately on each of these datasets (note that in reality you only have access to one such dataset but please play along in this thought experiment). Let’s say we want to determine the value of some parameter $\theta$. Because the models are so flexible, we can actually model the idiosyncrasies of each dataset. The result is that the value for $\theta$ is likely to be very different for each dataset. But because we didn’t impose much inductive bias the average of many of such estimates will be about right. We say that the bias is small, but the variance is high. In the case of very restrictive models the opposite happens: the bias is potentially large but the variance small. Note that not only is a large bias bad (for obvious reasons), a large variance is bad as well: because we only have one dataset of size $N$, our estimate could be very far off simply we were unlucky with the dataset we were given. What we should therefore strive for is to inject all our prior knowledge into the learning problem (this makes learning easier) but avoid injecting the wrong prior knowledge. If we don’t trust our prior knowledge we should let the data speak. However, letting the data speak too much might lead to overfitting, so we need to find the boundary between too complex and too simple a model and get its complexity just right. Access to more data means that the data can speak more relative to prior knowledge. That, in a nutshell, is what machine learning is all about. #################### File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf Page: 27 Context: # 3.1 In a Nutshell Learning is all about generalizing regularities in the training data to new, yet unobserved data. It is not about remembering the training data. Good generalization means that you need to balance prior knowledge with information from data. Depending on the dataset size, you can entertain more or less complex models. The correct size of the model can be determined by playing a compression game. Learning = generalization = abstraction = compression. #################### File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf Page: 28 Context: # CHAPTER 3. LEARNING ## Introduction Learning involves acquiring knowledge, skills, attitudes, or competencies. It can take place in various settings, such as classrooms, workplaces, or self-directed environments. This chapter discusses the types and processes of learning. ## Types of Learning 1. **Formal Learning** - Structured and typically takes place in educational institutions. - Includes degrees, diplomas, and certifications. 2. **Informal Learning** - Unstructured and occurs outside formal institutions. - Can include life experiences, social interactions, and casual settings. 3. **Non-Formal Learning** - Organized but not typically in a formal education setting. - Often community-based, such as workshops and training sessions. ## Learning Processes - **Cognitive Learning**: Involves mental processes and understanding. - **Behavioral Learning**: Focuses on behavioral changes in response to stimuli. - **Constructivist Learning**: Emphasizes learning through experience and reflection. ## Table of Learning Theories | Theory | Key Contributor | Description | |--------------------------|----------------------|--------------------------------------------------| | Behaviorism | B.F. Skinner | Learning as a change in behavior due to reinforcement. | | Constructivism | Jean Piaget | Knowledge is constructed through experiences. | | Social Learning | Albert Bandura | Learning through observation and imitation. | ## Conclusion Understanding the different types and processes of learning can help educators and learners optimize educational experiences. Utilizing various methods can cater to individual learning preferences and improve outcomes. ## References - Smith, J. (2020). *Learning Theories: A Comprehensive Approach*. Educational Press. - Doe, A. (2019). *Techniques for Effective Learning*. Learning Publishers. #################### File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf Page: 29 Context: # Chapter 4 ## Types of Machine Learning We now will turn our attention and discuss some learning problems that we will encounter in this book. The most well-studied problem in ML is that of **supervised learning**. To explain this, let’s first look at an example. Bob wants to learn how to distinguish between bobcats and mountain lions. He types these words into Google Image Search and closely studies all catlike images of bobcats on the one hand and mountain lions on the other. Some months later on a hiking trip in the San Bernardino mountains he sees a big cat... The data that Bob collected was labelled because Google is supposed to only return pictures of bobcats when you search for the word "bobcat" (and similarly for mountain lions). Let’s call the images $ X_1, \ldots, X_n $, and the labels $ Y_1, \ldots, Y_n $. Note that $ X_i $ are much higher dimensional objects because they represent all the information extracted from the image (approximately 1 million pixel color values), while $ Y_i $ is simply −1 or 1 depending on how we choose to label our classes. So, that would be a ratio of about 1 million to 1 in terms of information content! The classification problem can usually be posed as finding (a.k.a. learning) a function $ f(x) $ that approximates the correct class labels for any input $ x $. For instance, we may decide that $ \text{sign}(f(x)) $ is the predictor for our class label. In the following, we will be studying quite a few of these classification algorithms. There is also a different family of learning problems known as **unsupervised learning** problems. In this case there are no labels $ Y $ involved, just the features $ X $. Our task is not to classify, but to organize the data, or to discover the structure in the data. This may be very useful for visualization data, compressing data, or organizing data for easy accessibility. Extracting structure in data often leads to the discovery of concepts, topics, abstractions, factors, causes, and more such terms that all really mean the same thing. These are the underlying semantic structures. #################### File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf Page: 30 Context: # CHAPTER 4. TYPES OF MACHINE LEARNING Factors that can explain the data. Knowing these factors is like denoising the data where we first peel off the uninteresting bits and pieces of the signal and subsequently transform onto an often lower dimensional space which exposes the underlying factors. There are two dominant classes of unsupervised learning algorithms: clustering-based algorithms assume that the data organizes into groups. Finding these groups is then the task of the ML algorithm and the identity of the group is the semantic factor. Another class of algorithms strives to project the data onto a lower dimensional space. This mapping can be nonlinear, but the underlying assumption is that the data is approximately distributed on some (possibly curved) lower dimensional manifold embedded in the input space. Unrolling that manifold is then the task of the learning algorithm. In this case, the dimensions should be interpreted as semantic factors. There are many variations on the above themes. For instance, one is often confronted with a situation where you have access to many more unlabeled data (only $X_i$) and many fewer labeled instances (both $(X_i, Y_i)$). Take the task of classifying news articles by topic (weather, sports, national news, international etc.). Some people may have labeled some news-articles by hand but there won’t be all that many of those. However, we do have a very large digital library of scanned newspapers available. Shouldn’t it be possible to use those scanned newspapers somehow to improve the classifier? Imagine that the data naturally clusters into well-separated groups (for instance because news articles reporting on different topics use very different words). This is depicted in Figure ??. Note that there are only very few cases which have labels attached to them. From this figure it becomes clear that the expected optimal decision boundary nicely separates these clusters. In other words, you do not expect that the decision boundary will cut through one of the clusters. Yet that is exactly what would happen if you would only be using the labeled data. Hence, by simply requiring that decision boundaries do not cut through regions of high probability we can improve our classifier. The subfield that studies how to improve classification algorithms using unlabeled data goes under the name “semi-supervised learning”. A fourth major class of learning algorithms deals with problems where the supervised signal consists only of rewards (or costs) that are possibly delayed. Consider for example a mouse that needs to solve a labyrinth in order to obtain his food. While making his decisions he will not receive any feedback (apart from perhaps slowly getting more hungry). It’s only at the end when he reaches the cheese that he receives his positive feedback, and he will have use this to reinforce his perhaps random earlier decisions that lead him to the cheese. These problems are generally known as reinforcement learning. #################### File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf Page: 31 Context: fall under the name "reinforcement learning". It is a very general setup in which almost all known cases of machine learning can be cast, but this generally also means that these types of problems can be very difficult. The most general RL problems do not even assume that you know what the world looks like (i.e. the maze for the mouse), so you have to simultaneously learn a model of the world and solve your task in it. This dual task induces interesting trade-offs: should you invest time now to learn machine learning and reap the benefit later in terms of a high salary working for Yahoo!, or should you stop investing now and start exploiting what you have learned so far? This is clearly a function of age, or the time horizon that you still have to take advantage of these investments. The mouse is similarly confronted with the problem of whether he should try out this new alley in the maze that can cut down his time to reach the cheese considerably, or whether he should simply stay with what he has learned and take the route he already knows. This clearly depends on how often he thinks he will have to run through the same maze in the future. We call this the exploration versus exploitation trade-off. The reason that RL is a very exciting field of research is because of its biological relevance. Do we not also have to figure out how the world works and survive in it? Let's go back to the news-articles. Assume we have control over what article we will label next. Which one would be pick? Surely the one that would be most informative in some suitably defined sense. Or the mouse in the maze. Given that it needs to explore, where does he explore? Surely he will try to seek out alleys that look promising, i.e. alleys that he expects to maximize his reward. We call the problem of finding the next best data-case to investigate "active learning". One may also be faced with learning multiple tasks at the same time. These tasks are related but not identical. For instance, consider the problem if recommending movies to customers of Netflix. Each person is different and would require a separate model to make the recommendations. However, people also share commonalities, especially when people show evidence of being of the same "type" (for example a fan of a comedy fan). We can learn personalized models but share features between them. Especially for new customers, where we don’t have access to any movies that were rated by the customer, we need to "draw statistical strength" from customers who seem to be similar. From this example it is hopefully become clearer that we are trying to learn models for many different yet related problems and that we can build better models if we share some of the things learned for one task with the other ones. The trick is not to share too much nor too little and how much we should share depends on how much data and prior knowledge we have access to for each task. We call this subfield of machine learning: "multi-task learning". #################### File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf Page: 34 Context: # CHAPTER 5. NEAREST NEIGHBORS CLASSIFICATION To figure out the label of a test case, we simply look around and see what labels our neighbors have. Asking your closest neighbor is like betting all your money on a single piece of advice and you might get really unlucky if your closest neighbor happens to be an odd one out. It’s typically better to ask several opinions before making your decision. However, if you ask too much around you will be forced to ask advice from data cases that are no longer very similar to you. So there is some optimal number of neighbors to ask, which may be different for every problem. Determining this optimal number of neighbors is not easy, but we can again use cross validation (see section ??) to estimate it. ## What is Good and Bad about KNN? First, it’s simplicity makes it attractive. Very few assumptions about the data are used in the classification process. This property can also be a disadvantage: if you have prior knowledge about how the data was generated, it’s better to use it, because less information has to be extracted from the data. A second consideration is computation time and memory efficiency. Assume you have a very large dataset, but you need to make decisions very quickly. As an example, consider surfacing the web pages at Amazon.com. Whenever you search for a book, it likes to suggest 10 others. To do that it could classify books into categories and suggest the top ranked in that category. KNN requires Amazon to store all features of all books at a location that is accessible for fast computation. Moreover, to classify KNN has to do the neighborhood search every time again. Clearly, there are tricks that can be played with smart indexing, but wouldn’t it be much easier if we could have summarized all books by a simple classification function $ f(X) $, that “spits out” a class for any combination of features $ X $? This distinction between algorithms/models that require remembering every data-item data is often called “parametric” versus “non-parametric”. It’s important to realize that this is somewhat of a misnomer: non-parametric models can have parameters (such as the number of neighbors to consider). The key distinction is rather whether the data is summarized through a set of parameters which together comprise a classification function $ f_0(X) $, or whether we retain all the data to do the classification “on the fly”. KNN is also known to suffer from the “curse of high dimensions”. If we use many features to describe our data, and in particular when most of these features turn out to be irrelevant and noisy for the classification, then KNN is quickly confused. Imagine that there are two features that contain all the information necessary for a perfect classification, but that we have added 98 noisy, uninformative features. The neighbors in the two dimensional space of the relevant features are unfortunately no longer likely to be the neighbors in the 100 dimensional space. #################### File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf Page: 38 Context: # CHAPTER 6. THE NAIVE BAYESIAN CLASSIFIER An example of the traffic that it generates: the University of California Irvine receives on the order of 2 million spam emails a day. Fortunately, the bulk of these emails (approximately 97%) is filtered out or dumped into your spam box and will reach your attention. How is this done? Well, it turns out to be a classic example of a classification problem: spam or ham, that’s the question. Let’s say that spam will receive a label 1 and ham a label 0. Our task is thus to label each new email with either 0 or 1. What are the attributes? Rephrasing this question, what would you measure in an email to see if it is spam? Certainly, if I read “viagra” in the subject I would stop right there and dump it in the spam box. What else? Here are a few: “enlargement,” “cheap,” “buy,” “pharmacy,” “money,” “loan,” “mortgage,” “credit,” and so on. We can build a dictionary of words that we can detect in each email. This dictionary could also include word phrases such as “buy now,” “penis enlargement,” one can make phrases as sophisticated as necessary. One could measure whether the words or phrases appear at least once or one could count the actual number of times they appear. Spammers know about the way these spam filters work and counteract by slight misspellings of certain key words. Hence we might also want to detect words like “via gra” and so on. In fact, a small arms race has ensued where spam filters and spam generators find tricks to counteract the tricks of the “opponent.” Putting all these subtleties aside for a moment we’ll simply assume that we measure a number of these attributes for every email in a dataset. We’ll also assume that we have spam/ham labels for these emails, which were acquired by someone removing spam emails by hand from his/her inbox. Our task is then to train a predictor for spam/ham labels for future emails where we have access to attributes but not to labels. The NB model is what we call a "generative" model. This means that we imagine how the data was generated in an abstract sense. For emails, this works as follows, an imaginary entity first decides how many spam and ham emails it will generate on a daily basis. Say, it decides to generate 40% spam and 60% ham. We will assume this doesn’t change with time (of course it doesn’t, but we will make this simplifying assumption for now). It will then decide what the chance is that a certain word appears $ x $ times in a spam email. For example, the word “viagra” has a chance of 96% to not appear at all, 1% to appear once, 0.9% to appear twice, etc. These probabilities are clearly different for spam and ham. “Viagra” should have a much smaller probability to appear in a ham email (but of course; consider I send this text to my publisher by email). Given these probabilities, we can then go on and try to generate emails that actually look like real emails, i.e., with proper sentences, but we won’t need that in the following. Instead we make the simplifying assumption that email consists of “a bag of words,” in random order. #################### File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf Page: 46 Context: # CHAPTER 7. THE PERCEPTRON I like to warn the reader at this point that more features is not necessarily a good thing if the new features are uninformative for the classification task at hand. The problem is that they introduce noise in the input that can mask the actual signal (i.e. the good, discriminative features). In fact, there is a whole subfield of ML that deals with selecting relevant features from a set that is too large. The problem of too many dimensions is sometimes called "the curse of high dimensionality." Another way of seeing this is that more dimensions often lead to more parameters in the model (as in the case for the perceptron) and can hence lead to overfitting. To combat that in turn we can add regularizers as we will see in the following. With the introduction of regularizers, we can sometimes play magic and use an infinite number of features. How we play this magic will be explained when we will discuss kernel methods in the next sections. But let us first start simple with the perceptron. ## 7.1 The Perceptron Model Our assumption is that a line can separate the two classes of interest. To make our life a little easier we will switch to the $ Y = \{+1, -1\} $ representation. With this, we can express the condition mathematically expressed as: \[ y_n \approx \text{sign}\left(\sum_k w_k x_{n} - \alpha\right) \tag{7.1} \] where "sign" is the sign-function (+1 for nonnegative reals and -1 for negative reals). We have introduced $ K + 1 $ parameters $ \{w_1, \ldots, w_k, \alpha\} $ which define the line for us. The vector $ w $ represents the direction orthogonal to the decision boundary depicted in Figure ??. For example, a line through the origin is represented by $ w^T x = 0 $, i.e., all vectors $ x $ with a vanishing inner product with $ w $. The scalar quantity $ \alpha $ represents the offset of the line $ w^T x = 0 $ from the origin, i.e., the shortest distance from the origin to the line. This can be seen by writing the points on the line as $ x = y + v $ where $ v $ is a fixed vector pointing to an arbitrary point on the line and $ y $ is the vector on the line starting at $ y $ (see Figure ??). Here, $ w^T (y + v) - \alpha = 0 $. Since by definition $ w^T v = 0 $, we find $ w^T y = \alpha $ which means that $ \alpha $ is the projection of $ y $ onto $ w $ which is the shortest distance from the origin to the line. 1. Note that we can replace $ x_n \rightarrow \phi_k(X) $ but for the sake of simplicity we will refrain from doing so at this point. #################### File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf Page: 58 Context: # CHAPTER 8: SUPPORT VECTOR MACHINES ## Introduction Support Vector Machines (SVMs) are a set of supervised learning methods used for classification, regression, and outliers detection. ## Key Concepts 1. **Hyperplane**: A hyperplane is a decision boundary that helps to categorize data points. 2. **Support Vectors**: Support vectors are the data points that are closest to the hyperplane and influence its position and orientation. 3. **Margin**: The margin is the distance between the hyperplane and the nearest data point from either class. SVM aims to maximize the margin. ## Implementation Steps - **Step 1**: Choose the appropriate kernel (linear, polynomial, RBF). - **Step 2**: Train the SVM model using the training dataset. - **Step 3**: Evaluate the model using a test dataset. ## Advantages of SVM - Effective in high-dimensional spaces. - Robust against overfitting, especially in high-dimensional datasets. ## Disadvantages of SVM - Less effective on very large datasets. - Poor performance with overlapping classes. ## Conclusion Support Vector Machines are powerful tools for classification and regression tasks, offering advantages in high-dimensional spaces while having limitations in very large datasets. #################### File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf Page: 70 Context: # CHAPTER 11: KERNEL K-MEANS AND SPECTRAL CLUSTERING ## 11.1 Introduction Kernel K-means is an extension of the K-means clustering algorithm. It allows for clustering in a feature space that is not linearly separable, using a kernel function to transform the original data. ## 11.2 Algorithm The steps of the Kernel K-means algorithm are as follows: 1. **Choose a kernel function** $ K(x_i, x_j) $. 2. **Initialize** $ k $ centroids randomly. 3. **Assign points** to the nearest centroid using the kernel. 4. **Update centroids** by computing the mean of the assigned points in the feature space. 5. **Repeat** steps 3 and 4 until convergence. ## 11.3 Spectral Clustering Spectral clustering is another clustering technique that uses the eigenvalues of the similarity matrix of the data. ### 11.3.1 Algorithm 1. Compute the **similarity matrix**. 2. Compute the **Laplacian matrix** from the similarity matrix. 3. Compute the eigenvalues and eigenvectors of the Laplacian matrix. 4. Use the top $ k $ eigenvectors to form a new feature space. 5. Apply K-means to cluster the data in the new space. ## 11.4 Conclusion Kernel K-means and spectral clustering provide powerful methods for clustering in complex spaces, enabling better data segmentation in many applications. ## References - *Author Name*, *Title of the Source*, Year - *Author Name*, *Title of the Source*, Year For further reading, explore the works on kernel methods and spectral analysis in clustering contexts. #################### File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf Page: 80 Context: # CHAPTER 13. FISHER LINEAR DISCRIMINANT ANALYSIS ## Introduction Fisher Linear Discriminant Analysis (LDA) is a supervised learning method used mainly for classification. It projects data onto a lower-dimensional space while maximizing class separability. ## Key Concepts 1. **Discriminant Function** The goal is to find a linear combination of features that separates two or more classes of objects or events. 2. **LDA vs. PCA** - LDA focuses on maximizing the separation between multiple classes, while Principal Component Analysis (PCA) focuses on variance. 3. **Regularization** Techniques can be applied to improve results when the number of features is high compared to the number of samples. ## Mathematical Formulation Given a set of classes $C_1, C_2, ..., C_k$, let $ \mu_i $ be the mean of class $ C_i $ and $ S_w $ and $ S_b $ the within-class and between-class scatter matrices, respectively. The LDA objective is to maximize: \[ J(w) = \frac{|S_b|}{|S_w|} \] where $w$ is the projection vector. ## Implementation Steps 1. Calculate the mean vectors for each class. 2. Compute the scatter matrices $ S_w $ and $ S_b $. 3. Solve the generalized eigenvalue problem for the matrix $ S_w^{-1}S_b $. 4. Choose the top eigenvectors to form a new feature space. ## Use Cases - Face recognition - Medical diagnosis - Marketing segmentation ## Conclusion Fisher LDA is a powerful tool for classification tasks with a focus on maximizing the separation between classes. Its mathematical foundation allows for effective implementation across various domains. ## References 1. McLachlan, G. J. (2004). *Discriminant Analysis and Statistical Pattern Recognition*. Wiley-Interscience. 2. Duda, R. O., Hart, P. E., & Stork, D. G. (2000). *Pattern Classification*. Wiley-Interscience. ## Further Reading - [Wikipedia: Linear Discriminant Analysis](https://en.wikipedia.org/wiki/Linear_discriminant_analysis) - [Scikit-learn: Linear Discriminant Analysis](https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html) #################### File: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf Page: 87 Context: # A.I. LAGRANGIANS AND ALL THAT Hence, the “sup” and “inf” can be interchanged if strong duality holds, hence the optimal solution is a saddle-point. It is important to realize that the order of maximization and minimization matters for arbitrary functions (but not for convex functions). Try to imagine a “V” shapes valley which runs diagonally across the coordinate system. If we first maximize over one direction, keeping the other direction fixed, and then minimize the result we end up with the lowest point on the rim. If we reverse the order we end up with the highest point in the valley. There are a number of important necessary conditions that hold for problems with zero duality gap. These Karush-Kuhn-Tucker conditions turn out to be sufficient for convex optimization problems. They are given by: $$ \nabla f_0(x^*) + \sum_{i} \lambda_i \nabla f_i(x^*) + \sum_{j} \nu_j \nabla h_j(x^*) = 0 \quad (A.8) $$ $$ f_i(x^*) \leq 0 \quad (A.9) $$ $$ h_j(x^*) = 0 \quad (A.10) $$ $$ \lambda_i \geq 0 \quad (A.11) $$ $$ \lambda_i f_i(x^*) = 0 \quad (A.12) $$ The first equation is easily derived because we already saw that $ p^* = \inf_{\alpha, \lambda, \nu} L_p(x, \lambda, \nu) $ and hence all the derivatives must vanish. This condition has a nice interpretation as a “balancing of forces”. Imagine a ball rolling down a surface defined by $ f_0(x) $ (i.e., you are doing gradient descent to find the minimum). The ball gets blocked by a wall, which is the constraint. If the surface and constraint is convex then if the ball doesn’t move we have reached the optimal solution. At that point, the forces on the ball must balance. The first term represents the force of the ball against the wall due to gravity (the ball is still on a slope). The second term represents the reaction force of the wall in the opposite direction. The $ \lambda $ represents the magnitude of the reaction force, which needs to be higher if the surface slopes more. We say that this constraint is “active”. Other constraints which do not exert a force are “inactive” and have $ \lambda = 0 $. The latter statement can be read from the last KKT condition which we call “complementary slackness”. It says that either $ f_i(x) = 0 $ (the constraint is saturated and hence active) in which case $ \lambda $ is free to take on a non-zero value. However, if the constraint is inactive: $ f_i(x) \leq 0 $, then $ \lambda $ must vanish. As we will see soon, the active constraints will correspond to the support vectors in SVMs! #################### File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf Page: 1 Context: # A Cool Brisk Walk Through Discrete Mathematics **Version 2.2.1** **Stephen Davies, Ph.D.** ## Table of Contents 1. Introduction 2. Concepts - Sets - Functions - Relations 3. Applications 4. Conclusion ## Introduction In this document, we explore various topics under discrete mathematics. ## Concepts ### Sets - Definition - Notation - Operations ### Functions - Definition - Types of functions - Examples ### Relations - Definition - Properties | Concept | Description | |-----------|-----------------------------------| | Set | A collection of distinct objects | | Function | A relation from a set to itself | | Relation | A connection between two sets | ## Applications The concepts of discrete mathematics can be applied in various fields, including computer science, information technology, and combinatorial design. ## Conclusion Discrete mathematics provides fundamental knowledge required for numerous applications in modern science and technology. Image Analysis: ### Analysis of the Attached Visual Content #### 1. Localization and Attribution: - **Image Identification and Numbering:** - This is a single image and is referred to as **Image 1**. #### 2. Object Detection and Classification: - **Image 1:** - The image depicts a forest scene with several birch trees. - **Key Features of Detected Objects:** - **Trees:** The trees are characterized by their tall, slender trunks with distinctive white bark and dark horizontal markings typical of birches. - **Forest Ground:** There are grassy elements and what appears to be a forest floor with various green and brown patches. #### 3. Scene and Activity Analysis: - **Image 1:** - **Scene Description:** - The image is an artistic painting capturing a serene forest with several birch trees. - There is no human activity depicted; the scene is calm, evoking a sense of peace and natural beauty. #### 4. Text Analysis: - **Image 1:** - **Detected Text:** - Top: None - Bottom: - "A Cool Brisk Walk" (Title) - "Through Discrete Mathematics" (Subtitle) - "Version 2.2.1" (Version Information) - "Stephen Davies, Ph.D." (Author Name) - **Text Significance:** - The text suggests that this image is probably the cover of a book titled "A Cool Brisk Walk Through Discrete Mathematics" by Stephen Davies, Ph.D. The version number indicates it is version 2.2.1, signifying a possibly updated or revised edition. #### 8. Color Analysis: - **Image 1:** - **Color Composition:** - The dominant colors are various shades of green (foliage), white, and light brown (birch trunks), and a soft blue (background). - **Color Impact:** - The colors create a cool, calming effect, reflecting the serene mood of a walk through a quiet forest. #### 9. Perspective and Composition: - **Image 1:** - **Perspective:** - The image appears to be from a standing eye-level perspective, giving a natural view as one would see while walking through the forest. - **Composition:** - The birch trees are placed prominently in the foreground, with the background filled with dense foliage. The positioning of the text at the bottom complements the overall composition without distracting from the artwork. #### 10. Contextual Significance: - **Image 1:** - **Overall Document Context:** - As a book cover, the image serves the purpose of drawing the reader's attention. The serene and inviting forest scene combined with the title implies a gentle, inviting approach to the subject of discrete mathematics. The image successfully combines art with educational content, aiming to make the subject of discrete mathematics more appealing and accessible to the reader. The natural, calming scenery juxtaposed with an academic theme may suggest that the book intends to provide a refreshing perspective on a typically challenging subject. #################### File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf Page: 2 Context: Copyright © 2023 Stephen Davies. ## University of Mary Washington **Department of Computer Science** James Farmer Hall 1301 College Avenue Fredericksburg, VA 22401 --- Permission is granted to copy, distribute, transmit and adapt this work under a [Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/). The accompanying materials at [www.allthemath.org](http://www.allthemath.org) are also under this license. If you are interested in distributing a commercial version of this work, please contact the author at [stephen@umw.edu](mailto:stephen@umw.edu). The LaTeX source for this book is available from: [https://github.com/divilian/cool-brisk-walk](https://github.com/divilian/cool-brisk-walk). --- Cover art copyright © 2014 Elizabeth M. Davies. #################### File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf Page: 3 Context: # Contents at a glance ## Contents at a glance - **Contents at a glance** …… i - **Preface** …… iii - **Acknowledgements** …… v 1. **Meetup at the trailhead** …… 1 2. **Sets** …… 7 3. **Relations** …… 35 4. **Probability** …… 59 5. **Structures** …… 85 6. **Counting** …… 141 7. **Numbers** …… 165 8. **Logic** …… 197 9. **Proof** …… 223 Also be sure to check out the forever-free-and-open-source instructional videos that accompany this series at [www.allthemath.org](http://www.allthemath.org)! #################### File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf Page: 5 Context: # Preface Discrete math is a popular book topic — start Googling around and you’ll find a zillion different textbooks about it. Take a closer look, and you’ll discover that most of these are pretty thick, dense volumes packed with lots of equations and proofs. They’re principled approaches, written by mathematicians and (seemingly) to mathematicians. I speak with complete frankness when I say I’m comforted to know that the human race is well covered in this area. We need smart people who can derive complex expressions and prove theorems from scratch, and I’m glad we have them. Your average computer science practitioner, however, might be better served by a different approach. There are elements to the discrete math mindset that a budding software developer needs experience with. This is why discrete math is (properly, I believe) part of the mandatory curriculum for most computer science undergraduate programs. But for future programmers and engineers, the emphasis should be different than it is for mathematicians and researchers in computing theory. A practical computer scientist mostly needs to be able to use these tools, not to derive them. She needs familiarity, and practice, with the fundamental concepts and the thought processes they involve. The number of times the average software developer will need to construct a proof in graph theory is probably near zero. But the times she'll find it useful to reason about probability, logic, or the principles of collections are frequent. I believe the majority of computer science students benefit most from simply gaining an appreciation for the richness and rigor of #################### File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf Page: 11 Context: # Understanding Integration and Differentiation People in your family, there will never be 5.3 of them (although there could be 6 someday). The last couple of entries on this list are worth a brief comment. They are math symbols, some of which you may be familiar with. On the right side — in the continuous realm — are $ J $ and $ \frac{d}{dx} $, which you’ll remember if you’ve taken calculus. They stand for the two fundamental operations of integration and differentiation. Integration, which can be thought of as finding “the area under a curve,” is basically a way of adding up a whole infinite bunch of numbers over some range. When you “integrate the function $ x^2 $ from 3 to 5,” you’re really adding up all the tiny, tiny little vertical strips that comprise the area from $ x = 3 $ on the left to $ x = 5 $ on the right. Its corresponding entry in the left-column of the table is $ \Sigma $, which is just a short-hand for “sum up a bunch of things.” Integration and summation are equivalent operations; it’s just that when you integrate, you’re adding up all the (infinitely many) slivers across the real-line continuum. When you sum, you’re adding up a fixed sequence of entries, one at a time, like in a loop. $ \Sigma $ is just the discrete “version” of $ J $. The same sort of relationship holds between ordinary subtraction ($-$) and differentiation ($\frac{d}{dx}$). If you’ve plotted a bunch of discrete points on $ x $-$ y $ axes, and you want to find the slope between two of them, you just subtract their $ y $ values and divide by the $ x $ distance between them. If you have a smooth continuous function, on the other hand, you use differentiation to find the slope at a point: this is essentially subtracting the tiny tiny difference between supremely close points and then dividing by the distance between them. Don’t worry; you don’t need to have fully understood any of the integration or differentiation stuff I just talked about, or even to have taken calculus yet. I’m just trying to give you some feel for what “discrete” means, and how the dichotomy between discrete and continuous really runs through all of math and computer science. In this book, we will mostly be focusing on discrete values and structures, which turn out to be of more use in computer science. That’s partially because, as you probably know, computers #################### File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf Page: 12 Context: # CHAPTER 1. MEETUP AT THE TRAILHEAD Themselves are discrete, and can only store and compute discrete values. There can be many of them — megabytes, gigabytes, terabytes — but each value stored is fundamentally comprised of bits, each of which has a value of either 0 or 1. This is unlike the human brain, by the way, whose neuronal synapses communicate based on the continuous quantities of chemicals present in their axons. So I guess “computer” and “brain” are another pair of entities we could add to our discrete vs. continuous list. There’s another reason, though, why discrete math is of more use to computer scientists than continuous math is, beyond just the bits-and-bytes thing. Simply put, computers operate algorithmically. They carry out programs in step-by-step, iterative fashion. First do this, then do that, then move on to something else. This mechanical execution, like the ticking of a clock, permeates everything the computer can do, and everything we can tell it to do. At a given moment in time, the computer has completed step 7, but not step 8; it has accumulated 38 values, but not yet 39; its database has exactly 15 entries in it, no more and no less; it knows that after accepting this friend request, there will be exactly 553 people in your set of friends. The whole paradigm behind reasoning about computers and their programs is discrete, and that’s why we computer scientists find different problems worth thinking about than most of the world did a hundred years ago. But it’s still math. It’s just discrete math. There’s a lot to come, so limber up and let me know when you’re ready to hit the road. ## 1.1 Exercises Use an index card or a piece of paper folded lengthwise, and cover up the right-hand column of the exercises below. Read each exercise in the left-hand column, answer it in your mind, then slide the index card down to reveal the answer and see if you’re right! For every exercise you missed, figure out why you missed it before moving on. #################### File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf Page: 13 Context: # 1.1 EXERCISES 1. **What’s the opposite of concrete?** Abstract. 2. **What’s the opposite of discrete?** Continuous. 3. **Consider a quantity of water in a glass. Would you call it abstract, or concrete? Discrete, or continuous?** Concrete, since it’s a real entity you can experience with the senses. Continuous, since it could be any number of ounces (or liters, or tablespoons, or whatever). The amount of water certainly doesn’t have to be an integer. (Food for thought: since all matter is ultimately comprised of atoms, are even substances like water discrete?) 4. **Consider the number 27. Would you call it abstract, or concrete? Discrete, or continuous?** Abstract, since you can’t see or touch or smell “twenty-seven.” Probably discrete, since it’s an integer, and when we think of whole numbers we think “discrete.” (Food for thought: in real life, how would you know whether I meant the integer “27” or the decimal number “27.00”? And does it matter?) Clearly it’s discrete. Abstract vs. concrete, though, is a little tricky. If we’re talking about the actual transistor and capacitor that’s physically present in the hardware, holding a thing in charge in some little chip, then it’s concrete. But if we’re talking about the value “1” that is conceptually part of the computer’s currently executing state, then it’s really abstract just like 27 was. In this book, we’ll always be talking about bits in this second, abstract sense. 5. **Consider a bit in a computer’s memory. Would you call it abstract, or concrete? Discrete, or continuous?** Any kind of abstract object that has properties we can reason about. 6. **If math isn’t just about numbers, what else is it about?** Any kind of abstract object that has properties we can reason about. #################### File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf Page: 15 Context: # Chapter 2 ## Sets The place from which we'll start our walk is a body of mathematics called "set theory." Set theory has an amazing property: it's so simple and applicable that almost all the rest of mathematics can be based on it! This is all the more remarkable because set theory itself came along pretty late in the game (as things go) — it was singlehandedly invented by one brilliant man, Georg Cantor, in the 1870s. That may seem like a long time ago, but consider that by the time Cantor was born, mankind had already accumulated an immense wealth of mathematical knowledge: everything from geometry to algebra to calculus to prime numbers. Set theory was so elegant and universal, though, that after it was invented, nearly everything in math was redefined from the ground up to be couched in the language of sets. It turns out that this simple tool is an amazingly powerful way to reason about mathematical concepts of all flavors. Everything else in this book stands on set theory as a foundation. Cantor, by the way, went insane as he tried to extend set theory to fully encompass the concept of infinity. Don't let that happen to you! #################### File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf Page: 26 Context: # 2.9 Combining Sets Okay, so we have sets. Now what can we do with them? When you first learn about numbers back before kindergarten, the next thing you learn is how to combine numbers using various operations to produce other numbers. These include $ -, \times, +, \div $, exponents, roots, etc. Sets, too, have operations that are useful for combining to make other sets. These include: - **Union ($\cup$)**: The union of two sets is a set that includes the elements that either (or both) of them have as members. For instance, if $ A = \{ \text{Dad}, \text{Lizzy} \} $ and $ B = \{ \text{Lizzy}, \text{T.J.}, \text{Johnny} \} $, then $ A \cup B = \{ \text{Dad}, \text{Lizzy}, \text{T.J.}, \text{Johnny} \} $. Note that any element is in the union if it is in $ A $ or $ B $. For this reason, there is a strong relationship between the union operator of sets and the “or” ($ \vee $) operator of boolean logic that we’ll see later. - **Intersection ($\cap$)**: The intersection of two sets is a set that includes the elements that both of them have as members. In the above example, $ A \cap B = \{ \text{Lizzy} \} $. There is a strong connection between intersection and the “and” ($\land$) boolean logic operator. - **(Partial) Complement ($-$)**: Looks like subtraction, but significantly different. $ A - B $ contains the elements from $ A $ that are also not in $ B $. So you start with $ A $ and then “subtract off” the contents of $ B $ if they occur. In the above example, $ A - B = \{ \text{Dad} \} $. (Note that T.J. and Johnny didn’t really enter into the calculation.) Unlike $ \cup $ and $ \cap $, it is not commutative. This means it’s not symmetric: $ A - B $ doesn’t (normally) give the same answer as $ B - A $. In this example, $ B - A = \{ \text{T.J.}, \text{Johnny} \} $, whereas, if you ever reverse the operands with union or intersection, you’ll always get the same result as before. - **(Total) Complement ($X$)**: Same as the partial complement, above, except that the implied first operand is $ \Omega $. In other words, $ A - B $ is “all the things in $ A $ that aren’t in $ B $,” whereas... #################### File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf Page: 37 Context: 2.13 Exercises ================ Use an index card or a piece of paper folded lengthwise, and cover up the right-hand column of the exercises below. Read each exercise in the left-hand column, answer it in your mind, then slide the index card down to reveal the answer and see if you're right! For every exercise you missed, figure out why you missed it before moving on. 1. Is the set $ \{ Will, Smith \} $ the same as the set $ \{ Smith, Will \} $? Yes indeed. 2. Is the ordered pair $ (Will, Smith) $ the same as $ (Smith, Will) $? No. Order matters with ordered pairs (hence the name), and with any size tuple for that matter. 3. Is the set $ \{ Luke, Leia \}, Han $ the same as the set $ \{ Luke, Leia, Han \} $? No. For instance, the first set has Han as a member but the second does not. (Instead, it has another set as a member, and that inner set happens to include Han.) 4. What's the first element of the set $ \{ Cowboys, Redskins, Steelers \} $? The question doesn't make sense. There is no "first element" of a set. All three teams are equally members of the set and could be listed in any order. 5. Let $ G = \{ Matthew, Mark, Luke, John \} $ be $ \{ Luke, Obi-wan, Yoda \} $. $ S $ be the set of all Star Wars characters, and $ F $ be the four gospels from the New Testament. Now then. Is $ J \subseteq G $? No. 6. Is $ J \subseteq S $? Yes. 7. Is Yoda $ \in J $? Yes. #################### File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf Page: 67 Context: # Chapter 4 ## Probability Probability is the study of **uncertainty**. This may seem like a hopeless endeavor, sort of like knowing the unknowable, but it's not. The study of probability gives us tools for taming the uncertain world we live and program in, and for reasoning about it in a precise and helpful way. We may not know exactly how long a particular visitor is willing to wait for our webpage to load in their browser, but we can use probability to estimate how much traffic we’ll lose if this takes longer than a certain average duration. We may not know which specific passwords a hacker will try as he attempts to break our security protocol, but we can use probability to estimate how feasible this approach will be for him. We may not know exactly when a certain program will run out of RAM and have to swap its data out to virtual memory, but we can predict how often this is likely to occur — and how painful it will be for us — given a certain system load and user behavior. The trick is to use the tools we've already built — sets, relations, functions — to characterize and structure our notions of the relative likelihood of various outcomes. Once those underpinnings are secured, a layer of deductive reasoning will help us make good use of that information to begin to predict the future. #################### File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf Page: 85 Context: # 4.6. BAYES' THEOREM See how that works? If I do have the disease (and there's a 1 in 1,000 chance of that), there's a .99 probability of me testing positive. On the other hand, if I don't have the disease (a 999 in 1,000 chance of that), there's a .01 probability of me testing positive anyway. The sum of those two mutually exclusive probabilities is 0.01098. Now we can use our Bayes’ Theorem formula to deduce: \[ P(D|T) = \frac{P(T|D) P(D)}{P(T)} \] \[ P(D|T) = \frac{.99 \cdot \frac{1}{1000}}{0.01098} \approx .0902 \] Wow. We tested positive on a 99% accurate medical exam, yet we only have about a 9% chance of actually having the disease! Great news for the patient, but a head-scratcher for the math student. How can we understand this? Well, the key is to look back at that Total Probability calculation in equation 4.1. Remember that there were two ways to test positive: one where you had the disease, and one where you didn't. Look at the contribution to the whole that each of those two probabilities produced. The first was 0.00099, and the second was 0.9999, over ten times higher. Why? Simply because the disease is so rare. Think about it: the test fails once every hundred times, but a random person only has the disease once every thousand times. If you test positive, it’s far more likely that the test screwed up than that you actually have the disease, which is rarer than blue moons. Anyway, all of this about diseases and tests is a side note. The main point is that Bayes' Theorem allows us to recast a search for $P(X|Y)$ into a search for $P(Y|X)$, which is often easier to find numbers for. One of many computer science applications of Bayes' Theorem is in text mining. In this field, we computationally analyze the words in documents in order to automatically classify them or form summaries or conclusions about their contents. One goal might be to identify the true author of a document, given samples of the writing of various suspected authors. Consider the Federalist Papers, #################### File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf Page: 95 Context: # 5.1. GRAPHS The edges, we have precisely the structure of a graph. Psychologists have given this another name: a semantic network. It is thought that the myriad of concepts you have committed to memory — Abraham Lincoln, and bar of soap, and my fall schedule, and perhaps millions of others — are all associated in your mind in a vast semantic network that links the related concepts together. When your mind recalls information, or deduces facts, or even drifts randomly in idle moments, it’s essentially traversing this graph along the various edges that exist between vertices. That’s deep. But you don’t have to go near that deep to see the appearance of graph structures all throughout computer science. What’s MapQuest, if not a giant graph where the vertices are travelable locations and the edges are routes between them? What’s Facebook, if not a giant graph where the vertices are people and the edges are friendships? What’s the World Wide Web, if not a giant graph where the vertices are pages and the edges are hyperlinks? What’s the Internet, if not a giant graph where the vertices are computers or routers and the edges are communication links between them? This simple scheme of linked vertices is powerful enough to accommodate a whole host of applications, which is worth studying. ## Graph Terms The study of graphs brings with it a whole bevy of new terms which are important to use precisely: 1. **Vertex.** Every graph contains zero or more vertices.¹ (These are also sometimes called nodes, concepts, or objects.) 2. **Edge.** Every graph contains zero or more edges. (These are also sometimes called links, connections, associations, or relationships.) Each edge connects exactly two vertices, unless the edge connects a vertex to itself, which is possible, believe it. > ¹ The phrase "zero or more" is common in discrete math. In this case, it indicates that the empty graph, which contains no vertices at all, is still a legitimate graph. #################### File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf Page: 132 Context: # CHAPTER 5. STRUCTURES If this doesn’t seem amazing to you, it’s probably because you don’t fully appreciate how quickly this kind of exponential growth can accumulate. Suppose you had a perfect binary tree of height 30—certainly not an awe-inspiring figure. One could imagine it fitting on a piece of paper—height-wise, that is. But run the numbers and you’ll discover that such a tree would have over half a billion leaves, more than one for every person in the United States. Increase the tree’s height to a mere 34—just 4 additional levels—and suddenly you have over 8 billion leaves, easily greater than the population of planet Earth. The power of exponential growth is only fully reached when the binary tree is perfect, since a tree with some “missing” internal nodes does not carry the maximum capacity that it’s capable of. It’s got some holes in it. Still, as long as the tree is fairly bushy (i.e., it’s not horribly lopsided in just a few areas) the enormous growth predicted for perfect trees is still approximately the case. The reason this is called “exponential” growth is that the quantity we’re varying—the height—appears as an exponent in the number of leaves, which is $2^h$. Every time we add just one level, we double the number of leaves. The number of leaves (call it $l$) is $2^h$, if $h$ is the height of the tree. Flipping this around, we say that $h = \log_2(l)$. The function “log” is a logarithm, specifically a logarithm with base-2. This is what computer scientists often use, rather than a base of 10 (which is written “log”) or a base of $e$ (which is written “ln”). Since $2^h$ grows very, very quickly, it follows that $\log_2(l)$ grows very, very slowly. After our tree reaches a few million nodes, we can add more and more nodes without growing the height of the tree significantly at all. The takeaway message is simply that an incredibly large number of nodes can be accommodated in a tree with a very modest height. This makes it possible to, among other things, search a huge amount of information astonishingly quickly—provided the tree’s contents are arranged properly. #################### File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf Page: 167 Context: # 6.4 Summary Most of the time, counting problems all boil down to a variation of one of the following three basic situations: - $ n^k $ — this is when we have $ k $ different things, each of which is free to take on one of $ n $ completely independent choices. - $ \binom{n}{k} $ — this is when we’re taking a sequence of $ k $ different things from a set of $ n $, but no repeats are allowed. (A special case of this is $ n! $, when $ k = n $.) - $ \binom{n}{k} $ — this is when we’re taking $ k $ different things from a set of $ n $, but the order doesn’t matter. Sometimes it’s tricky to deduce exactly which of these three situations apply. You have to think carefully about the problem and ask yourself whether repeated values would be allowed and whether it matters what order the values appear in. This is often subtle. As an example, suppose my friend and I work out at the same gym. This gym has 18 different weight machines to choose from, each of which exercises a different muscle group. Each morning, #################### File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf Page: 173 Context: # Chapter 7 ## Numbers Wow, last chapter was about “counting,” and this one is about “numbers.” It sure seems like we’re regressing back to first grade or earlier. And indeed, this chapter will contain a repeat of some elementary school concepts! But this is so we can re-examine the foundations and generalize them somewhat. The mechanical processes you’ve always used with numbers — adding, subtracting, comparing, checking whether something divides evenly, working with place value — are all correct, but they’re all hard-coded for decimal numbers. The word “decimal,” in this chapter, won’t mean “a number with a decimal point,” like 5.62* but rather a number expressed in base 10. And what does “expressed in base 10” mean? It means that the digits, from right to left, represent a “one’s place,” a “ten’s place,” a “hundred’s place,” and so on. This is what we all learned in grade school, and perhaps you thought that’s just how numbers “were.” But it turns out that 1, 10, 1000, …, is just one choice of place values, and that we could equally as well choose many other things, like 1, 2, 4, 8, …, or 1, 16, 256, 4096, …, or even 1, 23, 529, 12167, …, as long as those values are of a certain type (successive powers of the base). It’s the concept of bases, and specifically bases other than 10, that will cause us to rethink some things. It’ll feel unnatural at first, but soon you’ll discover that there are aspects of how you work with numbers that are unnecessarily specific, and that it’s freeing. #################### File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf Page: 225 Context: # 8.2. Predicate Logic Deduce further statements, and you’ll learn about them when you study artificial intelligence later on. Most of them are formalized versions of common sense. “If you know A is true, and you know A=B is true, then you can conclude B is true.” Or “if you know X∧Y is false, and then you discover that Y is true, you can then conclude that X is false.” Etc. The power to produce new truth from existing truth is the hallmark of AI systems, and why this stuff really matters. If you can imagine a program doing this sort of automated reasoning, it will become clear why the precision of something like predicate logic—instead of the sloppiness of English—becomes important. English is a beautiful and poetic language, but its ambiguity is notorious. For example, back in Chapter 3 we used the phrase “some employee belongs to every department” when describing relations. Now consider that English sentence. What does “some employee belongs to every department” actually mean? Does it mean that there is some special employee who happens to hold membership in every department in the company? Or does it mean that no department is empty: all departments have at least one person in them, for crying out loud? The English could mean either. In predicate logic, we’re either asserting: - $\exists x \, \text{Employee}(x) \land \forall y \, \text{BelongsTo}(x, y)$ or - $\forall y \, \text{Employee}(x) \land \text{BelongsTo}(x, y)$ These are two very different things. A human being would realize that it’s the third one on the speaker’s mind, drawing from a whole range of experience and common sense and context clues. But a bot has available none of these, and so it demands that the language clearly and unambiguously state exactly what’s meant. English is rife with these ambiguities, especially involving pronouns. > “After John hit George he ran away.” What happened? Did John run away after striking George, fearing that George would retaliate? Or did George run away after getting hit, fearing additional abuse? It’s unclear what “he” refers to, so we can’t say from the sentence alone. #################### File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf Page: 233 Context: # 9.1. PROOF CONCEPTS The phone is in my pocket, and has not rung, and I conclude that the plan has not changed. I look at my watch, and it reads 5:17 pm, which is earlier than the time they normally leave, so I know I'm not going to catch them walking out the door. This is, roughly speaking, my thought process that justifies the conclusion that the house will be empty when I pull into the garage. Notice, however, that this prediction depends precariously on several facts. What if I spaced out the day of the week, and this is actually Thursday? All bets are off. What if my cell phone battery has run out of charge? Then perhaps they did try to call me but couldn’t reach me. What if I set my watch wrong and it’s actually 4:17 pm? Etc. Just like a chain is only as strong as its weakest link, a whole proof falls apart if even one step isn’t reliable. Knowledge bases in artificial intelligence systems are designed to support these chains of reasoning. They contain statements expressed in formal logic that can be examined to deduce only the new facts that logically follow from the old. Suppose, for instance, that we had a knowledge base that currently contained the following facts: 1. $ A = C $ 2. $ \neg (C \land D) $ 3. $ (F \lor E) \to D $ 4. $ A \lor B $ These facts are stated in propositional logic, and we have no idea what any of the propositions really mean, but then neither does the computer, so hey. Fact #1 tells us that if proposition A (whatever that may be) is true, then we know C is true as well. Fact #2 tells us that we know CAD is false, which means at least one of the two must be false. And so on. Large knowledge bases can contain thousands or even millions of such expressions. It’s a complete record of everything the system “knows.” Now suppose we learn an additional fact: $ \neg B $. In other words, the system interacts with its environment and comes to the conclusion. #################### File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf Page: 252 Context: # 9.4 Final Word Finding proofs is an art. In some ways, it’s like programming: you have a set of building blocks, each one defined very precisely, and your goal is to figure out how to assemble those blocks into a structure that starts with only axioms and ends with your conclusion. It takes skill, patience, practice, and sometimes a little bit of luck. Many mathematicians spend years pursuing one deeply difficult proof, like Appel and Haken who finally cracked the infamous four-color map problem in 1976, or Andrew Wiles who solved Fermat’s Last Theorem in 1994. Some famous mathematical properties may never have proofs, such as Christian Goldbach’s 1742 conjecture that every even integer is the sum of two primes; or the most elusive and important question in computing theory: does P=NP? (Put very simply: if you consider the class of problems where it’s easy to verify a solution once you have it, but crazy hard to find it in the first place, is there actually an easy algorithm for finding the solution that we just haven’t figured out yet?) Most computer scientists think “no,” but despite a mind-boggling number of hours invested by the brightest minds in the world, no one has ever been able to prove it one way or the other. Most practicing computer scientists spend time taking advantage of the known results about mathematical objects and structures, and rarely (if ever) have to construct a watertight proof about them. For the more theoretically-minded student, however, who enjoys probing the basis behind the tools and speculating about additional properties that might exist, devising proofs is an essential skill that can also be very rewarding. #################### File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf Page: 253 Context: # Index - n-choose-k notation, 156 - n-to-the-k-falling operator, 152 - a priori, 68 - modus ponens, 201, 226 - modus tollens, 226 - quad erat demonstrandum (Q.E.D.), 226 - reductio ad absurdum, 229 - acyclic (graphs), 91 - additivity property, 63 - adjacent (vertices), 89 - algorithm, 97, 127, 128, 132, 149, 150 - Ali, Muhammad, 92 - American Idol, 62, 68 - ancestor (of a node), 115 - and (logical operator), 18, 199, 203 - antisymmetric (relation), 40, 43 - Appel, Kenneth, 244 - arrays, 13 - artificial intelligence (AI), 197, 201, 225 - associative, 20 - asymmetric (relation), 41 - ATM machines, 143 - atomic (propositions), 198 - AVL trees, 133 - axioms, 226, 229 - background knowledge, 68, 70 - balancedness (of a tree), 132 - base case (of a proof), 233, 240 - bases (of number systems), 166, 168, 170 - Bayes' Theorem, 75 - Bayes, Thomas, 67 - Bayesian, 66 - BFT (breadth-first traversal), 95, 97 - Big-O notation, 127 - bijective (function), 49 - binary numbers, 25, 177, 178, 180, 182 - binary search trees, 123, 125 - binomial coefficients, 156 - bit, 177 - Booth, John Wilkes, 86 - BST property, 125, 131 - byte, 180 #################### File: A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf Page: 254 Context: ``` # CHAPTER 9. PROOF - Cantor, Georg, 7, 12, 17 - capacity (of a byte), 182 - cardinality (of sets), 16, 25, 28, 66 - Carroll, Lewis, 227 - carry-in, 189 - carry-out, 189 - Cartesian product (of sets), - 19, 35 - chess, 114 - child (of a node), 115 - closed interval, 61 - codomain (of a function), 45 - collectively exhaustive, 26 - combinations, 154 - combinators, 25, 141 - commutative, 18, 20, 71 - compilers, 114 - complement laws (of sets), - 21 - complement, partial (of sets), - 18 - complement, total (of sets), - 18, 65, 146, 162 - complete binary tree, 121 - conclusion (of implication), 200 - conditional probability, - 68, 72, 74, 78 - congruent, 173 - conjunction, 199, 208 - connected (vertices/graphs), - 89, 95 - coordinates, 15 - curly brace notation, 11 - cycles, 90 ## DAGs (directed acyclic graphs), 90 - data structures, 85 - Davies family, 8, 9, 26, 147, 154 - De Morgan's laws, 21, 22, 207, 208 - decimal numbers, 165, 169, 173, 178 - degree (of a vertex), 90 - depth (of a node), 115 - dequeuing, 95 - descendant (of a node), 115 - DFT (depth-first traversal), 99, 101 ## Dijkstra's algorithm, 101, 104 - Dijkstra, Edsger, 101 - direct proof, 227 - directed graphs, 88, 91 - disjunction, 199, 208, 226 - disjunctive syllogism, 226 - disk sectors, 156 - distributive, 20, 208, 227 - domain (of a function), 45 - domain of discourse (?), 9, - 19, 21, 27, 60, 210 - domination laws (of sets), 21 - dominors, 234 - drinking age, 232, 234 - duplicates (in sets), 13 - edges, 86, 87 - elements (of sets), 8, 15, 23 - ellipses, 12 - empty graph, 87 - empty set, 9, 16, 21, 24, 25, - 36, 114 - endorelations, 38, 93 ``` #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 1 Context: # A Brief Introduction to Machine Learning for Engineers (2018), “A Brief Introduction to Machine Learning for Engineers”, Vol. XX, No. XX, pp 1–231. DOI: XXX. Osvaldo Simeone Department of Informatics King’s College London osvaldo.simeone@kcl.ac.uk #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 2 Context: # Contents I Basics 5 1 Introduction 6 1.1 What is Machine Learning? 6 1.2 When to Use Machine Learning? 8 1.3 Goals and Outline 11 2 A Gentle Introduction through Linear Regression 15 2.1 Supervised Learning 15 2.2 Inference 17 2.3 Frequentist Approach 19 2.4 Bayesian Approach 36 2.5 Minimum Description Length (MDL) 42 2.6 Information-Theoretic Metrics 44 2.7 Interpretation and Causality 47 2.8 Summary 49 3 Probabilistic Models for Learning 51 3.1 Preliminaries 52 3.2 The Exponential Family 53 3.3 Frequentist Learning 59 #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 3 Context: # Learning Theory ## Table of Contents 3.4 Bayesian Learning ......................................... 63 3.5 Supervised Learning via Generalized Linear Models (GLM) ... 69 3.6 Maximum Entropy Property* ................................. 71 3.7 Energy-based Models* ...................................... 72 3.8 Some Advanced Topics* ..................................... 73 3.9 Summary ................................................... 74 ## II Supervised Learning ### 75 4 Classification ................................................ 76 4.1 Preliminaries: Stochastic Gradient Descent ............ 77 4.2 Classification as a Supervised Learning Problem ...... 78 4.3 Discriminative Deterministic Models ..................... 80 4.4 Discriminative Probabilistic Models: Generalized Linear Models ............................................... 90 4.5 Discriminative Probabilistic Models: Beyond GLM ..... 96 4.6 Generative Probabilistic Models .......................... 102 4.7 Boosting .................................................... 106 4.8 Summary ................................................... 107 ## 5 Statistical Learning Theory* ### 113 5.1 A Formal Framework for Supervised Learning .......... 114 5.2 PAC Learnability and Sample Complexity ............... 119 5.3 PAC Learnability for Finite Hypothesis Classes ....... 120 5.4 VC Dimension and Fundamental Theorem of PAC Learning .................................................... 124 5.5 Summary ................................................... 126 ## III Unsupervised Learning ### 129 6 Unsupervised Learning ....................................... 130 6.1 Unsupervised Learning ................................... 131 6.2 K-Means Clustering ...................................... 134 6.3 ML, ELBO and EM ......................................... 136 6.4 Directed Generative Models ............................. 148 6.5 Undirected Generative Models ............................ 155 #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 4 Context: ``` 6.6 Discriminative Models........................................159 6.7 Autoencoders...............................................163 6.8 Ranking*...................................................164 6.9 Summary....................................................164 IV Advanced Modelling and Inference..............................165 7 Probabilistic Graphical Models................................166 7.1 Introduction...............................................167 7.2 Bayesian Networks..........................................170 7.3 Markov Random Fields.......................................178 7.4 Bayesian Inference in Probabilistic Graphical Models......182 7.5 Summary....................................................185 8 Approximate Inference and Learning............................186 8.1 Monte Carlo Methods........................................187 8.2 Variational Inference.......................................189 8.3 Monte Carlo-Based Variational Learning*....................197 8.4 Approximate Learning*......................................199 8.5 Summary....................................................201 V Conclusions...................................................202 9 Concluding Remarks............................................203 Appendices.......................................................206 A Appendix A: Information Measures..............................207 A.1 Entropy....................................................207 A.2 Conditional Entropy and Mutual Information................210 A.3 Divergence Measures.......................................212 B Appendix B: KL Divergence and Exponential Family..........215 Acknowledgements...............................................217 ``` #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 8 Context: 2. Notation ============== - The notation log represents the logarithm in base two, while ln represents the natural logarithm. - $ \mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma) $ indicates that random vector $\mathbf{x}$ is distributed according to a multivariate Gaussian distribution with mean vector $\boldsymbol{\mu}$ and covariance matrix $\Sigma$. The multivariate Gaussian pdf is denoted as $ \mathcal{N}(\mathbf{x} | \boldsymbol{\mu}, \Sigma) $ as a function of $\mathbf{x}$. - $ \mathbf{x} \sim U[a,b] $ indicates that $\mathbf{x}$ is distributed according to a uniform distribution in the interval [a, b]. The corresponding uniform pdf is denoted as $ U(a,b) $. - $ \delta(x) $ denotes the Dirac delta function or the Kronecker delta function, as clear from the context. - $ \|\mathbf{a}\|_2^2 = \sum_{i=1}^N a_i^2 $ is the quadratic, or $ l_2 $ norm of a vector $\mathbf{a} = [a_1, \ldots, a_N]^T$. We similarly define the $ l_1 $ norm as $ \|\mathbf{a}\|_1 = \sum_{i=1}^N |a_i| $, and the $ l_0 $ pseudo-norm $ \|\mathbf{a}\|_0 $ as the number of non-zero entries of vector $\mathbf{a}$. - $ I $ denotes the identity matrix, whose dimensions will be clear from the context. Similarly, I represents a vector of all ones. - $ \mathbb{R} $ is the set of real numbers; $ \mathbb{R}^+ $ the set of non-negative real numbers; $ \mathbb{R}^- $ the set of non-positive real numbers; and $ \mathbb{R}^N $ is the set of all vectors of $ N $ real numbers. - $ |S| $ represents the cardinality of a set $ S $. - $ \mathbf{Z} $ represents a set of random variables indexed by the integers $ k \in S $. #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 10 Context: # Acronyms | Acronym | Meaning | |---------|------------------------------------------------| | pmf | probability mass function | | PCA | Principal Component Analysis | | PPCA | Probabilistic Principal Component Analysis | | QDA | Quadratic Discriminant Analysis | | RBM | Restricted Boltzmann Machine | | SGD | Stochastic Gradient Descent | | SVM | Support Vector Machine | | rv | random variable or random vector (depending on the context) | | s.t. | subject to | | VAE | Variational AutoEncoder | | VC | Vapnik–Chervonenkis | | VI | Variational Inference | #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 11 Context: # Part I ## Basics #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 12 Context: # 1 Introduction Having taught courses on machine learning, I am often asked by colleagues and students with a background in engineering to suggest “the best place to start” to get into this subject. I typically respond with a list of books – for a general, but slightly outdated introduction, read this book; for a detailed survey of methods based on probabilistic models, check this other reference; to learn about statistical learning, I found this text useful; and so on. This answer strikes me, and most likely also my interlocutors, as quite unsatisfactory. This is especially so since the size of many of these books may be discouraging for busy professionals and students working on other projects. This monograph is an attempt to offer a basic and compact reference that describes key ideas and principles in simple terms and within a unified treatment, encompassing also more recent developments and pointers to the literature for further study. ## 1.1 What is Machine Learning? A useful way to introduce the machine learning methodology is by means of a comparison with the conventional engineering design flow. #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 13 Context: # 1.1. What is Machine Learning? This starts with an in-depth analysis of the problem domain, which culminates with the definition of a mathematical model. The mathematical model is meant to capture the key features of the problem under study and is typically the result of the work of a number of experts. The mathematical model is finally leveraged to derive hand-crafted solutions to the problem. For instance, consider the problem of defining a chemical process to produce a given molecule. The conventional flow requires chemists to leverage their knowledge of models that predict the outcome of individual chemical reactions, in order to craft a sequence of suitable steps that synthesize the desired molecule. Another example is the design of speech translation or image/video compression algorithms. Both of these tasks involve the definition of models and algorithms by teams of experts, such as linguists, psychologists, and signal processing practitioners, not infrequently during the course of long standardization meetings. The engineering design flow outlined above may be too costly and inefficient for problems in which faster or less expensive solutions are desirable. The machine learning alternative is to collect large data sets, e.g., of labeled speech, images, or videos, and to use this information to train general-purpose learning machines to carry out the desired task. While the standard engineering flow relies on domain knowledge and on design optimized for the problem at hand, machine learning lets large amounts of data dictate algorithms and solutions. To this end, rather than requiring a precise model of the set-up under study, machine learning requires the specification of an objective, of a model to be trained, and of an optimization technique. Returning to the first example above, a machine learning approach would proceed by training a general-purpose machine to predict the outcome of known chemical reactions based on a large data set, and then by using the trained algorithm to explore ways to produce more complex molecules. In a similar manner, large data sets of images or videos would be used to train a general-purpose algorithm with the aim of obtaining compressed representations from which the original input can be recovered with some distortion. #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 14 Context: # 1.2 When to Use Machine Learning? Based on the discussion above, machine learning can offer an efficient alternative to the conventional engineering flow when development cost and time are the main concerns, or when the problem appears to be too complex to be studied in its full generality. On the flip side, the approach has the key disadvantages of providing generally suboptimal performance, or hindering interpretability of the solution, and to apply only to a limited set of problems. In order to identify tasks for which machine learning methods may be useful, reference [31] suggests the following criteria: 1. The task involves a function that maps well-defined inputs to well-defined outputs. 2. Large data sets exist or can be created containing input-output pairs. 3. The task provides clear feedback with clearly definable goals and metrics. 4. The task does not involve long chains of logic or reasoning that depend on diverse background knowledge or common sense. 5. The task does not require detailed explanations for how the decision was made. 6. The task has a tolerance for error and no need for provably correct or optimal solutions. 7. The phenomenon or function being learned should not change rapidly over time. 8. No specialized dexterity, physical skills, or mobility is required. These criteria are useful guidelines for the decision of whether machine learning methods are suitable for a given task of interest. They also offer a convenient demarcation line between machine learning as is intended today, with its focus on training and computational statistics tools, and more general notions of Artificial Intelligence (AI) based on knowledge and common sense [87] (see [126] for an overview on AI research). #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 15 Context: # 1.2. When to Use Machine Learning? ## 1.2.1 Learning Tasks We can distinguish among three different main types of machine learning problems, which are briefly introduced below. The discussion reflects the focus of this monograph on parametric probabilistic models, as further elaborated on in the next section. 1. **Supervised learning:** We have $ n $ labeled training examples $ D = \{(x_i, y_i)\}_{i=1}^n $, where $ x_i $ represents a covariate, or explanatory variable, while $ y_i $ is the corresponding label, or response. For instance, variable $ y_i $ may represent the text of an email, while $ y_i $ may be a binary variable indicating whether the email is spam or not. The goal of supervised learning is to predict the value of the label $ y $ for an input $ x $ that is not in the training set. In other words, supervised learning aims at generalizing the observations in the data set $ D $ to new inputs. For example, an algorithm trained on a set of emails should be able to classify a new email not present in the data set $ D $. We can generally distinguish between classification problems, in which the label $ y $ is discrete, as in the example above, and regression problems, in which variable $ y $ is continuous. An example of a regression task is the prediction of tomorrow’s temperature $ t $ based on today’s meteorological observations $ x $. An effective way to learn a predictor is to identify from the data set $ D $ a predictive distribution $ p(y|x) $ from a set of parametrized distributions. The conditional distribution $ p(y|x) $ defines a profile of beliefs over all possible values of the label $ y $ given the input $ x $. For instance, for temperature prediction, one could learn mean and variance of a Gaussian distribution $ p(y|x) $ as a function of the input $ x $. As a special case, the output of a supervised learning algorithm may be in the form of a deterministic predictive function $ \hat{y} = f(x) $. 2. **Unsupervised learning:** Suppose now that we have an unlabeled set of training examples $ D = \{x_i\}_{i=1}^n $. Less well defined than supervised learning, unsupervised learning generally refers to the task of learning properties of the mechanism that generates this data set. Specific tasks and applications include clustering, which is the problem of grouping similar examples $ x_i $; dimensionality reduction, feature extraction, and representation learning, all related to the problem of simplifying data. #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 16 Context: # Introduction representing the data in a smaller or more convenient space; and generative modelling, which is the problem of learning a generating mechanism to produce artificial examples that are similar to available data in the data set $ D $. As a generalization of both supervised and unsupervised learning, **semi-supervised learning** refers to scenarios in which not all examples are labelled, with the unlabelled examples providing information about the distribution of the covariates $ x $. ## 3. Reinforcement learning Reinforcement learning refers to the problem of inferring optimal sequential decisions based on rewards or punishments received as a result of previous actions. Under supervised learning, the “label” refers to an action to be taken when the learner is in an informational state about the environment given by a variable $ x $. Upon taking an action $ a $ in a state $ x $, the learner is provided with feedback on the immediate reward accrued via this decision, and the environment moves on to a different state. As an example, an agent can be trained to navigate a given environment in the presence of obstacles by penalizing decisions that result in collisions. Reinforcement learning is hence neither supervised, since the learner is not provided with the optimal actions $ t $ to select in a given state $ x $; nor is it fully unsupervised, given the availability of feedback on the quality of the chosen action. Reinforcement learning is also distinguished from supervised and unsupervised learning due to the influence of previous actions on future states and rewards. This monograph focuses on supervised and unsupervised learning. These general tasks can be further classified along the following dimensions: - **Passive vs. active learning**: A passive learner is given the training examples, while an active learner can affect the choice of training examples on the basis of prior observations. - **Offline vs. online learning**: Offline learning operates over a batch of training samples, while online learning processes samples in a streaming fashion. Note that reinforcement learning operates inherently in an online manner, while supervised and unsupervised learning can be carried out by following either offline or online formulations. #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 17 Context: # 1.3 Goals and Outline This monograph considers only passive and offline learning. ## 1.3 Goals and Outline This monograph aims at providing an introduction to key concepts, algorithms, and theoretical results in machine learning. The treatment concentrates on probabilistic models for supervised and unsupervised learning problems. It introduces fundamental concepts and algorithms by building on first principles, while also exposing the reader to more advanced topics with extensive pointers to the literature, within a unified notation and mathematical framework. Unlike other texts that are focused on one particular aspect of the field, an effort has been made here to provide a broad but concise overview in which the main ideas and techniques are systematically presented. Specifically, the material is organized according to clearly defined categories, such as discriminative and generative models, frequentist and Bayesian approaches, exact and approximate inference, as well as directed and undirected models. This monograph is meant as an entry point for researchers with a background in probability and linear algebra. A prior exposure to information theory is useful but not required. Detailed discussions are provided on basic concepts and ideas, including overfitting and generalization, Maximum Likelihood and regularization, and Bayesian inference. The text also endeavors to provide intuitive explanations and pointers to advanced topics and research directions. Sections and subsections containing more advanced material that may be regarded at a first reading are marked with a star (+). The reader will find here further discussions on computing platform or programming frameworks, such as map-reduce, nor details on specific applications involving large data sets. These can be easily found in a vast and growing body of work. Furthermore, rather than providing exhaustive details on the existing myriad solutions in each specific category, techniques have been selected that are useful to illustrate the most salient aspects. Historical notes have also been provided only for a few selected milestone events. Finally, the monograph attempts to strike a balance between the algorithmic and theoretical viewpoints. In particular, all learning algorithms are presented in a manner that emphasizes their theoretical foundations while also providing practical insights. #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 18 Context: # Introduction Algorithms are introduced on the basis of theoretical arguments, often based on information-theoretic measures. Moreover, a chapter is devoted to statistical learning theory, demonstrating how to set the field of supervised learning on solid theoretical foundations. This chapter is more theoretically involved than the others, and proofs of some key results are included in order to illustrate the theoretical underpinnings of learning. This contrasts with other chapters, in which proofs of few theoretical results are kept at a minimum in order to focus on the main ideas. The rest of the monograph is organized into five parts. The first part covers introductory material. Specifically, Chapter 2 introduces the frequentist, Bayesian and Minimum Description Length (MDL) learning frameworks; the discriminative and generative categories of probabilistic models; as well as key concepts such as training, loss, generalization, and overfitting – all in the context of a simple linear regression problem. Information-theoretic metrics are also briefly introduced, as well as the advanced topics of interpretation and causality. Chapter 3 then provides an introduction to the exponential family of probabilistic models, to Generalized Linear Models (GLMs), and to energy-based models, emphasizing main properties that will be invoked in later chapters. The second part concerns supervised learning. Chapter 4 covers linear and non-linear classification methods via discriminative and generative models, including Support Vector Machines (SVMs), kernel methods, logistic regression, multi-layer neural networks, and boosting. Chapter 5 is a brief introduction to the statistical learning framework of the Probably Approximately Correct (PAC) theory, covering the Vapnik-Chervonenkis (VC) dimension and the fundamental theorem of PAC learning. The third part, consisting of a single chapter, introduces unsupervised learning. In particular, in Chapter 6, unsupervised learning models are addressed by distinguishing among directed models, for which Expectation Maximization (EM) is derived as the iterative maximization of the Evidence Lower Bound (ELBO); undirected models, for which Restricted Boltzmann Machines (RBMs) are discussed as a representative example; and discriminative models trained using the InfoMax. #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 19 Context: # 1.3 Goals and Outline Generative Adversarial Networks (GANs) are also introduced. The fourth part covers more advanced modelling and inference approaches. Chapter 7 provides an introduction to probabilistic graphical models, namely Bayesian Networks (BNs) and Markov Random Fields (MRFs), as means to encode more complex probabilistic dependencies than the models studied in previous chapters. Approximate inference and learning methods are introduced in Chapter 8 by focusing on Monte Carlo (MC) and Variational Inference (VI) techniques. The chapter briefly introduces a unified way techniques such as variational EM, Variational AutoEncoders (VAE), and black-box inference. Some concluding remarks are provided in the last part, consisting of Chapter 9. We conclude this chapter by emphasizing the importance of probability as a common language for the definition of learning algorithms [34]. The centrality of the probabilistic viewpoint was not always recognized, but has deep historical roots. This is demonstrated by the following two quotes, the first from the first AI textbook published by P. H. Winston in 1977, and the second from an unfinished manuscript by J. von Neumann (see [126, 64] for more information): “Many ancient Greeks supported Socrates upon that deep, inexplicable thoughts came from the gods. Today’s equivalent to those gods is the erratic, even probabilistic neuron. It is more likely that increased randomness of neural behavior is the problem of the epileptic and the drunk, not the advantage of the brilliant.” from *Artificial Intelligence*, 1977. “Of all this will lead to theories of computation which are much less rigidly of an all-or-none nature than past and present formal logic. There are numerous indications to make us believe that this new system of formal logic will move closer to another discipline which has been little linked in the past with logic. This is thermodynamics primarily in the form it was received from Boltzmann.” #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 21 Context: # 2 ## A Gentle Introduction through Linear Regression In this chapter, we introduce the frequentist, Bayesian, and MDL learning frameworks, as well as key concepts in supervised learning, such as discriminative and generative models, training loss, generalization, and overfitting. This is done by considering a simple linear regression problem as a recurring example. We start by introducing the problem of supervised learning and by presenting some background on inference. We then present the frequentist, Bayesian, and MDL learning approaches in this order. The treatment of MDL is limited to an introductory discussion, as the rest of the monograph concentrates on frequentist and Bayesian viewpoints. We conclude with an introduction to the important topic of information-theoretic metrics, and with a brief introduction to the advanced topics of causal inference and interpretation. ## 2.1 Supervised Learning In the standard formulation of a supervised learning problem, we are given a training set $ D $ containing $ N $ training points $ (x_n, y_n) $, $ n = 1, \ldots, N $. The observations $ x_n $ are considered to be free variables, and known as covariates, domain points, or explanatory variables; while the target #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 22 Context: ```markdown variables $ t_n $ are assumed to be dependent on $ x_n $, and are referred to as dependent variables, labels, or responses. An example is illustrated in Fig. 2.1. We use the notation $ \mathbf{x}_D = (x_1, \ldots, x_N)^T $ for the covariates and $ \mathbf{t}_D = (t_1, \ldots, t_N)^T $ for the labels in the training set $ D $. Based on this data, the goal of supervised learning is to identify an algorithm to predict the label $ t $ for a new, that is, as of yet unobserved, domain point $ x $. ![Figure 2.1: Example of a training set $ D $ with $ N = 10 $ points $ (x_n, t_n) $, $ n = 1, \ldots, N $.](path_to_image) The outlined learning task is clearly impossible in the absence of additional information on the mechanism relating variables $ z $ and $ t $. With reference to Fig. 2.1, unless we assume, say, that $ z $ and $ t $ are related by a function $ t = f(z) $ with some properties, such as smoothness, we have no way of predicting the label $ t $ for an unobserved domain point $ z $. This observation is formalized by the no free lunch theorem to be reviewed in Chapter 5: one cannot learn rules that generalize to unseen examples without making assumptions about the mechanism generating the data. The set of all assumptions made by the learning algorithm is known as the inductive bias. This discussion points to a key difference between memorizing and learning. While the former amounts to mere retrieval of a value $ t_n $, ``` #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 41 Context: ## 2.3 Frequentist Approach ### 2.3.5 Regularization We have seen above that the MAP learning criterion amounts to the addition of a regularization function $ R(\theta) $ to the ML or ERM learning losses. This function penalizes values of the weight vector $ \mathbf{w} $ that are likely to occur in the presence of overfitting, or, generally, that are improbable on the basis of the available prior information. The net effect of this addition is to effectively decrease the capacity of the model, as the set of values for the parameter vector $ \theta $ that the learning algorithm is likely to choose from is reduced. As a result, seen, regularization can control overfitting and its optimization requires validation. Regularization generally refers to techniques that aid in reducing overfitting during training. The discussion in the previous subsection has focused on a specific form of regularization that is grounded in a probabilistic interpretation in terms of MAP learning. We note that the same techniques, such as ridge regression and LASSO, can also be introduced independently of a probabilistic framework in an ERM formulation. Furthermore, apart from the discussed addition of regularization terms to the empirical risk, there are other ways to perform regularization. One approach is to modify the optimization scheme by using techniques such as early stopping [56]. Another is to augment the training set by generating artificial examples and hence effectively increasing the number $ N $ of training examples. Related to this idea is the technique known as bagging. With bagging, we first create a number $ K $ of bootstrap data sets. Bootstrap data sets are obtained by selecting $ N $ data points uniformly with replacement from $ D $ so that the same data point generally appears multiple times in the bootstrap data set. Then, we train the model $ K $ times, each time over a different bootstrap set. Finally, we average the results obtained from the models using equal weights. If the errors accrued by the different models were independent, bagging would yield an estimation error that decreases with $ K $. In practice, significantly smaller gains are obtained, particularly for large $ K $, given that the bootstrap data sets are all drawn from $ D $ and hence the estimation errors are not independent [23]. #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 50 Context: ``` 44 # A Gentle Introduction through Linear Regression Produces a description of approximately $$ - \sum_{n=1}^{N} \log{p(y_n | x_n, w_{ML}, \beta_{ML})} $$ bits. This description is, however, not sufficient, since the decoder of the description should also be informed about the parameters $(w_{ML}, \beta_{ML})$. Using quantization, the parameters can be described by means of a number $C(M)$ of bits that is proportional to the number of parameters, here $M + 2$. Concatenating these bits with the description produced by the ML model yields the overall description length $$ - \sum_{n=1}^{N} \log{p[y_n | x_n, w_{ML}, \beta_{ML}]} + C(M). \quad (2.39) $$ MDL – in the simplified form discussed here – selects the model order $M$ that minimizes the description length (2.39). Accordingly, the term $C(M)$ acts as a regularizer. The optimal value of $M$ for the MDL criterion is hence the result of the trade-off between the minimization of the overhead $C(M)$, which calls for a small value of $M$, and the minimization of the NLL, which decreases with $M$. Under some technical assumptions, the overhead term can be often evaluated in the form $$ \left( \frac{K}{2} \right) N \log N + c, $$ where $K$ is the number of parameters in the model and $c$ is a constant. This expression is not quite useful in practice, but it provides intuition about the mechanism used by MDL to combat overfitting. ## 2.6 Information-Theoretic Metrics We now provide a brief introduction to information-theoretic metrics by leveraging the example studied in this chapter. As we will see in the following chapters, information-theoretic metrics are used extensively in the definition of learning algorithms. Appendix A provides a detailed introduction to information-theoretic measures in terms of inferential tasks. Here we introduce the key metrics of Kullback-Leibler (KL) divergence and entropy by examining the asymptotic behavior of ML in the regime of large $N$. The case with finite $N$ is covered in Chapter 6 (see Sec. 6.4.3). ``` #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 54 Context: # A Gentle Introduction through Linear Regression ![Figure 2.10: Illustration of Simpson's paradox [113]](path_to_image) The way out of this conundrum is to leverage prior information we have about the problem domain. In fact, we can explain away this spurious correlation by including another measurable variable in the model, namely age. To see this, consider the same data, now redrawn by highlighting the age of the individual corresponding to each data point. The resulting plot, seen in Fig. 2.10 (bottom), reveals that older people — within the observed bracket — tend to have a higher cholesterol as well as to exercise more. Therefore, age is a common cause of both exercise and cholesterol levels. In order to capture the causal relationship between the latter variables, we hence need to adjust for age. Doing this requires to consider the trend within each age separately, recovering the expected conclusion that exercising is useful to lower one’s cholesterol. We conclude that in this example the correlation between $ z $ and $ t $, while useful for prediction, should not be acted upon for decision making. When assessing the causal relationship between $ z $ and $ t $, we should first understand which other variables may explain the observations and then discount any spurious correlations. This discussion reveals an important limitation of most existing machine learning algorithms when it comes to identifying causality relationships, or, more generally, to answering counterfactual queries [112]. The study of causality can be carried out within the elegant framework of footnotes. 7 This example is an instance of the so-called Simpson's paradox: patterns visible in the data disappear, or are even reversed, on the segregated data. #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 55 Context: # 2.8 Summary In this chapter, we have reviewed three key learning frameworks, namely frequentist, Bayesian and MDL, within a parametric probabilistic setup. The frequentist viewpoint postulates the presence of a true unknown distribution for the data and aims at learning a predictor that generalizes well on unseen data drawn from this distribution. This can be done either by learning a probabilistic model to be plugged into the expression of the optimal predictor or by directly solving the ERM problem over the predictor. The Bayesian approach outputs a predictive distribution that combines prior information with the data by solving the inference problem of computing the posterior distribution over the unseen label. Finally, the MDL method aims at selecting a model that allows the data to be described with the smallest number of bits, hence doing away with the need to define the task of generalizing over unobserved examples. The chapter has also focused extensively on the key problem of overfitting, demonstrating how the performance of a learning algorithm can be understood in terms of bias and estimation error. In particular, while choosing a hypothesis class is essential in order to enable learning, choosing the “wrong” class constitutes an irrecoverable bias that can make learning impossible. As a real-world example, as reported in [109], including as independent variables in the ZIP code of an individual seeking credit at a bank may disadvantage certain applicants or minorities. Another example of this phenomenon is the famous experiment by B. F. Skinner on pigeons [133]. We conclude this chapter by emphasizing an important fact about the probabilistic models that are used in modern machine learning applications. In frequentist methods, typically at least two (possibly conditional) distributions are involved: the empirical data distribution and the model distribution. The former amounts to the histogram of the data which, by the law of large numbers, tends to the real distribution. #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 58 Context: # 3.1 Preliminaries We start with a brief review of some definitions that will be used throughout the chapter and elsewhere in the monograph (see [28] for more details). Readers with a background in convex analysis and calculus may just review the concept of sufficient statistic in the last paragraph. First, we define a **convex set** as a subset of $\mathbb{R}^D$, for some $D$, that contains all segments between any two points in the set. Geometrically, convex sets hence cannot have “indentations.” Function $f(x)$ is convex if its domain is a convex set and if it satisfies the inequality $f(\lambda x_1 + (1 - \lambda)x_2) \leq \lambda f(x_1) + (1 - \lambda)f(x_2)$ for all $x_1$ and $x_2$ in its domain and for all $0 \leq \lambda \leq 1$. Geometrically, this condition says that the function is “U”-shaped: the curve defining the function cannot be above the segment obtained by connecting any two points on the curve. A function is strictly convex if the inequality above is strict except for $\lambda = 0$ or $\lambda = 1$; a concave, or strictly concave, function is defined by reversing the inequality above – it is hence “n-shaped.” The minimization of a convex (“U”) function over a convex constraint set or the maximization of a concave (“n”) function over a convex constraint set are known as convex optimization problems. For these problems, there exist powerful analytical and algorithmic tools to obtain globally optimal solutions [28]. We also introduce two useful concepts from calculus. The **gradient** of a differentiable function $f(x)$ with $x = [x_1, \ldots, x_D]^T \in \mathbb{R}^D$ is defined as the $D \times 1$ vector $\nabla f(x) = [\frac{\partial f(x)}{\partial x_1}, \ldots, \frac{\partial f(x)}{\partial x_D}]^T$ containing all partial derivatives. At any point $x$ in the domain of the function, the gradient is a vector that points to the direction of locally maximal increase of the function. The Hessian $\nabla^2 f(x)$ is the $D \times D$ matrix with $(i,j)$ element given by the second-order derivative $\frac{\partial^2 f(x)}{\partial x_i \partial x_j}$. It captures the local curvature of the function. 1. A statistic is a function of the data. #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 59 Context: # 3.2 The Exponential Family The probability $ p(\theta \mid x) $ of the parameters $ \theta $ depends on $ x $ only through the function $ f(x) $. As an example, for $ x \sim \mathcal{N}(0, \sigma^2) $, the function $ f(x) = x^2 $ can be easily seen to be sufficient for the estimate of the variance $ \sigma^2 $. ## 3.2 The Exponential Family In this section, we introduce the exponential family of parametric probabilistic models. As it will be discussed, this family includes as special cases most of the distributions typically assumed when solving machine learning problems. For example, it includes Gaussian, Laplace, Gamma, Beta and Dirichlet pdfs, as well as Bernoulli, Categorical, multinomial, and Poisson pmfs. An extensive list can be found in [156]. ### 3.2.1 Basic Definitions The exponential family contains probabilistic models of the form: \[ p(x \mid \eta) = \frac{1}{Z(\eta)} \exp\left( \sum_{k=1}^{K} m_k u_k(x) \right) m(z) \] \[ = \frac{1}{Z(\eta)} \exp\left( \eta^T u(x) \right) m(x), \tag{3.1} \] where $ x $ is a discrete or continuous-valued vector; $ \eta = [\eta_1, \ldots, \eta_K]^T $ is the vector of natural parameters; $ u(x) = [u_1(x), \ldots, u_K(x)]^T $ is the vector of sufficient statistics, with each sufficient statistic $ u_k(x) $ being a function of $ x $; $ m(z) \geq 0 $ is the base measure, which is a function of $ x $ that is independent of the natural parameter $ \eta $; and $ Z(\eta) $ is the partition function: \[ Z(\eta) = \int \exp\left( \eta^T u(x) \right) m(x) dx \tag{3.2} \] for continuous rvs and $ Z(\eta) = \sum_{x} \exp\left( \eta^T u(x) \right) m(x) $ for discrete rvs. The sufficient statistic vector $ u(x) $ can be seen to be a sufficient statistic for the estimation of the natural parameter vector $ \eta $ given $ x $. The partition function normalizes the distribution so that it integrates, or sums to one. It is also often useful to use the unnormalized distribution: \[ \tilde{p}(x) = \exp\left( \eta^T u(x) \right) m(x), \tag{3.3} \] #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 79 Context: # 3.8 Some Advanced Topics* With energy-based models, the key formula (3.28) of the gradient of the LL with respect to the model's parameters generalizes as $$ \frac{1}{N} \nabla_{n} \ln p(z | \eta) = - \frac{1}{N} \sum_{n=1}^{N} \nabla_{n} E_{c}(z_{n} | \eta) + \sum_{c} E_{c} - p(z_{c}) \nabla_{n} E_{c}(x_{i}). $$ (3.58) Generalizing the discussion around (3.28), the first term in (3.58) is the “positive” component that points in a direction that minimizes the energy of the observations $ x_{i} $; while the second term is the “negative” component that pushes up the energy of the unobserved configurations. In gradient-ascent methods, the application of the first term is typically referred to as the positive phase, while the second is referred to as the negative phase. (The negative phase is even taken by some authors to model the working of the brain while dreaming!) [56] While for the exponential family the expectation in the negative phase readily yields the mean parameters, for more general models, the evaluation of this term is generally prohibitive and typically requires Monte Carlo approximations, which are discussed in Chapter 8. ## 3.8 Some Advanced Topics* The previous sections have focused on the important class of parametric probabilistic models in the exponential family. Here we briefly put the content of this chapter in the broader context of probabilistic models for machine learning. First, it is often useful to encode additional information about the relationships among the model variables by means of a graphical formalism that will be discussed in Chapter 7. Second, the problem of learning the distribution of given observations, which has been studied here using parametric models, can also be tackled using a non-parametric approach. Accordingly, the distribution is inferred making only assumptions regarding its local smoothness. Typical techniques in this family include Kernel Density Estimation and Nearest Neighbor Density Estimation (see, e.g., [140]). Furthermore, rather than learning individual probability densities, in some applications, it is more useful to directly estimate ratios of densities. This is the case, for instance, when one wishes to estimate the #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 82 Context: # 4 ## Classification The previous chapters have covered important background material on learning and probabilistic models. In this chapter, we use the principles and ideas covered so far to study the supervised learning problem of classification. Classification is arguably the quintessential machine learning problem, with the most advanced state of the art and the most extensive application to problems as varied as email spam detection and medical diagnosis. Due to space limitations, this chapter cannot provide an exhaustive review of all existing techniques and latest developments, particularly in the active field of neural network research. For instance, we do not cover decision trees here (see, e.g., [155]). Rather, we will provide a principled taxonomy of approaches, and offer a few representative techniques for each category within a unified framework. We will specifically proceed by first introducing as preliminary material the Stochastic Gradient Descent optimization method. Then, we will discuss deterministic and probabilistic discriminative models, and finally we will cover probabilistic generative models. #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 92 Context: ``` 86 Classification The loss function is defined as $$ \ell(t(x, \hat{w})) = \max(0, -t \cdot a(x, \hat{w})) $$ (4.12) The perceptron process assigns zero cost to a correctly classified example $ x $, whose functional margin $ t \cdot a(x, \hat{w}) $ is positive, and a cost equal to the absolute value of the functional margin for a misclassified example, whose functional margin is negative. A comparison with the 0-1 loss is shown in Fig. 4.4. The perceptron algorithm tackles problem (4.11) with $ \lambda = 0 $ via SGD with mini-batch size $ S = 1 $. The resulting algorithm works as follows. First, the weights $ \hat{w}^{(0)} $ are initialized. Then, for each iteration $ i = 1, 2, \ldots $: 1. Pick a training example $ (x_n, t_n) $ uniformly with replacement from $ D $: - If the example is correctly classified, i.e., if $ t_n a(x_n, \hat{w}^{(i)}) \geq 0 $, do not update the weights: $$ \hat{w}^{(i)} = \hat{w}^{(i-1)}. $$ - If the example is not correctly classified, i.e., if $ t_n a(x_n, \hat{w}^{(i)}) < 0 $, update the weights as: $$ \hat{w}^{(i)} = \hat{w}^{(i-1)} - \nabla_{\hat{w}} \ell(t_n, a(x_n, \hat{w}^{(i)})) \big|_{\hat{w} = \hat{w}^{(i-1)}} = \hat{w}^{(i-1)} + t_n a(x_n, \hat{w}^{(i-1)}). $$ (4.13) It can be proved that, at each step, the algorithm reduces the term $ \ell(t_n, a(x_n, \hat{w})) $ in the perception loss related to the selected training example even if the latter is misclassified. It can also be shown that, if the training set is linearly separable, the perceptron algorithm finds a weight vector $ \hat{w} $ that separates the two classes exactly in a finite number of steps [23]. However, convergence can be slow. More importantly, the perceptron fails on training sets that are not linearly separable, such as the XOR training set $ D = \{(0, 0), (0, 1), (1, 0), (1, 1)\} $ [97]. This realization came as a disappointment and contributed to the first soc-called AI winter period characterized by a reduced funding for AI and machine learning research [154]. ## Support Vector Machine (SVM) SVM, introduced in its modern form by Cortes and Vapnik [37] in 1995, was among the main causes for a renewal of interest in machine learning. ``` #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 96 Context: # 4.3.3 Multi-Class Classification Here we briefly describe classification scenarios with $ K > 2 $ classes. As a first observation, it is possible to build multi-class classifiers based solely on multiple binary classifiers, such as SVM. This can be done by following one of two general strategies, namely one-versus-the-rest and one-versus-one [23, Chapter 7]. The one-versus-the-rest approach trains $ K $ separate binary classifiers, say $ k = 1, \ldots, K $, with the $ k $-th classifier operating on the examples relative to class $ C_k $ against the examples from all other classes. The one-versus-one method, instead, trains $ K(K - 1)/2 $ binary classifiers, one for each pair of classes. Both approaches can yield ambiguities in classification [23, Chapter 7]. # 4.4 Discriminative Probabilistic Models: Generalized Linear Models Discriminative probabilistic models are potentially more powerful than deterministic ones since they allow to model sources of uncertainty in the label assignment to the input variables. This randomness may model noise, labeling errors, e.g., for crowdsourced labels, and/or the residual uncertainty in the classification rule due to the availability of limited data. Probabilistic models can also more naturally accommodate the presence of more than two classes by producing a probability distribution over the possible label values. In this section, we study GLMs, which were introduced in Sec. 3.5. We recall that a GLM (3.5.1) posits that the conditional pdf $ p(y | \boldsymbol{x}) $ is a member of the exponential family in which the natural parameter vector $ \eta $ is given as a linear function of a feature vector $ \boldsymbol{\varphi}(\boldsymbol{x}) $, i.e., $ \eta = \boldsymbol{W}^T \boldsymbol{\varphi} $. It is noted that GLMs are not linear: the unnormalized log-likelihood is linear in the weight matrix $ \boldsymbol{W} $ (see Sec. 3.2). We start by discussing binary classification and then cover the multi-class case in Sec. 4.4.3. #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 106 Context: # Classification ![Figure 4.8](path/to/your/image) **Figure 4.8:** Probability that the class label is the same as for the examples marked with circles according to the output of a feed-forward multi-layer network with one hidden layer (L = 1) and six hidden neurons with sigmoid non-linearity. The probability is represented by the color map illustrated by the bar on the right of the figure. For reference, the solid line represents the decision line for logistic regression. ## Example 4.3 In the example in Fig. 4.8, the illustrated $ N = 300 $ training examples are used to train a logistic regression, i.e., GLM, model and a feed-forward multi-layer network with one hidden layer (L = 1) and six hidden neurons with sigmoid non-linearity $ h(z) = \sigma(x) $. The logistic model uses linear features $ \sigma(x) = [1]^T $. Both networks are trained using SGD. For logistic regression, the decision line is illustrated as a solid line, while for the multi-layer network we plot the probability that the class labels the same as for the examples marked with circles as color map. The GLM model with linear features is seen to be unable to capture the structure of the data, while the multi-layer network can learn suitable features that improve the effectiveness of classification. ## 4.5.3 Some Advanced Topics We conclude this section by noting a few important aspects of the ongoing research on deep neural networks. A first issue concerns the theoretical understanding of the generalization properties of deep neural networks. On the face of it, the success Image Analysis: ### Comprehensive Examination of the Visual Content #### 1. Localization and Attribution - **Image Position:** - **Image 1:** Located centrally on the page. #### 2. Object Detection and Classification - **Image 1:** - **Objects Detected:** - Data points represented by blue circles and red crosses. - Color map background in grayscale. - Decision boundary indicated by a solid line. #### 3. Scene and Activity Analysis - **Image 1:** - **Scene Description:** - The image depicts a scatter plot overlaid with a color map and a decision boundary. - **Activities:** - The scatter plot shows data points classified into two categories (blue circles and red crosses). - The color map indicates the probability of classification. - A solid line separates the regions, representing the decision boundary from logistic regression. #### 4. Text Analysis - **Extracted Text:** - **Caption:** - **Figure 4.8:** Probability that the class label is the same as for the examples marked with circles according to the output of a feed-forward multi-layer network with one hidden layer ($L=1$) and six hidden neurons (with sigmoid non-linearity). The probability is represented by the color map illustrated by the bar on the right of the figure. For reference, the solid line represents the decision line for logistic regression. - **Example 4.3:** - In the example in Fig. 4.8, the illustrated $N=300$ training examples are used to train a logistic regression, i.e., GLM, model and a feed-forward multi-layer network with one hidden layer ($L=1$) and six hidden neurons with sigmoid non-linearity $h(z)=\sigma(z)$. The logistic model uses linear features $\sigma(z)=[1 \, z^T]$. Both networks are trained using SGD. For logistic regression, the decision line is illustrated as a solid line, while for the multi-layer network we plot the probability that the class label is the same as for the examples marked with circles as color map. The GLM model with linear features is seen to be unable to capture the structure of the data, while the multi-layer network can learn suitable features that improve the effectiveness of classification. #### 5. Diagram and Chart Analysis - **Image 1:** - **Axes and Scales:** - X-axis ($Z_1$): Ranges approximately from -4 to 4. - Y-axis ($Z_2$): Ranges approximately from -4 to 4. - **Legends:** - Red crosses ($ \times $): Data points of class 1. - Blue circles ($ \circ $): Data points of class 2. - Color map: Represents the probability, with a bar on the right showing the gradient from 0 to 1. - **Key Insights:** - The decision boundary (solid line) shows the separation made by logistic regression. - The color map illustrates that the feed-forward multi-layer network better captures the structure of the data compared to the linear model. #### 8. Color Analysis - **Image 1:** - **Dominant Colors:** - Red and blue are used to differentiate classes. - Black, white, and shades of gray indicate the probability regions. - **Impact on Perception:** - The color scheme helps to distinguish between different classes and the corresponding decision areas effectively. #### 9. Perspective and Composition - **Image 1:** - **Perspective:** - Direct, front-facing view typical of scatter plots. - **Composition:** - Center-aligned with data points spread across the plot. - Decision boundary centrally positioned, highlighting the separation between classes. - Color bar on the right side provides probability information. #### 10. Contextual Significance - **Image 1:** - **Overall Document Contribution:** - The plot supports the explanation of Example 4.3 in comparing the effectiveness of logistic regression vs. multi-layer networks. - It visually demonstrates how different models handle training data and classification. By considering these aspects, we can see that this page offers a detailed explanation of machine learning classification techniques with a practical example illustrated by the scatter plot and color map. The image effectively highlights the differences in performance and decision-making capabilities of logistic regression and a multi-layer network. #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 107 Context: # 4.5. Discriminative Probabilistic Models: Beyond GLM of deep neural networks appears to defy one of the principles laid out in Chapter 2, which will be formalized in the next chapter: Highly over-parameterized models trained via ML suffer from overfitting and hence do not generalize well. Recent results suggest that the use of SGD, based on the gradient (4.37), may act as a regularizer following an MDL perspective. In fact, SGD favors the attainment of flat local maxima of the likelihood function. Flat maxima require a smaller number of bits to be specified with respect to sharp maxima; since, in flat maxima, the parameter vector can be described with limited precision as long as it remains within the flat region of the likelihood function [66, 79, 71] (see also [78] for a different perspective). Another important aspect concerns the hardware implementation of backprop. This is becoming extremely relevant given the practical applications of deep neural networks for consumers' devices. In fact, a key aspect of backprop is the need to propagate the error, which is measured at the last layer, to each synapse via the backward pass. This is needed in order to evaluate the gradient (4.37). While a software implementation of this rule does not present any conceptual difficulty, realizing the computations in (4.36) in hardware, or even on a biological neural system, is faced with a number of issues. 1. A first problem is the non-locality of the update (4.37). An update rule is local if it only uses information available at each neuron. In contrast, as seen, rule (4.37) requires back-propagation through the neural network. 2. Another issue is the need for the backward pass to use a different neural path with respect to the forward pass, given that, unlike the backward pass, the forward pass includes also non-linearities. A useful discussion of hardware implementation aspects can be found in [12] (see also references therein). To obviate at least some of these practical issues, a number of variations of the rule (4.37) have been proposed [12]. For instance, the feedback alignment rule modifies (4.36) by using fixed random matrices in lieu of the current weight matrices $ W' $; while the broadcast alignment rule rewrites the vectors $ \mathbf{y} - \mathbf{t} $ as a linear function with fixed random coefficients of the error ( $ \mathbf{y} - \mathbf{t} $ ), hence removing the need for back-propagation [129]. #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 113 Context: # 4.8 Summary When training the $k$th model, the outputs $a_1(x, \mathbf{w}_1), \ldots, a_k(x, \mathbf{w}_{k-1})$ of the previously trained models, as well as their weights $\pi_1, \ldots, \pi_{k-1}$, are kept fixed. Excluding the models $k + 1, \ldots, K$, the training loss can be written as \[ L_k = \sum_{n=1}^{N} \alpha_{n}^{(k)} \exp(-\pi_k t_n - \sum_{j=1}^{k-1} \pi_j a_j(x_n, \mathbf{w}_j)). \tag{4.47} \] with the weights \[ \alpha_{n}^{(k)} = \exp\left(-t_n - \sum_{j=1}^{k-1} \pi_j a_j(x_n, \mathbf{w}_j)\right). \tag{4.48} \] An important point is that the weights (4.48) are larger for training samples with smaller functional margin under the mixture model $\sum_{j=1}^{k} \pi_j a_j(x_n, \mathbf{w}_j)$. Therefore, when training the $k$th model, we give more importance to examples that fare worse in terms of classification margins under the current mixture model. Note that, at each training step $k$, one trains a simple model, which has the added advantage of reducing the computational complexity as compared to the direct learning of the full training model. We refer to [23, Ch. 14][33, Ch. 10] for further details. #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 115 Context: 4.8. Summary ----------- Chapter 8 for details). Learning rate schedules that satisfy (4.49) include $\gamma(t) = 1/t$. The intuitive reason for the use of diminishing learning rates is the need to limit the impact of the “noise” associated with the finite-sample estimate of the gradient [22]. The proof of convergence leverages the unbiasedness of the estimate of the gradient obtained by SGD. In practice, a larger mini-batch size $S$ decreases the variance of the estimate of the gradient, hence improving the accuracy when close to a stationary point. However, choosing a smaller $S$ can improve the speed of convergence when the current solution is far from the optimum [152, Chapter 8][22]. A smaller mini-batch size $S$ is also known to improve the generalization performance of learning algorithms by avoiding sharp external points of the training loss function [66, 79] (see also Sec. 4.5). Furthermore, as an alternative to decreasing the step size, one can also increase the size of the mini-batch along the iterations of the SGD algorithm [136]. Variations and Generalizations ------------------------------- Many variations of the discussed basic SGD algorithm have been proposed and routinely used. General principles motivating these schedule variants include [56, Chapter 8]: 1. **Momentum**, or heavy-ball, memory: correct the direction suggested by the stochastic gradient by considering the “momentum” acquired during the last update; 2. **Adapitivity**: use a different learning rate for different parameters depending on an estimate of the curvature of the loss function with respect to each parameter; 3. **Control variates**: in order to reduce the variance of the SGD updates, and control variates that do not affect the unbiasedness of the stochastic gradient; and 4. **Second-order updates**: include information about the curvature of the cost or objective function in the parameter update. As detailed in [56, Chapter 8][76, 43], to which we refer for further discussions, methods in the first category include Nesterov momentum; in the second category, we find AdaGrad, RMSProp, and Adam; and the third encompasses SVRG and SAGA. Finally, the fourth features Newton's method. #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 116 Context: # Classification Tom methods, which require calculation of the Hessian to evaluate the local curvature of the objective, and approximated Newton methods, which leverage an estimate of the Hessian. The practical and theoretical implications of these methods is still under discussion [157]. Related to second-order methods is the natural gradient approach, which applies most naturally to probabilistic models in which the function to be optimized is a LL [5]. The conventional gradient method updates the parameters by moving in the direction that minimizes the cost function under a constraint on the norm of the update vector in the parameter space. A potentially problematic aspect of this method is that the Euclidean distance $ \|\theta' - \theta''\|^2 $ between two parameter vectors $ \theta' $ and $ \theta'' $, two mean parameters in a model within the exponential family, does not provide a direct measure of the distance of the two corresponding distributions in terms of relevant metrics such as the KL divergence. The natural gradient method addresses this issue by measuring the size of the update directly in terms of the KL divergence between distributions. This modifies the update by pre-multiplying the gradient with the inverse of the Fisher information matrix [5]. The discussion above focuses on the common case of differentiable cost functions. ERM problems typically include possibly non-differentiable regularization terms. To tackle these problems, techniques such as the subgradient method and proximal gradient can be used in lieu of SGD [22]. Other important aspects of optimization schemes include parallelism and non-convexity (see, e.g., [130, 141, 161]). Alternatives to gradient methods that do not require differentiability include evolutionary schemes [128]. ## Appendix B: Kernel Methods In this section, we provide a brief introduction to kernel methods. This section requires some background in Lagrange duality. We start by revisiting the problem (4.16) solved by SVM. Using Lagrange duality, the optimization (4.16) can be solved in the dual domain, that is, by optimizing over the dual variables or the Lagrange multipliers. Referring for details to [23, Chapter 7], the resulting problem… #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 119 Context: # 5 ## Statistical Learning Theory* Statistical learning theory provides a well-established theoretical framework in which to study the trade-off between the number $ N $ of available data points and the generalization performance of a trained machine. The approach formalizes the notions of model capacity, estimation error (or generalization gap), and bias that underlie many of the design choices required by supervised learning, as we have seen in the previous chapters. This chapter is of mathematical nature, and it departs from the algorithmic focus of the text so far. While it may be skipped at a first reading, the chapter sheds light on the key empirical observations made in the previous chapters relative to learning in a frequentist setup. It does so by covering the theoretical underpinnings of supervised learning within the classical framework of statistical learning theory. To this end, the chapter contains a number of formal statements with proofs. The proofs have been carefully selected in order to highlight and clarify the key theoretical ideas. This chapter follows mostly the treatment in [133]. #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 125 Context: ``` 5.2 PAC Learnability and Sample Complexity =========================================== In order to formally address the key question posed above regarding the learnability of a model $ \mathcal{H} $, we make the following definitions. As mentioned, for simplicity, we consider binary classification under the 0-1 loss, although the analysis can be generalized under suitable conditions [133]. ### Definition 5.2 A hypothesis class $ \mathcal{H} $ is PAC learnable if, for any $ \epsilon, \delta \in (0, 1) $, there exist an $ (N, \epsilon, \delta) $ PAC learning rule as long as the inequality \[ N \geq N_{\mathcal{H}}(\epsilon, \delta) \] is satisfied for some function $ N_{\mathcal{H}}(\epsilon, \delta) < \infty $. In words, a hypothesis class is PAC learnable if, as long as enough data is collected, a learning algorithm can be found that obtains any desired level of accuracy and confidence. An illustration of the threshold $ N_{\mathcal{H}}(\epsilon, \delta) $ can be found in Fig. 5.2. A less strong definition of PAC learnability requires (5.7) to hold only for all distributions $ p(x, t) $ that can be written as \[ p(x, t) = p(x)1(t = \hat{i}(x)) \] for some marginal distribution $ p(x) $ and for some hypothesis $ \hat{i}(x) \in \mathcal{H} $. The condition (5.9) is known as the realizability assumption, which implies that the data is generated from some mechanism that is included in the hypothesis class. Note that realizability implies the linear separability of any data set drawn from the true distribution for the class of linear predictors (see Chapter 4). A first proposition, and perhaps surprising, observation is that not all models are PAC learnable. As an extreme example of this phenomenon, consider the class of all functions from $ \mathbb{R}^d $ to $ \{0, 1\} $. By the no free lunch theorem, this class is not PAC learnable. In fact, given any amount of data, we can always find a distribution $ p(x, t) $ under which the PAC condition is not satisfied. Intuitively, even in the realizable case, knowing the correct predictor $ \hat{i}(x) $ in (5.9) for any number of ``` #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 133 Context: # 5.5. Summary To elaborate, consider a probabilistic model $ \mathcal{H} $ defined as the set of all pmfs $ p(x|\theta) $ parameterized by $ \theta $ in a given set. With some abuse of notation, we take $ \mathcal{H} $ to be also the domain of parameter $ \theta $. To fix the ideas, assume that $ \theta $ takes values over a finite alphabet. We know from Section 2.5 that a distribution $ q(x) $ is associated with a lossless compression scheme that requires around $ -\log q(x) $ bits to describe a value. Furthermore, if we were informed about the true parameter $ \theta $, the minimum average coding length would be the entropy $ H(p(\cdot|\theta)) $, which requires setting $ q(x) = p(x|\theta) $ (see Appendix A). Assume now that we only know that the parameter $ \theta $ lies in set $ \mathcal{H} $, and hence the true distribution $ p(x|\theta) $ is not known. In this case, we cannot select the true parameter distribution, and we need instead to choose a generally different distribution $ q(x) $ to define a compression scheme. With a given distribution $ q(x) $, the average coding length is given by \[ \bar{L}(q(x), \theta) = -\sum_{x} p(x|\theta) \log q(x). \] Therefore, the choice of a generally different distribution $ q(x) $ entails a redundancy of \[ \Delta R(q(x), \theta) = -\sum_{x} p(x|\theta) \log \frac{q(x)}{p(x|\theta)} \geq 0 \tag{5.22} \] bits. The redundancy $ \Delta R(q(x), \theta) $ in (5.22) depends on the true value of $ \theta $. Since the latter is not known, this quantity cannot be computed. We can instead obtain a computable metric by maximizing over all values of $ \theta \in \mathcal{H} $, which yields the worst-case redundancy \[ \Delta R(q(x), \mathcal{H}) = \max_{\theta \in \mathcal{H}} \Delta R(q(x), \theta). \tag{5.23} \] This quantity can be minimized over $ q(x) $ yielding the so-called min-max redundancy: \[ \Delta R(q(x)) = \min_{q} \Delta R(q(x), \mathcal{H}) \tag{5.24} \] \[ = \min_{q} \max_{\theta} \sum_{x} p(x|\theta) \log \frac{p(x|\theta)}{q(x)} \tag{5.25} \] \[ = \min_{q} \max_{\theta} \sum_{x} p(x|\theta) \log \frac{p(x|\theta)}{q(x)}. \tag{5.26} \] The minimum redundancy can be taken as a measure of the capacity of model $ \mathcal{H} $, since a richer model tends to yield a larger $ \Delta R(\mathcal{H}) $. In fact, #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 135 Context: # Part III ## Unsupervised Learning ### 1. Introduction Unsupervised learning is a type of machine learning that does not require labeled output. It relies on patterns and structures in input data to generate insights. ### 2. Common Algorithms - K-means Clustering - Hierarchical Clustering - DBSCAN ### 3. Applications - Market Basket Analysis - Image Compression - Anomaly Detection ### 4. Conclusion Unsupervised learning provides significant value across various domains, enabling organizations to extract meaningful information from unstructured data. #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 138 Context: # Unsupervised Learning may wish to cluster a set $ D $ of text documents according to their topics, by modeling the latter as an unobserved label $ z_n $. Broadly speaking, this requires to group together documents that are similar according to some metric. It is important at this point to emphasize the distinction between classification and clustering: While the former assumes the availability of a labelled set of training examples and evaluates its (generalization) performance on a separate set of unlabelled examples, the latter works with a single, unlabelled, set of examples. The different notation used for the labels $ z_n $ in lieu of $ t_n $ is meant to provide a reminder of this key difference. ## Dimensions reduction and representation: Given the set $ D $, we would like to represent the data points $ x_n \in D $ in a space of lower dimensionality. This makes it possible to highlight independent explanatory factors, and/or to ease visualization and interpretation [93], e.g., for text analysis via vector embedding (see, e.g., [124]). ## Feature extraction: Feature extraction is the task of deriving functions of the data points $ x_n $ that provide useful lower-dimensional inputs for tasks such as supervised learning. The extracted features are unobserved, and hence latent, variables. As an example, the hidden layer of a deep neural network extracts features from the data for use by the output layer (see Sec. 4.5). ## Generation of new samples: The goal here is to train a machine that is able to produce samples that are approximately distributed according to the true distribution $ p(x) $. For example, in computer graphics for filmmaking or gaming, one may want to train a software that is able to produce artistic scenes based on a given description. The variety of tasks and the difficulty in providing formal definitions, e.g., on the realism of an artificially generated image, may hinder unsupervised learning, at least in its current state, a less formal field than supervised learning. Often, loss criteria in unsupervised learning measure the divergence between the learned model and the empirical data distribution, but there are important exceptions, as we will see. #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 142 Context: # Unsupervised Learning We obtain \[ \mu^{new}_{k} = \frac{1}{\sum_{n=1}^{N} \delta_{k,z_{n}}} \sum_{n=1}^{N} \delta_{k,z_{n}} x_{n} \] The new cluster representative $\mu^{new}_{k}$ for each cluster $k$ is the mean of the data points assigned to cluster $k$. - If a convergence criterion is not satisfied, set $ \mu^{old}_{k} \leftarrow \mu^{new}_{k} $ and return to the E step. Since both the E step and M step minimize the objective function in (6.4), respectively over the cluster assignment variables $z_n$ and the cluster representatives $\mu_k$, the value of the objective function is non-decreasing across the iterations. This ensures convergence. Illustrations of convergence and examples can be found in [23, Chapter 9]. As a note, the algorithm is also known as Lloyd-Max quantization [54]. At a high level, K-means alternates between: 1. Making inferences about the hidden variables $\{z_{n}\}$ based on the current model defined by the representatives $\{\mu_{k}\}$ in the E step; and 2. Updating the model $\{\mu_{k}\}$ to match the data $\{x_{n}\}$ and the inferred variables $\{z_{n}\}$ in the M step. We will see that a similar algorithmic structure is applied by many unsupervised learning algorithms. Before we move on to discussing more general solutions, it is worth spending a few words on the problem of selecting the number $K$ of clusters. A first possibility is to add or remove clusters until certain heuristic criteria are satisfied, such as the “purity” of clusters. A second approach is hierarchical clustering, whereby one builds a tree, known as a dendrogram, that includes clustering solutions with an increasing number of clusters as one moves away from the root (see, e.g., [51]). Yet another solution is to let $K$ be selected automatically by adopting a non-parametric Bayesian approach via a Dirichlet process prior [104]. ## 6.3 ML, ELBO and EM In this section, we discuss two key technical tools that are extensively used in tackling unsupervised learning problems, namely the Evidence Lower Bound (ELBO) and the EM algorithm. The starting point is the fundamental problem of learning a probabilistic, directed or undirected, #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 160 Context: 154 # Unsupervised Learning The Jensen-Shannon divergence, among others, with specific choices of function $ g $ (see Appendix A). ## Generalizing the MI problem (6.37) GANs attempt to solve the problem \[ \min_{\phi} D_f(p_{\theta}(x)||p_{g}(\phi)). \] (6.40) More precisely, GANs parameterize the discriminator $ D_{\phi}(x) $ by choosing a differentiable function $ T_{\phi}(x) $ of the parameter vector $ \phi $. This effectively reduces the search space for the discriminator, and defines a different type of divergence for each value of $ \phi $. A typical choice for $ T_{\phi} $ is the output of a multi-layer neural network with weights $ \phi $ or a function thereof (see See 4.5). With this in mind, using (6.39) in (6.40), we obtain the minimax problem solved by GANs \[ \min \max_{x \sim p_{r}} [T_{\phi}(x)] - E_{x \sim p_{g}(\phi)}[T_{\phi}(x)]. \] (6.41) According to (6.41), GANs select the divergence measure adaptively through the optimization of the discriminator $ T_{\phi} $. Problem (6.41) can also be interpreted in terms of a strategic game played by generator and discriminator [57]. The original GAN method [57] operates by setting $ T(x) = D_{\phi}(x) $, where $ D_{\phi}(x) $ is the output of a multi-layer perceptron, and $ g(t) = -\log(1 - \exp(t)) $. A variation that is shown to be more useful in practice is also discussed in the first GAN paper (see [49]). As another popular example, Wasserstein GAN simply sets $ T(x) $ to be the output of a multi-layer perceptron and chooses $ g(t) = t - 1 $. ## Application to likelihood-free models GANs are typically used with likelihood-free models. Accordingly, the model distribution $ p_{\phi}(x) $ is written as \[ p_{\phi}(x) = \int \delta(x - G_{\phi}(z)) \sqrt{\rho(z)} \, dz. \] (6.42) 1. For a given function $ T(x) $, the right-hand side of (6.39), excluding the maximation, provides a lower bound on the divergence $ D_{y}(||\cdot||) $. Another useful related bound for the KL divergence is the so-called Donoho-Vandereylen representation (see, e.g., [51]). 2. The version of GAN given here is known as $-GANs [107]$, which is a generalization of the original GAN. 3. It is noted that the GAN setup is reminiscent of the adversarial model used to define semantic security in cryptography. #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 162 Context: # Unsupervised Learning ## 6.5.1 Restricted Boltzmann Machines (RBM) RBMs are typically characterized by an $ M $-dimensional binary hidden vector $ z \in \{0, 1\}^M $, while the observed variables may be discrete or continuous. Here we consider the case of binary observed variables, which is suitable to model, e.g., black-and-white images or positive/negative recommendations. The RBM model is defined as (6.3), where the joint distribution of visible and latent variables is a log-linear model, and hence part of the exponential family. Mathematically, we have: \[ p(x | z) = \frac{1}{Z(\theta)} \exp(-E(x, z | \theta)), \tag{6.43} \] with the energy function given as \[ E(x, z | \theta) = -a^T z - b^T x - z^T W z, \tag{6.44} \] and the partition function as \[ Z(\theta) = \sum_{x, z} \exp(-E(x, z | \theta)). \] The parameter vector $\theta$ includes the $ M \times 1 $ vector $ a $, the $ D \times 1 $ vector $ b $, and the $ D \times M $ matrix $ W $. The qualifier "restricted" captures the fact that the energy function (6.44) only features cross-terms that include one variable from $ x $ and one from $ z $, and no term includes both variables in either $ x $ or $ z $. In other words, the model accounts for the affinities between variables in $ x $ and $ z $, but directly between variables in either $ x $ or $ z $. For instance, if $ (i, j) $th entry $ w_{ij} $ of matrix $ W $ is a large positive number, variables $ x_i $ and $ z_j $ tend to have high signals in order to minimize the energy and hence maximize the probability; and the opposite is true when $ w_{ij} $ is negative. As we will see in the next chapter, this type of probabilistic relationship can be represented by the undirected graphical model known as MRF shown in Fig. 6.6. #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 169 Context: # 6.8 Ranking ## Denoising autoencoders An alternative approach to facilitate the learning of useful features is that taken by denoising autoencoders [150]. Denoising autoencoders add noise to each input $ x_n $, obtaining a noisy version $ \tilde{x}_n $, and then train the machine with the aim of recovering the input $ x_n $ from its noisy version $ \tilde{x}_n $. Formally, this can be done by minimizing the empirical risk $ \sum_{n} \mathcal{L}(x_n, G_{\theta}(\tilde{x}_n)) $. ## Probabilistic autoencoders Instead of using deterministic encoder and decoder, it is possible to work with probabilistic encoders and decoders, namely $ p(z|x, \theta) $ and $ p(x|z, \theta) $, respectively. Treating the decoder as a variational distribution, learning can then be done by a variant of the EM algorithm. The resulting algorithm, known as Variational AutoEncoder (VAE), will be mentioned in Chapter 8. ## 6.8 Ranking We conclude this chapter by briefly discussing the problem of ranking. When one has available ranked examples, ranking can be formulated as a supervised learning problem [133]. Here, we focus instead on the problem of ranking a set of webpages based only on the knowledge of their underlying graph of mutual hyperlinks. This set-up may be considered as a special instance of unsupervised learning. We specifically describe a representative, popular, scheme known as PageRank [110], which uses solely the web of hyperlinks as a form of supervision signal to guide the ranking. To elaborate, we define the connectivity graph by including a vertex for each webpage and writing the adjacency matrix as: \[ L_{ij} = \begin{cases} 1 & \text{if page } j \text{ links to page } i \\ 0 & \text{otherwise} \end{cases} \] The outgoing degree of a webpage is given as: \[ C_j = \sum_{i} L_{ij}. \] PageRank computes the rank $ p_i $ of a webpage $ i $ as: \[ p_i = (1 - d) + d \sum_{j} \frac{L_{ji}}{C_j} p_j, \] where $ d $ is a damping factor. #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 173 Context: # 7.1 Introduction In this section, we start by discussing two examples that illustrate the type of structural information that can be encoded by means of probabilistic graphical models. We then provide an overview of this chapter. As illustrated by the two examples below, structured probabilistic models can be used to set up parametric models for both supervised and unsupervised learning. In the former case, all variables are observed in the training set, with some revs being inputs, i.e., covariates ($x$), and others being considered as outputs, or targets ($y$). In contrast, in the latter case, some variables are unobserved and play the role of latent variables ($z$) that help explain or generate the observed variables ($y$).[^1] ## Example 7.1 Consider the tasks of text classification via supervised learning or topic clustering via unsupervised learning. In the supervised learning case, the problem is to classify documents depending on their topic, e.g., sport, politics or entertainment, based on a set of labeled documents. With unsupervised learning, the problem is to cluster documents according to the similarity of their contents based on the sole observation of the documents themselves. A minimal model for this problem should include a variable $t$ representing the topic and a variable $x$ for the document. The topic can be represented by a categorical variable taking $T$ values, i.e., $t \in \{1, ... , T\}$, which is observed for supervised learning and latent for unsupervised learning. As for the document, with “bag-of-words” encoding, a set of $W$ words of interest is selected, and a document is encoded as $W \times 1$ binary vector $z = [z_1, ..., z_W]^T$, where $z_w = 1$ if word $w$ is contained in the document.[^2] To start, we could try to use an unstructured directed model defined as \[ t \sim \text{Cat}(π) \tag{7.1a} \] \[ x | t \sim \text{Cat}(t) \tag{7.1b} \] [^1]: Strictly speaking, this distinction applies to the frequentist approach, since in the Bayesian approach the model parameters are always treated as unobserved random variables. [^2]: Note that $W$ here does not represent a matrix of weights! #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 202 Context: # Approximate Inference and Learning The equations (8.21) for approximating the posterior $ p(z|x) $ are given as \[ q(z_i) \propto \exp \left( \eta_1 \sum_{j} E_{j,k}|\phi_{z_i}z_j| + \eta_2z_i \right) \quad (8.22) \] and hence, upon normalization, we have \[ q(z_i = 1) = \frac{1}{1 + \exp(-2(\eta_1 \sum_{j} \mu_j + \eta_2 z_i))} \quad (8.23) \] or \[ = \sigma \left( \eta_1 \sum_{j} \mu_j + \eta_2 \right) \] where $ \mu_i = q(z_i = 1) - q(z_i = -1) = 2q(z_i = 1) - 1 $. For a numerical example, consider a $ 4 \times 4 $ binary image $ x $ observed as matrix $ x $, where the joint distribution of $ x $ and $ z $ is given by the Lifting model. Note that, according to this model, the observed matrix $ x $ is such that each pixel of the original matrix $ z $ is flipped independently with probability $ (1 - 2\eta) $. In this small example, it is easy to generate an image $ z $ dictated by the model, as well as to compute the exact posterior $ p(z|x) $ by enumeration of all possible images $ z $. The KL divergence $ \text{KL}[p(x||q(z_i))] $ obtained at the end of each iteration of MFV1 – with one iteration applying (8.23) one by one to all variables – is shown in Fig. 8.2 for $ \eta_1 = 0.15 $ and various values of $ \eta_2 $. As $ \eta_2 $ increases, the posterior distribution tends to become deterministic since $ z $ is an increasingly accurate measurement of $ x $. As a result, the final mean-field approximation is more faithful to the real posterior, since a product distribution can capture a deterministic posterior. For smaller values of $ \eta_2 $, however, the bias due to the mean-field assumption yields a significantly larger KL divergence. ## Beyond MFV1 MFV1 assumes a fully factorized variational distribution $ q(\cdot) $. It is also possible to develop methods that are based on the same factorization as the joint distribution $ p(x, z) $. This is known as the Bethe approximation, and yields Loopy Belief Propagation (LBP) as a specific solution technique. LBP can also be interpreted as the application of message passing belief propagation, which was described. #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 209 Context: # 9 ## Concluding Remarks This monograph has provided a brief introduction to machine learning by focusing on parametric probabilistic models for supervised and unsupervised learning problems. It has endeavored to describe fundamental concepts within a unified treatment starting from first principles. Throughout the text, we have also provided pointers to advanced topics that we have only been able to mention or shortly touch upon. Here, we offer a brief list of additional important aspects and open problems that have not been covered in the preceding chapters. - **Privacy:** In many applications, data sets used to train machine learning algorithms contain sensitive private information, such as personal preferences for recommendation systems. It is hence important to ensure that the learned model does not reveal any information about the individual entries of the training set. This constraint can be formulated using the concept of differential privacy. Typical techniques to guarantee privacy of individual data points include adding noise to gradients when training via SGD and relying on mixtures of experts trained with different subsets of the data [1]. - **Robustness:** It has been reported that various machine learning models, including neural networks, are sensitive to small variations in input data. #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 210 Context: # Concluding Remarks The data, producing the wrong response upon minor, properly chosen, changes in the explanatory variables. Note that such adversarially chosen examples, which cause a specific machine to fail, are conceptually different from the randomly selected examples that are assumed when defining the generalization properties of a network. There is evidence that finding such examples is possible even without knowing the internal structure of a machine, but solely based on black-box observations [11]. Modifying the training procedure in order to ensure robustness to adversarial examples is an active area of research with important practical implications [55]. ## Computing Platforms and Programming Frameworks In order to scale up machine learning applications, it is necessary to leverage distributed computing architectures and related standard programming frameworks [17, 7]. As a complementary and more futuristic approach, recent work has been proposed to leverage the capabilities of an annealing-based quantum computers as samplers [82] or discrete optimizers [103]. ## Transfer Learning Machines trained for a certain task currently need to be re-trained in order to be re-purposed for a different task. For instance, a machine that learned how to drive a car would need to be retrained in order to learn how to drive a truck. The field of transfer learning covers scenarios in which one wishes to transfer the expertise acquired from some tasks to others. Transfer learning includes different related paradigms, such as multitask learning, lifelong learning, zero-shot learning, and domain adaptation [149]. In multitask learning, several tasks are learned simultaneously. Typical solutions for multitask learning based on neural networks preserve the presence of common representations among neural networks trained for different tasks [19]. Lifelong learning utilizes a machine trained on a number of tasks to carry out a new task by leveraging the knowledge accumulated during the previous training phases [143]. Zero-shot learning refers to models capable of recognizing unseen classes with training examples available only for related, but different, classes. This often entails the task of learning representation of classes, such as prototype vectors, that generate data in the class through a fixed probabilistic mechanism [52]. Domain adaptation will be discussed separately in the next point. #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 211 Context: - **Domain adaptation:** In many learning problems, the available data has a different distribution from the data on which the algorithm will be tested. For instance, in speech recognition, one has available data for a given user at learning time, but it may be desirable, after learning, to use the same machine for another user. For supervised learning, this is typically modeled by assuming that the distribution of the covariates $ x $ is different during training and test, while the discriminative conditional distribution $ p(y|x) $ is the same for both phases [149]. A generalization of PAC theory analyzes domain adaptation, obtaining bounds on the generalization error under the desired test distribution as a function of the difference between the training and test distributions [26]. - **Communication-efficient learning:** In distributed computing platforms, data is typically partitioned among the processors and communication among the processors entails latency and energy consumption. An important research problem is that of characterizing the best trade-off between learning performance and communication overhead [160]. - **Reinforcement learning:** Reinforcement learning is at the heart of recent successes of machine learning methods in acquiring the skills necessary to play video games or games against human opponents (see, e.g., [99]). In reinforcement learning, one wishes to learn the optimal mapping, say $ \phi(x, \theta) $, between the observed state $ x $ of the world and an action $ a $. Unlike supervised learning, the optimal action $ a $ is not known, but the machine observes a reward/punishment signal depending on the effect of the action. A popular approach, referred to as deep reinforcement learning, models the mapping $ \phi(x, \theta) $ using a deep neural network. This is trained to maximize the average reward via SGD by using the REINFORCE method (Chapter 8) to estimate the gradient [135, 88, 7, 9]. #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 213 Context: # Appendix A: Information Measures In this appendix, we describe a principled and intuitive introduction to information measures that builds on inference, namely estimation and hypothesis testing. We focus on entropy, mutual information, and divergence measures. We also concentrate on discrete r.v.s. In the monograph, we have taken the pragmatic approach of extending the definitions to continuous variables by substituting sums with integrals. It is worth noting that this approach does not come with any practical complications when dealing with mutual information and divergence. Instead, the continuous version of the entropy, known as differential entropy, should be treated with care, as it does not satisfy some key properties of the entropy such as non-negativity. ## A.1 Entropy As proposed by Claude Shannon, the amount of information received from the observation of a discrete random variable $ X \sim p(x) $ defined over a finite alphabet should be measured by the amount of uncertainty about its value prior to its measurement. To this end, we consider the problem of estimating the value of $ x $ when one only knows the following: #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 225 Context: # References [9] Arulkumaran, K., M. P. Deisenroth, M. Brundage, and A. A. Bharath. 2017. “Deep Reinforcement Learning: A Brief Survey”. *IEEE Signal Processing Magazine*, 34(6): 26–38. ISSN: 1053-5888. DOI: [10.1109/MSP.2017.2732100](https://doi.org/10.1109/MSP.2017.2732100). [10] Azoury, K. S. and M. K. Warmuth. 2014. “Relative loss bounds for on-line density estimation with the exponential family of distributions”. *Machine Learning*, 43(3): 211–246. [11] Bagheri, A., O. Simoens, and B. Rajendran. 2017. “Training Probabilistic Spiking Neural Networks with First-to-spike Decoding”. arXiv e-prints. Oct. ArXiv: [1710.10704](https://arxiv.org/abs/1710.10704) [stat.ML]. [12] Baldil, P., P. Sadowski, and Z. Lu. 2016. “Learning in the machine: Random backpropagation and the learning channel”. arXiv preprint arXiv: [1612.02734](https://arxiv.org/abs/1612.02734). [13] Bamler, R., C. Zhang, M. Opper, and S. Mandt. 2017. “Perturbative Black Box Variational Inference”. arXiv e-prints. Sept. arXiv: [1709.07433](https://arxiv.org/abs/1709.07433) [stat.ML]. [14] Baraniuk, R. G. 2007. “Compressive sensing [lecture notes]”. *IEEE Signal Processing Magazine*, 24(4): 118–121. [15] Barber, D. 2012. *Bayesian reasoning and machine learning*. Cambridge University Press. [16] Beal, M. J. 2003. *Variational algorithms for approximate Bayesian inference*. University of London London. [17] Bekker, R., M. Blienko, and J. Langford. 2011. *Scaling up machine learning: Parallel and distributed approaches*. Cambridge University Press. [18] Belghazi, I. S., R. Rajasekaran, A. Baratin, R. D. Hjelm, and A. Courville. 2018. “Mine: Mutual Information Neural Estimation”. arXiv preprint arXiv: [1801.04562](https://arxiv.org/abs/1801.04562). [19] Bengio, Y. 2012. “Deep learning of representations for unsupervised and transfer learning”. In Proceedings of ICML Workshop on Unsupervised and Transfer Learning, 17–36. [20] Bengio, Y., A. Courville, and P. Vincent. 2013. “Representation learning: A review and new perspectives”. *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 35(8): 1798–1828. #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 227 Context: References ========== [33] Cevher, V., S. Becker, and M. Schmidt. 2014. “Convex optimization for big data: Scalable, randomized, and parallel algorithms for big data analytics.” *IEEE Signal Processing Magazine*, 31(5): 32–43. [34] Cheeseman, P. C. 1985. “In Defence of Probability.” In: *IJCAI*, Vol. 85. 1002–1009. [35] Cichoki, A., D. Mandic, L. De Lathauwer, G. Zhou, Q. Zhao, C. Caiafa, and H. A. Phan. 2015. “Tensor decompositions for signal processing applications: From two-way to multiway component analysis.” *IEEE Signal Processing Magazine*, 32(2): 145–163. [36] Collins, M., S. Dasgupta, and R. E. Schapire. 2002. “A generalization of principal components analysis to the exponential family.” In: *Advances in neural information processing systems*, 617–624. [37] Cortes, C. and V. Vapnik. 1995. “Support-vector networks.” *Machine learning*, 20(3): 273–297. [38] Cover, T. M. and J. A. Thomas. 2012. *Elements of information theory*. John Wiley & Sons. [39] Cristianini, N. and J. Shawe-Taylor. 2000. *An introduction to support vector machines and other kernel-based learning methods*. Cambridge university press. [40] Csáji, I. J. and P. C. Shields. 2004. “Information theory and statistics: A tutorial.” *Foundations and Trends® in Communications and Information Theory*, 1(4): 417–528. [41] Davidson-Pilon, C. 2015. “Probabilistic Programming & Bayesian Methods for Hackers”. [42] Dayan, P., G. E. Hinton, R. M. Neal, and R. S. Zemel. 1995. “The helmholtz machine”. *Neural computation*, 7(5): 889–904. [43] De, S., G. Taylor, and T. Goldstein. 2015. “Variance Reduction for Distributed Stochastic Gradient Descent”. arXiv preprint arXiv:1510.01708. [44] Di Lorenzo, P. and G. Secuarti. 2016. “Next: In-network nonconvex optimization.” *IEEE Transactions on Signal and Information Processing over Networks*, 2(2): 120–136. #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 228 Context: # References 1. Duchi, J. 2016. “Lecture Notes for Statistics 311/Electrical Engineering 377.” 2. Duchi, J. C., Khosravi, P., and F. Ruan. 2016. “Information Measures, Experiments, Multi-category Hypothesis Tests, and Surrogate Losses.” arXiv preprint arXiv:1603.00126. 3. Dumoulin, V., Belghazi, M. I., Poole, B., Mastropietro, A., Lamb, M., Arjovsky, M., and A. Courville. 2016. “Adversarially learned inference.” arXiv preprint arXiv:1606.00704. 4. Efron, B. and T. Hastie. 2016. *Computer Age Statistical Inference*. Vol. 5. Cambridge University Press. 5. Fedus, W., M. Rosca, B. Lakshminarayanan, A. M. Dai, S. Mohemed, and I. Goodfellow. 2017. “Many Paths to Equilibrium: GANs Do Not Need to Decrease a Divergence At Every Step.” ArXiv e-prints. Oct. arXiv:1710.08456 [stat.ML]. 6. Feutry, C., P. Piantanida, Y. Bengio, and P. Duhamel. 2018. “Learning Anonymized Representations with Adversarial Neural Networks.” ArXiv e-prints. Feb. arXiv:1802.03986 [stat.ML]. 7. Friedman, J., T. Hastie, and R. Tibshirani. 2001. *The elements of statistical learning*. Vol. 1. Springer series in statistics New York. 8. Fu, Y., T. Xiang, Y.-G. Jiang, X. Xue, L. Sigal, and S. Gong. 2017. “Recent advances in zero-shot recognition.” arXiv preprint arXiv:1710.04897. 9. Gal, Y. 2016. “Uncertainty in Deep Learning.” PhD thesis. University of Cambridge. 10. Geron, A. and R. M. Gray. 2012. *Vector quantization and signal compression*. Vol. 159. Springer Science & Business Media. 11. Goodfellow, I., Y. J. Shlens, and C. Szegedy. 2014. “Explaining and harnessing adversarial examples.” arXiv preprint arXiv:1412.6572. 12. Goodfellow, I., J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. 2014. “Generative adversarial nets.” In: *Advances in neural information processing systems*, 2672-2680. #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 230 Context: ```markdown # References [72] Jain, P. and P. Kar. 2017. “Non-convex Optimization for Machine Learning.” *Foundations and Trends® in Machine Learning*, 10(3-4): 142–336. ISSN: 1938-8237. [URL](http://dx.doi.org/10.1561/2200000075) [73] Jang, E.-S., Gu, and B. Poole. 2016. “Categorical reparameterization with gumbel-softmax.” *arXiv preprint arXiv:1611.01144.* [74] Jiao, J., T. A. Courtade, A. No, K. Venkat, and T. Weissman. 2014. “Information measures: the curious case of the binary alphabet.” *IEEE Transactions on Information Theory*, 60(12): 7616–7626. [75] Jiao, J., T. A. Courtade, K. Venkat, and T. Weissman. 2015. “Justification of logarithmic loss via the benefit of side information.” *IEEE Transactions on Information Theory*, 61(10): 5357–5365. [76] Johnson, R. and T. Zhang. 2013. “Accelerating stochastic gradient descent using predictive variance reduction.” In: *Advances in Neural Information Processing Systems*, 315–323. [77] Karpathy, A. “Deep Reinforcement Learning From Pixels.” [URL](http://karpathy.github.io/2016/05/31/). [78] Kawaguchi, K., L. Pack Kaelbling, and Y. Bengio. 2017. “Generalization in Deep Learning.” *arXiv e-prints*. *arXiv:1710.05468* [stat]. [79] Keskar, N. S., D. Mudigere, J. Nocedal, M. Smelyanskiy, and J. Tang. 2016. “On large-batch training for deep learning: Generalization gap and sharp minima.” *arXiv preprint arXiv:1609.04836*. [80] Kingma, D. P. and M. Welling. 2013. “Auto-encoding variational bayes.” *arXiv preprint arXiv:1312.6114*. [81] Koller, D. and N. Friedman. 2009. *Probabilistic Graphical Models: Principles and Techniques.* MIT Press. [82] Kronberger, K., Y. Yue, Z. Bian, P. Chuduk, W. G. Macready, J. Rofle, and E. Andriyash. 2016. “Benchmarking quantum hardware for training of fully visible boltzmann machines.” *arXiv preprint arXiv:1611.04528*. [83] LeCun, Y., S. Chopra, R. Hadsell, M. Ranzato, and F. Huang. 2006. “A tutorial on energy-based learning.” *Predicting structure data*. ``` #################### File: A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf Page: 235 Context: # References [136] Smith, S. L., P.-J. Kindermans, and Q. V. Le. 2017. “Don’t Decay the Learning Rate, Increase the Batch Size.” ArXiv e-prints. Nov. arXiv:1711.00168 [cs.LG]. [137] Spectrum. “Will the Future of AI Learning Depend More on Nature or Nurture?” URL: https://spectrum.ieee.org/tech-talk/robotics/artificial-intelligence. [138] Stigler, S. M. 2016. *The Seven Pillars of Statistical Wisdom.* Harvard University Press. [139] Subramanian, S., T. Palpanas, D. Papadopoulos, V. Kalogeraki, and D. Gunopulos. 2006. “Online outlier detection in sensor data using non-parametric models.” In *Proceedings of the 32nd International Conference on Very Large Data Bases.* VLDB Endowment. 187–198. [140] Sugiyama, M., T. Suzuki, and T. Kanamori. 2012. *Density Estimation in Machine Learning.* Cambridge University Press. [141] Sun, Y., P. Babu, and D. P. Palomar. 2017. “Majorization-minimization algorithms in signal processing, communications, and machine learning.” *IEEE Transactions on Signal Processing.* 65(3): 794–816. [142] Tegmark, M. 2017. *Life 3.0: Being Human in the Age of Artificial Intelligence.* Allen Lane. [143] Thrun, S. 1996. “Is learning the n-th thing any easier than learning the first?” In *Advances in Neural Information Processing Systems.* 640–646. [144] Times, T. N. Y. 1958. *New Navy Device Learns by Doing; Psychologist Shows Embryo of Computer Designed to Read and Grow Wiser.* URL: http://www.nytimes.com/1958/07/08/archives/new-baby-device-learns-by-doing-psychologist-shows-embryo-of.html. [145] Tishby, N., F. C. Pereira, and W. Bialek. 2000. “The information bottleneck method.” *arXiv preprint* physics/0004057. [146] Tsybakov, A. B. 2009. *Introduction to Nonparametric Estimation.* [147] Turner, R. E. and M. Sahayani. 2011. “Two problems with variable-selection maximisation for time-series models.” *Bayesian Time Series Models:* 115–138. [148] Ur, Pyg. *Deep Universal Probabilistic Programming.* URL: http://pyro.ai/. ########## """QUERY: You are a super intelligent assistant. Please answer all my questions precisely and comprehensively. You have a large Knowledge Base with all the information of the user. This is the initial message to start the chat. Based on the following summary/context you should formulate an initial message greeting the user with the following user name diverse NF SepTest tell them that you are the AI Chatbot Simon using the Large Language Model gemini-1.5-flash to answer all questions. Formulate the initial message in the Usersettings Language German Please use the following context to suggest some questions or topics to chat about this knowledge base. List at least 3-10 possible topics or suggestions up and use emojis. The chat should be professional and in business terms. At the end ask an open question what the user would like to check on the list You provided excerpts from three different books on mathematics and machine learning. Let's summarize each of them: **Book 1: "A Cool Brisk Walk Through Discrete Mathematics" by Stephen Davies** This book provides a gentle introduction to various concepts in discrete mathematics, a branch of mathematics that deals with distinct, separate values. * **Chapter 2: Sets** discusses the fundamental concept of sets, collections of distinct objects, and how to define them extensionally (listing members) and intensionally (describing meaning). It also explains the concept of subsets and proper subsets. * **Chapter 3: Relations** delves into relations, which describe connections between elements of sets. It covers properties of relations like reflexivity, symmetry, antisymmetry, and transitivity. It also introduces partial orders and posets, which establish partial hierarchies within sets. * **Chapter 4: Probability** introduces probability measures, which quantify the likelihood of events. It discusses the frequentist and Bayesian interpretations of probability and explains concepts like sample spaces, events, conditional probability, and independence. * **Chapter 5: Structures** explores different data structures like graphs and trees. It explains how graphs can represent relations and covers different types of trees, including binary trees, binary search trees (BSTs), and their traversal methods. * **Chapter 6: Counting** focuses on combinatorial principles, including the Fundamental Theorem of Counting, permutations, and combinations. It explains factorials and binomial coefficients. * **Chapter 7: Numbers** revisits the concept of numbers and number systems, focusing on different bases (like base 7 and binary). It explains place value and two's complement representation for negative numbers in binary. * **Chapter 8: Logic** introduces propositional logic and predicate logic, which provide formal systems for reasoning about truth and falsehood. It covers truth tables, logical connectives, and quantifiers. * **Chapter 9: Proof** discusses the concept of mathematical proof, explaining direct and indirect proof methods. It emphasizes the importance of rigorous proof in establishing mathematical truth. **Book 2: "A Brief Introduction to Machine Learning for Engineers" by Osvaldo Simeone** This book provides a concise introduction to machine learning concepts, focusing on probabilistic models and their applications in supervised and unsupervised learning. * **Chapter 1: Introduction** defines machine learning, discusses when it's appropriate to use, and outlines the book's goals. * **Chapter 2: A Gentle Introduction through Linear Regression** introduces supervised learning, inference, and the frequentist and Bayesian approaches to learning. It uses linear regression as a running example. * **Chapter 3: Probabilistic Models for Learning** covers the exponential family of probabilistic models, including Gaussian, Bernoulli, and Categorical distributions. It discusses frequentist learning, conjugate priors, and energy-based models. * **Chapter 4: Classification** focuses on classification problems, where the goal is to assign data points to discrete categories. It covers discriminative deterministic models (like SVM), discriminative probabilistic models (like logistic regression), and generative probabilistic models. * **Chapter 5: Statistical Learning Theory** provides a theoretical framework for understanding the relationship between data size and generalization performance. It covers PAC learnability, sample complexity, and structural risk minimization. * **Chapter 6: Unsupervised Learning** explores unsupervised learning, where the goal is to discover patterns in unlabeled data. It covers clustering (like K-means), dimensionality reduction (like PCA), and generative models (like GANs and RBMs). * **Chapter 7: Probabilistic Graphical Models** introduces Bayesian Networks (BNs) and Markov Random Fields (MRFs), which provide graphical representations of probabilistic models. It covers conditional independence, factorization, and inference in graphical models. * **Chapter 8: Approximate Inference and Learning** discusses approximate inference methods like Monte Carlo (MC) methods and Variational Inference (VI). It covers their application to learning and introduces techniques like REINFORCE and amortized VI. * **Chapter 9: Concluding Remarks** summarizes the book's key concepts and discusses additional topics like privacy, robustness, domain adaptation, communication-efficient learning, and reinforcement learning. **Book 3: "A First Encounter with Machine Learning" by Max Welling** This book offers a beginner-friendly introduction to machine learning, covering various algorithms and concepts. * **Chapter 5: Nearest Neighbors Classification** introduces the k-nearest neighbors (kNN) algorithm, a simple classification method based on finding the most similar data points. It discusses the advantages and disadvantages of kNN. * **Chapter 6: Bayes Classifiers** covers Bayes classifiers, which use probability theory to make predictions. It explains how to compute class probabilities and make decisions based on them. * **Chapter 7: The Perceptron** introduces the perceptron, a linear classifier that learns a decision boundary by minimizing a cost function. It explains gradient descent and stochastic gradient descent for training the perceptron. * **Chapter 8: Support Vector Machines** discusses Support Vector Machines (SVMs), a powerful classification method that maximizes the margin between classes. It covers key concepts like hyperplanes, support vectors, and margins. * **Chapter 10: Kernel Ridge Regression** introduces kernel ridge regression, a regression method that uses kernel functions to model non-linear relationships. It discusses the lack of sparseness in kernel ridge regression compared to SVMs. * **Chapter 14: Kernel Canonical Correlation Analysis** covers kernel canonical correlation analysis (KCCA), a method for finding correlations between two sets of variables using kernel functions. It explains how to formulate and solve the KCCA problem. Overall, these excerpts provide a glimpse into the fascinating world of mathematics and machine learning, covering fundamental concepts, algorithms, and theoretical frameworks. They highlight the importance of probabilistic models, optimization techniques, and the trade-off between model complexity and generalization performance.""" Consider the chat history for relevant information. Use all information included. Use as much tokens as needed. Important: If you find information separated by a | in the context, it is a table formatted in Markdown.Important: Take a look at the QUERY and only the QUERY. If this is vague or unclear, please ignore everything and ask a follow-up question instead! Final Files Sources: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 1, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 4, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 5, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 6, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 7, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 9, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 11, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 13, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 14, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 16, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 20, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 23, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 24, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 25, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 26, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 27, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 28, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 29, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 30, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 31, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 34, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 38, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 46, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 58, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 70, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 80, A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29.pdf - Page 87, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 1, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 2, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 3, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 5, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 11, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 12, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 13, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 15, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 26, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 37, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 67, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 85, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 95, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 132, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 167, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 173, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 225, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 233, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 252, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 253, A%20Cool%20Brisk%20Walk%20Through%20Discrete%20Mathematics%20-%20Stephen%20Davies%20%28PDF%29.pdf - Page 254, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 1, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 2, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 3, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 4, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 8, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 10, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 11, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 12, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 13, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 14, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 15, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 16, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 17, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 18, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 19, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 21, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 22, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 41, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 50, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 54, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 55, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 58, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 59, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 79, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 82, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 92, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 96, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 106, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 107, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 113, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 115, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 116, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 119, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 125, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 133, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 135, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 138, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 142, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 160, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 162, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 169, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 173, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 202, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 209, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 210, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 211, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 213, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 225, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 227, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 228, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 230, A%20Brief%20Introduction%20to%20Machine%20Learning%20for%20Engineers%20-%20Osvaldo%20Simeone%20%28PDF%29.pdf - Page 235 ================================================== **Elapsed Time: 0.00 seconds** ================================================== FINAL ANSWER Answer: Hallo diverse NF SepTest, ich bin Simon, dein KI-Chatbot, der mit dem großen Sprachmodell gemini-1.5-flash angetrieben wird. Ich kann dir alle deine Fragen präzise und umfassend beantworten. Basierend auf den von dir bereitgestellten Textauszügen aus drei Büchern über Mathematik und maschinelles Lernen, kann ich dir zu folgenden Themen weiterhelfen: * **Diskrete Mathematik:** * Mengenlehre 🧮 * Relationen 🔗 * Wahrscheinlichkeit 🎲 * Datenstrukturen (Graphen, Bäume) 🌳 * Kombinatorik 🔢 * Zahlensysteme 🧮 * Logik 🧠 * Beweisführung ✍️ * **Maschinelles Lernen:** * Überwachtes Lernen 🧑‍🏫 * Unsupervisiertes Lernen 🕵️‍♀️ * Klassifizierung 🗃️ * Regression 📈 * Wahrscheinlichkeitsmodelle 📊 * Statistische Lerntheorie 🧪 * Kernel-Methoden 🧲 * Bayes'sche Klassifikatoren 🔮 * Perzeptron 🤖 * Support Vector Machines (SVMs) 🗡️ * K-Means Clustering 🧲 * Probabilistische Graphische Modelle 🗺️ * Approximative Inferenz 🧮 * Reinforcement Learning 🕹️ Was möchtest du dir aus dieser Liste genauer ansehen? 🤔 ================================================== **Elapsed Time: 8.02 seconds** ==================================================