INITIALIZATION File URL: https://jonasbrosbucket.s3.us-east-2.amazonaws.com/f9fd444e-c03b-45ab-97e4-e8242bac402b/A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29%20%281%29.pdf Namespace: f9fd444e-c03b-45ab-97e4-e8242bac402b Index Name: ki-dev-large ================================================== **Elapsed Time: 0.00 seconds** ================================================== FILE BYTES EXTRACTED FILE NAME: A%20First%20Encounter%20with%20Machine%20Learning%20-%20Max%20Welling%20%28PDF%29%20%281%29.pdf ================================================== **Elapsed Time: 0.86 seconds** ================================================== PDF EXTRACTION DONE ================================================== **Elapsed Time: 17.82 seconds** ================================================== INDEXING SUCCESS Content: [{'id': 'e2ec54ce-74ec-4618-a154-0ff3e1c3ca07', 'page': 1, 'text': 'AFirstEncounterwithMachineLearningMaxWellingDonaldBrenSchoolofInformationandComputerScienceUniversityofCaliforniaIrvineNovember4,2011\x0c'}, {'id': '1d262cce-7336-4f18-a0ca-6b09f1f47524', 'page': 2, 'text': '2\x0c'}, {'id': 'b6f8361c-db65-44fe-9058-90cee6088b69', 'page': 3, 'text': 'ContentsPrefaceiiiLearningandIntuitionvii1DataandInformation11.1DataRepresentation.........................21.2PreprocessingtheData.......................42DataVisualization73Learning113.1InaNutshell.............................154TypesofMachineLearning174.1InaNutshell.............................205NearestNeighborsClassification215.1TheIdeaInaNutshell........................236TheNaiveBayesianClassifier256.1TheNaiveBayesModel......................256.2LearningaNaiveBayesClassifier.................276.3Class-PredictionforNewInstances.................286.4Regularization............................306.5Remarks...............................316.6TheIdeaInaNutshell........................317ThePerceptron337.1ThePerceptronModel.......................34i\x0c'}, {'id': '0ae287d9-25b7-4881-a004-b464460ff8a7', 'page': 4, 'text': 'iiCONTENTS7.2ADifferentCostfunction:LogisticRegression..........377.3TheIdeaInaNutshell........................388SupportVectorMachines398.1TheNon-Separablecase......................439SupportVectorRegression4710KernelridgeRegression5110.1KernelRidgeRegression......................5210.2Analternativederivation......................5311KernelK-meansandSpectralClustering5512KernelPrincipalComponentsAnalysis5912.1CenteringDatainFeatureSpace..................6113FisherLinearDiscriminantAnalysis6313.1KernelFisherLDA.........................6613.2AConstrainedConvexProgrammingFormulationofFDA....6814KernelCanonicalCorrelationAnalysis6914.1KernelCCA.............................71AEssentialsofConvexOptimization73A.1Lagrangiansandallthat.......................73BKernelDesign77B.1PolynomialsKernels........................77B.2AllSubsetsKernel.........................78B.3TheGaussianKernel........................79\x0c'}, {'id': '390c2d3e-3eb7-4217-811c-ae8964eca489', 'page': 5, 'text': 'PrefaceInwinterquarter2007ItaughtanundergraduatecourseinmachinelearningatUCIrvine.WhileIhadbeenteachingmachinelearningatagraduatelevelitbecamesoonclearthatteachingthesamematerialtoanundergraduateclasswasawholenewchallenge.Muchofmachinelearningisbuilduponconceptsfrommathematicssuchaspartialderivatives,eigenvaluedecompositions,multivariateprobabilitydensitiesandsoon.Iquicklyfoundthattheseconceptscouldnotbetakenforgrantedatanundergraduatelevel.Thesituationwasaggravatedbythelackofasuitabletextbook.Excellenttextbooksdoexistforthisfield,butIfoundallofthemtobetootechnicalforafirstencounterwithmachinelearning.Thisexperienceledmetobelievetherewasagenuineneedforasimple,intuitiveintroductionintotheconceptsofmachinelearning.Afirstreadtowettheappetitesotospeak,apreludetothemoretechnicalandadvancedtextbooks.Hence,thebookyouseebeforeyouismeantforthosestartingoutinthefieldwhoneedasimple,intuitiveexplanationofsomeofthemostusefulalgorithmsthatourfieldhastooffer.Machinelearningisarelativelyrecentdisciplinethatemergedfromthegen-eralfieldofartificialintelligenceonlyquiterecently.Tobuildintelligentmachinesresearchersrealizedthatthesemachinesshouldlearnfromandadapttotheiren-vironment.Itissimplytoocostlyandimpracticaltodesignintelligentsystemsbyfirstgatheringalltheexpertknowledgeourselvesandthenhard-wiringitintoamachine.Forinstance,aftermanyyearsofintenseresearchthewecannowrecog-nizefacesinimagestoahighdegreeaccuracy.Buttheworldhasapproximately30,000visualobjectcategoriesaccordingtosomeestimates(Biederman).Shouldweinvestthesameefforttobuildgoodclassifiersformonkeys,chairs,pencils,axesetc.orshouldwebuildsystemstocanobservemillionsoftrainingimages,somewithlabels(e.g.inthesepixelsintheimagecorrespondtoacar)butmostofthemwithoutsideinformation?Althoughthereiscurrentlynosystemwhichcanrecognizeevenintheorderof1000objectcategories(thebestsystemcangetiii\x0c'}, {'id': 'ac386c09-4bec-4d34-bdca-8c482b1d8a16', 'page': 6, 'text': 'ivPREFACEabout60%correcton100categories),thefactthatwepullitoffseeminglyeffort-lesslyservesasa“proofofconcept”thatitcanbedone.Butthereisnodoubtinmymindthatbuildingtrulyintelligentmachineswillinvolvelearningfromdata.Thefirstreasonfortherecentsuccessesofmachinelearningandthegrowthofthefieldasawholeisrootedinitsmultidisciplinarycharacter.MachinelearningemergedfromAIbutquicklyincorporatedideasfromfieldsasdiverseasstatis-tics,probability,computerscience,informationtheory,convexoptimization,con-troltheory,cognitivescience,theoreticalneuroscience,physicsandmore.Togiveanexample,themainconferenceinthisfieldiscalled:advancesinneuralinformationprocessingsystems,referringtoinformationtheoryandtheoreticalneuroscienceandcognitivescience.Thesecond,perhapsmoreimportantreasonforthegrowthofmachinelearn-ingistheexponentialgrowthofbothavailabledataandcomputerpower.Whilethefieldisbuildontheoryandtoolsdevelopedstatisticsmachinelearningrecog-nizesthatthemostexitingprogresscanbemadetoleveragetheenormousfloodofdatathatisgeneratedeachyearbysatellites,skyobservatories,particleaccel-erators,thehumangenomeproject,banks,thestockmarket,thearmy,seismicmeasurements,theinternet,video,scannedtextandsoon.Itisdifficulttoap-preciatetheexponentialgrowthofdatathatoursocietyisgenerating.Togiveanexample,amodernsatellitegeneratesroughlythesameamountofdataallprevioussatellitesproducedtogether.Thisinsighthasshiftedtheattentionfromhighlysophisticatedmodelingtechniquesonsmalldatasetstomorebasicanaly-sisonmuchlargerdata-sets(thelattersometimescalleddata-mining).Hencetheemphasisshiftedtoalgorithmicefficiencyandasaresultmanymachinelearningfaculty(likemyself)cantypicallybefoundincomputersciencedepartments.Togivesomeexamplesofrecentsuccessesofthisapproachonewouldonlyhavetoturnononecomputerandperformaninternetsearch.Modernsearchenginesdonotrunterriblysophisticatedalgorithms,buttheymanagetostoreandsiftthroughalmosttheentirecontentoftheinternettoreturnsensiblesearchresults.Therehasalsobeenmuchsuccessinthefieldofmachinetranslation,notbecauseanewmodelwasinventedbutbecausemanymoretranslateddocumentsbecameavailable.Thefieldofmachinelearningismultifacetedandexpandingfast.Tosampleafewsub-disciplines:statisticallearning,kernelmethods,graphicalmodels,ar-tificialneuralnetworks,fuzzylogic,Bayesianmethodsandsoon.Thefieldalsocoversmanytypesoflearningproblems,suchassupervisedlearning,unsuper-visedlearning,semi-supervisedlearning,activelearning,reinforcementlearningetc.Iwillonlycoverthemostbasicapproachesinthisbookfromahighlyper-\x0c'}, {'id': '325de266-ff48-4fa9-9a54-3aa50d2b786b', 'page': 7, 'text': 'vsonalperspective.InsteadoftryingtocoverallaspectsoftheentirefieldIhavechosentopresentafewpopularandperhapsusefultoolsandapproaches.Butwhatwill(hopefully)besignificantlydifferentthanmostotherscientificbooksisthemannerinwhichIwillpresentthesemethods.Ihavealwaysbeenfrustratedbythelackofproperexplanationofequations.ManytimesIhavebeenstaringataformulahavingnottheslightestcluewhereitcamefromorhowitwasderived.Manybooksalsoexcelinstatingfactsinanalmostencyclopedicstyle,withoutprovidingtheproperintuitionofthemethod.Thisismyprimarymission:towriteabookwhichconveysintuition.ThefirstchapterwillbedevotedtowhyIthinkthisisimportant.MEANTFORINDUSTRYASWELLASBACKGROUNDREADING]ThisbookwaswrittenduringmysabbaticalattheRadboudtUniversityinNi-jmegen(Netherlands).Hansfordiscussiononintuition.IliketothankProf.BertKappenwholeadsanexcellentgroupofpostocsandstudentsforhishospitality.Marga,kids,UCI,...\x0c'}, {'id': 'f6d5b000-5712-42ba-9043-937cf0d63094', 'page': 8, 'text': 'viPREFACE\x0c'}, {'id': '7d260887-f5f3-49b6-9de8-b52cfbc2b543', 'page': 9, 'text': 'LearningandIntuitionWehaveallexperiencedthesituationthatthesolutiontoaproblempresentsitselfwhileridingyourbike,walkinghome,“relaxing”inthewashroom,wakingupinthemorning,takingyourshoweretc.Importantly,itdidnotappearwhilebang-ingyourheadagainsttheprobleminaconsciousefforttosolveit,staringattheequationsonapieceofpaper.Infact,Iwouldclaim,thatallmybitsandpiecesofprogresshaveoccuredwhiletakingabreakand“relaxingoutoftheproblem”.Greekphilosopherswalkedincircleswhenthinkingaboutaproblem;mostofusstareatacomputerscreenallday.Thepurposeofthischapteristomakeyoumoreawareofwhereyourcreativemindislocatedandtointeractwithitinafruitfulmanner.Mygeneralthesisisthatcontrarytopopularbelief,creativethinkingisnotperformedbyconsciousthinking.Itisratheraninterplaybetweenyourcon-sciousmindwhopreparestheseedstobeplantedintotheunconsciouspartofyourmind.Theunconsciousmindwillmunchontheproblem“outofsight”andreturnpromisingroadstosolutionstotheconsciousness.Thisprocessiteratesuntiltheconsciousminddecidestheproblemissufficientlysolved,intractableorplaindullandmovesontothenext.Itmaybealittleunsettlingtolearnthatatleastpartofyourthinkinggoesoninapartofyourmindthatseemsinaccessibleandhasaverylimitedinterfacewithwhatyouthinkofasyourself.Butitisun-deniablethatitisthereanditisalsoundeniablethatitplaysaroleinthecreativethought-process.Tobecomeacreativethinkeroneshouldhowlearntoplaythisgamemoreeffectively.Todoso,weshouldthinkaboutthelanguageinwhichtorepresentknowledgethatismosteffectiveintermsofcommunicationwiththeunconscious.Inotherwords,whattypeof“interface”betweenconsciousandunconsciousmindshouldweuse?Itisprobablynotagoodideatomemorizeallthedetailsofacomplicatedequationorproblem.Insteadweshouldextracttheabstractideaandcapturetheessenceofitinapicture.Thiscouldbeamoviewithcolorsandothervii\x0c'}, {'id': 'b4add4b4-442f-4c0d-b839-83986803fef0', 'page': 10, 'text': 'viiiLEARNINGANDINTUITIONbaroquefeaturesoramore“dull”representation,whateverworks.Somescientisthavebeenaskedtodescribehowtheyrepresentabstractideasandtheyinvari-ablyseemtoentertainsometypeofvisualrepresentation.Abeautifulaccountofthisinthecaseofmathematicianscanbefoundinamarvellousbook“XXX”(Hardamard).Bybuildingaccuratevisualrepresentationsofabstractideaswecreateadata-baseofknowledgeintheunconscious.Thiscollectionofideasformsthebasisforwhatwecallintuition.Ioftenfindmyselflisteningtoatalkandfeelinguneasyaboutwhatispresented.ThereasonseemstobethattheabstractideaIamtryingtocapturefromthetalkclashedwithasimilarideathatisalreadystored.ThisinturncanbeasignthatIeithermisunderstoodtheideabeforeandneedtoupdateit,orthatthereisactuallysomethingwrongwithwhatisbeingpresented.InasimilarwayIcaneasilydetectthatsomeideaisasmallperturbationofwhatIalreadyknew(Ifeelhappilybored),orsomethingentirelynew(Ifeelintriguedandslightlyfrustrated).Whilethenoviceiscontinuouslychallengedandoftenfeelsoverwhelmed,themoreexperiencedresearcherfeelsatease90%ofthetimebecausethe“new”ideawasalreadyinhis/herdata-basewhichthereforeneedsnoandverylittleupdating.Somehowourunconsciousmindcanalsomanipulateexistingabstractideasintonewones.Thisiswhatweusuallythinkofascreativethinking.Onecanstimulatethisbyseedingthemindwithaproblem.Thisisaconsciouseffortandisusuallyacombinationofdetailedmathematicalderivationsandbuildinganintuitivepictureormetaphorforthethingoneistryingtounderstand.Ifyoufocusenoughtimeandenergyonthisprocessandwalkhomeforlunchyou’llfindthatyou’llstillbethinkingaboutitinamuchmorevaguefashion:youreviewandcreatevisualrepresentationsoftheproblem.Thenyougetyourmindofftheproblemaltogetherandwhenyouwalkbacktoworksuddenlypartsofthesolu-tionsurfaceintoconsciousness.Somehow,yourunconscioustookoverandkeptworkingonyourproblem.Theessenceisthatyoucreatedvisualrepresentationsasthebuildingblocksfortheunconsciousmindtoworkwith.Inanycase,whateverthedetailsofthisprocessare(andIamnopsychologist)Isuspectthatanygoodexplanationshouldincludebothanintuitivepart,includingexamples,metaphorsandvisualizations,andaprecisemathematicalpartwhereeveryequationandderivationisproperlyexplained.ThisthenisthechallengeIhavesettomyself.Itwillbeyourtasktoinsistonunderstandingtheabstractideathatisbeingconveyedandbuildyourownpersonalizedvisualrepresentations.Iwilltrytoassistinthisprocessbutitisultimatelyyouwhowillhavetodothehardwork.\x0c'}, {'id': '492d6cea-32f1-4c66-b66b-5056845d9cc5', 'page': 11, 'text': 'ixManypeoplemayfindthissomewhatexperimentalwaytointroducestudentstonewtopicscounter-productive.Undoubtedlyformanyitwillbe.Ifyoufeelunder-challengedandbecomeboredIrecommendyoumoveontothemoread-vancedtext-booksofwhichtherearemanyexcellentsamplesonthemarket(foralistsee(books)).ButIhopethatformostbeginningstudentsthisintuitivestyleofwritingmayhelptogainadeeperunderstandingoftheideasthatIwillpresentinthefollowing.Aboveall,havefun!\x0c'}, {'id': 'fe76f24b-6205-497a-9a82-cd08074f46ff', 'page': 12, 'text': 'xLEARNINGANDINTUITION\x0c'}, {'id': '4a67aa0f-731d-4b92-9fe7-33698b99a9c6', 'page': 13, 'text': 'Chapter1DataandInformationDataiseverywhereinabundantamounts.Surveillancecamerascontinuouslycapturevideo,everytimeyoumakeaphonecallyournameandlocationgetsrecorded,oftenyourclickingpatternisrecordedwhensurfingtheweb,mostfi-nancialtransactionsarerecorded,satellitesandobservatoriesgeneratetera-bytesofdataeveryyear,theFBImaintainsaDNA-databaseofmostconvictedcrimi-nals,soonallwrittentextfromourlibrariesisdigitized,needIgoon?Butdatainitselfisuseless.Hiddeninsidethedataisvaluableinformation.Theobjectiveofmachinelearningistopulltherelevantinformationfromthedataandmakeitavailabletotheuser.Whatdowemeanby“relevantinformation”?Whenanalyzingdatawetypicallyhaveaspecificquestioninmindsuchas:“Howmanytypesofcarcanbediscernedinthisvideo”or“whatwillbeweathernextweek”.Sotheanswercantaketheformofasinglenumber(thereare5cars),orasequenceofnumbersor(thetemperaturenextweek)oracomplicatedpattern(thecloudconfigurationnextweek).Iftheanswertoourqueryisitselfcomplexweliketovisualizeitusinggraphs,bar-plotsorevenlittlemovies.Butoneshouldkeepinmindthattheparticularanalysisdependsonthetaskonehasinmind.Letmespelloutafewtasksthataretypicallyconsideredinmachinelearning:Prediction:Hereweaskourselveswhetherwecanextrapolatetheinformationinthedatatonewunseencases.Forinstance,ifIhaveadata-baseofattributesofHummerssuchasweight,color,numberofpeopleitcanholdetc.andanotherdata-baseofattributesofFerraries,thenonecantrytopredictthetypeofcar(HummerorFerrari)fromanewsetofattributes.Anotherexampleispredictingtheweather(givenalltherecordedweatherpatternsinthepast,canwepredicttheweathernextweek),orthestockprizes.1\x0c'}, {'id': 'c513d81b-3419-4f2b-9c3d-522e6f72c331', 'page': 14, 'text': '2CHAPTER1.DATAANDINFORMATIONInterpretation:Hereweseektoanswerquestionsaboutthedata.Forinstance,whatpropertyofthisdrugwasresponsibleforitshighsuccess-rate?Doesasecu-rityofficerattheairportapplyracialprofilingindecidingwho’sluggagetocheck?Howmanynaturalgroupsarethereinthedata?Compression:Hereweareinterestedincompressingtheoriginaldata,a.k.a.thenumberofbitsneededtorepresentit.Forinstance,filesinyourcomputercanbe“zipped”toamuchsmallersizebyremovingmuchoftheredundancyinthosefiles.Also,JPEGandGIF(amongothers)arecompressedrepresentationsoftheoriginalpixel-map.Alloftheaboveobjectivesdependonthefactthatthereisstructureinthedata.Ifdataiscompletelyrandomthereisnothingtopredict,nothingtointerpretandnothingtocompress.Hence,alltasksaresomehowrelatedtodiscoveringorleveragingthisstructure.Onecouldsaythatdataishighlyredundantandthatthisredundancyisexactlywhatmakesitinteresting.Taketheexampleofnatu-ralimages.Ifyouarerequiredtopredictthecolorofthepixelsneighboringtosomerandompixelinanimage,youwouldbeabletodoaprettygoodjob(forinstance20%maybeblueskyandpredictingtheneighborsofablueskypixeliseasy).Also,ifwewouldgenerateimagesatrandomtheywouldnotlooklikenaturalscenesatall.Forone,itwouldn’tcontainobjects.Onlyatinyfractionofallpossibleimageslooks“natural”andsothespaceofnaturalimagesishighlystructured.Thus,alloftheseconceptsareintimatelyrelated:structure,redundancy,pre-dictability,regularity,interpretability,compressibility.Theyrefertothe“food”formachinelearning,withoutstructurethereisnothingtolearn.Thesamethingistrueforhumanlearning.Fromthedaywearebornwestartnoticingthatthereisstructureinthisworld.Oursurvivaldependsondiscoveringandrecordingthisstructure.IfIwalkintothisbrowncylinderwithagreencanopyIsuddenlystop,itwon’tgiveway.Infact,itdamagesmybody.Perhapsthisholdsforalltheseobjects.WhenIcrymymothersuddenlyappears.Ourgameistopredictthefutureaccurately,andwepredictitbylearningitsstructure.1.1DataRepresentationWhatdoes“data”looklike?Inotherwords,whatdowedownloadintoourcom-puter?Datacomesinmanyshapesandforms,forinstanceitcouldbewordsfromadocumentorpixelsfromanimage.Butitwillbeusefultoconvertdataintoa\x0c'}, {'id': '2ed46554-a5bc-4bfa-bfff-21ce3f661968', 'page': 15, 'text': '1.1.DATAREPRESENTATION3standardformatsothatthealgorithmsthatwewilldiscusscanbeappliedtoit.Mostdatasetscanberepresentedasamatrix,X=[Xin],withrowsindexedby“attribute-index”iandcolumnsindexedby“data-index”n.ThevalueXinforattributeianddata-casencanbebinary,real,discreteetc.,dependingonwhatwemeasure.Forinstance,ifwemeasureweightandcolorof100cars,thematrixXis2×100dimensionalandX1,20=20,684.57istheweightofcarnr.20insomeunits(arealvalue)whileX2,20=2isthecolorofcarnr.20(sayoneof6predefinedcolors).Mostdatasetscanbecastinthisform(butnotall).Fordocuments,wecangiveeachdistinctwordofaprespecifiedvocabularyanr.andsimplycounthowoftenawordwaspresent.Saytheword“book”isdefinedtohavenr.10,568inthevocabularythenX10568,5076=4wouldmean:thewordbookappeared4timesindocument5076.Sometimesthedifferentdata-casesdonothavethesamenumberofattributes.Considersearchingtheinternetforimagesaboutrats.You’llretrievealargevarietyofimagesmostwithadifferentnumberofpixels.Wecaneithertrytorescaletheimagestoacommonsizeorwecansimplyleavethoseentriesinthematrixempty.Itmayalsooccurthatacertainentryissupposedtobetherebutitcouldn’tbemeasured.Forinstance,ifwerunanopticalcharacterrecognitionsystemonascanneddocumentsomeletterswillnotberecognized.We’lluseaquestionmark“?”,toindicatethatthatentrywasn’tobserved.Itisveryimportanttorealizethattherearemanywaystorepresentdataandnotallareequallysuitableforanalysis.BythisImeanthatinsomerepresen-tationthestructuremaybeobviouswhileinotherrepresentationismaybecometotallyobscure.Itisstillthere,butjusthardertofind.Thealgorithmsthatwewilldiscussarebasedoncertainassumptions,suchas,“HummersandFerrariescanbeseparatedwithbyaline,seefigure??.Whilethismaybetrueifwemeasureweightinkilogramsandheightinmeters,itisnolongertrueifwedecidetore-codethesenumbersintobit-strings.Thestructureisstillinthedata,butwewouldneedamuchmorecomplexassumptiontodiscoverit.Alessontobelearnedisthustospendsometimethinkingaboutinwhichrepresentationthestructureisasobviousaspossibleandtransformthedataifnecessarybeforeapplyingstandardalgorithms.Inthenextsectionwe’lldiscusssomestandardpreprocessingopera-tions.Itisoftenadvisabletovisualizethedatabeforepreprocessingandanalyzingit.Thiswilloftentellyouifthestructureisagoodmatchforthealgorithmyouhadinmindforfurtheranalysis.Chapter??willdiscusssomeelementaryvisual-izationtechniques.\x0c'}, {'id': '3404b0fd-ed5e-4eaa-8b77-514cd4160c7d', 'page': 16, 'text': '4CHAPTER1.DATAANDINFORMATION1.2PreprocessingtheDataAsmentionedintheprevioussection,algorithmsarebasedonassumptionsandcanbecomemoreeffectiveifwetransformthedatafirst.Considerthefollowingexample,depictedinfigure??a.Thealgorithmweconsistsofestimatingtheareathatthedataoccupy.Itgrowsacirclestartingattheoriginandatthepointitcontainsallthedatawerecordtheareaofcircle.Inthefigurewhythiswillbeabadestimate:thedata-cloudisnotcentered.Ifwewouldhavefirstcentereditwewouldhaveobtainedreasonableestimate.Althoughthisexampleissomewhatsimple-minded,therearemany,muchmoreinterestingalgorithmsthatassumecentereddata.Tocenterdatawewillintroducethesamplemeanofthedata,givenby,E[X]i=1NNXn=1Xin(1.1)Hence,foreveryattributeiseparately,wesimpleaddalltheattributevalueacrossdata-casesanddividebythetotalnumberofdata-cases.Totransformthedatasothattheirsamplemeaniszero,weset,X′in=Xin−E[X]i∀n(1.2)ItisnoweasytocheckthatthesamplemeanofX′indeedvanishes.Anillustra-tionoftheglobalshiftisgiveninfigure??b.Wealsoseeinthisfigurethatthealgorithmdescribedabovenowworksmuchbetter!Inasimilarspiritascentering,wemayalsowishtoscalethedataalongthecoordinateaxisinordermakeitmore“spherical”.Considerfigure??a,b.Inthiscasethedatawasfirstcentered,buttheelongatedshapestillpreventedusfromusingthesimplisticalgorithmtoestimatetheareacoveredbythedata.Thesolutionistoscaletheaxessothatthespreadisthesameineverydimension.Todefinethisoperationwefirstintroducethenotionofsamplevariance,V[X]i=1NNXn=1X2in(1.3)wherewehaveassumedthatthedatawasfirstcentered.Notethatthisissimilartothesamplemean,butnowwehaveusedthesquare.Itisimportantthatwehaveremovedthesignofthedata-cases(bytakingthesquare)becauseotherwisepositiveandnegativesignsmightcanceleachotherout.Byfirsttakingthesquare,alldata-casesfirstgetmappedtopositivehalfoftheaxes(foreachdimensionor\x0c'}, {'id': 'd2096e20-b60e-4468-bf48-416e7d772988', 'page': 17, 'text': '1.2.PREPROCESSINGTHEDATA5attributeseparately)andthenaddedanddividedbyN.YouhaveperhapsnoticedthatvariancedoesnothavethesameunitsasXitself.IfXismeasuredingrams,thenvarianceismeasuredingramssquared.Sotoscalethedatatohavethesamescaleineverydimensionwedividebythesquare-rootofthevariance,whichisusuallycalledthesamplestandarddeviation.,X′′in=X′inpV[X′]i∀n(1.4)Noteagainthatspheringrequirescenteringimplyingthatwealwayshavetoper-formtheseoperationsinthisorder,firstcenter,thensphere.Figure??a,b,cillus-tratethisprocess.Youmaynowbeasking,“wellwhatifthedatawhereelongatedinadiagonaldirection?”.Indeed,wecanalsodealwithsuchacasebyfirstcentering,thenrotatingsuchthattheelongateddirectionpointsinthedirectionofoneoftheaxes,andthenscaling.Thisrequiresquiteabitmoremath,andwillpostponethisissueuntilchapter??on“principalcomponentsanalysis”.However,thequestionisinfactaverydeepone,becauseonecouldarguethatonecouldkeepchangingthedatausingmoreandmoresophisticatedtransformationsuntilallthestructurewasremovedfromthedataandtherewouldbenothinglefttoanalyze!Itisindeedtruethatthepre-processingstepscanbeviewedaspartofthemodelingprocessinthatitidentifiesstructure(andthenremovesit).Byrememberingthesequenceoftransformationsyouperformedyouhaveimplicitlybuildamodel.Reversely,manyalgorithmcanbeeasilyadaptedtomodelthemeanandscaleofthedata.Now,thepreprocessingisnolongernecessaryandbecomesintegratedintothemodel.Justaspreprocessingcanbeviewedasbuildingamodel,wecanuseamodeltotransformstructureddatainto(more)unstructureddata.Thedetailsofthisprocesswillbeleftforlaterchaptersbutagoodexampleisprovidedbycompres-sionalgorithms.Compressionalgorithmsarebasedonmodelsfortheredundancyindata(e.g.text,images).Thecompressionconsistsinremovingthisredun-dancyandtransformingtheoriginaldataintoalessstructuredorlessredundant(andhencemoresuccinct)code.Modelsandstructurereducingdatatransforma-tionsareinsenseeachothersreverse:weoftenassociatewithamodelanunder-standingofhowthedatawasgenerated,startingfromrandomnoise.Reversely,pre-processingstartswiththedataandunderstandshowwecangetbacktotheunstructuredrandomstateofthedata[FIGURE].Finally,Iwillmentiononemorepopulardata-transformationtechnique.Manyalgorithmsarearebasedontheassumptionthatdataissortofsymmetricaround\x0c'}, {'id': '53c5537e-5ca7-4775-bde5-2b7d0c6c19d1', 'page': 18, 'text': '6CHAPTER1.DATAANDINFORMATIONtheorigin.Ifdatahappenstobejustpositive,itdoesn’tfitthisassumptionverywell.Takingthefollowinglogarithmcanhelpinthatcase,X′in=log(a+Xin)(1.5)\x0c'}, {'id': '69d6f327-6a00-4692-aa2e-1789931e5f85', 'page': 19, 'text': 'Chapter2DataVisualizationTheprocessofdataanalysisdoesnotjustconsistofpickinganalgorithm,fittingittothedataandreportingtheresults.Wehaveseenthatweneedtochoosearepresentationforthedatanecessitatingdata-preprocessinginmanycases.De-pendingonthedatarepresentationandthetaskathandwethenhavetochooseanalgorithmtocontinueouranalysis.Butevenafterwehaverunthealgorithmandstudytheresultsweareinterestedin,wemayrealizethatourinitialchoiceofalgorithmorrepresentationmaynothavebeenoptimal.Wemaythereforedecidetotryanotherrepresentation/algorithm,comparetheresultsandperhapscombinethem.Thisisaniterativeprocess.Whatmayhelpusindecidingtherepresentationandalgorithmforfurtheranalysis?ConsiderthetwodatasetsinFigure??.Intheleftfigureweseethatthedatanaturallyformsclusters,whileintherightfigureweobservethatthedataisapproximatelydistributedonaline.Theleftfiguresuggestsaclusteringapproachwhiletherightfiguresuggestsadimensionalityreductionapproach.Thisillus-tratestheimportanceoflookingatthedatabeforeyoustartyouranalysisinsteadof(literally)blindlypickinganalgorithm.Afteryourfirstpeek,youmaydecidetotransformthedataandthenlookagaintoseeifthetransformeddatabettersuittheassumptionsofthealgorithmyouhaveinmind.“Lookingatthedata”soundsmoreeasythanitreallyis.Thereasonisthatwearenotequippedtothinkinmorethan3dimensions,whilemostdatalivesinmuchhigherdimensions.Forinstanceimagepatchesofsize10×10liveina100pixelspace.Howarewegoingtovisualizeit?Therearemanyanswerstothisproblem,butmostinvolveprojection:wedetermineanumberof,say,2or3dimensionalsubspacesontowhichweprojectthedata.Thesimplestchoiceofsubspacesaretheonesalignedwiththefeatures,e.g.wecanplotX1nversusX2n7\x0c'}, {'id': '2c267029-554d-4fe1-b4cf-7a051d0e35cf', 'page': 20, 'text': '8CHAPTER2.DATAVISUALIZATIONetc.AnexampleofsuchascatterplotisgiveninFigure??.Notethatwehaveatotalofd(d−1)/2possibletwodimensionalprojectionswhichamountsto4950projectionsfor100dimensionaldata.Thisisusuallytoomanytomanuallyinspect.Howdowecutdownonthenumberofdimensions?perhapsrandomprojectionsmaywork?Unfortunatelythatturnsouttobenotagreatideainmanycases.ThereasonisthatdataprojectedonarandomsubspaceoftenlooksdistributedaccordingtowhatisknownasaGaussiandistribution(seeFigure??).Thedeeperreasonbehindthisphenomenonisthecentrallimittheo-remwhichstatesthatthesumofalargenumberofindependentrandomvariablesis(undercertainconditions)distributedasaGaussiandistribution.Hence,ifwedenotewithwavectorinRdandbyxthed-dimensionalrandomvariable,theny=wTxisthevalueoftheprojection.Thisisclearlyisaweightedsumoftherandomvariablesxi,i=1..d.Ifweassumethatxiareapproximatelyin-dependent,thenwecanseethattheirsumwillbegovernedbythiscentrallimittheorem.Analogously,adataset{Xin}canthusbevisualizedinonedimensionby“histogramming”1thevaluesofY=wTX,seeFigure??.Inthisfigureweclearlyrecognizethecharacteristic“Bell-shape”oftheGaussiandistributionofprojectedandhistogrammeddata.Inonesensethecentrallimittheoremisaratherhelpfulquirkofnature.ManyvariablesfollowGaussiandistributionsandtheGaussiandistributionisoneofthefewdistributionswhichhaveveryniceanalyticproperties.Unfortunately,theGaussiandistributionisalsothemostuninformativedistribution.Thisnotionof“uninformative”canactuallybemadeverypreciseusinginformationtheoryandstates:Givenafixedmeanandvariance,theGaussiandensityrepresentstheleastamountofinformationamongalldensitieswiththesamemeanandvariance.ThisisratherunfortunateforourpurposesbecauseGaussianprojectionsaretheleastrevealingdimensionstolookat.Soingeneralwehavetoworkabithardertoseeinterestingstructure.Alargenumberofalgorithmshasbeendevisedtosearchforinformativepro-jections.Thesimplestbeing“principalcomponentanalysis”orPCAforshort??.Here,interestingmeansdimensionsofhighvariance.However,itwasrecognizedthathighvarianceisnotalwaysagoodmeasureofinterestingnessandoneshouldrathersearchfordimensionsthatarenon-Gaussian.Forinstance,“independentcomponentsanalysis”(ICA)??and“projectionpursuit”??searchesfordimen-1Ahistogramisabar-plotwheretheheightofthebarrepresentsthenumberitemsthathadavaluelocatedintheintervalonthex-axisowhichthebarstands(i.e.thebasisofthebar).Ifmanyitemshaveavaluearoundzero,thenthebarcenteredatzerowillbeveryhigh.\x0c'}, {'id': '7efb9204-64e6-4461-bf3f-d3e31bca0e07', 'page': 21, 'text': '9sionsthathaveheavytailsrelativetoGaussiandistributions.Anothercriterionistotofindprojectionsontowhichthedatahasmultiplemodes.Amorerecentapproachistoprojectthedataontoapotentiallycurvedmanifold??.Scatterplotsareofcoursenottheonlywaytovisualizedata.Itsacreativeexerciseandanythingthathelpsenhanceyourunderstandingofthedataisallowedinthisgame.ToillustrateIwillgiveafewexamplesforma\x0c'}, {'id': 'd2a7f4b9-1905-418a-83bd-5e10997f4f22', 'page': 22, 'text': '10CHAPTER2.DATAVISUALIZATION\x0c'}, {'id': 'f9987cad-718d-4408-a5ce-ae0f841a8116', 'page': 23, 'text': 'Chapter3LearningThischapteriswithoutquestionthemostimportantoneofthebook.Itconcernsthecore,almostphilosophicalquestionofwhatlearningreallyis(andwhatitisnot).Ifyouwanttorememberonethingfromthisbookyouwillfindithereinthischapter.Ok,let’sstartwithanexample.Alicehasaratherstrangeailment.Sheisnotabletorecognizeobjectsbytheirvisualappearance.Atherhomesheisdoingjustfine:hermotherexplainedAliceforeveryobjectinherhousewhatisisandhowyouuseit.Whensheishome,sherecognizestheseobjects(iftheyhavenotbeenmovedtoomuch),butwhensheentersanewenvironmentsheislost.Forexample,ifsheentersanewmeetingroomsheneedsalongtimetoinferwhatthechairsandthetableareintheroom.Shehasbeendiagnosedwithaseverecaseof”overfitting”.WhatisthematterwithAlice?Nothingiswrongwithhermemorybecausesherememberstheobjectsonceshehasseemthem.Infact,shehasafantasticmemory.Sherememberseverydetailoftheobjectsshehasseen.Andeverytimesheseesanewobjectsshereasonsthattheobjectinfrontofherissurelynotachairbecauseitdoesn’thaveallthefeaturesshehasseeninear-lierchairs.TheproblemisthatAlicecannotgeneralizetheinformationshehasobservedfromoneinstanceofavisualobjectcategorytoother,yetunobservedmembersofthesamecategory.ThefactthatAlice’sdiseaseissorareisunder-standabletheremusthavebeenastrongselectionpressureagainstthisdisease.Imagineourancestorswalkingthroughthesavannaonemillionyearsago.Alionappearsonthescene.AncestralAlicehasseenlionsbefore,butnotthisparticularoneanditdoesnotinduceafearresponse.Ofcourse,shehasnotimetoinferthepossibilitythatthisanimalmaybedangerouslogically.Alice’scontemporariesnoticedthattheanimalwasyellow-brown,hadmanesetc.andimmediatelyun-11\x0c'}, {'id': '4a74271d-9a95-4d94-bed0-a61881b9026a', 'page': 24, 'text': '12CHAPTER3.LEARNINGderstoodthatthiswasalion.Theyunderstoodthatalllionshavetheseparticularcharacteristicsincommon,butmaydifferinsomeotherones(likethepresenceofascarsomeplace).Bobhasanotherdiseasewhichiscalledover-generalization.Oncehehasseenanobjecthebelievesalmosteverythingissome,perhapstwistedinstanceofthesameobjectclass(Infact,IseemtosufferfromthissonowandthenwhenIthinkallofmachinelearningcanbeexplainedbythisonenewexcitingprinciple).IfancestralBobwalksthesavannaandhehasjustencounteredaninstanceofalionandfledintoatreewithhisbuddies,thenexttimeheseesasquirrelhebelievesitisasmallinstanceofadangerouslionandfleesintothetreesagain.Over-generalizationseemstoberathercommonamongsmallchildren.Oneofthemainconclusionsfromthisdiscussionisthatweshouldneitherover-generalizenorover-fit.Weneedtobeontheedgeofbeingjustright.Butjustrightaboutwhat?Itdoesn’tseemthereisonecorrectGod-givendefinitionofthecategorychairs.Weseemtoallagree,butonecansurelyfindexamplesthatwouldbedifficulttoclassify.Whendowegeneralizeexactlyright?ThemagicwordisPREDICTION.Fromanevolutionarystandpoint,allwehavetodoismakecorrectpredictionsaboutaspectsoflifethathelpussurvive.Nobodyreallycaresaboutthedefinitionoflion,butwedocareabouttheourresponsestothevariousanimals(runawayforlion,chasefordeer).Andtherearealotofthingsthatcanbepredictedintheworld.Thisfoodkillsmebutthatfoodisgoodforme.Drummingmyfistsonmyhairychestinfrontofafemalegeneratesopportunitiesforsex,stickingmyhandintothatyellow-orangeflickering“flame”hurtsmyhandandsoon.Theworldiswonderfullypredictableandweareverygoodatpredictingit.Sowhydowecareaboutobjectcategoriesinthefirstplace?Well,apparentlytheyhelpusorganizetheworldandmakeaccuratepredictions.Thecategorylionsisanabstractionandabstractionshelpustogeneralize.Inacertainsense,learningisallaboutfindingusefulabstractionsorconceptsthatdescribetheworld.Taketheconcept“fluid”,itdescribesallwaterysubstancesandsummarizessomeoftheirphysicalproperties.Otheconceptof“weight”:anabstractionthatdescribesacertainpropertyofobjects.Hereisoneveryimportantcorollaryforyou:“machinelearningisnotinthebusinessofrememberingandregurgitatingobservedinformation,itisinthebusinessoftransferring(generalizing)propertiesfromobserveddataontonew,yetunobserveddata”.Thisisthemantraofmachinelearningthatyoushouldrepeattoyourselfeverynightbeforeyougotobed(atleastuntilthefinalexam).Theinformationwereceivefromtheworldhastwocomponentstoit:there\x0c'}, {'id': '78a3aa00-5a5d-40a6-8821-7ce1dbfcf5fa', 'page': 25, 'text': '13isthepartoftheinformationwhichdoesnotcarryovertothefuture,theun-predictableinformation.Wecallthis“noise”.Andthenthereistheinformationthatispredictable,thelearnablepartoftheinformationstream.Thetaskofanylearningalgorithmistoseparatethepredictablepartfromtheunpredictablepart.NowimagineBobwantstosendanimagetoAlice.Hehastopay1dollarcentforeverybitthathesends.IftheimagewerecompletelywhiteitwouldbereallystupidofBobtosendthemessage:pixel1:white,pixel2:white,pixel3:white,.....Hecouldjusthavesendthemessageallpixelsarewhite!.Theblankimageiscompletelypredictablebutcarriesverylittleinformation.Nowimagineaimagethatconsistofwhitenoise(yourtelevisionscreenifthecableisnotconnected).TosendtheexactimageBobwillhavetosendpixel1:white,pixel2:black,pixel3:black,....Bobcannotdobetterbecausethereisnopredictableinformationinthatimage,i.e.thereisnostructuretobemodeled.Youcanimagineplayingagameandrevealingonepixelatatimetosomeoneandpayhim1$foreverynextpixelhepredictscorrectly.Forthewhiteimageyoucandoperfect,forthenoisypictureyouwouldberandomguessing.Realpicturesareinbetween:somepixelsareveryhardtopredict,whileothersareeasier.Tocompresstheimage,Bobcanextractrulessuchas:alwayspredictthesamecolorasthemajorityofthepixelsnexttoyou,exceptwhenthereisanedge.Theserulesconstitutethemodelfortheregularitiesoftheimage.Insteadofsendingtheentireimagepixelbypixel,BobwillnowfirstsendhisrulesandaskAlicetoapplytherules.EverytimetherulefailsBobalsosendacorrection:pixel103:white,pixel245:black.Afewrulesandtwocorrectionsisobviouslycheaperthan256pixelvaluesandnorules.Thereisonefundamentaltradeoffhiddeninthisgame.SinceBobissendingonlyasingleimageitdoesnotpaytosendanincrediblycomplicatedmodelthatwouldrequiremorebitstoexplainthansimplysendingallpixelvalues.Ifhewouldbesending1billionimagesitwouldpayofftofirstsendthecomplicatedmodelbecausehewouldbesavingafractionofallbitsforeveryimage.Ontheotherhand,ifBobwantstosend2pixels,therereallyisnoneedinsendingamodelwhatsoever.Therefore:thesizeofBob’smodeldependsontheamountofdatahewantstotransmit.Ironically,theboundarybetweenwhatismodelandwhatisnoisedependsonhowmuchdatawearedealingwith!Ifweuseamodelthatistoocomplexweoverfittothedataathand,i.e.partofthemodelrepresentsnoise.Ontheotherhand,ifweuseatoosimplemodelwe”underfit”(over-generalize)andvaluablestructureremainsunmodeled.Bothleadtosub-optimalcompressionoftheimage.Butbothalsoleadtosuboptimalpredictiononnewimages.Thecompressiongamecanthereforebeusedtofindtherightsizeofmodelcomplexityforagivendataset.Andsowehavediscoveredadeep\x0c'}, {'id': '582d61fe-4ee6-4b28-9432-99f3cadb3004', 'page': 26, 'text': '14CHAPTER3.LEARNINGconnectionbetweenlearningandcompression.Nowlet’sthinkforamomentwhatwereallymeanwith“amodel”.Amodelrepresentsourpriorknowledgeoftheworld.Itimposesstructurethatisnotnec-essarilypresentinthedata.Wecallthisthe“inductivebias”.Ourinductivebiasoftencomesintheformofaparametrizedmodel.Thatistosay,wedefineafamilyofmodelsbutletthedatadeterminewhichofthesemodelsismostappro-priate.Astronginductivebiasmeansthatwedon’tleaveflexibilityinthemodelforthedatatoworkon.Wearesoconvincedofourselvesthatwebasicallyignorethedata.Thedownsideisthatifwearecreatinga“badbias”towardstowrongmodel.Ontheotherhand,ifwearecorrect,wecanlearntheremainingdegreesoffreedominourmodelfromveryfewdata-cases.Conversely,wemayleavethedooropenforahugefamilyofpossiblemodels.Ifwenowletthedatazoominonthemodelthatbestexplainsthetrainingdataitwilloverfittothepeculiaritiesofthatdata.Nowimagineyousampled10datasetsofthesamesizeNandtraintheseveryflexiblemodelsseparatelyoneachofthesedatasets(notethatinrealityyouonlyhaveaccesstoonesuchdatasetbutpleaseplayalonginthisthoughtexperiment).Let’ssaywewanttodeterminethevalueofsomeparameterθ.Be-causethemodelsaresoflexible,wecanactuallymodeltheidiosyncrasiesofeachdataset.Theresultisthatthevalueforθislikelytobeverydifferentforeachdataset.Butbecausewedidn’timposemuchinductivebiastheaverageofmanyofsuchestimateswillbeaboutright.Wesaythatthebiasissmall,butthevari-anceishigh.Inthecaseofveryrestrictivemodelstheoppositehappens:thebiasispotentiallylargebutthevariancesmall.Notethatnotonlyisalargebiasisbad(forobviousreasons),alargevarianceisbadaswell:becauseweonlyhaveonedatasetofsizeN,ourestimatecouldbeveryfaroffsimplywewereunluckywiththedatasetweweregiven.Whatweshouldthereforestriveforistoinjectallourpriorknowledgeintothelearningproblem(thismakeslearningeasier)butavoidinjectingthewrongpriorknowledge.Ifwedon’ttrustourpriorknowledgeweshouldletthedataspeak.However,lettingthedataspeaktoomuchmightleadtooverfitting,soweneedtofindtheboundarybetweentoocomplexandtoosimpleamodelandgetitscomplexityjustright.Accesstomoredatameansthatthedatacanspeakmorerelativetopriorknowledge.That,inanutshelliswhatmachinelearningisallabout.\x0c'}, {'id': 'a707a024-102e-4db2-bb82-f4057af13f00', 'page': 27, 'text': '3.1.INANUTSHELL153.1InaNutshellLearningisallaboutgeneralizingregularitiesinthetrainingdatatonew,yetun-observeddata.Itisnotaboutrememberingthetrainingdata.Goodgeneralizationmeansthatyouneedtobalancepriorknowledgewithinformationfromdata.De-pendingonthedatasetsize,youcanentertainmoreorlesscomplexmodels.Thecorrectsizeofmodelcanbedeterminedbyplayingacompressiongame.Learning=generalization=abstraction=compression.\x0c'}, {'id': '9a3a4723-5945-4090-a4c9-b96acf201378', 'page': 28, 'text': '16CHAPTER3.LEARNING\x0c'}, {'id': '786c0476-9da0-41e1-b938-7ebedbd89c90', 'page': 29, 'text': 'Chapter4TypesofMachineLearningWenowwillturnourattentionanddiscusssomelearningproblemsthatwewillencounterinthisbook.ThemostwellstudiedprobleminMListhatofsupervisedlearning.Toexplainthis,let’sfirstlookatanexample.Bobwanttolearnhowtodistinguishbetweenbobcatsandmountainlions.HetypesthesewordsintoGoogleImageSearchandcloselystudiesallcatlikeimagesofbobcatsontheonehandandmountainlionsontheother.SomemonthslateronahikingtripintheSanBernardinomountainsheseesabigcat....ThedatathatBobcollectedwaslabelledbecauseGoogleissupposedtoonlyreturnpicturesofbobcatswhenyousearchfortheword”bobcat”(andsimilarlyformountainlions).Let’scalltheimagesX1,..XnandthelabelsY1,...,Yn.NotethatXiaremuchhigherdimensionalobjectsbecausetheyrepresentallthein-formationextractedfromtheimage(approximately1millionpixelcolorvalues),whileYiissimply−1or1dependingonhowwechoosetolabelourclasses.So,thatwouldbearatioofabout1millionto1intermsofinformationcontent!Theclassificationproblemcanusuallybeposedasfinding(a.k.a.learning)afunctionf(x)thatapproximatesthecorrectclasslabelsforanyinputx.Forinstance,wemaydecidethatsign[f(x)]isthepredictorforourclasslabel.Inthefollowingwewillbestudyingquiteafewoftheseclassificationalgorithms.Thereisalsoadifferentfamilyoflearningproblemsknownasunsupervisedlearningproblems.InthiscasetherearenolabelsYinvolved,justthefeaturesX.Ourtaskisnottoclassify,buttoorganizethedata,ortodiscoverthestructureinthedata.Thismaybeveryusefulforvisualizationdata,compressingdata,ororganizingdataforeasyaccessibility.Extractingstructureindataoftenleadstothediscoveryofconcepts,topics,abstractions,factors,causes,andmoresuchtermsthatallreallymeanthesamething.Thesearetheunderlyingsemantic17\x0c'}, {'id': 'd2baad10-2f1c-4892-aff3-c052e5e19aee', 'page': 30, 'text': '18CHAPTER4.TYPESOFMACHINELEARNINGfactorsthatcanexplainthedata.Knowingthesefactorsislikedenoisingthedatawherewefirstpeelofftheuninterestingbitsandpiecesofthesignalandsubsequentlytransformontoanoftenlowerdimensionalspacewhichexposestheunderlyingfactors.Therearetwodominantclassesofunsupervisedlearningalgorithms:cluster-ingbasedalgorithmsassumethatthedataorganizesintogroups.FindingthesegroupsisthenthetaskoftheMLalgorithmandtheidentityofthegroupisthese-manticfactor.Anotherclassofalgorithmsstrivestoprojectthedataontoalowerdimensionalspace.Thismappingcanbenonlinear,buttheunderlyingassump-tionisthatthedataisapproximatelydistributedonsome(possiblycurved)lowerdimensionalmanifoldembeddedintheinputspace.Unrollingthatmanifoldisthenthetaskofthelearningalgorithm.Inthiscasethedimensionsshouldbeinterpretedassemanticfactors.Therearemanyvariationsontheabovethemes.Forinstance,oneisoftenconfrontedwithasituationwhereyouhaveaccesstomanymoreunlabeleddata(onlyXi)andmanyfewerlabeledinstances(both(Xi,Yi).Takethetaskofclas-sifyingnewsarticlesbytopic(weather,sports,nationalnews,internationaletc.).Somepeoplemayhavelabeledsomenews-articlesbyhandbuttherewon’tbeallthatmanyofthose.However,wedohaveaverylargedigitallibraryofscannednewspapersavailable.Shouldn’titbepossibletousethosescannednewspaperssomehowtotoimprovetheclassifier?Imaginethatthedatanaturallyclustersintowellseparatedgroups(forinstancebecausenewsarticlesreportingondifferenttopicsuseverydifferentwords).ThisisdepictedinFigure??).Notethatthereareonlyveryfewcaseswhichhavelabelsattachedtothem.Fromthisfigureitbecomesclearthattheexpectedoptimaldecisionboundarynicelyseparatestheseclusters.Inotherwords,youdonotexpectthatthedecisionboundarywillcutthroughoneoftheclusters.Yetthatisexactlywhatwouldhappenifyouwouldonlybeusingthelabeleddata.Hence,bysimplyrequiringthatdecisionbound-ariesdonotcutthroughregionsofhighprobabilitywecanimproveourclassifier.Thesubfieldthatstudieshowtoimproveclassificationalgorithmsusingunlabeleddatagoesunderthename“semi-supervisedlearning”.Afourthmajorclassoflearningalgorithmsdealswithproblemswherethesupervisedsignalconsistsonlyofrewards(orcosts)thatarepossiblydelayed.Considerforexampleamousethatneedstosolvealabyrinthinordertoobtainhisfood.Whilemakinghisdecisionshewillnotreceiveanyfeedback(apartfromperhapsslowlygettingmorehungry).It’sonlyattheendwhenhereachesthecheesethatreceiveshispositivefeedback,andhewillhaveusethistoreinforcehisperhapsrandomearlierdecisionsthatleadhimtothecheese.Theseproblem\x0c'}, {'id': '7bef4f5b-0764-4379-9211-2305b114f78e', 'page': 31, 'text': '19fallunderthename”reinforcementlearning”.Itisaverygeneralsetupinwhichalmostallknowncasesofmachinelearningcanbecast,butthisgeneralityalsomeansthatthesetypeofproblemscanbeverydifficult.ThemostgeneralRLproblemsdonotevenassumethatyouknowwhattheworldlookslike(i.e.themazeforthemouse),soyouhavetosimultaneouslylearnamodeloftheworldandsolveyourtaskinit.Thisdualtaskinducesinterestingtrade-offs:shouldyouinvesttimenowtolearnmachinelearningandreapthebenefitlaterintermsofahighsalaryworkingforYahoo!,orshouldyoustopinvestingnowandstartexploitingwhatyouhavelearnedsofar?Thisisclearlyafunctionofage,orthetimehorizonthatyoustillhavetotakeadvantageoftheseinvestments.Themouseissimilarlyconfrontedwiththeproblemofwhetherheshouldtryoutthisnewalleyinthemazethatcancutdownhistimetoreachthecheeseconsiderably,orwhetherheshouldsimplystaywithhehaslearnedandtaketheroutehealreadyknows.Thisclearlydependsonhowoftenhethinkshewillhavetorunthroughthesamemazeinthefuture.Wecallthistheexplorationversusexploitationtrade-off.ThereasonthatRLisaveryexcitingfieldofresearchisbecauseofitsbiologicalrelevance.Dowenotalsohavefigureouthowtheworldworksandsurviveinit?Let’sgobacktothenews-articles.Assumewehavecontroloverwhatarticlewewilllabelnext.Whichonewouldbepick.Surelytheonethatwouldbemostinformativeinsomesuitablydefinedsense.Orthemouseinthemaze.Giventhatdecidestoexplore,wheredoesheexplore?Surelyhewilltrytoseekoutalleysthatlookpromising,i.e.alleysthatheexpectstomaximizehisreward.Wecalltheproblemoffindingthenextbestdata-casetoinvestigate“activelearning”.Onemayalsobefacedwithlearningmultipletasksatthesametime.Thesetasksarerelatedbutnotidentical.Forinstance,considertheproblemifrecom-mendingmoviestocustomersofNetflix.Eachpersonisdifferentandwouldre-allyrequireaseparatemodeltomaketherecommendations.However,peoplealsosharecommonalities,especiallywhenpeopleshowevidenceofbeingofthesame“type”(forexampleasffanoracomedyfan).Wecanlearnpersonalizedmodelsbutsharefeaturesbetweenthem.Especiallyfornewcustomers,wherewedon’thaveaccesstomanymoviesthatwereratedbythecustomer,weneedto“drawstatisticalstrength”fromcustomerswhoseemtobesimilar.Fromthisexampleithashopefullybecomeclearthatwearetryingtolearnmodelsformanydiffer-entyetrelatedproblemsandthatwecanbuildbettermodelsifwesharesomeofthethingslearnedforonetaskwiththeotherones.Thetrickisnottosharetoomuchnortoolittleandhowmuchweshouldsharedependsonhowmuchdataandpriorknowledgewehaveaccesstoforeachtask.Wecallthissubfieldofmachinelearning:“multi-tasklearning.\x0c'}, {'id': '0ec62e39-61d9-48e4-a871-af4646c351a5', 'page': 32, 'text': '20CHAPTER4.TYPESOFMACHINELEARNING4.1InaNutshellTherearemanytypesoflearningproblemswithinmachinelearning.Supervisedlearningdealswithpredictingclasslabelsfromattributes,unsupervisedlearn-ingtriestodiscoverinterestingstructureindata,semi-supervisedlearningusesbothlabeledandunlabeleddatatoimprovepredictiveperformance,reinforcementlearningcanhandlesimplefeedbackintheformofdelayedreward,activelearn-ingoptimizesthenextsampletoincludeinthelearningalgorithmandmulti-tasklearningdealswithsharingcommonmodelcomponentsbetweenrelatedlearningtasks.\x0c'}, {'id': '09889b87-955b-4892-ad18-de09d875a4c1', 'page': 33, 'text': 'Chapter5NearestNeighborsClassificationPerhapsthesimplestalgorithmtoperformclassificationisthe“knearestneigh-bors(kNN)classifier”.Asusualweassumethatwehavedataoftheform{Xin,Yn}whereXinisthevalueofattributeifordata-casenandYnisthelabelfordata-casen.Wealsoneedameasureofsimilaritybetweendata-cases,whichwewilldenotewithK(Xn,Xm)wherelargervaluesofKdenotemoresimilardata-cases.Giventhesepreliminaries,classificationisembarrassinglysimple:whenyouareprovidedwiththeattributesXtforanew(unseen)test-case,youfirstfindthekmostsimilardata-casesinthedatasetbycomputingK(Xt,Xn)foralln.CallthissetS.Then,eachofthesekmostsimilarneighborsinScancastavoteonthelabelofthetestcase,whereeachneighborpredictsthatthetestcasehasthesamelabelasitself.Assumingbinarylabelsandanoddnumberofneighbors,thiswillalwaysresultinadecision.AlthoughkNNalgorithmsareoftenassociatedwiththissimplevotingscheme,moresophisticatedwaysofcombiningtheinformationoftheseneighborsisal-lowed.Forinstance,onecouldweigheachvotebythesimilaritytothetest-case.Thisresultsinthefollowingdecisionrule,Yt=1ifXn∈SK(Xt,Xn)(2Yn−1)>0(5.1)Yt=0ifXn∈SK(Xt,Xn)(2Yn−1)<0(5.2)(5.3)andflippingacoinifitisexactly0.Whydoweexpectthisalgorithmtoworkintuitively?Thereasonisthatweexpectdata-caseswithsimilarlabelstoclustertogetherinattributespace.Soto21\x0c'}, {'id': 'f1c68397-5e70-400e-a6e0-963514c2d02c', 'page': 34, 'text': '22CHAPTER5.NEARESTNEIGHBORSCLASSIFICATIONfigureoutthelabelofatest-casewesimplylookaroundandseewhatlabelsourneighborshave.Askingyourclosestneighborislikebettingallyourmoneyonasinglepieceofadviceandyoumightgetreallyunluckyifyourclosestneighborhappenstobeanodd-one-out.It’stypicallybettertoaskseveralopinionsbeforemakingyourdecision.However,ifyouasktoomucharoundyouwillbeforcedtoaskadvicefromdata-casesthatarenolongerverysimilartoyou.Sothereissomeoptimalnumberofneighborstoask,whichmaybedifferentforeveryproblem.Determiningthisoptimalnumberofneighborsisnoteasy,butwecanagainusecrossvalidation(section??)toestimateit.SowhatisgoodandbadaboutkNN?First,it’ssimplicitymakesitattractive.Veryfewassumptionsaboutthedataareusedintheclassificationprocess.Thispropertycanalsobeadisadvantage:ifyouhavepriorknowledgeabouthowthedatawasgenerated,itsbettertouseit,becauselessinformationhastobeex-tractedfromthedata.Asecondconsiderationiscomputationtimeandmemoryefficiency.Assumeyouhaveaverylargedataset,butyouneedtomakedecisionsveryquickly.Asanexample,considersurfingtheweb-pagesofAmazone.com.Wheneveryousearchforabook,itlikestosuggest10others.Todothatitcouldclassifybooksintocategoriesandsuggestthetoprankedinthatcategory.kNNre-quiresAmazonetostoreallfeaturesofallbooksatalocationthatisaccessibleforfastcomputation.Moreover,toclassifykNNhastodotheneighborhoodsearcheverytimeagain.Clearly,therearetricksthatcanbeplayedwithsmartindexing,butwouldn’titbemucheasierifwewouldhavesummarizedallbooksbyasim-pleclassificationfunctionfθ(X),that“spitsout”aclassforanycombinationoffeaturesX?Thisdistinctionbetweenalgorithms/modelsthatrequirememorizingeverydata-itemdataisoftencalled“parametric”versus“non-parametric”.It’simpor-tanttorealizethatthisissomewhatofamisnomer:non-parametricmodelscanhaveparameters(suchasthenumberofneighborstoconsider).Thekeydistinc-tionisratherwetherthedataissummarizedthroughasetofparameterswhichtogethercompriseaclassificationfunctionfθ(X),orwhetherweretainallthedatatodotheclassification“onthefly”.KNNisalsoknowntosufferfromthe“curseofhighdimensions”.Ifweusemanyfeaturestodescribeourdata,andinparticularwhenmostofthesefea-turesturnouttobeirrelevantandnoisyfortheclassification,thenkNNisquicklyconfused.Imaginethattherearetwofeaturesthatcontainalltheinformationnec-essaryforaperfectclassification,butthatwehaveadded98noisy,uninformativefeatures.Theneighborsinthetwodimensionalspaceoftherelevantfeaturesareunfortunatelynolongerlikelytobetheneighborsinthe100dimensionalspace,\x0c'}, {'id': '5f87bd34-0140-4ef1-8ec9-2e3f93d57ad8', 'page': 35, 'text': '5.1.THEIDEAINANUTSHELL23because98noisydimensionshavebeenadded.ThiseffectisdetrimentaltothekNNalgorithm.Onceagain,itisveryimportanttochooseyourinitialrepresen-tationwithmuchcareandpreprocessthedatabeforeyouapplythealgorithm.Inthiscase,preprocessingtakestheformof“featureselection”onwhichawholebookinitselfcouldbewritten.5.1TheIdeaInaNutshellToclassifyanewdata-itemyoufirstlookfortheknearestneighborsinfeaturespaceandassignitthesamelabelasthemajorityoftheseneighbors.\x0c'}, {'id': '0eddc0c5-4cf5-492d-9be5-5356be64aff3', 'page': 36, 'text': '24CHAPTER5.NEARESTNEIGHBORSCLASSIFICATION\x0c'}, {'id': '373dd313-edb9-4e29-aea8-f5ee3b7238ee', 'page': 37, 'text': 'Chapter6TheNaiveBayesianClassifierInthischapterwewilldiscussthe“NaiveBayes”(NB)classifier.Ithasproventobeveryusefulinmanyapplicationbothinscienceaswellasinindustry.IntheintroductionIpromisedIwouldtrytoavoidtheuseofprobabilitiesasmuchaspossible.However,inchapterI’llmakeanexception,becausetheNBclassifierismostnaturallyexplainedwiththeuseofprobabilities.Fortunately,wewillonlyneedthemostbasicconcepts.6.1TheNaiveBayesModelNBismostlyusedwhendealingwithdiscrete-valuedattributes.Wewillexplainthealgorithminthiscontextbutnotethatextensionstocontinuous-valuedat-tributesarepossible.Wewillrestrictattentiontoclassificationproblemsbetweentwoclassesandrefertosection??forapproachestoextendthistwomorethantwoclasses.InourusualnotationweconsiderDdiscretevaluedattributesXi∈[0,..,Vi],i=1..D.NotethateachattributecanhaveadifferentnumberofvaluesVi.Iftheorig-inaldatawassuppliedinadifferentformat,e.g.X1=[Yes,No],thenwesimplyreassignthesevaluestofittheaboveformat,Yes=1,No=0(orreversed).Inadditionwearealsoprovidedwithasupervisedsignal,inthiscasethelabelsareY=0andY=1indicatingthatthatdata-itemfellinclass0orclass1.Again,whichclassisassignedto0or1isarbitraryandhasnoimpactontheperformanceofthealgorithm.Beforewemoveon,let’sconsiderarealworldexample:spam-filtering.Everydayyourmailboxget’sbombardedwithhundredsofspamemails.Togivean25\x0c'}, {'id': 'ce32ab22-fb1f-43d8-b298-7b044a4645c9', 'page': 38, 'text': '26CHAPTER6.THENAIVEBAYESIANCLASSIFIERexampleofthetrafficthatitgenerates:theuniversityofCaliforniaIrvinereceivesontheorderof2millionspamemailsaday.Fortunately,thebulkoftheseemails(approximately97%)isfilteredoutordumpedintoyourspam-boxandwillreachyourattention.Howisthisdone?Well,itturnsouttobeaclassicexampleofaclassificationproblem:spamorham,that’sthequestion.Let’ssaythatspamwillreceivealabel1andhamalabel0.Ourtaskisthustolabeleachnewemailwitheither0or1.Whataretheattributes?Rephrasingthisquestion,whatwouldyoumeasureinanemailtoseeifitisspam?Certainly,ifIwouldread“viagra”inthesubjectIwouldstoprightthereanddumpitinthespam-box.Whatelse?Hereareafew:“enlargement,cheap,buy,pharmacy,money,loan,mortgage,credit”andsoon.Wecanbuildadictionaryofwordsthatwecandetectineachemail.Thisdictionarycouldalsoincludewordphrasessuchas“buynow”,“penisenlargement”,onecanmakephrasesassophisticatedasnecessary.Onecouldmeasurewhetherthewordsorphrasesappearatleastonceoronecouldcounttheactualnumberoftimestheyappear.Spammersknowaboutthewaythesespamfiltersworkandcounteractbyslightmisspellingsofcertainkeywords.Hencewemightalsowanttodetectwordslike“viagra”andsoon.Infact,asmallarmsracehasensuedwherespamfiltersandspamgeneratorsfindnewtrickstocounteractthetricksofthe“opponent”.Puttingallthesesubtletiesasideforamomentwe’llsimplyassumethatwemeasureanumberoftheseattributesforeveryemailinadataset.We’llalsoassumethatwehavespam/hamlabelsfortheseemails,whichwereacquiredbysomeoneremovingspamemailsbyhandfromhis/herinbox.Ourtaskisthentotrainapredictorforspam/hamlabelsforfutureemailswherewehaveaccesstoattributesbutnottolabels.TheNBmodeliswhatwecalla“generative”model.Thismeansthatweimaginehowthedatawasgeneratedinanabstractsense.Foremails,thisworksasfollows,animaginaryentityfirstdecideshowmanyspamandhamemailsitwillgenerateonadailybasis.Say,itdecidestogenerate40%spamand60%ham.Wewillassumethisdoesn’tchangewithtime(ofcourseitdoes,butwewillmakethissimplifyingassumptionfornow).Itwillthendecidewhatthechanceisthatacertainwordappearsktimesinaspamemail.Forexample,theword“viagra”hasachanceof96%tonotappearatall,1%toappearonce,0.9%toappeartwiceetc.Theseprobabilitiesareclearlydifferentforspamandham,“viagra”shouldhaveamuchsmallerprobabilitytoappearinahamemail(butitcouldofcourse;considerIsendthistexttomypublisherbyemail).Giventheseprobabilities,wecanthengoonandtrytogenerateemailsthatactuallylooklikerealemails,i.e.withpropersentences,butwewon’tneedthatinthefollowing.Insteadwemakethesimplifyingassumptionthatemailconsistsof“abagofwords”,inrandom\x0c'}, {'id': '57be3fc4-0b18-40be-962b-6f5f905970da', 'page': 39, 'text': '6.2.LEARNINGANAIVEBAYESCLASSIFIER27order.6.2LearningaNaiveBayesClassifierGivenadataset,{Xin,Yn},i=1..D,n=1..N,wewishtoestimatewhattheseprobabilitiesare.Tostartwiththesimplestone,whatwouldbeagoodestimateforthenumberofthepercentageofspamversushamemailsthatourimaginaryentityusestogenerateemails?Well,wecansimplycounthowmanyspamandhamemailswehaveinourdata.Thisisgivenby,P(spam)=#spamemailstotal#emails=PnI[Yn=1]N(6.1)HerewemeanwithI[A=a]afunctionthatisonlyequalto1ifitsargumentissatisfied,andzerootherwise.Hence,intheequationaboveitcountsthenumberofinstancesthatYn=1.Sincetheremainderoftheemailsmustbeham,wealsofindthatP(ham)=1−P(spam)=#hamemailstotal#emails=PnI[Yn=0]N(6.2)wherewehaveusedthatP(ham)+P(spam)=1sinceanemailiseitherhamorspam.Next,weneedtoestimatehowoftenweexpecttoseeacertainwordorphraseineitheraspamorahamemail.Inourexamplewecouldforinstanceaskourselveswhattheprobabilityisthatwefindtheword“viagra”ktimes,withk=0,1,>1,inaspamemail.Let’srecodethisasXviagra=0meaningthatwedidn’tobserve“viagra”,Xviagra=1meaningthatweobserveditonceandXviagra=2meaningthatweobserveditmorethanonce.Theanswerisagainthatwecancounthowoftentheseeventshappenedinourdataandusethatasanestimatefortherealprobabilitiesaccordingtowhichitgeneratedemails.Firstforspamwefind,Pspam(Xi=j)=#spamemailsforwhichthewordiwasfoundjtimestotal#ofspamemails(6.3)=PnI[Xin=j∧Yn=1]PnI[Yn=1](6.4)Herewehavedefinedthesymbol∧tomeanthatbothstatementstotheleftandrightofthissymbolshouldholdtrueinorderfortheentiresentencetobetrue.\x0c'}, {'id': 'cc3399db-dcec-412a-ac85-3d51f05b9068', 'page': 40, 'text': '28CHAPTER6.THENAIVEBAYESIANCLASSIFIERForhamemails,wecomputeexactlythesamequantity,Pham(Xi=j)=#hamemailsforwhichthewordiwasfoundjtimestotal#ofhamemails(6.5)=PnI[Xin=j∧Yn=0]PnI[Yn=0](6.6)Boththesequantitiesshouldbecomputedforallwordsorphrases(ormoregen-erallyattributes).Wehavenowfinishedthephasewhereweestimatethemodelfromthedata.Wewilloftenrefertothisphaseas“learning”ortrainingamodel.Themodelhelpsusunderstandhowdatawasgeneratedinsomeapproximatesetting.Thenextphaseisthatofpredictionorclassificationofnewemail.6.3Class-PredictionforNewInstancesNewemaildoesnotcomewithalabelhamorspam(ifitwouldwecouldthrowspaminthespam-boxrightaway).Whatwedoseearetheattributes{Xi}.Ourtaskistoguessthelabelbasedonthemodelandthemeasuredattributes.Theapproachwetakeissimple:calculatewhethertheemailhasahigherprobabilityofbeinggeneratedfromthespamorthehammodel.Forexample,becausetheword“viagra”hasatinyprobabilityofbeinggeneratedunderthehammodelitwillendupwithahigherprobabilityunderthespammodel.Butclearly,allwordshaveasayinthisprocess.It’slikealargecommitteeofexperts,oneforeachword.eachmembercastsavoteandcansaythingslike:“Iam99%certainitsspam”,or“It’salmostdefinitelynotspam(0.1%spam)”.Eachoftheseopinionswillbemultipliedtogethertogenerateafinalscore.Wethenfigureoutwhetherhamorspamhasthehighestscore.Thereisonelittlepracticalcaveatwiththisapproach,namelythattheproductofalargenumberofprobabilities,eachofwhichisnecessarilysmallerthanone,veryquicklygetssosmallthatyourcomputercan’thandleit.Thereisaneasyfixthough.Insteadofmultiplyingprobabilitiesasscores,weusethelogarithmsofthoseprobabilitiesandaddthelogarithms.Thisisnumericallystableandleadstothesameconclusionbecauseifa>bthenwealsohavethatlog(a)>log(b)andviceversa.Inequationswecomputethescoreasfollows:Sspam=XilogPspam(Xi=vi)+logP(spam)(6.7)\x0c'}, {'id': '6e78e5c3-37d1-4dbd-bbbe-b131231ec522', 'page': 41, 'text': '6.3.CLASS-PREDICTIONFORNEWINSTANCES29wherewithviwemeanthevalueforattributeithatweobserveintheemailunderconsideration,i.e.iftheemailcontainsnomentionoftheword“viagra”wesetvviagra=0.ThefirstterminEqn.6.7addsallthelog-probabilitiesunderthespammodelofobservingtheparticularvalueofeachattribute.Everytimeawordisobservedthathashighprobabilityforthespammodel,andhencehasoftenbeenobservedinthedataset,willboostthisscore.Thelasttermaddsanextrafactortothescorethatexpressesourpriorbeliefofreceivingaspamemailinsteadofahamemail.Wecomputeasimilarscoreforham,namely,Sham=XilogPham(Xi=vi)+logP(ham)(6.8)andcomparethetwoscores.Clearly,alargescoreforspamrelativetohampro-videsevidencethattheemailisindeedspam.Ifyourgoalistominimizethetotalnumberoferrors(whethertheyinvolvespamorham)thenthedecisionshouldbetochoosetheclasswhichhasthehighestscore.Inreality,onetypeoferrorcouldhavemoreseriousconsequencesthanan-other.Forinstance,aspamemailmakingitinmyinboxisnottoobad,badanimportantemailthatendsupinmyspam-box(whichInevercheck)mayhaveseriousconsequences.Toaccountforthisweintroduceageneralthresholdθandusethefollowingdecisionrule,Y=1ifS1>S0+θ(6.9)Y=0ifS10)=−∞whilethefinalscoreisasumoftheseindividualword-scores.Tocounteractthisphenomenon,wegiveeachwordinthedictionaryasmallprobabilityofbeingpresentinanyemail(spamorham),beforeseeingthedata.Thisprocessiscalledsmoothing.Theimpactontheestimatedprobabilitiesaregivenbelow,Pspam(Xi=j)=α+PnI[Xin=j∧Yn=1]Viα+PnI[Yn=1](6.12)Pham(Xi=j)=α+PnI[Xin=j∧Yn=0]Viα+PnI[Yn=0](6.13)whereViisthenumberofpossiblevaluesofattributei.Thus,αcanbeinterpretedasasmall,possiblyfractionalnumberof“pseudo-observations”oftheattributeinquestion.It’slikeaddingtheseobservationstotheactualdataset.Whatvalueforαdoweuse?Fittingitsvalueonthedatasetwillnotwork,becausethereasonweaddeditwasexactlybecauseweassumedtherewastoolittledatainthefirstplace(wehadn’treceivedoneofthoseannoying“Nigeria”emailsyet)andthuswillrelatetothephenomenonofoverfitting.However,wecanusethetrickdescribedinsection??wherewesplitthedatatwopieces.Welearnamodelononechunkandadjustαsuchthatperformanceoftheotherchunkisoptimal.Weplaythisgamethismultipletimeswithdifferentsplitsandaveragetheresults.\x0c'}, {'id': '0d215c35-6520-4dc2-b52b-3045c117c2b2', 'page': 43, 'text': '6.5.REMARKS316.5RemarksOneofthemainlimitationsoftheNBclassifieristhatitassumesindependencebe-tweenattributes(ThisispresumablythereasonwhywecallitthenaiveBayesianclassifier).Thisisreflectedinthefactthateachclassifierhasanindependentvoteinthefinalscore.However,imaginethatImeasurethewords,“home”and“mortgage”.Observing“mortgage”certainlyraisestheprobabilityofobserving“home”.Wesaythattheyarepositivelycorrelated.Itwouldthereforebemorefairifweattributedasmallerweightto“home”ifwealreadyobservedmortgagebecausetheyconveythesamething:thisemailisaboutmortgagesforyourhome.Onewaytoobtainamorefairvotingschemeistomodelthesedependenciesex-plicitly.However,thiscomesatacomputationalcost(alongertimebeforeyoureceiveyouremailinyourinbox)whichmaynotalwaysbeworththeadditionalaccuracy.Oneshouldalsonotethatmoreparametersdonotnecessarilyimproveaccuracybecausetoomanyparametersmayleadtooverfitting.6.6TheIdeaInaNutshellConsiderFigure??.Wecanclassifydatabybuildingamodelofhowthedatawasgenerated.ForNBwefirstdecidewhetherwewillgenerateadata-itemfromclassY=0orclassY=1.GiventhatdecisionwegeneratethevaluesforDattributesindependently.Eachclasshasadifferentmodelforgeneratingattributes.Clas-sificationisachievedbycomputingwhichmodelwasmorelikelytogeneratethenewdata-point,biasingtheoutcometowardstheclassthatisexpectedtogeneratemoredata.\x0c'}, {'id': '0325e729-1439-4b03-b254-a650aa9cd129', 'page': 44, 'text': '32CHAPTER6.THENAIVEBAYESIANCLASSIFIER\x0c'}, {'id': '79157606-10a2-4601-b519-a04cb5e08931', 'page': 45, 'text': 'Chapter7ThePerceptronWewillnowdescribeonethesimplestparametricclassifiers:theperceptronanditscousinthelogisticregressionclassifier.However,despiteitssimplicityitshouldnotbeunder-estimated!Itistheworkhorseformostcompaniesin-volvedwithsomeformofmachinelearning(perhapstyingwiththedecisiontreeclassifier).Onecouldsaythatitrepresentsthecanonicalparametricapproachtoclassificationwherewebelievethatastraightlineissufficienttoseparatethetwoclassesofinterest.AnexampleofthisisgiveninFigure??wheretheassumptionthatthetwoclassescanbeseparatedbyalineisclearlyvalid.However,thisassumptionneednotalwaysbetrue.LookingatFigure??weclearlyobservethatthereisnostraightlinethatwilldothejobforus.Whatcanwedo?Ourfirstinclinationisprobablytotryandfitamorecomplicatedsepa-rationboundary.However,thereisanothertrickthatweillbeusingofteninthisbook.Insteadwecanincreasethedimensionalityofthespaceby“measuring”morethingsofthedata.Callφk(X)featurekthatwasmeasuredfromthedata.Thefeaturescanbehighlynonlinearfunctions.Thesimplestchoicemaybetoalsomeasureφi(X)=X2i,∀kforeachattributeXk.Butwemayalsomeasurecross-productssuchasφij(X)=XiXj,∀i,j.Thelatterwillallowyoutoex-plicitlymodelcorrelationsbetweenattributes.Forexample,ifXirepresentsthepresence(1)orabsence(0)oftheword“viagra”andsimilarlyforXjandthepres-ence/absenceoftheword“dysfunction”,thenthecrossproductfeatureXiXjlet’syoumodelthepresenceofbothwordssimultaneously(whichshouldbehelpfulintryingtofindoutwhatthisdocumentisabout).Wecanaddasmanyfeaturesaswelike,addinganotherdimensionforeverynewfeature.Inthishigherdimensionalspacewecannowbemoreconfidentinassumingthatthedatacanbeseparatedbyaline.33\x0c'}, {'id': 'ede9dd74-1883-470a-88c8-6ad63bdcb9a2', 'page': 46, 'text': '34CHAPTER7.THEPERCEPTRONIliketowarnthereaderatthispointthatmorefeaturesisnotnecessarilyagoodthingifthenewfeaturesareuninformativefortheclassificationtaskathand.Theproblemisthattheyintroducenoiseintheinputthatcanmasktheactualsignal(i.e.thegood,discriminativefeatures).Infact,thereisawholesubfieldofMLthatdealswithselectingrelevantfeaturesfromasetthatistoolarge.Theproblemoftoomanydimensionsissometimescalled“thecurseofhighdimensionality”.Anotherwayofseeingthisisthatmoredimensionsoftenleadtomoreparametersinthemodel(asinthecasefortheperceptron)andcanhenceleadtooverfitting.Tocombatthatinturnwecanaddregularizersaswewillseeinthefollowing.Withtheintroductionofregularizers,wecansometimesplaymagicanduseaninfinitenumberoffeatures.Howweplaythismagicwillbeexplainedwhenwewilldiscusskernelmethodsinthenextsections.Butletusfirststartsimplewiththeperceptron.7.1ThePerceptronModelOurassumptionisthatalinecanseparatethetwoclassesofinterest.TomakeourlifealittleeasierwewillswitchtotheY={+1,−1}representation.Withthis,wecanexpresstheconditionmathematicallyexpressedas1,Yn≈sign(XkwkXkn−α)(7.1)where“sign”isthesign-function(+1fornonnegativerealsand−1fornegativereals).WehaveintroducedK+1parameters{w1,..,wK,α}whichdefinethelineforus.ThevectorwrepresentsthedirectionorthogonaltothedecisionboundarydepictedinFigure??.Forexample,alinethroughtheoriginisrepresentedbywTx=0,i.e.allvectorsxwithavanishinginnerproductwithw.ThescalarquantityαrepresentstheoffsetofthelinewTx=0fromtheorigin,i.e.theshortestdistancefromtheorigintotheline.Thiscanbeseenbywritingthepointsonthelineasx=y+vwhereyisafixedvectorpointingtoanarbitrarypointonthelineandvisthevectoronthelinestartingaty(seeFigure??).Hence,wT(y+v)−α=0.SincebydefinitionwTv=0,wefindwTy=αwhichmeansthatαistheprojectionofyontowwhichistheshortestdistancefromtheorigintotheline.1NotethatwecanreplaceXk→φk(X)butthatforthesakeofsimplicitywewillrefrainfromdoingsoatthispoint.\x0c'}, {'id': '28547351-610a-4115-8332-b7836c670caa', 'page': 47, 'text': '7.1.THEPERCEPTRONMODEL35Weliketoestimatetheseparametersfromthedata(whichwewilldoinaminute),butitisimportanttonoticethatthenumberofparametersisfixedinadvance.Insomesense,webelievesomuchinourassumptionthatthedataislinearlyseparablethatwesticktoitirrespectiveofhowmanydata-caseswewillencounter.Thisfixedcapacityofthemodelistypicalforparametricmethods,butperhapsalittleunrealisticforrealdata.Amorereasonableassumptionisthatthedecisionboundarymaybecomemorecomplexasweseemoredata.Toofewdata-casessimplydonotprovidetheresolution(evidence)necessarytoseemorecomplexstructureinthedecisionboundary.Recallthatnon-parametricmethods,suchasthe“nearest-neighbors”classifiersactuallydohavethisdesirablefeature.Nevertheless,thelinearseparabilityassumptioncomeswithsomecomputationadvantagesaswell,suchasveryfastclasspredictiononnewtestdata.Ibelievethatthiscomputationalconveniencemaybeattherootforitspopularity.Bytheway,whenwetakethelimitofaninfinitenumberoffeatures,wewillhavehappilyreturnedthelandof“non-parametrics”butwehaveexercisealittlepatiencebeforewegetthere.Nowlet’swritedownacostfunctionthatwewishtominimizeinorderforourlineardecisionboundarytobecomeagoodclassifier.Clearly,wewouldliketocontrolperformanceonfuture,yetunseentestdata.However,thisisalittlehard(sincewedon’thaveaccesstothisdatabydefinition).Asasurrogatewewillsimplyfitthelineparametersonthetrainingdata.Itcannotbestressedenoughthatthisisdangerousinprincipleduetothephenomenonofoverfitting(seesec-tion??).Ifwehaveintroducedverymanyfeaturesandnoformofregularizationthenwehavemanyparameterstofit.Whenthiscapacityistoolargerelativetothenumberofdatacasesatourdisposal,wewillbefittingtheidiosyncrasiesofthisparticulardatasetandthesewillnotcarryovertothefuturetestdata.So,oneshouldsplitofasubsetofthetrainingdataandreserveitformonitoringper-formance(oneshouldnotusethissetinthetrainingprocedure).Cyclingthoughmultiplesplitsandaveragingtheresultwasthecross-validationproceduredis-cussedinsection??.Ifwedonotusetoomanyfeaturesrelativetothenumberofdata-cases,themodelclassisverylimitedandoverfittingisnotanissue.(Infact,onemaywanttoworrymoreabout“underfitting”inthiscase.)Ok,sonowthatweagreeonwritingdownacostonthetrainingdata,weneedtochooseanexplicitexpression.Considernowthefollowingchoice:C(w,α)=121nnXi=1(Yn−wTXn+α)2(7.2)\x0c'}, {'id': 'f74688b9-3357-4685-8cf4-f1b493a8d2cf', 'page': 48, 'text': '36CHAPTER7.THEPERCEPTRONwherewehaverewrittenwTXn=PkwkXkn.IfweminimizethiscostthenwTXn−αtendstobepositivewhenYn=+1andnegativewhenYn=−1.Thisiswhatwewant!OnceoptimizedwecantheneasilyuseouroptimalparameterstoperformpredictiononnewtestdataXtestasfollows:˜Ytest=sign(Xkw∗kXtest−α∗)(7.3)where˜YisusedtoindicatethepredictedvalueforY.Sofarsogood,buthowdoweobtainourvaluesfor{w∗,α∗}?Thesimplestapproachistocomputethegradientandslowlydescentonthecostfunction(seeappendix??forbackground).Inthiscase,thegradientsaresimple:∇wC(w,α)=−1nnXi=1(Yn−wTXn+α)Xn=−X(Y−XTw+α)(7.4)∇αC(w,α)=1nnXi=1(Yn−wTXn+α)=(Y−XTw+α)(7.5)whereinthelattermatrixexpressionwehaveusedtheconventionthatXisthematrixwithelementsXkn.Ourgradientdescentisnowsimplygivenas,wt+1=wt−η∇wC(wt,αt)(7.6)αt+1=αt−η∇αC(wt,αt)(7.7)Iteratingtheseequationsuntilconvergencewillminimizethecostfunction.Onemaycriticizeplainvanillagradientdescentformanyreasons.Forexampleyouneedtobecarefullychoosethestepsizeηorriskeitherexcruciatinglyslowconver-genceorexplodingvaluesoftheiterateswt,αt.Evenifconvergenceisachievedasymptotically,itistypicallyslow.UsingaNewton-Ralphsonmethodwillim-proveconvergencepropertiesconsiderablybutisalsoveryexpensive.Manymeth-odshavebeendevelopedtoimprovetheoptimizationofthecostfunction,butthatisnotthefocusofthisbook.However,Idowanttomentionaverypopularapproachtooptimizationonverylargedatasetsknownas“stochasticgradientdescent”.Theideaistoselectasingledata-itemrandomlyandperformanupdateontheparametersbasedonthat:wt+1=wt+η(Yn−wTXn+α)Xn(7.8)αt+1=αt=η(Yn−wTXn+α)(7.9)\x0c'}, {'id': 'da3e74c1-a232-4492-af0d-3f1d6af35cb4', 'page': 49, 'text': '7.2.ADIFFERENTCOSTFUNCTION:LOGISTICREGRESSION37Thefactthatwearepickingdata-casesrandomlyinjectsnoiseintheupdates,soevenclosetoconvergenceweare“wigglingaround”thesolution.Ifwedecreasethestepsizehowever,thewigglesgetsmaller.Soitseemsasensiblestrategywouldbetoslowlydecreasethestepsizeandwiggleourwaytothesolution.Thisstochasticgradientdescentisactuallyveryefficientinpracticeifwecanfindagoodannealingscheduleforthestepsize.Whyreally?Itseemsthatifweusemoredata-casesinamini-batchtoperformaparameterupdateweshouldbeabletomakelargerstepsinparameterspacebyusingbiggerstepsizes.Whilethisreasoningholdsclosetothesolutionitdoesnotfarawayfromthesolution.Theintuitivereasonisthatfarawayfromconvergenceeverydatapointwilltellyouthesamestory:moveindirectionXtoimproveyourmodel.Yousimplydonotneedtoquerydatapointsinordertoextractthatinformation.Soforabadmodelthereisalotofredundancyintheinformationthatdata-casescanconveyaboutimprovingtheparametersandqueryingafewissufficient.Closertoconvergenceyouneedtoeitherusemoredataordecreasethestepsizetoincreasetheresolutionofyourgradients.Thistypeofreasoningclearlymakesanefforttoincludethecomputationalbudgetpartoftheoverallobjective.ThisiswhatwehavearguedinchapterXXisthedistinguishingfeatureofmachinelearning.Ifyouarenotconvincedabouthowimportantthisisinthefaceofmoderndaydatasetsimaginerthefollowing.CompanyCorganizesacontestwheretheyprovideavirtuallyinfinitedatasetforsomepredictiontask.Youcanearn1milliondollarsifyoumakeaccuratepredictionsonsometestsetbyFridaynextweek.Youcanchoosebetweenasingleparameterupdatebasedonallthedataormanyupdatesonsmallsubsetsofthedata,Whodoyouthinkwillwinthecontest?7.2ADifferentCostfunction:LogisticRegressionThecostfunctionofEq.7.2penalizesgrossviolationsofonespredictionsratherseverely(quadratically).Thisissometimescounter-productivebecausethealgo-rithmmightgetobsessedwithimprovingtheperformanceofonesingledata-caseattheexpenseofalltheothers.Therealcostsimplycountsthenumberofmis-labelledinstances,irrespectiveofhowbadlyoffyoupredictionfunctionwTXn+αwas.So,adifferentfunctionisoftenused,C(w,α)=−1nnXi=1Yntanh(wTXn+α)(7.10)\x0c'}, {'id': '020dbc08-0d37-4d04-9af4-a3555857ce4c', 'page': 50, 'text': '38CHAPTER7.THEPERCEPTRONThefunctiontanh(·)isplottedinfigure??.Itshowsthatthecostcanneverbelargerthan2whichensurestherobustnessagainstoutliers.Weleaveittothereadertoderivethegradientsandformulatethegradientdescentalgorithm.7.3TheIdeaInaNutshellFigure??tellsthestory.Oneassumesthatyourdatacanbeseparatedbyaline.AnylinecanberepresentedbywTx=α.Data-casesfromoneclasssatisfywTXn≤αwhiledata-casesfromtheotherclasssatisfywTXn≥α.Toachievethat,youwritedownacostfunctionthatpenalizesdata-casesfallingonthewrongsideofthelineandminimizeitover{w,α}.ForatestcaseyousimplycomputethesignofwTXtest−αtomakeapredictionastowhichclassitbelongsto.\x0c'}, {'id': '9e37287b-5df9-433f-a8a5-0116306f6b3f', 'page': 51, 'text': 'Chapter8SupportVectorMachinesOurtaskistopredictwhetheratestsamplebelongstooneoftwoclasses.Wereceivetrainingexamplesoftheform:{xi,yi},i=1,...,nandxi∈Rd,yi∈{−1,+1}.Wecall{xi}theco-variatesorinputvectorsand{yi}theresponsevariablesorlabels.Weconsideraverysimpleexamplewherethedataareinfactlinearlysepa-rable:i.e.Icandrawastraightlinef(x)=wTx−bsuchthatallcaseswithyi=−1fallononesideandhavef(xi)<0andcaseswithyi=+1fallontheotherandhavef(xi)>0.Giventhatwehaveachievedthat,wecouldclassifynewtestcasesaccordingtotheruleytest=sign(xtest).However,typicallythereareinfinitelymanysuchhyper-planesobtainedbysmallperturbationsofagivensolution.Howdowechoosebetweenallthesehyper-planeswhichthesolvetheseparationproblemforourtrainingdata,butmayhavedifferentperformanceonthenewlyarrivingtestcases.Forinstance,wecouldchoosetoputthelineveryclosetomembersofoneparticularclass,sayy=−1.Intuitively,whentestcasesarrivewewillnotmakemanymistakesoncasesthatshouldbeclassifiedwithy=+1,butwewillmakeveryeasilymistakesonthecaseswithy=−1(forinstance,imaginethatanewbatchoftestcasesarriveswhicharesmallperturbationsofthetrainingdata).Asensiblethingthusseemstochoosetheseparationlineasfarawayfrombothy=−1andy=+1trainingcasesaswecan,i.e.rightinthemiddle.Geometrically,thevectorwisdirectedorthogonaltothelinedefinedbywTx=b.Thiscanbeunderstoodasfollows.Firsttakeb=0.Nowitisclearthatallvec-tors,x,withvanishinginnerproductwithwsatisfythisequation,i.e.allvectorsorthogonaltowsatisfythisequation.Nowtranslatethehyperplaneawayfromtheoriginoveravectora.Theequationfortheplanenowbecomes:(x−a)Tw=0,39\x0c'}, {'id': '081d89fc-51d8-44fa-8c5d-0cc28db95546', 'page': 52, 'text': '40CHAPTER8.SUPPORTVECTORMACHINESi.e.wefindthatfortheoffsetb=aTw,whichistheprojectionofaontotothevectorw.Withoutlossofgeneralitywemaythuschooseaperpendiculartotheplane,inwhichcasethelength||a||=|b|/||w||representstheshortest,orthogonaldistancebetweentheoriginandthehyperplane.Wenowdefine2morehyperplanesparalleltotheseparatinghyperplane.Theyrepresentthatplanesthatcutthroughtheclosesttrainingexamplesoneitherside.Wewillcallthem“supporthyper-planes”inthefollowing,becausethedata-vectorstheycontainsupporttheplane.Wedefinethedistancebetweenthethesehyperplanesandtheseparatinghy-perplanetobed+andd−respectively.Themargin,γ,isdefinedtobed++d−.Ourgoalisnowtofindatheseparatinghyperplanesothatthemarginislargest,whiletheseparatinghyperplaneisequidistantfromboth.Wecanwritethefollowingequationsforthesupporthyperplanes:wTx=b+δ(8.1)wTx=b−δ(8.2)Wenownotethatwehaveover-parameterizedtheproblem:ifwescalew,bandδbyaconstantfactorα,theequationsforxarestillsatisfied.Toremovethisambiguitywewillrequirethatδ=1,thissetsthescaleoftheproblem,i.e.ifwemeasuredistanceinmillimetersormeters.Wecannowalsocomputethevaluesford+=(||b+1|−|b||)/||w||=1/||w||(thisisonlytrueifb/∈(−1,0)sincetheorigindoesn’tfallinbetweenthehyper-planesinthatcase.Ifb∈(−1,0)youshouldused+=(||b+1|+|b||)/||w||=1/||w||).Hencethemarginisequaltotwicethatvalue:γ=2/||w||.Withtheabovedefinitionofthesupportplaneswecanwritedownthefollow-ingconstraintthatanysolutionmustsatisfy,wTxi−b≤−1∀yi=−1(8.3)wTxi−b≥+1∀yi=+1(8.4)orinoneequation,yi(wTxi−b)−1≥0(8.5)WenowformulatetheprimalproblemoftheSVM:minimizew,b12||w||2subjecttoyi(wTxi−b)−1≥0∀i(8.6)\x0c'}, {'id': '52e6a9ca-1661-47a5-be89-43a3f74ae914', 'page': 53, 'text': '41Thus,wemaximizethemargin,subjecttotheconstraintsthatalltrainingcasesfalloneithersideofthesupporthyper-planes.Thedata-casesthatlieonthehyperplanearecalledsupportvectors,sincetheysupportthehyper-planesandhencedeterminethesolutiontotheproblem.Theprimalproblemcanbesolvedbyaquadraticprogram.However,itisnotreadytobekernelised,becauseitsdependenceisnotonlyoninnerproductsbetweendata-vectors.Hence,wetransformtothedualformulationbyfirstwritingtheproblemusingaLagrangian,L(w,b,α)=12||w||2−NXi=1αi(cid:2)yi(wTxi−b)−1(cid:3)(8.7)ThesolutionthatminimizestheprimalproblemsubjecttotheconstraintsisgivenbyminwmaxαL(w,α),i.e.asaddlepointproblem.Whentheoriginalobjective-functionisconvex,(andonlythen),wecaninterchangetheminimizationandmaximization.Doingthat,wefindthatwecanfindtheconditiononwthatmustholdatthesaddlepointwearesolvingfor.Thisisdonebytakingderivativeswrtwandbandsolving,w−Xiαiyixi=0⇒w∗=Xiαiyixi(8.8)Xiαiyi=0(8.9)InsertingthisbackintotheLagrangianweobtainwhatisknownasthedualprob-lem,maximizeLD=NXi=1αi−12XijαiαjyiyjxTixjsubjecttoXiαiyi=0(8.10)αi≥0∀i(8.11)Thedualformulationoftheproblemisalsoaquadraticprogram,butnotethatthenumberofvariables,αiinthisproblemisequaltothenumberofdata-cases,N.Thecrucialpointishowever,thatthisproblemonlydependsonxithroughtheinnerproductxTixj.ThisisreadilykernelisedthroughthesubstitutionxTixj→k(xi,xj).Thisisarecurrenttheme:thedualproblemlendsitselftokernelisation,whiletheprimalproblemdidnot.\x0c'}, {'id': 'd78b338e-45b3-4d6b-8e35-269a87ddceb1', 'page': 54, 'text': '42CHAPTER8.SUPPORTVECTORMACHINESThetheoryofdualityguaranteesthatforconvexproblems,thedualprob-lemwillbeconcave,andmoreover,thattheuniquesolutionoftheprimalprob-lemcorrespondstottheuniquesolutionofthedualproblem.Infact,wehave:LP(w∗)=LD(α∗),i.e.the“duality-gap”iszero.Nextweturntotheconditionsthatmustnecessarilyholdatthesaddlepointandthusthesolutionoftheproblem.ThesearecalledtheKKTconditions(whichstandsforKarush-Kuhn-Tucker).Theseconditionsarenecessaryingeneral,andsufficientforconvexoptimizationproblems.Theycanbederivedfromthepri-malproblembysettingthederivativeswrttowtozero.Also,theconstraintsthemselvesarepartoftheseconditionsandweneedthatforinequalityconstraintstheLagrangemultipliersarenon-negative.Finally,animportantconstraintcalled“complementaryslackness”needstobesatisfied,∂wLP=0→w−Xiαiyixi=0(8.12)∂bLP=0→Xiαiyi=0(8.13)constraint-1yi(wTxi−b)−1≥0(8.14)multiplierconditionαi≥0(8.15)complementaryslacknessαi(cid:2)yi(wTxi−b)−1(cid:3)=0(8.16)Itisthelastequationwhichmaybesomewhatsurprising.Itstatesthateithertheinequalityconstraintissatisfied,butnotsaturated:yi(wTxi−b)−1>0inwhichcaseαiforthatdata-casemustbezero,ortheinequalityconstraintissaturatedyi(wTxi−b)−1=0,inwhichcaseαicanbeanyvalueαi≥0.In-equalityconstraintswhicharesaturatedaresaidtobe“active”,whileunsaturatedconstraintsareinactive.Onecouldimaginetheprocessofsearchingforasolutionasaballwhichrunsdowntheprimaryobjectivefunctionusinggradientdescent.Atsomepoint,itwillhitawallwhichistheconstraintandalthoughthederivativeisstillpointingpartiallytowardsthewall,theconstraintsprohibitstheballtogoon.Thisisanactiveconstraintbecausetheballisgluedtothatwall.Whenafinalsolutionisreached,wecouldremovesomeconstraints,withoutchangingthesolution,theseareinactiveconstraints.Onecouldthinkoftheterm∂wLPastheforceactingontheball.Weseefromthefirstequationabovethatonlytheforceswithαi6=0exsertaforceontheballthatbalanceswiththeforcefromthecurvedquadraticsurfacew.Thetrainingcaseswithαi>0,representingactiveconstraintsontheposi-tionofthesupporthyperplanearecalledsupportvectors.Thesearethevectors\x0c'}, {'id': 'da4ce3c1-c157-4cfc-b526-001432c6ec5e', 'page': 55, 'text': '8.1.THENON-SEPARABLECASE43thataresituatedinthesupporthyperplaneandtheydeterminethesolution.Typi-cally,thereareonlyfewofthem,whichpeoplecalla“sparse”solution(mostα’svanish).Whatwearereallyinterestedinisthefunctionf(·)whichcanbeusedtoclassifyfuturetestcases,f(x)=w∗Tx−b∗=XiαiyixTix−b∗(8.17)AsanapplicationoftheKKTconditionswederiveasolutionforb∗byusingthecomplementaryslacknesscondition,b∗= XjαjyjxTjxi−yi!iasupportvector(8.18)whereweusedy2i=1.So,usinganysupportvectoronecandetermineb,butfornumericalstabilityitisbettertoaverageoverallofthem(althoughtheyshouldobviouslybeconsistent).Themostimportantconclusionisagainthatthisfunctionf(·)canthusbeexpressedsolelyintermsofinnerproductsxTixiwhichwecanreplacewithker-nelmatricesk(xi,xj)tomovetohighdimensionalnon-linearspaces.Moreover,sinceαistypicallyverysparse,wedon’tneedtoevaluatemanykernelentriesinordertopredicttheclassofthenewinputx.8.1TheNon-SeparablecaseObviously,notalldatasetsarelinearlyseparable,andsoweneedtochangetheformalismtoaccountforthat.Clearly,theproblemliesintheconstraints,whichcannotalwaysbesatisfied.So,let’srelaxthoseconstraintsbyintroducing“slackvariables”,ξi,wTxi−b≤−1+ξi∀yi=−1(8.19)wTxi−b≥+1−ξi∀yi=+1(8.20)ξi≥0∀i(8.21)Thevariables,ξiallowforviolationsoftheconstraint.Weshouldpenalizetheobjectivefunctionfortheseviolations,otherwisetheaboveconstraintsbecomevoid(simplyalwayspickξiverylarge).PenaltyfunctionsoftheformC(Piξi)k\x0c'}, {'id': 'da6fb4ca-d64e-4ab9-b711-4b7201f2d0a1', 'page': 56, 'text': '44CHAPTER8.SUPPORTVECTORMACHINESwillleadtoconvexoptimizationproblemsforpositiveintegersk.Fork=1,2itisstillaquadraticprogram(QP).Inthefollowingwewillchoosek=1.Ccontrolsthetradeoffbetweenthepenaltyandmargin.Tobeonthewrongsideoftheseparatinghyperplane,adata-casewouldneedξi>1.Hence,thesumPiξicouldbeinterpretedasmeasureofhow“bad”theviolationsareandisanupperboundonthenumberofviolations.Thenewprimalproblemthusbecomes,minimizew,b,ξLP=12||w||2+CXiξisubjecttoyi(wTxi−b)−1+ξi≥0∀i(8.22)ξi≥0∀i(8.23)leadingtotheLagrangian,L(w,b,ξ,α,µ)=12||w||2+CXiξi−NXi=1αi(cid:2)yi(wTxi−b)−1+ξi(cid:3)−NXi=1µiξi(8.24)fromwhichwederivetheKKTconditions,1.∂wLP=0→w−Xiαiyixi=0(8.25)2.∂bLP=0→Xiαiyi=0(8.26)3.∂ξLP=0→C−αi−µi=0(8.27)4.constraint-1yi(wTxi−b)−1+ξi≥0(8.28)5.constraint-2ξi≥0(8.29)6.multipliercondition-1αi≥0(8.30)7.multipliercondition-2µi≥0(8.31)8.complementaryslackness-1αi(cid:2)yi(wTxi−b)−1+ξi(cid:3)=0(8.32)9.complementaryslackness-1µiξi=0(8.33)(8.34)Fromherewecandeducethefollowingfacts.Ifweassumethatξi>0,thenµi=0(9),henceαi=C(1)andthusξi=1−yi(xTiw−b)(8).Also,whenξi=0wehaveµi>0(9)andhenceαi0(8).Otherwise,ifyi(wTxi−b)−1>0\x0c'}, {'id': 'ba9e602b-9256-48e6-b390-a0693d529964', 'page': 57, 'text': '8.1.THENON-SEPARABLECASE45thenαi=0.Insummary,asbeforeforpointsnotonthesupportplaneandonthecorrectsidewehaveξi=αi=0(allconstraintsinactive).Onthesupportplane,westillhaveξi=0,butnowαi>0.Finally,fordata-casesonthewrongsideofthesupporthyperplanetheαimax-outtoαi=Candtheξibalancetheviolationoftheconstraintsuchthatyi(wTxi−b)−1+ξi=0.Geometrically,wecancalculatethegapbetweensupporthyperplaneandtheviolatingdata-casetobeξi/||w||.Thiscanbeseenbecausetheplanedefinedbyyi(wTx−b)−1+ξi=0isparalleltothesupportplaneatadistance|1+yib−ξi|/||w||fromtheorigin.Sincethesupportplaneisatadistance|1+yib|/||w||theresultfollows.Finally,weneedtoconverttothedualproblemtosolveitefficientlyandtokerneliseit.Again,weusetheKKTequationstogetridofw,bandξ,maximizeLD=NXi=1αi−12XijαiαjyiyjxTixjsubjecttoXiαiyi=0(8.35)0≤αi≤C∀i(8.36)Surprisingly,thisisalmostthesameQPisbefore,butwithanextraconstraintonthemultipliersαiwhichnowliveinabox.Thisconstraintisderivedfromthefactthatαi=C−µiandµi≥0.WealsonotethatitonlydependsoninnerproductsxTixjwhicharereadytobekernelised.\x0c'}, {'id': '56c8f9f3-9ba2-4e6d-b090-d6cdba8cca4e', 'page': 58, 'text': '46CHAPTER8.SUPPORTVECTORMACHINES\x0c'}, {'id': '20459a5d-2a3c-46e0-8c74-54ecda178963', 'page': 59, 'text': 'Chapter9SupportVectorRegressionInkernelridgeregressionwehaveseenthefinalsolutionwasnotsparseinthevariablesα.Wewillnowformulatearegressionmethodthatissparse,i.e.ithastheconceptofsupportvectorsthatdeterminethesolution.Thethingtonoticeisthatthesparsenessarosefromcomplementaryslacknessconditionswhichinturncamefromthefactthatwehadinequalityconstraints.IntheSVMthepenaltythatwaspaidforbeingonthewrongsideofthesupportplanewasgivenbyCPiξkiforpositiveintegersk,whereξiistheorthogonaldistanceawayfromthesupportplane.Notethattheterm||w||2wastheretopenalizelargewandhencetoregularizethesolution.Importantly,therewasnopenaltyifadata-casewasontherightsideoftheplane.Becauseallthesedata-pointsdonothaveanyeffectonthefinalsolutiontheαwassparse.Herewedothesamething:weintroduceapenaltyforbeingtofarawayfrompredictedlinewΦi+b,butonceyouarecloseenough,i.e.insome“epsilon-tube”aroundthisline,thereisnopenalty.Wethusexpectthatallthedata-caseswhichlieinsidethedata-tubewillhavenoimpactonthefinalsolutionandhencehavecorrespondingαi=0.Usingtheanalogyofsprings:inthecaseofridge-regressionthespringswereattachedbetweenthedata-casesandthedecisionsurface,henceeveryitemhadanimpactonthepositionofthisboundarythroughtheforceitexerted(recallthatthesurfacewasfrom“rubber”andpulledbackbecauseitwasparameterizedusingafinitenumberofdegreesoffreedomorbecauseitwasregularized).ForSVRthereareonlyspringsattachedbetweendata-casesoutsidethetubeandtheseattachtothetube,notthedecisionboundary.Hence,data-itemsinsidethetubehavenoimpactonthefinalsolution(orrather,changingtheirpositionslightlydoesn’tperturbthesolution).Weintroducedifferentconstraintsforviolatingthetubeconstraintfromabove47\x0c'}, {'id': 'fef17bf1-7aa7-4913-ac55-b62a9b6d4b25', 'page': 60, 'text': '48CHAPTER9.SUPPORTVECTORREGRESSIONandfrombelow,minimize−w,ξ,ˆξ12||w||2+C2Xi(ξ2i+ˆξ2i)subjecttowTΦi+b−yi≤ε+ξi∀iyi−wTΦi−b≤ε+ˆξi∀i(9.1)TheprimalLagrangianbecomes,LP=12||w||2+C2Xi(ξ2i+ˆξ2i)+Xiαi(wTΦi+b−yi−ε−ξi)+Xiˆαi(yi−wTΦi−b−ε−ˆξi)(9.2)RemarkI:Wecouldhaveaddedtheconstraintsthatξi≥0andˆξi≥0.However,itisnothardtoseethatthefinalsolutionwillhavethatrequirementautomaticallyandthereisnosenseinconstrainingtheoptimizationtotheoptimalsolutionaswell.Toseethis,imaginesomeξiisnegative,then,bysettingξi=0thecostislowerandnonoftheconstraintsisviolated,soitispreferred.Wealsonoteduetotheabovereasoningwewillalwayshaveatleastoneoftheξ,ˆξzero,i.e.insidethetubebotharezero,outsidethetubeoneofthemiszero.Thismeansthatatthesolutionwehaveξˆξ=0.RemarkII:Notethatwedon’tscaleε=1likeintheSVMcase.Thereasonisthat{yi}nowdeterminesthescaleoftheproblem,i.e.wehavenotover-parameterizedtheproblem.Wenowtakethederivativesw.r.t.w,b,ξandˆξtofindthefollowingKKTconditions(therearemoreofcourse),w=Xi(ˆαi−αi)Φi(9.3)ξi=αi/Cˆξi=ˆαi/C(9.4)Pluggingthisbackinandusingthatnowwealsohaveαiˆαi=0wefindthedualproblem,maximizeα,ˆα−12Xij(ˆαi−αi)(ˆαj−αj)(Kij+1Cδij)+Xi(ˆαi−αi)yi−Xi(ˆαi+αi)εsubjecttoXi(ˆαi−αi)=0αi≥0,ˆαi≥0∀i(9.5)\x0c'}, {'id': '9e05ee94-de83-4790-8eee-9a24bd35aa62', 'page': 61, 'text': '49Fromthecomplementaryslacknessconditionswecanreadthesparsenessofthesolutionout:αi(wTΦi+b−yi−ε−ξi)=0(9.6)ˆαi(yi−wTΦi−b−ε−ˆξi)=0(9.7)ξiˆξi=0,αiˆαi=0(9.8)whereweaddedthelastconditionsbyhand(theydon’tseemtodirectlyfollowfromtheformulation).Nowweclearlyseethatifacaseisabovethetubeˆξiwilltakeonitssmallestpossiblevalueinordertomaketheconstraintssatisfiedˆξi=yi−wTΦi−b−ε.Thisimpliesthatˆαiwilltakeonapositivevalueandthefartheroutsidethetubethelargertheˆαi(youcanthinkofitasacompensatingforce).Notethatinthiscaseαi=0.Asimilarstorygoesifξi>0andαi>0.Ifadata-caseisinsidethetubetheαi,ˆαiarenecessarilyzero,andhenceweobtainsparseness.WenowchangevariablestomakethisoptimizationproblemlookmoresimilartotheSVMandridge-regressioncase.Introduceβi=ˆαi−αianduseˆαiαi=0towriteˆαi+αi=|βi|,maximizeβ−12Xijβiβj(Kij+1Cδij)+Xiβiyi−Xi|βi|εsubjecttoXiβi=0(9.9)wheretheconstraintcomesfromthefactthatweincludedabiasterm1b.Fromtheslacknessconditionswecanalsofindavalueforb(similartotheSVMcase).Also,asusual,thepredictionofnewdata-caseisgivenby,y=wTΦ(x)+b=XiβiK(xi,x)+b(9.10)Itisaninterestingexerciseforthereadertoworkherwaythroughthecase1Notebythewaythatwecouldnotusethetrickweusedinridge-regressionbydefiningaconstantfeatureφ0=1andb=w0.Thereasonisthattheobjectivedoesnotdependonb.\x0c'}, {'id': '70733fe9-0f81-44a6-b99c-f78d25348c70', 'page': 62, 'text': '50CHAPTER9.SUPPORTVECTORREGRESSIONwherethepenaltyislinearinsteadofquadratic,i.e.minimizew,ξ,ˆξ12||w||2+CXi(ξi+ˆξi)subjecttowTΦi+b−yi≤ε+ξi∀iyi−wTΦi−b≤ε+ˆξi∀i(9.11)ξi≥0,ˆξi≥0∀i(9.12)leadingtothedualproblem,maximizeβ−12XijβiβjKij+Xiβiyi−Xi|βi|εsubjecttoXiβi=0(9.13)−C≤βi≤+C∀i(9.14)wherewenotethatthequadraticpenaltyonthesizeofβisreplacedbyaboxconstraint,asistobeexpectedinswitchingfromL2normtoL1norm.Finalremark:Let’sremindourselvesthatthequadraticprogramsthatwehavederivedareconvexoptimizationproblemswhichhaveauniqueoptimalsolutionwhichcanbefoundefficientlyusingnumericalmethods.Thisisoftenclaimedasgreatprogressw.r.t.theoldneuralnetworksdayswhichwereplaguedbymanylocaloptima.\x0c'}, {'id': '3431b2b8-0f39-48f1-840b-03c30db0a778', 'page': 63, 'text': 'Chapter10KernelridgeRegressionPossiblythemostelementaryalgorithmthatcanbekernelizedisridgeregression.Hereourtaskistofindalinearfunctionthatmodelsthedependenciesbetweencovariates{xi}andresponsevariables{yi},bothcontinuous.Theclassicalwaytodothatistominimizethequadraticcost,C(w)=12Xi(yi−wTxi)2(10.1)However,ifwearegoingtoworkinfeaturespace,wherewereplacexi→Φ(xi),thereisancleardangerthatweoverfit.Henceweneedtoregularize.Thisisanimportanttopicthatwillreturninfutureclasses.Asimpleyeteffectivewaytoregularizeistopenalizethenormofw.Thisissometimescalled“weight-decay”.Itremainstobedeterminedhowtochooseλ.Themostusedalgorithmistousecrossvalidationorleave-one-outestimates.Thetotalcostfunctionhencebecomes,C=12Xi(yi−wTxi)2+12λ||w||2(10.2)whichneedstobeminimized.Takingderivativesandequatingthemtozerogives,Xi(yi−wTxi)xi=λw⇒w= λI+XixixTi!−1 Xjyjxj!(10.3)Weseethattheregularizationtermhelpstostabilizetheinversenumericallybyboundingthesmallesteigenvaluesawayfromzero.51\x0c'}, {'id': '5a7f6a23-d5fe-47b4-8cd6-056995ac910b', 'page': 64, 'text': '52CHAPTER10.KERNELRIDGEREGRESSION10.1KernelRidgeRegressionWenowreplacealldata-caseswiththeirfeaturevector:xi→Φi=Φ(xi).Inthiscasethenumberofdimensionscanbemuchhigher,oreveninfinitelyhigher,thanthenumberofdata-cases.Thereisaneattrickthatallowsustoperformtheinverseaboveinsmallestspaceofthetwopossibilities,eitherthedimensionofthefeaturespaceorthenumberofdata-cases.Thetrickisgivenbythefollowingidentity,(P−1+BTR−1B)−1BTR−1=PBT(BPBT+R)−1(10.4)NownotethatifBisnotsquare,theinverseisperformedinspacesofdifferentdimensionality.ToapplythistoourcasewedefineΦ=Φaiandy=yi.Thesolutionisthengivenby,w=(λId+ΦΦT)−1Φy=Φ(ΦTΦ+λIn)−1y(10.5)Thisequationcanberewrittenas:w=PiαiΦ(xi)withα=(ΦTΦ+λIn)−1y.Thisisanequationthatwillbearecurrentthemeanditcanbeinterpretedas:Thesolutionwmustlieinthespanofthedata-cases,evenifthedimensionalityofthefeaturespaceismuchlargerthanthenumberofdata-cases.Thisseemsintuitivelyclear,sincethealgorithmislinearinfeaturespace.Wefinallyneedtoshowthatweneveractuallyneedaccesstothefeaturevec-tors,whichcouldbeinfinitelylong(whichwouldberatherimpractical).Whatweneedinpracticeisisthepredictedvalueforanewtestpoint,x.Thisiscomputedbyprojectingitontothesolutionw,y=wTΦ(x)=y(ΦTΦ+λIn)−1ΦTΦ(x)=y(K+λIn)−1κ(x)(10.6)whereK(bxi,bxj)=Φ(xi)TΦ(xj)andκ(x)=K(xi,x).TheimportantmessagehereisofcoursethatweonlyneedaccesstothekernelK.Wecannowaddbiastothewholestorybyaddingonemore,constantfeaturetoΦ:Φ0=1.Thevalueofw0thenrepresentsthebiassince,wTΦ=XawaΦai+w0(10.7)Hence,thestorygoesthroughunchanged.\x0c'}, {'id': 'b52d3bc7-5821-422e-9d8f-dcb96f2cc4f7', 'page': 65, 'text': '10.2.ANALTERNATIVEDERIVATION5310.2AnalternativederivationInsteadofoptimizingthecostfunctionabovewecanintroduceLagrangemulti-pliersintotheproblem.ThiswillhavetheeffectthatthederivationgoesalongsimilarlinesastheSVMcase.Weintroducenewvariables,ξi=yi−wTΦiandrewritetheobjectiveasthefollowingconstrainedQP,minimize−w,ξLP=Xiξ2isubjecttoyi−wTΦi=ξi∀i(10.8)||w||≤B(10.9)ThisleadstotheLagrangian,LP=Xiξ2i+Xiβi[yi−wTΦi−ξi]+λ(||w||2−B2)(10.10)TwooftheKKTconditionstellusthatatthesolutionwehave:2ξi=βi∀i,2λw=XiβiΦi(10.11)PluggingitbackintotheLagrangian,weobtainthedualLagrangian,LD=Xi(−14β2i+βiyi)−14λXij(βiβjKij)−λB2(10.12)Wenowredefineαi=βi/(2λ)toarriveatthefollowingdualoptimizationprob-lem,maximize−α,λ−λ2Xiα2i+2λXiαiyi−λXijαiαjKij−λB2s.t.λ≥0(10.13)Takingderivativesw.r.t.αgivespreciselythesolutionwehadalreadyfound,α∗i=(K+λI)−1y(10.14)Formallywealsoneedtomaximizeoverλ.However,differentchoicesofλcor-respondtodifferentchoicesforB.EitherλorBshouldbechosenusingcross-validationorsomeothermeasure,sowecouldaswellvaryλinthisprocess.\x0c'}, {'id': '905391e3-6dad-4ebb-9f18-3b16a1128c6c', 'page': 66, 'text': '54CHAPTER10.KERNELRIDGEREGRESSIONOnebigdisadvantageoftheridge-regressionisthatwedon’thavesparsenessintheαvector,i.e.thereisnoconceptofsupportvectors.Thisisusefulbe-causewhenwetestanewexample,weonlyhavetosumoverthesupportvectorswhichismuchfasterthansummingovertheentiretraining-set.IntheSVMthesparsenesswasbornoutoftheinequalityconstraintsbecausethecomplementaryslacknessconditionstoldusthateitheriftheconstraintwasinactive,thenthemultiplierαiwaszero.Thereisnosucheffecthere.\x0c'}, {'id': 'f93f8202-db01-4b61-a338-12a63e67f9cf', 'page': 67, 'text': 'Chapter11KernelK-meansandSpectralClusteringTheobjectiveinK-meanscanbewrittenasfollows:C(z,µ)=Xi||xi−µzi||2(11.1)wherewewishtominimizeovertheassignmentvariableszi(whichcantakeval-ueszi=1,..,K,foralldata-casesi,andovertheclustermeansµk,k=1..K.Itisnothardtoshowthatthefollowingiterationsachievethat,zi=argmink||xi−µk||2(11.2)µk=1NkXi∈Ckxi(11.3)whereCkisthesetofdata-casesassignedtoclusterk.Now,let’sassumewehavedefinedmanyfeatures,φ(xi)andwishtodoclus-teringinfeaturespace.Theobjectiveissimilartobefore,C(z,µ)=Xi||φ(xi)−µzi||2(11.4)WewillnowintroduceaN×Kassignmentmatrix,Znk,eachcolumnofwhichrepresentsadata-caseandcontainsexactlyone1atrowkifitisassignedtoclusterk.AsaresultwehavePkZnk=1andNk=PnZnk.Alsodefine55\x0c'}, {'id': 'df4aea55-ccef-49f7-b3ea-7c771a5cee53', 'page': 68, 'text': '56CHAPTER11.KERNELK-MEANSANDSPECTRALCLUSTERINGL=diag[1/PnZnk]=diag[1/Nk].FinallydefineΦin=φi(xn).WiththesedefinitionsyoucannowcheckthatthematrixMdefinedas,M=ΦZLZT(11.5)consistsofNcolumns,oneforeachdata-case,whereeachcolumncontainsacopyoftheclustermeanµktowhichthatdata-caseisassigned.UsingthiswecanwriteouttheK-meanscostas,C=tr[(Φ−M)(Φ−M)T](11.6)NextwecanshowthatZTZ=L−1(checkthis),andthusthat(ZLZT)2=ZLZT.Inotherwords,itisaprojection.Similarly,I−ZLZTisaprojectiononthecomplementspace.Usingthiswesimplifyeqn.11.6as,C=tr[Φ(I−ZLZT)2ΦT](11.7)=tr[Φ(I−ZLZT)ΦT](11.8)=tr[ΦΦT]−tr[ΦZLZTΦT](11.9)=tr[K]−tr[L12ZTKZL12](11.10)whereweusedthattr[AB]=tr[BA]andL12isdefinedastakingthesquarerootofthediagonalelements.NotethatonlythesecondtermdependsontheclusteringmatrixZ,sowecanwecannowformulatethefollowingequivalentkernelclusteringproblem,maxZtr[L12ZTKZL12](11.11)suchthat:Zisabinaryclusteringmatrix.(11.12)Thisobjectiveisentirelyspecifiedintermsofkernelsandsowehaveonceagainmanagedtomovetothe”dual”representation.Notealsothatthisproblemisverydifficulttosolveduetotheconstraintswhichforcesustosearchofbinarymatrices.Ournextstepwillbetoapproximatethisproblemthrougharelaxationonthisconstraint.FirstwerecallthatZTZ=L−1⇒L12ZTZL12=I.RenamingH=ZL12,withHanN×Kdimensionalmatrix,wecanformulatethefollowingrelaxationoftheproblem,maxHtr[HTKH](11.13)subjecttoHTH=I(11.14)\x0c'}, {'id': '2c930063-d03b-442b-b94e-79979500910f', 'page': 69, 'text': '57NotethatwedidnotrequireHtobebinaryanylonger.Thehopeisthatthesolutionisclosetosomeclusteringsolutionthatwecanthenextractaposteriori.Theaboveproblemshouldlookfamiliar.InterpretthecolumnsofHasacollectionofKmutuallyorthonormalbasisvectors.Theobjectivecanthenbewrittenas,KXk=1hTkKhk(11.15)BychoosinghkproportionaltotheKlargesteigenvectorsofKwewillmaximizetheobjective,i.e.wehaveK=UΛUT,⇒H=U[1:K]R(11.16)whereRisarotationinsidetheeigenvaluespace,RRT=RTR=I.Usingthisyoucannoweasilyverifythattr[HTKH]=PKk=1λkwhere{λk},k=1..KarethelargestKeigenvalues.Whatisperhapssurprisingisthatthesolutiontothisrelaxedkernel-clusteringproblemisgivenbykernel-PCA!RecallthatforkernelPCAwealsosolvedfortheeigenvaluesofK.Howthendoweextractaclusteringsolutionfromkernel-PCA?RecallthatthecolumnsofH(theeigenvectorsofK)shouldapproximatethebinarymatrixZwhichhadasingle1perrowindicatingtowhichclusterdata-casenisassigned.WecouldtrytosimplythresholdtheentriesofHsothatthelargestvalueissetto1andtheremainingonesto0.However,itoftenworksbettertofirstnormalizeH,ˆHnk=HnkpPkH2nk(11.17)AllrowsofˆHarelocatedontheunitsphere.WecannowrunasimpleclusteringalgorithmsuchasK-meansonthedatamatrixˆHtoextractKclusters.Theaboveprocedureissometimesreferredtoas“spectralclustering”.Conclusion:Kernel-PCAcanbeviewedasanonlinearfeatureextractiontech-nique.Inputisamatrixofsimilarities(thekernelmatrixorGrammatrix)whichshouldbepositivesemi-definiteandsymmetric.Ifyouextracttwoorthreefea-tures(dimensions)youcanuseitasanon-lineardimensionalityreductionmethod(forpurposesofvisualization).Ifyouusetheresultasinputtoasimpleclusteringmethod(suchasK-means)itbecomesanonlinearclusteringmethod.\x0c'}, {'id': 'cd3b39ca-0ffc-423e-a1ff-f68b344498d3', 'page': 70, 'text': '58CHAPTER11.KERNELK-MEANSANDSPECTRALCLUSTERING\x0c'}, {'id': '50ddfc6f-f45f-472d-bf66-7fee333ca4d0', 'page': 71, 'text': 'Chapter12KernelPrincipalComponentsAnalysisLet’sfistseewhatPCAiswhenwedonotworryaboutkernelsandfeaturespaces.Wewillalwaysassumethatwehavecentereddata,i.e.Pixi=0.Thiscanalwaysbeachievedbyasimpletranslationoftheaxis.Ouraimistofindmeaningfulprojectionsofthedata.However,wearefacinganunsupervisedproblemwherewedon’thaveaccesstoanylabels.Ifwehad,weshouldbedoingLinearDiscriminantAnalysis.Duetothislackoflabels,ouraimwillbetofindthesubspaceoflargestvariance,wherewechoosethenumberofretaineddimensionsbeforehand.Thisisclearlyastrongassumption,becauseitmayhappenthatthereisinterestingsignalinthedirectionsofsmallvariance,inwhichcasePCAinnotasuitabletechnique(andweshouldperhapsuseatechniquecalledindependentcomponentanalysis).However,usuallyitistruethatthedirectionsofsmallestvariancerepresentuninterestingnoise.Tomakeprogress,westartbywritingdownthesample-covariancematrixC,C=1NXixixTi(12.1)Theeigenvaluesofthismatrixrepresentthevarianceintheeigen-directionsofdata-space.Theeigen-vectorcorrespondingtothelargesteigenvalueisthedirec-tioninwhichthedataismoststretchedout.Theseconddirectionisorthogonaltoitandpicksthedirectionoflargestvarianceinthatorthogonalsubspaceetc.Thus,toreducethedimensionalityofthedata,weprojectthedataontothere-59\x0c'}, {'id': '018a9b31-c2f8-4454-9763-04ace93dde3b', 'page': 72, 'text': '60CHAPTER12.KERNELPRINCIPALCOMPONENTSANALYSIStainedeigen-directionsoflargestvariance:UΛUT=C⇒C=XaλauauTa(12.2)andtheprojectionisgivenby,yi=UTkxi∀i(12.3)whereUkmeansthed×ksub-matrixcontainingthefirstkeigenvectorsascolumns.Asasideeffect,wecannowshowthattheprojecteddataarede-correlatedinthisnewbasis:1NXiyiyTi=1NXiUTkxixTiUk=UTkCUk=UTkUΛUTUTk=Λk(12.4)whereΛkisthe(diagonal)k×ksub-matrixcorrespondingtothelargesteigen-values.AnotherconvenientpropertyofthisprocedureisthatthereconstructionerrorinL2-normbetweenfromytoxisminimal,i.e.Xi||xi−Pkxi||2(12.5)wherePk=UkUTkistheprojectionontothesubspacespannedbythecolumnsofUk,isminimal.Nowimaginethattherearemoredimensionsthandata-cases,i.e.somedi-mensionsremainunoccupiedbythedata.Inthiscaseitisnothardtoshowthattheeigen-vectorsthatspantheprojectionspacemustlieinthesubspacespannedbythedata-cases.Thiscanbeseenasfollows,λaua=Cua=1NXixixTiua=1NXi(xTiua)xi⇒ua=Xi(xTiua)Nλaxi=Xiαaixi(12.6)whereuaissomearbitraryeigen-vectorofC.Thelastexpressioncanbeinter-pretedas:“everyeigen-vectorcanbeexactlywritten(i.e.losslessly)assomelinearcombinationofthedata-vectors,andhenceitmustlieinitsspan”.ThisalsoimpliesthatinsteadoftheeigenvalueequationsCu=λuwemayconsidertheNprojectedequationsxTiCu=λxTiu∀i.Fromthisequationthecoefficients\x0c'}, {'id': '41e41b47-6627-4eed-9aa6-98dbbb98ce8c', 'page': 73, 'text': '12.1.CENTERINGDATAINFEATURESPACE61αaicanbecomputedefficientlyaspaceofdimensionN(andnotd)asfollows,xTiCua=λaxTiua⇒xTi1NXkxkxTkXjαajxj=λaxTiXjαajxj⇒1NXj,kαaj[xTixk][xTkxj]=λaXjαaj[xTixj](12.7)Wenowrenamethematrix[xTixj]=Kijtoarriveat,K2αa=NλaKαa⇒Kαa=(˜λa)αawith˜λa=Nλa(12.8)So,wehavederivedaneigenvalueequationforαwhichinturncompletelydeter-minestheeigenvectorsu.Byrequiringthatuisnormalizedwefind,uTaua=1⇒Xi,jαaiαaj[xTixj]=αTaKαa=NλaαTaαa=1⇒||αa||=1/pNλa(12.9)Finally,whenwereceiveanewdata-casetandweliketocomputeitsprojectionsontothenewreducedspace,wecompute,uTat=XiαaixTit=XiαaiK(xi,t)(12.10)Thisequationshouldlookfamiliar,itiscentraltomostkernelmethods.Obviously,thewholeexpositionwassetupsothatintheendweonlyneededthematrixKtodoourcalculations.Thisimpliesthatwearenowreadytoker-nelizetheprocedurebyreplacingxi→Φ(xi)anddefiningKij=Φ(xi)Φ(xj)T,whereΦ(xi)=Φia.12.1CenteringDatainFeatureSpaceItisinfactverydifficulttoexplicitlycenterthedatainfeaturespace.But,weknowthatthefinalalgorithmonlydependsonthekernelmatrix,soifwecancenterthekernelmatrixwearedoneaswell.AkernelmatrixisgivenbyKij=ΦiΦTj.Wenowcenterthefeaturesusing,Φi=Φi−1NXkΦk(12.11)\x0c'}, {'id': 'f5212a1b-a66a-4d35-83ac-d405b44731f0', 'page': 74, 'text': '62CHAPTER12.KERNELPRINCIPALCOMPONENTSANALYSISHencethekernelintermsofthenewfeaturesisgivenby,Kcij=(Φi−1NXkΦk)(Φj−1NXlΦl)T(12.12)=ΦiΦTj−[1NXkΦk]ΦTj−Φi[1NXlΦTl]+[1NXkΦk][1NXlΦTl](12.13)=Kij−κi1Tj−1iκTj+k1i1Tj(12.14)withκi=1NXkKik(12.15)andk=1N2XijKij(12.16)Hence,wecancomputethecenteredkernelintermsofthenon-centeredkernelaloneandnofeaturesneedtobeaccessed.Attest-timeweneedtocompute,Kc(ti,xj)=[Φ(ti)−1NXkΦ(xk)][Φ(xj)−1NXlΦ(xl)]T(12.17)Usingasimilarcalculation(leftforthereader)youcanfindthatthiscanbeex-pressedeasilyintermsofK(ti,xj)andK(xi,xj)asfollows,Kc(ti,xj)=K(ti,xj)−κ(ti)1Tj−1iκ(xj)T+k1i1Tj(12.18)\x0c'}, {'id': 'eb1aaca8-bcdf-43e2-b7cd-c76cea6e38aa', 'page': 75, 'text': 'Chapter13FisherLinearDiscriminantAnalysisThemostfamousexampleofdimensionalityreductionis”principalcomponentsanalysis”.Thistechniquesearchesfordirectionsinthedatathathavelargestvari-anceandsubsequentlyprojectthedataontoit.Inthisway,weobtainalowerdimensionalrepresentationofthedata,thatremovessomeofthe”noisy”direc-tions.Therearemanydifficultissueswithhowmanydirectionsoneneedstochoose,butthatisbeyondthescopeofthisnote.PCAisanunsupervisedtechniqueandassuchdoesnotincludelabelinforma-tionofthedata.Forinstance,ifweimagine2cigarlikeclustersin2dimensions,onecigarhasy=1andtheothery=−1.Thecigarsarepositionedinparallelandverycloselytogether,suchthatthevarianceinthetotaldata-set,ignoringthelabels,isinthedirectionofthecigars.Forclassification,thiswouldbeaterribleprojection,becausealllabelsgetevenlymixedandwedestroytheusefulinfor-mation.Amuchmoreusefulprojectionisorthogonaltothecigars,i.e.inthedirectionofleastoverallvariance,whichwouldperfectlyseparatethedata-cases(obviously,wewouldstillneedtoperformclassificationinthis1-Dspace).Sothequestionis,howdoweutilizethelabelinformationinfindinginforma-tiveprojections?TothatpurposeFisher-LDAconsidersmaximizingthefollowingobjective:J(w)=wTSBwwTSWw(13.1)whereSBisthe“betweenclassesscattermatrix”andSWisthe“withinclassesscattermatrix”.NotethatduetothefactthatscattermatricesareproportionaltothecovariancematriceswecouldhavedefinedJusingcovariancematrices–theproportionalityconstantwouldhavenoeffectonthesolution.Thedefinitionsof63\x0c'}, {'id': 'a32641e5-6332-41cd-bbef-878a795b30c2', 'page': 76, 'text': '64CHAPTER13.FISHERLINEARDISCRIMINANTANALYSISthescattermatricesare:SB=XcNc(µc−¯x)(µc−¯x)T(13.2)SW=XcXi∈c(xi−µc)(xi−µc)T(13.3)where,µc=1NcXi∈cxi(13.4)¯x==1NXixi=1NXcNcµc(13.5)andNcisthenumberofcasesinclassc.Oftentimesyouwillseethatfor2classesSBisdefinedasS′B=(µ1−µ2)(µ1−µ2)T.Thisisthescatterofclass1withrespecttothescatterofclass2andyoucanshowthatSB=N1N2NS′B,butsinceitboilsdowntomultiplyingtheobjectivewithaconstantismakesnodifferencetothefinalsolution.Whydoesthisobjectivemakesense.Well,itsaysthatagoodsolutionisonewheretheclass-meansarewellseparated,measuredrelativetothe(sumofthe)variancesofthedataassignedtoaparticularclass.Thisispreciselywhatwewant,becauseitimpliesthatthegapbetweentheclassesisexpectedtobebig.Itisalsointerestingtoobservethatsincethetotalscatter,ST=Xi(xi−¯x)(xi−¯x)T(13.6)isgivenbyST=SW+SBtheobjectivecanberewrittenas,J(w)=wTSTwwTSWw−1(13.7)andhencecanbeinterpretedasmaximizingthetotalscatterofthedatawhileminimizingthewithinscatteroftheclasses.AnimportantpropertytonoticeabouttheobjectiveJisthatisisinvariantw.r.t.rescalingsofthevectorsw→αw.Hence,wecanalwayschoosewsuchthatthedenominatorissimplywTSWw=1,sinceitisascalaritself.Forthisrea-sonwecantransformtheproblemofmaximizingJintothefollowingconstrained\x0c'}, {'id': '40d8127f-3855-4078-8679-1099fa0427c7', 'page': 77, 'text': '13.1.KERNELFISHERLDA65optimizationproblem,minw−12wTSBw(13.8)s.t.wTSWw=1(13.9)correspondingtothelagrangian,LP=−12wTSBw+12λ(wTSWw−1)(13.10)(thehalvesareaddedforconvenience).TheKKTconditionstellusthatthefol-lowingequationneedstoholdatthesolution,SBw=λSWw(13.11)Thisalmostlookslikeaneigen-valueequation.Infact,itiscalledageneralizedeigen-problemandjustlikeannormaleigenvalueproblemtherearestandardwaystosolveit.Remainstochoosewhicheigenvalueandeigenvectorcorrespondstothede-siredsolution.PluggingthesolutionbackintotheobjectiveJ,wefind,J(w)=wTSBwwTSWw=λkwTkSWwkwTkSWwk=λk(13.12)fromwhichitimmediatelyfollowsthatwewantthelargesteigenvaluetomaxi-mizetheobjective1.13.1KernelFisherLDASohowdowekernelizethisproblem?UnlikeSVMsitdoesn’tseemthedualproblemrevealthekernelizedproblemnaturally.ButinspiredbytheSVMcasewemakethefollowingkeyassumption,w=XiαiΦ(xi)(13.13)1Ifyoutrytofindthedualandmaximizethat,you’llgetthewrongsignitseems.Mybestguessofwhatgoeswrongisthattheconstraintisnotlinearandasaresulttheproblemisnotconvexandhencewecannotexpecttheoptimaldualsolutiontobethesameastheoptimalprimalsolution.\x0c'}, {'id': 'd9c7f8b4-9541-4aa5-a7b9-80a6e685a177', 'page': 78, 'text': '66CHAPTER13.FISHERLINEARDISCRIMINANTANALYSISThisisacentralrecurrentequationthatkeepspoppingupineverykernelmachine.Itsaysthatalthoughthefeaturespaceisveryhigh(oreveninfinite)dimensional,withafinitenumberofdata-casesthefinalsolution,w∗,willnothaveacomponentoutsidethespacespannedbythedata-cases.Itwouldnotmakemuchsensetodothistransformationifthenumberofdata-casesislargerthanthenumberofdimensions,butthisistypicallynotthecaseforkernel-methods.So,wearguethatalthoughtherearepossiblyinfinitedimensionsavailableapriori,atmostNarebeingoccupiedbythedata,andthesolutionwmustlieinitsspan.Thisisacaseofthe“representerstheorem”thatintuitivelyreasonsasfollows.Thesolutionwisthesolutiontosomeeigenvalueequation,S12BS−1WS12Bw=λw,wherebothSBandSW(andhenceitsinverse)lieinthespanofthedata-cases.Hence,thepartw⊥thatisperpendiculartothisspanwillbeprojectedtozeroandtheequationaboveputsnoconstraintsonthosedimensions.Theycanbearbitraryandhavenoimpactonthesolution.Ifwenowassumeaverygeneralformofregularizationonthenormofw,thentheseorthogonalcomponentswillbesettozerointhefinalsolution:w⊥=0.IntermsofαtheobjectiveJ(α)becomes,J(α)=αTSΦBααTSΦWα(13.14)whereitisunderstoodthatvectornotationnowappliestoadifferentspace,namelythespacespannedbythedata-vectors,RN.Thescattermatricesinkernelspacecanexpressedintermsofthekernelonlyasfollows(thisrequiressomealgebratoverify),SΦB=XcNc(cid:2)κcκTc−κκT(cid:3)(13.15)SΦW=K2−XcNcκcκTc(13.16)κc=1NcXi∈cKij(13.17)κ=1NXiKij(13.18)So,wehavemanagedtoexpresstheproblemintermsofkernelsonlywhichiswhatwewereafter.Notethatsincetheobjectiveintermsofαhasexactlythesameformasthatintermsofw,wecansolveitbysolvingthegeneralized\x0c'}, {'id': 'b6999128-b308-4295-aa7a-516fd03591bc', 'page': 79, 'text': '13.2.ACONSTRAINEDCONVEXPROGRAMMINGFORMULATIONOFFDA67eigenvalueequation.ThisscalesasN3whichiscertainlyexpensiveformanydatasets.Moreefficientoptimizationschemessolvingaslightlydifferentproblemandbasedonefficientquadraticprogramsexistintheliterature.Projectionsofnewtest-pointsintothesolutionspacecanbecomputedby,wTΦ(x)=XiαiK(xi,x)(13.19)asusual.Inordertoclassifythetestpointwestillneedtodividethespaceintoregionswhichbelongtooneclass.TheeasiestpossibilityistopicktheclusterwithsmallestMahalonobisdistance:d(x,µΦc)=(xα−µαc)2/(σαc)2whereµαcandσαcrepresenttheclassmeanandstandarddeviationinthe1-dprojectedspacerespectively.Alternatively,onecouldtrainanyclassifierinthe1-dsubspace.Oneveryimportantissuethatwedidnotpayattentiontoisregularization.Clearly,asitstandsthekernelmachinewilloverfit.Toregularizewecanaddatermtothedenominator,SW→SW+βI(13.20)Byaddingadiagonaltermtothismatrixmakessurethatverysmalleigenvaluesareboundedawayfromzerowhichimprovesnumericalstabilityincomputingtheinverse.IfwewritetheLagrangianformulationwherewemaximizeaconstrainedquadraticforminα,theextratermappearsasapenaltyproportionalto||α||2whichactsasaweightdecayterm,favoringsmallervaluesofαoverlargerones.Fortunately,theoptimizationproblemhasexactlythesameformintheregularizedcase.13.2AConstrainedConvexProgrammingFormu-lationofFDA\x0c'}, {'id': 'f3449c09-bdf2-4c0a-a9a7-a7a12e1ccd44', 'page': 80, 'text': '68CHAPTER13.FISHERLINEARDISCRIMINANTANALYSIS\x0c'}, {'id': '529e0441-4708-44e8-b05a-48a911aa24e5', 'page': 81, 'text': 'Chapter14KernelCanonicalCorrelationAnalysisImagineyouaregiven2copiesofacorpusofdocuments,onewritteninEnglish,theotherwritteninGerman.Youmayconsideranarbitraryrepresentationofthedocuments,butfordefinitenesswewillusethe“vectorspace”representationwherethereisanentryforeverypossiblewordinthevocabularyandadocumentisrepresentedbycountvaluesforeveryword,i.e.iftheword“theappeared12timesandthefirstwordinthevocabularywehaveX1(doc)=12etc.Let’ssayweareinterestedinextractinglowdimensionalrepresentationsforeachdocument.Ifwehadonlyonelanguage,wecouldconsiderrunningPCAtoextractdirectionsinwordspacethatcarrymostofthevariance.Thishastheabilitytoinfersemanticrelationsbetweenthewordssuchassynonymy,becauseifwordstendtoco-occuroftenindocuments,i.e.theyarehighlycorrelated,theytendtobecombinedintoasingledimensioninthenewspace.Thesespacescanoftenbeinterpretedastopicspaces.Ifwehavetwotranslations,wecantrytofindprojectionsofeachrepresenta-tionseparatelysuchthattheprojectionsaremaximallycorrelated.Hopefully,thisimpliesthattheyrepresentthesametopicintwodifferentlanguages.Inthiswaywecanextractlanguageindependenttopics.LetxbeadocumentinEnglishandyadocumentinGerman.Considertheprojections:u=aTxandv=bTy.Alsoassumethatthedatahavezeromean.Wenowconsiderthefollowingobjective,ρ=E[uv]pE[u2]E[v2](14.1)69\x0c'}, {'id': '9cd421b1-0928-4a47-be88-203cc687048e', 'page': 82, 'text': '70CHAPTER14.KERNELCANONICALCORRELATIONANALYSISWewanttomaximizethisobjective,becausethiswouldmaximizethecorrelationbetweentheunivariatesuandv.Notethatwedividedbythestandarddeviationoftheprojectionstoremovescaledependence.ThisexpositionisverysimilartotheFisherdiscriminantanalysisstoryandIencourageyoutorereadthat.Forinstance,thereyoucanfindhowtogeneralizetocaseswherethedataisnotcentered.Wealsointroducedthefollowing“trick”.Sincewecanrescaleaandbwithoutchangingtheproblem,wecanconstrainthemtobeequalto1.Thisthenallowsustowritetheproblemas,maximizea,bρ=E[uv]subjecttoE[u2]=1E[v2]=1(14.2)Or,ifweconstructaLagrangianandwriteouttheexpectationswefind,mina,bmaxλ1,λ2XiaTxiyTib−12λ1(XiaTxixTia−N)−12λ2(XibTyiyTib−N)(14.3)wherewehavemultipliedbyN.Let’stakederivativeswrttoaandbtoseewhattheKKTequationstellus,XixiyTib−λ1XixixTia=0(14.4)XiyixTia−λ2XiyiyTib=0(14.5)FirstnoticethatifwemultiplythefirstequationwithaTandthesecondwithbTandsubtractthetwo,whileusingtheconstraints,wearriveatλ1=λ2=λ.Next,renameSxy=PixiyTi,Sx=PixixTiandSy=PiyiyTi.Wedefinethefollowinglargermatrices:SDistheblockdiagonalmatrixwithSxandSyonthediagonalandzerosontheoff-diagonalblocks.Also,wedefineSOtobetheoff-diagonalmatrixwithSxyontheoffdiagonal.Finallywedefinec=[a,b].Thetwoequationscanthenwewrittenjointlyas,SOc=λSDc⇒S−1DSOc=λc⇒S12OS−1DS12O(S12Oc)=λ(S12Oc)(14.6)whichisagainanregulareigenvalueequationforc′=S12Oc\x0c'}, {'id': '6998d9a3-accb-4e6b-9a8a-44ac76a039b6', 'page': 83, 'text': '14.1.KERNELCCA7114.1KernelCCAAsusual,thestartingpointtomapthedata-casestofeaturevectorsΦ(xi)andΨ(yi).Whenthedimensionalityofthespaceislargerthanthenumberofdata-casesinthetraining-set,thenthesolutionmustlieinthespanofdata-cases,i.e.a=XiαiΦ(xi)b=XiβiΨ(yi)(14.7)UsingthisequationintheLagrangianweget,L=αTKxKyβ−12λ(αTK2xα−N)−12λ(βTK2yβ−N)(14.8)whereαisavectorinadifferentN-dimensionalspacethane.g.awhichlivesinaD-dimensionalspace,andKx=PiΦ(xi)TΦ(xi)andsimilarlyforKy.Takingderivativesw.r.t.αandβwefind,KxKyβ=λK2xα(14.9)KyKxα=λK2yβ(14.10)Let’strytosolvetheseequationsbyassumingthatKxisfullrank(whichistyp-icallythecase).Weget,α=λ−1K−1xKyβandhence,K2yβ=λ2K2yβwhichalwayshasasolutionforλ=1.Byrecallingthat,ρ=1NXiaTSxyb=1NXiλaTSxa=λ(14.11)weobservethatthisrepresentsthesolutionwithmaximalcorrelationandhencethepreferredone.Thisisatypicalcaseofover-fittingemphasizesagaintheneedtoregularizeinkernelmethods.ThiscanbedonebyaddingadiagonaltermtotheconstraintsintheLagrangian(orequivalentlytothedenominatoroftheoriginalobjective),leadingtotheLagrangian,L=αTKxKyβ−12λ(αTK2xα+η||α||2−N)−12λ(βTK2yβ+η||β||2−N)(14.12)Onecanseethatthisactsasaquadraticpenaltyonthenormofαandβ.Theresultingequationsare,KxKyβ=λ(K2x+ηI)α(14.13)KyKxα=λ(K2y+ηI)β(14.14)\x0c'}, {'id': '17fa572c-e4bd-4f1c-9a8b-0028a3459df2', 'page': 84, 'text': '72CHAPTER14.KERNELCANONICALCORRELATIONANALYSISAnaloguestotheprimalproblem,wewilldefinebigmatrices,KDwhichcontains(K2x+ηI)and(K2y+ηI)asblocksonthediagonalandzerosattheblocksoffthediagonal,andthematrixKOwhichhasthematricesKxKyontheright-upperoffdiagonalblockandKyKxattheleft-loweroff-diagonalblock.Also,wedefineγ=[α,β].Thisleadstotheequation,KOγ=λKDγ⇒K−1DKOγ=λγ⇒K12OK−1DK12O(K12Oγ)=λ(K12Oγ)(14.15)whichisagainaregulareigenvalueequation.Notethattheregularizationalsomovedthesmallesteigenvalueawayfromzero,andhencemadetheinversemorenumericallystable.Thevalueforηneedstobechosenusingcross-validationorsomeothermeasure.Solvingtheequationsusingthislargereigen-valueproblemisactuallynotquitenecessary,andmoreefficientmethodsexist(seebook).Thesolutionsarenotexpectedtobesparse,becauseeigen-vectorsarenotexpectedtobesparse.OnewouldhavetoreplaceL2normpenaltieswithL1normpenaltiestoobtainsparsity.\x0c'}, {'id': '4e3d7907-cd4e-47b3-ad77-0266c9e06d2d', 'page': 85, 'text': 'AppendixAEssentialsofConvexOptimizationA.1LagrangiansandallthatMostkernel-basedalgorithmsfallintotwoclasses,eithertheyusespectraltech-niquestosolvetheproblem,ortheyuseconvexoptimizationtechniquestosolvetheproblem.Herewewilldiscussconvexoptimization.Aconstrainedoptimizationproblemcanbeexpressedasfollows,minimizexf0(x)subjecttofi(x)≤0∀ihj(x)=0∀j(A.1)Thatiswehaveinequalityconstraintsandequalityconstraints.WenowwritetheprimalLagrangianofthisproblem,whichwillbehelpfulinthefollowingdevelopment,LP(x,λ,ν)=f0(x)+Xiλifi(x)+Xjνjhj(x)(A.2)wherewewillassumeinthefollowingthatλi≥0∀i.FromherewecandefinethedualLagrangianby,LD(λ,ν)=infxLP(x,λ,ν)(A.3)Thisobjectivecanactuallybecome−∞forcertainvaluesofitsarguments.Wewillcallparametersλ≥0,νforwhichLD>−∞dualfeasible.73\x0c'}, {'id': '6a48924b-b4aa-4993-bd89-a322b52b734c', 'page': 86, 'text': '74APPENDIXA.ESSENTIALSOFCONVEXOPTIMIZATIONItisimportanttonoticethatthedualLagrangianisaconcavefunctionofλ,νbecauseitisapointwiseinfimumofafamilyoflinearfunctionsinλ,νfunction.Hence,eveniftheprimalisnotconvex,thedualiscertainlyconcave!ItisnothardtoshowthatLD(λ,ν)≤p∗(A.4)wherep∗istheprimaloptimalpoint.ThissimplyfollowsbecausePiλifi(x)+Pjνjhj(x)≤0foraprimalfeasiblepointx∗.Thus,thedualproblemalwaysprovideslowerboundtotheprimalproblem.Theoptimallowerboundcanbefoundbysolvingthedualproblem,maximizeλ,νLD(λ,ν)subjecttoλi≥0∀i(A.5)whichisthereforeaconvexoptimizationproblem.Ifwecalld∗thedualoptimalpointwealwayshave:d∗≤p∗,whichiscalledweakduality.p∗−d∗iscalledthedualitygap.Strongdualityholdswhenp∗=d∗.Strongdualityisverynice,inparticularifwecanexpresstheprimalsolutionx∗intermsofthedualsolutionλ∗,ν∗,becausethenwecansimplysolvethedualproblemandconverttotheanswertotheprimaldomainsinceweknowthatsolutionmustthenbeoptimal.Oftenthedualproblemiseasiertosolve.Sowhendoesstrongdualityhold?Uptosomemathematicaldetailsthean-sweris:iftheprimalproblemisconvexandtheequalityconstraintsarelinear.Thismeansthatf0(x)and{fi(x)}areconvexfunctionsandhj(x)=Ax−b.Theprimalproblemcanbewrittenasfollows,p∗=infxsupλ≥0,νLP(x,λ,ν)(A.6)Thiscanbeseenasfollowsbynotingthatsupλ≥0,νLP(x,λ,ν)=f0(x)whenxisfeasiblebut∞otherwise.Toseethisfirstcheckthatbyviolatingoneoftheconstraintsyoucanfindachoiceofλ,νthatmakestheLagrangianinfinity.Also,whenalltheconstraintsaresatisfied,thebestwecandoismaximizetheadditionaltermstobezero,whichisalwayspossible.Forinstance,wecansimplysetallλ,νtozero,eventhoughthisisnotnecessaryiftheconstraintsthemselvesvanish.Thedualproblembydefinitionisgivenby,d∗=supλ≥0,νinfxLP(x,λ,ν)(A.7)\x0c'}, {'id': '40f41464-34f8-4276-ba9c-b8a58f2ab06b', 'page': 87, 'text': 'A.1.LAGRANGIANSANDALLTHAT75Hence,the“sup”and“inf”canbeinterchangedifstrongdualityholds,hencetheoptimalsolutionisasaddle-point.Itisimportanttorealizethattheorderofmaximizationandminimizationmattersforarbitraryfunctions(butnotforconvexfunctions).Trytoimaginea“V”shapesvalleywhichrunsdiagonallyacrossthecoordinatesystem.Ifwefirstmaximizeoveronedirection,keepingtheotherdirectionfixed,andthenminimizetheresultweendupwiththelowestpointontherim.Ifwereversetheorderweendupwiththehighestpointinthevalley.Thereareanumberofimportantnecessaryconditionsthatholdforproblemswithzerodualitygap.TheseKarush-Kuhn-Tuckerconditionsturnouttobesuffi-cientforconvexoptimizationproblems.Theyaregivenby,∇f0(x∗)+Xiλ∗i∇fi(x∗)+Xjν∗j∇hj(x∗)=0(A.8)fi(x∗)≤0(A.9)hj(x∗)=0(A.10)λ∗i≥0(A.11)λ∗ifi(x∗)=0(A.12)Thefirstequationiseasilyderivedbecausewealreadysawthatp∗=infxLP(x,λ∗,ν∗)andhenceallthederivativesmustvanish.Thisconditionhasaniceinterpretationasa“balancingofforces”.Imagineaballrollingdownasurfacedefinedbyf0(x)(i.e.youaredoinggradientdescenttofindtheminimum).Theballgetsblockedbyawall,whichistheconstraint.Ifthesurfaceandconstraintisconvextheniftheballdoesn’tmovewehavereachedtheoptimalsolution.Atthatpoint,theforcesontheballmustbalance.Thefirsttermrepresenttheforceoftheballagainstthewallduetogravity(theballisstillonaslope).Thesecondtermrepresentsthere-actionforceofthewallintheoppositedirection.Theλrepresentsthemagnitudeofthereactionforce,whichneedstobehigherifthesurfaceslopesmore.Wesaythatthisconstraintis“active”.Otherconstraintswhichdonotexertaforceare“inactive”andhaveλ=0.ThelatterstatementcanbereadoffromthelastKKTconditionwhichwecall“complementaryslackness”.Itsaysthateitherfi(x)=0(theconstraintissaturatedandhenceactive)inwhichcaseλisfreetotakeonanon-zerovalue.However,iftheconstraintisinactive:fi(x)≤0,thenλmustvanish.Aswewillseesoon,theactiveconstraintswillcorrespondtothesupportvectorsinSVMs!\x0c'}, {'id': 'af1d92d5-c15d-4b51-80c2-645f554d1df8', 'page': 88, 'text': '76APPENDIXA.ESSENTIALSOFCONVEXOPTIMIZATIONComplementaryslacknessiseasilyderivedby,f0(x∗)=LD(λ∗,ν∗)=infx f0(x)+Xiλ∗ifi(x)+Xjν∗jhj(x)!≤f0(x∗)+Xiλ∗ifi(x∗)+Xjν∗jhj(x∗)(A.13)≤f0(x∗)(A.14)wherethefirstlinefollowsfromEqn.A.6thesecondbecausetheinfisalwayssmallerthananyx∗andthelastbecausefi(x∗)≤0,λ∗i≥0andhj(x∗)=0.Henceallinequalitiesareequalitiesandeachtermisnegativesoeachtermmustvanishseparately.\x0c'}, {'id': '73c8db41-f5a4-4937-a26b-39513379cca3', 'page': 89, 'text': 'AppendixBKernelDesignB.1PolynomialsKernelsTheconstructionthatwewillfollowbelowistofirstwritefeaturevectorsproductsofsubsetsofinputattributes,i.e.definefeaturesvectorsasfollows,φI(x)=xi11xi22...xinn(B.1)wherewecanputvariousrestrictionsonthepossiblecombinationsofindiceswhichareallowed.Forinstance,wecouldrequirethattheirsumisaconstants,i.e.therearepreciselystermsintheproduct.Orwecouldrequirethateachij=[0,1].Generallyspeaking,thebestchoicedependsontheproblemyouaremodelling,butanotherimportantconstraintisthatthecorrespondingkernelmustbeeasytocompute.Let’sdefinethekernelasusualas,K(x,y)=XIφI(x)φI(y)(B.2)whereI=[i1,i2,...in].Wehavealreadyencounteredthepolynomialkernelas,K(x,y)=(R+xTy)d=dXs=0d!s!(d−s)!Rd−s(xTy)s(B.3)wherethelastequalityfollowsfromabinomialexpansion.Ifwewriteoutthe77\x0c'}, {'id': '1c6213a9-1770-4964-b055-eef67287f561', 'page': 90, 'text': '78APPENDIXB.KERNELDESIGNterm,(xTy)s=(x1y1+x2y2+...+xnyn)s=Xi1,i2,...,ini1+i2+...+in=ss!i1!i2!...in!(x1y1)i1(x2y2)i2...(xnyn)in(B.4)Takentogetherwitheqn.B.3weseethatthefeaturescorrespondto,φI(x)=sd!(d−s)!1i1!i2!...in!Rd−sxi11xi22...xinnwithi1+i2+...+in=s