{ "query": "what is this about?", "namespace": "83089d10-8f38-4906-ba85-0fd4104650c2", "messages": [], "stream": false, "language_level": "", "chat_channel": ":421:1", "language": "German", "tone": "neutral", "writing_style": "standard", "model": "claude-3-5-haiku-latest", "knowledgebase": "ki-dev-large", "seed": 1, "client_id": 1, "all_context": true, "follow_up_for": null, "knowledgebase_files_count": 8, "override_command": "", "disable_clarity_check": true, "high_consistency": false, "custom_primer": "", "logging": true, "query_route": "", "web_search": false } QUERY ROUTE Query Route: summary ================================================== **Elapsed Time: 1.38 seconds** ================================================== RAG PARAMS RAG Parameters: {'dynamically_expand': False, 'top_k': 120, 'actual_k': 120, 'satisfying_score': 0} ================================================== **Elapsed Time: 0.00 seconds** ================================================== VECTOR SEARCH RESULTS Results: [{'id': '51085944-450d-4ac0-b8da-d50af746ab74', 'metadata': {'chunk': 2.0, 'file_name': 'www-fairwaycapecod-com-menu-breakfast-62826.txt', 'is_dict': 'no', 'text': '## Fairway Favorites\n' '\n' '**The Mountain** \n' 'grilled homemade biscuit topped with two sausage ' 'patties, two scrambled eggs, bacon, cheddar cheese and ' 'sausage gravy â\x80¦ served with home fries \n' '14\n' '\n' '**Nannyâ\x80\x99s Potato Pancakes** \n' 'two homemade potato pancakes and two eggs â\x80¦ served ' 'with toast and choice of bacon, ham or sausage \n' '13.50\n' '\n' '**Corned Beef Hash and Eggs** \n' 'our own recipe, served with two eggs, toast and home ' 'fries \n' '14.50\n' '\n' '**Fairway Breakfast** \n' 'three eggs with ham, bacon and sausageâ\x80¦ served ' 'with toast and home fries \n' '12.50\n' '\n' '**Silver Skillet** \n' 'sauteed onions, peppers, home fries and choice of ' 'bacon, ham or sausage, all scrambled with three eggs ' 'and topped with a blend of cheddar and mozzarellaâ\x80¦ ' 'served with toast \n' '12.50\n' '\n' '**Country Breakfast** \n' 'grilled homemade biscuit topped with two sausage ' 'patties, two eggs and sausage gravy â\x80¦ served with ' 'home fries \n' '12.50\n' '\n' '**Two by Two** \n' 'two eggs, short stack of buttermilk pancakes or French ' 'toast, and choice of two strips of bacon, two sausage ' 'links or slice of ham \n' '12\n' '\n' '**Quinoa Bowl** \n' 'Quinoa mixed with black beans, corn, roasted red ' 'peppers and a touch of chipotle pepperâ\x80¦topped with ' 'two eggs any style, fresh avocado, pico de gallo and ' 'cheddar jack cheese. \n' '13\n' '\n' '**Avocado Toast** \n' 'two slices of toast topped with mashed avocado and ' 'poached eggs, finished with our house recipe everything ' 'bagel seasoning and fresh micro greens. \n' '11\n' '\n' '## Benedicts\n' '\n' '_all served with home fries_\n' '\n' '**Traditional** \n' 'English muffin topped with two poached eggs, Canadian ' 'bacon, and hollandaise \n' '12.50\n' '\n' '**Eggs Mulligan** \n' 'English muffin topped with poached eggs, our own corned ' 'beef hash and hollandaise \n' '14.50\n' '\n' '**Nannyâ\x80\x99s** \n' 'homemade potato pancakes topped with poached eggs and ' 'hollandaiseâ\x80¦ served with choice of fresh fruit or ' 'home fries \n' '13\n' '\n' '**Garden** \n' 'homemade Italian herbed focaccia toast topped with two ' 'poached eggs, spinach, tomatoes, mushrooms and ' 'hollandaise \n' '12.50'}, 'score': 0.0, 'values': []}, {'id': '4a0a5bff-c57f-4b2f-b879-1ce906d0611b', 'metadata': {'chunk': 3.0, 'file_name': 'www-fairwaycapecod-com-menu-breakfast-62826.txt', 'is_dict': 'no', 'text': '**Fried Green Tomatoes** \n' 'a bed of spinach topped with two poached eggs, fried ' 'green tomatoes, avocado, hollandaise and garnished with ' 'basil \n' '12\n' '\n' '**Crab Cakes** \n' 'grilled homemade crab cakes on a bed of fresh spinach, ' 'topped with poached eggs and hollandaise \n' '18\n' '\n' '**Portuguese Benedict** \n' 'homemade English muffin bread topped with two poached ' 'eggs, grilled linguica, sautéed onions and peppers ' 'with hollandaise \n' '16\n' '\n' '## Egg Combos\n' '\n' '_served with toast and home fries_\n' '\n' 'add bacon, ham, chicken sausage or sausage 4 Linguica ' '6\n' '\n' 'one egg 6 two eggs 7 three eggs 8\n' '\n' '## Omelets\n' '\n' '_Sautéed toppings cooked into the eggs, folded over, ' 'and served with home fries and toast_ \n' '\n' '**Western** \n' 'sautéed onions, peppers, ham and cheese \n' '14\n' '\n' '**Green Goodness** \n' 'sautéed asparagus, zucchini, broccoli, spinach served ' 'with homemade English muffin bread toast \n' '14\n' '\n' '**Athenian** \n' 'spinach, tomatoes, sautéed onions and feta cheese \n' '14\n' '\n' '**Build Your Own** \n' 'plain \n' '9\n' '\n' '**add your own ingredients**\n' '\n' 'cheese: American, Swiss, cheddar, feta\n' '\n' '2 each\n' '\n' 'ham \n' 'bacon \n' 'sausage \n' 'tomato \n' 'onions \n' 'peppers \n' 'mushrooms \n' 'spinach\n' '\n' '2 each\n' '\n' 'fresh avocado \n' 'linguiça\n' '\n' '3 each\n' '\n' 'hash\n' '\n' '4 each\n' '\n' '## Pancakes, French Toast and Waffles\n' '\n' '_add bacon, ham, sausage, chicken sausage 4 or linguica ' '6_ \n' '_100% pure maple syrup 2_ \n' '\n' '**Cinnamon Roll French Toast** \n' 'made with â\x80\x9cHoleâ\x80\x93Inâ\x80\x93Oneâ\x80\x9d ' 'cinnamon roll \n' '10\n' '\n' '**Traditional French Toast** \n' 'made with your choice of white, wheat, or raisin \n' 'two 9 three 10.50\n' '\n' '**Belgian Waffle** \n' 'made with our own recipe with malted flour. \n' 'plain 9 warm apples 11\n' '\n' '**Buttermilk Pancakes** \n' 'two 7 three 8.50\n' '\n' '**Fresh Blueberry Pancakes** \n' 'two 9 three 10.50'}, 'score': 0.0, 'values': []}, {'id': '863d270e-4e38-40c6-a22f-f3cabbe2d797', 'metadata': {'chunk': 4.0, 'file_name': 'www-fairwaycapecod-com-menu-breakfast-62826.txt', 'is_dict': 'no', 'text': '**Fresh Blueberry Pancakes** \n' 'two 9 three 10.50\n' '\n' '**Chocolate Chip Pancakes** \n' 'two 9 three 10.50\n' '\n' '**Fresh Blueberry Crunch Pancakes** \n' 'two 10 three 11.50\n' '\n' '## Breakfast Sandwiches\n' '\n' '**Classic** \n' 'two eggs, cheese and choice of bacon, ham, sausage ' 'patty or linguiça +3 â\x80¦ served on a brioche ' 'roll \n' '8\n' '\n' 'add home fries \n' '3\n' '\n' 'substitute flour wrap, buttermilk biscuit or bagel \n' '1\n' '\n' '**Veggie Focaccia** \n' 'scrambled eggs, spinach, mushrooms, tomatoes, sauteed ' 'onions and cheddar cheese on focaccia breadâ\x80¦ ' 'served with home fries \n' '10\n' '\n' '**California Focaccia** \n' 'scrambled eggs, fresh avocado, fresh tomato and cheddar ' 'cheese on focaccia breadâ\x80¦ served with home ' 'fries \n' '10\n' '\n' '## Sides\n' '\n' '**Potato Pancakes** \n' 'served with sour cream \n' '5 \n' 'add apple sauce 2\n' '\n' '**Biscuit And Gravy** \n' '8\n' '\n' '**Homemade Corned Beef Hash** \n' '8\n' '\n' '**Home Fries** \n' '3\n' '\n' '**Loaded Home Fries** \n' 'loaded with chopped bacon, cheddar cheese, scallions ' 'topped with hollandaise sauce \n' '8\n' '\n' '**Crab Cakes** \n' '15\n' '\n' '**Breakfast Protein** \n' 'Bacon, Ham, Sausage, Canadian Bacon or Chicken sausage ' 'â\x80\x93 4 Linguiça â\x80\x93 6\n' '\n' '**Seasonal Fresh Fruit Cup** \n' 'cup 5\n' '\n' '**Oatmeal** \n' '5 \n' 'add Craisins®, raisins, walnuts or chocolate chips 1 ' 'each \n' 'add fresh blueberries 1.5\n' '\n' '[Fairway Favorites](#favorites) | ' '[Benedicts](#benedicts) | [Egg Combos](#eggcombos) | ' '[Omelets](#omelets) | [Pancakes & French ' 'Toast](#pancakes) | [Breakfast Sandwiches](#sandwiches) ' '| [Sides](#sides)\n' '\n' 'Follow us on Facebook\n' '\n' 'Comments Box SVG iconsUsed for the like, share, ' 'comment, and reaction icons\n' '\n' '[ Fairway Restaurant & Pizzeria ' '](https://facebook.com/354293134601213) \n' '\n' ' 14 hours ago'}, 'score': 0.0, 'values': []}, {'id': 'e274e22b-cfc0-444c-8f65-59cd0c2480e4', 'metadata': {'chunk': 5.0, 'file_name': 'www-fairwaycapecod-com-menu-breakfast-62826.txt', 'is_dict': 'no', 'text': '14 hours ago \n' '\n' '[ ](https://facebook.com/354293134601213) \n' '\n' 'Today, we pause to honor and thank our veterans for ' 'their courage, dedication, and sacrifice. Your service ' 'means the world to us. ð\x9f\x87ºð\x9f\x87¸ From all of ' 'us at Fairway, Happy Veterans Day! ... [See MoreSee ' 'Less](#) \n' '\n' '[View](https://scontent-iad3-2.xx.fbcdn.net/v/t39.30808-6/465907081%5F1137112068259222%5F696551884986646616%5Fn.jpg?stp=dst-jpg%5Fs720x720&%5Fnc%5Fcat=111&ccb=1-7&%5Fnc%5Fsid=127cfc&%5Fnc%5Fohc=bt-%5FxPu6sKwQ7kNvgHyMWFl&%5Fnc%5Fzt=23&%5Fnc%5Fht=scontent-iad3-2.xx&edm=AKIiGfEEAAAA&%5Fnc%5Fgid=AaQzlHjjBMgFxbUUdAhDSBi&oh=00%5FAYBodCsyQaAS0ym8GklrCWbrRkGMXBr5QYNpcprr-sGniA&oe=6738D9F7)\n' '\n' '0 Comments[Comment on ' 'Facebook](https://www.facebook.com/354293134601213%5F1139471834689912)\n' '\n' 'Find Us on Instagram\n' '\n' '[](https://www.instagram.com/fairway%5Frestaurant%5Fcapecod/) ' '#riseandshine #homemade #breakfast #capecod #yummy ' '#comfortfood #eastham #homemade #dessert #flavors ' '#openyearround #open #locals #dinner #steak #pasta ' '#tasty #family #friends #eat\n' '\n' 'Get Our E-newsletter:\n' '\n' 'Featuring: Specials, News, Events and More!'}, 'score': 0.0, 'values': []}, {'id': '71d31904-5586-41a5-8318-5689652cae7b', 'metadata': {'chunk': 6.0, 'file_name': 'www-fairwaycapecod-com-menu-breakfast-62826.txt', 'is_dict': 'no', 'text': 'Get Our E-newsletter:\n' '\n' 'Featuring: Specials, News, Events and More!\n' '\n' '[ ' '](https://www.tripadvisor.com/Restaurant%5FReview-g41555-d385199-Reviews-Fairway%5FRestaurant%5FPizzeria-Eastham%5FCape%5FCod%5FMassachusetts.html)[ ' '](https://www.yelp.com/biz/fairway-restaurant-and-pizzeria-north-eastham)[](https://www.facebook.com/fairwayrestaurantcapecod)\n' '\n' 'Give Us a Review\n' '\n' 'â\x80\x9cExcellent place with great service. Fresh food ' 'and large servings. Love breakfast but also ' 'dinner.â\x80\x9d\n' '\n' '\\-Denise K.\n' '\n' 'The Fairway Restaurant and Pizzeria \n' '4295 State Highway North Eastham, MA 02642\n' '\n' '[508.255.3893](tel:508-255-3893)\n' '\n' 'HOURS: \n' 'Breakfast: 6:30 AM â\x80\x93 12:00 PM \n' 'Dinner: \n' '4:00 PM â\x80\x93 8:00 PM SUN-THURS \n' '4:00 PM â\x80\x93 9:00 PM FRI-SAT\n' '\n' '[Privacy ' 'Policy](https://www.fairwaycapecod.com/privacy/)\n' '\n' '[Sitemap](https://www.fairwaycapecod.com/sitemap/)\n' '\n' '© 2024 The Hole In One Group | All Rights Reserved\n' '\n' 'PreviousNext\n' '\n' 'View on Facebook'}, 'score': 0.0, 'values': []}, {'id': '49821323-addd-4d0e-8903-db3827a2ae8a', 'metadata': {'chunk': 0.0, 'file_name': 'www-fairwaycapecod-com-menu-breakfast-62826.txt', 'is_dict': 'no', 'text': '[508-255-3893](tel:508-255-3893) \n' '\n' '[ ](https://www.fairwaycapecod.com/) \n' '\n' '* [Home](https://www.fairwaycapecod.com/)\n' '* [Menus](https://www.fairwaycapecod.com/menu/) \n' ' * [î\x84\x81 Order ' 'Online](https://direct.chownow.com/order/20872/locations/30204) \n' ' * ' '[Breakfast](https://www.fairwaycapecod.com/menu/breakfast/) \n' ' * ' '[Dinner](https://www.fairwaycapecod.com/menu/dinner/) \n' ' * [Weekly ' 'Specials](https://www.fairwaycapecod.com/daily-specials/) \n' ' * [Family Style ' 'Takeout](https://www.fairwaycapecod.com/menu/home-meal-replacement/) \n' ' * [Homemade ' 'Desserts](https://www.fairwaycapecod.com/menu/homemade-desserts/) \n' ' * [From the ' 'Tavern](https://www.fairwaycapecod.com/menu/from-the-tavern/) \n' ' * ' '[Beverages](https://www.fairwaycapecod.com/menu/beverages/)\n' '* [ Order ' 'Online](https://direct.chownow.com/order/20872/locations/30204)\n' '* [About ' 'Us](https://www.fairwaycapecod.com/about-us/) \n' ' * ' '[History](https://www.fairwaycapecod.com/about-us/history/) \n' ' * [Photo ' 'Gallery](https://www.fairwaycapecod.com/photo-gallery/) \n' ' * [Customer ' 'Reviews](https://www.fairwaycapecod.com/about-us/customer-reviews/)\n' '* [Location](https://www.fairwaycapecod.com/location/)\n' '* [Store](https://www.fairwaycapecod.com/store/)\n' '* [Contact ' 'Us](https://www.fairwaycapecod.com/contact/) \n' ' * ' '[Employment](https://www.fairwaycapecod.com/contact/employment/)\n' '* [The Hole in One](https://theholecapecod.com/)'}, 'score': 0.0, 'values': []}, {'id': 'fb3d38f1-32a9-4366-a040-96f181eb781d', 'metadata': {'chunk': 1.0, 'file_name': 'www-fairwaycapecod-com-menu-breakfast-62826.txt', 'is_dict': 'no', 'text': 'Select Page \n' '* [Home](https://www.fairwaycapecod.com/)\n' '* [Menus](https://www.fairwaycapecod.com/menu/) \n' ' * [î\x84\x81 Order ' 'Online](https://direct.chownow.com/order/20872/locations/30204) \n' ' * ' '[Breakfast](https://www.fairwaycapecod.com/menu/breakfast/) \n' ' * ' '[Dinner](https://www.fairwaycapecod.com/menu/dinner/) \n' ' * [Weekly ' 'Specials](https://www.fairwaycapecod.com/daily-specials/) \n' ' * [Family Style ' 'Takeout](https://www.fairwaycapecod.com/menu/home-meal-replacement/) \n' ' * [Homemade ' 'Desserts](https://www.fairwaycapecod.com/menu/homemade-desserts/) \n' ' * [From the ' 'Tavern](https://www.fairwaycapecod.com/menu/from-the-tavern/) \n' ' * ' '[Beverages](https://www.fairwaycapecod.com/menu/beverages/)\n' '* [ Order ' 'Online](https://direct.chownow.com/order/20872/locations/30204)\n' '* [About ' 'Us](https://www.fairwaycapecod.com/about-us/) \n' ' * ' '[History](https://www.fairwaycapecod.com/about-us/history/) \n' ' * [Photo ' 'Gallery](https://www.fairwaycapecod.com/photo-gallery/) \n' ' * [Customer ' 'Reviews](https://www.fairwaycapecod.com/about-us/customer-reviews/)\n' '* [Location](https://www.fairwaycapecod.com/location/)\n' '* [Store](https://www.fairwaycapecod.com/store/)\n' '* [Contact ' 'Us](https://www.fairwaycapecod.com/contact/) \n' ' * ' '[Employment](https://www.fairwaycapecod.com/contact/employment/)\n' '* [The Hole in One](https://theholecapecod.com/)\n' '\n' 'Breakfast\n' '\n' '[Fairway Favorites](#favorites) | ' '[Benedicts](#benedicts) | [Egg Combos](#eggcombos) | ' '[Omelets](#omelets) | [Pancakes & French ' 'Toast](#pancakes) | [Breakfast Sandwiches](#sandwiches) ' '| [Sides](#sides)'}, 'score': 0.0, 'values': []}, {'id': '3cc1a48c-4071-4b74-9ca0-ff0eeba0a173-1', 'metadata': {'chunk': 1.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 107.0, 'text': 'inarydissimilarity.Ifobjectsiandjaredescribedbysymmetricbinaryattributes,thentheTable2.3ContingencyTableforBinaryAttributesObjectj10sum1qrq+rObjecti0sts+tsumq+sr+tp'}, 'score': 0.0, 'values': []}, {'id': 'a34634dd-1373-44b1-a858-755ea04d754c-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 108.0, 'text': 'HAN09-ch02-039-082-97801238147912011/6/13:15Page71#332.4MeasuringDataSimilarityandDissimilarity71dissimilaritybetweeniandjisd(i,j)=r+sq+r+s+t.(2.13)Forasymmetricbinaryattributes,thetwostatesarenotequallyimportant,suchasthepositive(1)andnegative(0)outcomesofadiseasetest.Giventwoasymmetricbinaryattributes,theagreementoftwo1s(apositivematch)isthenconsideredmoresignifi-cantthanthatoftwo0s(anegativematch).Therefore,suchbinaryattributesareoftenconsidered“monary”(havingonestate).Thedissimilaritybasedontheseattributesiscalledasymmetricbinarydissimilarity,wherethenumberofnegativematches,t,isconsideredunimportantandisthusignoredinthefollowingcomputation:d(i,j)=r+sq+r+s.(2.14)Complementarily,wecanmeasurethedifferencebetweentwobinaryattributesbasedonthenotionofsimilarityinsteadofdissimilarity.Forexample,theasymmetricbinarysimilaritybetweentheobjectsiandjcanbecomputedassim(i,j)=qq+r+s=1−d(i,j).(2.15)Thecoefficientsim(i,j)ofEq.(2.15)iscalledtheJaccardcoefficientandispopularlyreferencedintheliterature.Whenbothsymmetricandasymmetricbinaryattributesoccurinthesamedataset,themixedattributesapproachdescribedinSection2.4.6canbeapplied.Example2.18Dissimilaritybetweenbinaryattributes.Supposethatapatientrecordtable(Table2.4)containstheattributesname,gender,fever,cough,test-1,test-2,test-3,andtest-4,wherenameisanobjectidentifier,genderisasymmetricattribute,andtheremainingattributesareasymmetricbinary.Forasymmetricattributevalues,letthevaluesY(yes)andP(positive)besetto1,andthevalueN(noornegative)besetto0.SupposethatthedistancebetweenobjectsTable2.4RelationalTableWherePatientsAreDescribedbyBinaryAttributesnamegenderfevercoughtest-1test-2test-3test-4JackMYNPNNNJimMYYNNNNMaryFYNPNPN........................'}, 'score': 0.0, 'values': []}, {'id': '5b0f1b89-6306-4951-a328-590c700cc73e-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 109.0, 'text': 'HAN09-ch02-039-082-97801238147912011/6/13:15Page72#3472Chapter2GettingtoKnowYourData(patients)iscomputedbasedonlyontheasymmetricattributes.AccordingtoEq.(2.14),thedistancebetweeneachpairofthethreepatients—Jack,Mary,andJim—isd(Jack,Jim)=1+11+1+1=0.67,d(Jack,Mary)=0+12+0+1=0.33,d(Jim,Mary)=1+21+1+2=0.75.ThesemeasurementssuggestthatJimandMaryareunlikelytohaveasimilardiseasebecausetheyhavethehighestdissimilarityvalueamongthethreepairs.Ofthethreepatients,JackandMaryarethemostlikelytohaveasimilardisease.2.4.4DissimilarityofNumericData:MinkowskiDistanceInthissection,wedescribedistancemeasuresthatarecommonlyusedforcomputingthedissimilarityofobjectsdescribedbynumericattributes.ThesemeasuresincludetheEuclidean,Manhattan,andMinkowskidistances.Insomecases,thedataarenormalizedbeforeapplyingdistancecalculations.Thisinvolvestransformingthedatatofallwithinasmallerorcommonrange,suchas[−1,1]or[0.0,1.0].Consideraheightattribute,forexample,whichcouldbemeasuredineithermetersorinches.Ingeneral,expressinganattributeinsmallerunitswillleadtoalargerrangeforthatattribute,andthustendtogivesuchattributesgreatereffector“weight.”Normalizingthedataattemptstogiveallattributesanequalweight.Itmayormaynotbeusefulinaparticularapplication.MethodsfornormalizingdataarediscussedindetailinChapter3ondatapreprocessing.ThemostpopulardistancemeasureisEuclideandistance(i.e.,straightlineor“asthecrowflies”).Leti=(xi1,xi2,...,xip)andj=(xj1,xj2,...,xjp)betwoobjectsdescribedbypnumericattributes.TheEuclideandistancebetweenobjectsiandjisdefinedasd(i,j)=(cid:113)(xi1−xj1)2+(xi2−xj2)2+···+(xip−xjp)2.(2.16)Anotherwell-knownmeasureistheManhattan(orcityblock)distance,namedsobecauseitisthedistanceinblocksbetweenanytwopointsinacity(suchas2blocksdownand3blocksoverforatotalof5blocks).Itisdefinedasd(i,j)=|xi1−xj1|+|xi2−xj2|+···+|xip−xjp|.(2.17)BoththeEuclideanandtheManhattandistancesatisfythefollowingmathematicalproperties:Non-negativity:d(i,j)≥0:Distanceisanon-negativenumber.Identityofindiscernibles:d(i,i)=0:Thedistanceofanob'}, 'score': 0.0, 'values': []}, {'id': '6e96deee-25b6-4b90-ab1e-244ca409759f-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 110.0, 'text': 'HAN09-ch02-039-082-97801238147912011/6/13:15Page73#352.4MeasuringDataSimilarityandDissimilarity73Symmetry:d(i,j)=d(j,i):Distanceisasymmetricfunction.Triangleinequality:d(i,j)≤d(i,k)+d(k,j):Goingdirectlyfromobjectitoobjectjinspaceisnomorethanmakingadetouroveranyotherobjectk.Ameasurethatsatisfiestheseconditionsisknownasmetric.Pleasenotethatthenon-negativitypropertyisimpliedbytheotherthreeproperties.Example2.19EuclideandistanceandManhattandistance.Letx1=(1,2)andx2=(3,5)repre-senttwoobjectsasshowninFigure2.23.TheEuclideandistancebetweenthetwois√22+32=3.61.TheManhattandistancebetweenthetwois2+3=5.MinkowskidistanceisageneralizationoftheEuclideanandManhattandistances.Itisdefinedasd(i,j)=h(cid:113)|xi1−xj1|h+|xi2−xj2|h+···+|xip−xjp|h,(2.18)wherehisarealnumbersuchthath≥1.(SuchadistanceisalsocalledLpnorminsomeliterature,wherethesymbolpreferstoournotationofh.Wehavekeptpasthenumberofattributestobeconsistentwiththerestofthischapter.)ItrepresentstheManhattandistancewhenh=1(i.e.,L1norm)andEuclideandistancewhenh=2(i.e.,L2norm).Thesupremumdistance(alsoreferredtoasLmax,L∞normandastheChebyshevdistance)isageneralizationoftheMinkowskidistanceforh→∞.Tocomputeit,wefindtheattributefthatgivesthemaximumdifferenceinvaluesbetweenthetwoobjects.Thisdifferenceisthesupremumdistance,definedmoreformallyas:d(i,j)=limh→∞\uf8eb\uf8edp(cid:88)f=1|xif−xjf|h\uf8f6\uf8f81h=pmaxf|xif−xjf|.(2.19)TheL∞normisalsoknownastheuniformnorm.1223543321x2 ' '= (3,5)x1 = (1,2)Euclidean distance= (22 + 32)1/2 = ' '3.61Manhattan distance= 2 + 3 = 5Supremum distance= 5 – ' '2 = ' '3Figure2.23Euclidean,Manhattan,andsupremumdistancesbetweentwoobjects.'}, 'score': 0.0, 'values': []}, {'id': '6ca2193e-ccf1-418d-af99-369b85b32871-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 111.0, 'text': 'HAN09-ch02-039-082-97801238147912011/6/13:15Page74#3674Chapter2GettingtoKnowYourDataExample2.20Supremumdistance.Let’susethesametwoobjects,x1=(1,2)andx2=(3,5),asinFigure2.23.Thesecondattributegivesthegreatestdifferencebetweenvaluesfortheobjects,whichis5−2=3.Thisisthesupremumdistancebetweenbothobjects.Ifeachattributeisassignedaweightaccordingtoitsperceivedimportance,theweightedEuclideandistancecanbecomputedasd(i,j)=(cid:113)w1|xi1−xj1|2+w2|xi2−xj2|2+···+wm|xip−xjp|2.(2.20)Weightingcanalsobeappliedtootherdistancemeasuresaswell.2.4.5ProximityMeasuresforOrdinalAttributesThevaluesofanordinalattributehaveameaningfulorderorrankingaboutthem,yetthemagnitudebetweensuccessivevaluesisunknown(Section2.1.4).Anexam-pleincludesthesequencesmall,medium,largeforasizeattribute.Ordinalattributesmayalsobeobtainedfromthediscretizationofnumericattributesbysplittingthevaluerangeintoafinitenumberofcategories.Thesecategoriesareorganizedintoranks.Thatis,therangeofanumericattributecanbemappedtoanordinalattributefhavingMfstates.Forexample,therangeoftheinterval-scaledattributetemperature(inCelsius)canbeorganizedintothefollowingstates:−30to−10,−10to10,10to30,repre-sentingthecategoriescoldtemperature,moderatetemperature,andwarmtemperature,respectively.LetMrepresentthenumberofpossiblestatesthatanordinalattributecanhave.Theseorderedstatesdefinetheranking1,...,Mf.“Howareordinalattributeshandled?”Thetreatmentofordinalattributesisquitesimilartothatofnumericattributeswhencomputingdissimilaritybetweenobjects.Supposethatfisanattributefromasetofordinalattributesdescribingnobjects.Thedissimilaritycomputationwithrespecttofinvolvesthefollowingsteps:1.Thevalueofffortheithobjectisxif,andfhasMforderedstates,representingtheranking1,...,Mf.Replaceeachxifbyitscorrespondingrank,rif∈{1,...,Mf}.2.Sinceeachordinalattributecanhaveadifferentnumberofstates,itisoftennecessarytomaptherangeofeachattributeonto[0.0,1.0]sothateachattributehasequalweight.Weperformsuchdatanormalizationbyreplacingtherankrifoftheithobjectinthefthatt'}, 'score': 0.0, 'values': []}, {'id': '6ca2193e-ccf1-418d-af99-369b85b32871-1', 'metadata': {'chunk': 1.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 111.0, 'text': 'ithobjectinthefthattributebyzif=rif−1Mf−1.(2.21)3.DissimilaritycanthenbecomputedusinganyofthedistancemeasuresdescribedinSection2.4.4fornumericattributes,usingziftorepresentthefvaluefortheithobject.'}, 'score': 0.0, 'values': []}, {'id': '00798edb-9107-4515-9ee8-e2e369f9e6c2-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 112.0, 'text': 'HAN09-ch02-039-082-97801238147912011/6/13:15Page75#372.4MeasuringDataSimilarityandDissimilarity75Example2.21Dissimilaritybetweenordinalattributes.SupposethatwehavethesampledatashownearlierinTable2.2,exceptthatthistimeonlytheobject-identifierandthecontinuousordinalattribute,test-2,areavailable.Therearethreestatesfortest-2:fair,good,andexcellent,thatis,Mf=3.Forstep1,ifwereplaceeachvaluefortest-2byitsrank,thefourobjectsareassignedtheranks3,1,2,and3,respectively.Step2normalizestherankingbymappingrank1to0.0,rank2to0.5,andrank3to1.0.Forstep3,wecanuse,say,theEuclideandistance(Eq.2.16),whichresultsinthefollowingdissimilaritymatrix:\uf8ee\uf8ef\uf8ef\uf8ef\uf8f001.000.50.5001.00.50\uf8f9\uf8fa\uf8fa\uf8fa\uf8fb.Therefore,objects1and2arethemostdissimilar,asareobjects2and4(i.e.,d(2,1)=1.0andd(4,2)=1.0).Thismakesintuitivesensesinceobjects1and4arebothexcellent.Object2isfair,whichisattheoppositeendoftherangeofvaluesfortest-2.Similarityvaluesforordinalattributescanbeinterpretedfromdissimilarityassim(i,j)=1−d(i,j).2.4.6DissimilarityforAttributesofMixedTypesSections2.4.2through2.4.5discussedhowtocomputethedissimilaritybetweenobjectsdescribedbyattributesofthesametype,wherethesetypesmaybeeithernominal,sym-metricbinary,asymmetricbinary,numeric,orordinal.However,inmanyrealdatabases,objectsaredescribedbyamixtureofattributetypes.Ingeneral,adatabasecancontainalloftheseattributetypes.“So,howcanwecomputethedissimilaritybetweenobjectsofmixedattributetypes?”Oneapproachistogroupeachtypeofattributetogether,performingseparatedatamining(e.g.,clustering)analysisforeachtype.Thisisfeasibleiftheseanalysesderivecompatibleresults.However,inrealapplications,itisunlikelythataseparateanalysisperattributetypewillgeneratecompatibleresults.Amorepreferableapproachistoprocessallattributetypestogether,performingasingleanalysis.Onesuchtechniquecombinesthedifferentattributesintoasingledis-similaritymatrix,bringingallofthemeaningfulattributesontoacommonscaleoftheinterval[0.0,1.0].Supposethatthedatasetcontainspattributesofmixedtype.Thedissimilar'}, 'score': 0.0, 'values': []}, {'id': '00798edb-9107-4515-9ee8-e2e369f9e6c2-1', 'metadata': {'chunk': 1.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 112.0, 'text': 'edtype.Thedissimilarityd(i,j)betweenobjectsiandjisdefinedasd(i,j)=(cid:80)pf=1δ(f)ijd(f)ij(cid:80)pf=1δ(f)ij,(2.22)'}, 'score': 0.0, 'values': []}, {'id': '745355e1-76ed-4cf2-817c-413772189300-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 113.0, 'text': 'HAN09-ch02-039-082-97801238147912011/6/13:15Page76#3876Chapter2GettingtoKnowYourDatawheretheindicatorδ(f)ij=0ifeither(1)xiforxjfismissing(i.e.,thereisnomea-surementofattributefforobjectiorobjectj),or(2)xif=xjf=0andattributefisasymmetricbinary;otherwise,δ(f)ij=1.Thecontributionofattributeftothedissimilaritybetweeniandj(i.e.,d(f)ij)iscomputeddependentonitstype:Iffisnumeric:d(f)ij=|xif−xjf|maxhxhf−minhxhf,wherehrunsoverallnonmissingobjectsforattributef.Iffisnominalorbinary:d(f)ij=0ifxif=xjf;otherwise,d(f)ij=1.Iffisordinal:computetheranksrifandzif=rif−1Mf−1,andtreatzifasnumeric.Thesestepsareidenticaltowhatwehavealreadyseenforeachoftheindividualattributetypes.Theonlydifferenceisfornumericattributes,wherewenormalizesothatthevaluesmaptotheinterval[0.0,1.0].Thus,thedissimilaritybetweenobjectscanbecomputedevenwhentheattributesdescribingtheobjectsareofdifferenttypes.Example2.22Dissimilaritybetweenattributesofmixedtype.Let’scomputeadissimilaritymatrixfortheobjectsinTable2.2.Nowwewillconsideralloftheattributes,whichareofdifferenttypes.InExamples2.17and2.21,weworkedoutthedissimilaritymatricesforeachoftheindividualattributes.Theprocedureswefollowedfortest-1(whichisnominal)andtest-2(whichisordinal)arethesameasoutlinedearlierforprocessingattributesofmixedtypes.Therefore,wecanusethedissimilaritymatricesobtainedfortest-1andtest-2laterwhenwecomputeEq.(2.22).First,however,weneedtocomputethedissimilaritymatrixforthethirdattribute,test-3(whichisnumeric).Thatis,wemustcomputed(3)ij.Followingthecasefornumericattributes,weletmaxhxh=64andminhxh=22.ThedifferencebetweenthetwoisusedinEq.(2.22)tonormalizethevaluesofthedissimilaritymatrix.Theresultingdissimilaritymatrixfortest-3is\uf8ee\uf8ef\uf8ef\uf8ef\uf8f000.5500.451.0000.400.140.860\uf8f9\uf8fa\uf8fa\uf8fa\uf8fb.WecannowusethedissimilaritymatricesforthethreeattributesinourcomputationofEq.(2.22).Theindicatorδ(f)ij=1foreachofthethreeattributes,f.Weget,forexample,d(3,1)=1(1)+1(0.50)+1(0.45)3=0.65.Theresultingdissimilaritymatrixobtainedforthe'}, 'score': 0.0, 'values': []}, {'id': '8ba288cd-7e37-45d0-bc1b-b1a31217bf09-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 114.0, 'text': 'HAN09-ch02-039-082-97801238147912011/6/13:15Page77#392.4MeasuringDataSimilarityandDissimilarity77datadescribedbythethreeattributesofmixedtypesis:\uf8ee\uf8ef\uf8ef\uf8ef\uf8f000.8500.650.8300.130.710.790\uf8f9\uf8fa\uf8fa\uf8fa\uf8fb.FromTable2.2,wecanintuitivelyguessthatobjects1and4arethemostsimilar,basedontheirvaluesfortest-1andtest-2.Thisisconfirmedbythedissimilaritymatrix,whered(4,1)isthelowestvalueforanypairofdifferentobjects.Similarly,thematrixindicatesthatobjects1and2aretheleastsimilar.2.4.7CosineSimilarityAdocumentcanberepresentedbythousandsofattributes,eachrecordingthefrequencyofaparticularword(suchasakeyword)orphraseinthedocument.Thus,eachdocu-mentisanobjectrepresentedbywhatiscalledaterm-frequencyvector.Forexample,inTable2.5,weseethatDocument1containsfiveinstancesofthewordteam,whilehockeyoccursthreetimes.Thewordcoachisabsentfromtheentiredocument,asindicatedbyacountvalueof0.Suchdatacanbehighlyasymmetric.Term-frequencyvectorsaretypicallyverylongandsparse(i.e.,theyhavemany0val-ues).Applicationsusingsuchstructuresincludeinformationretrieval,textdocumentclustering,biologicaltaxonomy,andgenefeaturemapping.Thetraditionaldistancemeasuresthatwehavestudiedinthischapterdonotworkwellforsuchsparsenumericdata.Forexample,twoterm-frequencyvectorsmayhavemany0valuesincommon,meaningthatthecorrespondingdocumentsdonotsharemanywords,butthisdoesnotmakethemsimilar.Weneedameasurethatwillfocusonthewordsthatthetwodocu-mentsdohaveincommon,andtheoccurrencefrequencyofsuchwords.Inotherwords,weneedameasurefornumericdatathatignoreszero-matches.Cosinesimilarityisameasureofsimilaritythatcanbeusedtocomparedocu-mentsor,say,givearankingofdocumentswithrespecttoagivenvectorofquerywords.Letxandybetwovectorsforcomparison.UsingthecosinemeasureasaTable2.5DocumentVectororTerm-FrequencyVectorDocumentteamcoachhockeybaseballsoccerpenaltyscorewinlossseasonDocument15030200200Document23020110101Document30702100300Document40100122030'}, 'score': 0.0, 'values': []}, {'id': '10571b4e-c7a7-43d2-81d3-ad4c6dc0c7c7-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 115.0, 'text': 'HAN09-ch02-039-082-97801238147912011/6/13:15Page78#4078Chapter2GettingtoKnowYourDatasimilarityfunction,wehavesim(x,y)=x·y||x||||y||,(2.23)where||x||istheEuclideannormofvectorx=(x1,x2,...,xp),definedas(cid:113)x21+x22+···+x2p.Conceptually,itisthelengthofthevector.Similarly,||y||istheEuclideannormofvectory.Themeasurecomputesthecosineoftheanglebetweenvec-torsxandy.Acosinevalueof0meansthatthetwovectorsareat90degreestoeachother(orthogonal)andhavenomatch.Thecloserthecosinevalueto1,thesmallertheangleandthegreaterthematchbetweenvectors.NotethatbecausethecosinesimilaritymeasuredoesnotobeyallofthepropertiesofSection2.4.4definingmetricmeasures,itisreferredtoasanonmetricmeasure.Example2.23Cosinesimilaritybetweentwoterm-frequencyvectors.Supposethatxandyarethefirsttwoterm-frequencyvectorsinTable2.5.Thatis,x=(5,0,3,0,2,0,0,2,0,0)andy=(3,0,2,0,1,1,0,1,0,1).Howsimilararexandy?UsingEq.(2.23)tocomputethecosinesimilaritybetweenthetwovectors,weget:xt·y=5×3+0×0+3×2+0×0+2×1+0×1+0×0+2×1+0×0+0×1=25||x||=(cid:112)52+02+32+02+22+02+02+22+02+02=6.48||y||=(cid:112)32+02+22+02+12+12+02+12+02+12=4.12sim(x,y)=0.94Therefore,ifwewereusingthecosinesimilaritymeasuretocomparethesedocuments,theywouldbeconsideredquitesimilar.Whenattributesarebinary-valued,thecosinesimilarityfunctioncanbeinterpretedintermsofsharedfeaturesorattributes.Supposeanobjectxpossessestheithattributeifxi=1.Thenxt·yisthenumberofattributespossessed(i.e.,shared)bybothxandy,and|x||y|isthegeometricmeanofthenumberofattributespossessedbyxandthenumberpossessedbyy.Thus,sim(x,y)isameasureofrelativepossessionofcommonattributes.Asimplevariationofcosinesimilarityfortheprecedingscenarioissim(x,y)=x·yx·x+y·y−x·y,(2.24)whichistheratioofthenumberofattributessharedbyxandytothenumberofattributespossessedbyxory.Thisfunction,knownastheTanimotocoefficientorTanimotodistance,isfrequentlyusedininformationretrievalandbiologytaxonomy.'}, 'score': 0.0, 'values': []}, {'id': '9ce12bab-3348-48ba-9e46-ef560c31e09d-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 116.0, 'text': 'HAN09-ch02-039-082-97801238147912011/6/13:15Page79#412.6Exercises792.5SummaryDatasetsaremadeupofdataobjects.Adataobjectrepresentsanentity.Dataobjectsaredescribedbyattributes.Attributescanbenominal,binary,ordinal,ornumeric.Thevaluesofanominal(orcategorical)attributearesymbolsornamesofthings,whereeachvaluerepresentssomekindofcategory,code,orstate.Binaryattributesarenominalattributeswithonlytwopossiblestates(suchas1and0ortrueandfalse).Ifthetwostatesareequallyimportant,theattributeissymmetric;otherwiseitisasymmetric.Anordinalattributeisanattributewithpossiblevaluesthathaveameaningfulorderorrankingamongthem,butthemagnitudebetweensuccessivevaluesisnotknown.Anumericattributeisquantitative(i.e.,itisameasurablequantity)representedinintegerorrealvalues.Numericattributetypescanbeinterval-scaledorratio-scaled.Thevaluesofaninterval-scaledattributearemeasuredinfixedandequalunits.Ratio-scaledattributesarenumericattributeswithaninherentzero-point.Measurementsareratio-scaledinthatwecanspeakofvaluesasbeinganorderofmagnitudelargerthantheunitofmeasurement.Basicstatisticaldescriptionsprovidetheanalyticalfoundationfordatapreprocess-ing.Thebasicstatisticalmeasuresfordatasummarizationincludemean,weightedmean,median,andmodeformeasuringthecentraltendencyofdata;andrange,quan-tiles,quartiles,interquartilerange,variance,andstandarddeviationformeasuringthedispersionofdata.Graphicalrepresentations(e.g.,boxplots,quantileplots,quantile–quantileplots,histograms,andscatterplots)facilitatevisualinspectionofthedataandarethususefulfordatapreprocessingandmining.Datavisualizationtechniquesmaybepixel-oriented,geometric-based,icon-based,orhierarchical.Thesemethodsapplytomultidimensionalrelationaldata.Additionaltechniqueshavebeenproposedforthevisualizationofcomplexdata,suchastextandsocialnetworks.Measuresofobjectsimilarityanddissimilarityareusedindataminingapplicationssuchasclustering,outlieranalysis,andnearest-neighborclassification.Suchmea-suresofproximitycanbecomputedforeachattributetypestudiedinthischapt'}, 'score': 0.0, 'values': []}, {'id': '9ce12bab-3348-48ba-9e46-ef560c31e09d-1', 'metadata': {'chunk': 1.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 116.0, 'text': 'pestudiedinthischapter,orforcombinationsofsuchattributes.ExamplesincludetheJaccardcoefficientforasymmetricbinaryattributesandEuclidean,Manhattan,Minkowski,andsupremumdistancesfornumericattributes.Forapplicationsinvolvingsparsenumericdatavec-tors,suchasterm-frequencyvectors,thecosinemeasureandtheTanimotocoefficientareoftenusedintheassessmentofsimilarity.2.6Exercises2.1Givethreeadditionalcommonlyusedstatisticalmeasuresthatarenotalreadyillus-tratedinthischapterforthecharacterizationofdatadispersion.Discusshowtheycanbecomputedefficientlyinlargedatabases.'}, 'score': 0.0, 'values': []}, {'id': '90f3aa25-1025-4585-be3f-3788c79835df-1', 'metadata': {'chunk': 1.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 116.0, 'text': 'pestudiedinthischapter,orforcombinationsofsuchattributes.ExamplesincludetheJaccardcoefficientforasymmetricbinaryattributesandEuclidean,Manhattan,Minkowski,andsupremumdistancesfornumericattributes.Forapplicationsinvolvingsparsenumericdatavec-tors,suchasterm-frequencyvectors,thecosinemeasureandtheTanimotocoefficientareoftenusedintheassessmentofsimilarity.2.6Exercises2.1Givethreeadditionalcommonlyusedstatisticalmeasuresthatarenotalreadyillus-tratedinthischapterforthecharacterizationofdatadispersion.Discusshowtheycanbecomputedefficientlyinlargedatabases.'}, 'score': 0.0, 'values': []}, {'id': '38a8fd99-e27f-47e5-bb2d-1148ecf01034-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 117.0, 'text': '# Chapter 2 Getting to Know Your Data\n' '\n' '## 2.2 Data Analysis\n' '\n' 'Suppose that the data for analysis includes the ' 'attribute `age`. The age values for the data tuples are ' '(in increasing order) 13, 15, 16, 16, 19, 20, 21, 22, ' '22, 25, 25, 30, 33, 33, 35, 35, 35, 36, 40, 45, 46, 52, ' '70.\n' '\n' '1. **What is the mean of the data?**\n' '2. **What is the median?**\n' '3. **What is the mode of the data?** Comment on the ' "data's modality (i.e., bimodal, trimodal, etc.).\n" '4. **What is the midrange of the data?**\n' '5. **Can you find (roughly) the first quartile (Q1) and ' 'the third quartile (Q3) of the data?**\n' '6. **Give the five-number summary of the data.**\n' '7. **Show a boxplot of the data.**\n' '8. **How is a quantile–quantile plot different from a ' 'quantile plot?**\n' '\n' '## 2.3 Frequency Distribution\n' '\n' 'Suppose that the values for a given set of data are ' 'grouped into intervals. The intervals and corresponding ' 'frequencies are as follows:\n' '\n' '| Age | Frequency |\n' '|-------|-----------|\n' '| 1–5 | 200 |\n' '| 6–15 | 450 |\n' '| 16–20 | 300 |\n' '| 21–50 | 1500 |\n' '| 51–80 | 700 |\n' '| 81–110| 44 |\n' '\n' '**Compute an approximate median value for the data.**\n' '\n' '## 2.4 Age and Body Fat Data\n' '\n' 'Suppose that a hospital tested the age and body fat ' 'data for 18 randomly selected adults with the following ' 'results:\n' '\n' '| Age | | | | | | | | | | | | ' '| | | | | |\n' '|-----|----|----|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|\n' '| | 23 | 23 | 27| 27| 39| 41| 47| 49| 50| | | ' '| | | | | |\n' '| %fat| 9.5| 26.5| 7.8| 17.8| 31.4| 25.9| 27.4| 27.2| ' '31.2| | | | | | | | |\n' '\n' '| Age | | | | | | | | | | | | ' '| | | | | |\n' '|-----|----|----|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|\n' '| | 52 | 54 | 54| 56| 57| 58| 58| 60| 61| | | ' '| | | | | |\n' '| %fat| 34.6| 42.5| 28.8| 33.4| 30.2| 34.1| 32.9| 41.2| ' '35.7| | | | | | | | |\n' '\n' '1. **Calculate the mean, median, and standard deviation ' 'of age and %fat.**\n' '2. **Draw the boxplots for age and %fat.**\n' '3. **Draw a scatter plot and a q-q plot based on these ' 'two variables.**\n' '\n' '## 2.5 Dissimilarity of Objects\n' '\n' 'Briefly outline how to compute the dissimilarity ' 'between objects described by the following:\n' '\n' '- **Nominal attributes**\n' '- **Asymmetric binary attributes**'}, 'score': 0.0, 'values': []}, {'id': '4cb89d22-c459-42bf-960e-ed438ce12a1e-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 117.0, 'text': '# Chapter 2: Getting to Know Your Data\n' '\n' '## 2.2 \n' '\n' 'Suppose that the data for analysis includes the ' 'attribute **age**. The age values for the data tuples ' 'are (in increasing order) 13, 15, 16, 16, 19, 20, 21, ' '22, 22, 25, 25, 30, 33, 33, 35, 35, 35, 36, 40, 45, 46, ' '52, 70.\n' '\n' '(a) What is the **mean** of the data? What is the ' '**median**? \n' '(b) What is the **mode** of the data? Comment on the ' "data's modality (i.e., bimodal, trimodal, etc.). \n" '(c) What is the **range** of the data? \n' '(d) Can you find (roughly) the first quartile (Q1) and ' 'the third quartile (Q3) of the data? \n' '(e) Give the **five-number summary** of the data. \n' '(f) Show a **boxplot** of the data. \n' '(g) How is a **quantile–quantile plot** different from ' 'a **quantile plot**?\n' '\n' '## 2.3 \n' '\n' 'Suppose that the values for a given set of data are ' 'grouped into intervals. The intervals and corresponding ' 'frequencies are as follows:\n' '\n' '| age | frequency |\n' '|----------|-----------|\n' '| 1–5 | 200 |\n' '| 6–15 | 450 |\n' '| 16–20 | 300 |\n' '| 21–50 | 1500 |\n' '| 51–80 | 700 |\n' '| 81–110 | 44 |\n' '\n' 'Compute an **approximate median** value for the data.\n' '\n' '## 2.4 \n' '\n' 'Suppose that a hospital tested the age and body fat ' 'data for 18 randomly selected adults with the following ' 'results:\n' '\n' '| age | 23 | 23 | 27 | 27 | 39 | 41 | 47 | 49 | 50 |\n' '|-----|----|----|----|----|----|----|----|----|----|\n' '| %fat | 9.5 | 26.5 | 7.8 | 17.8 | 31.4 | 25.9 | 27.4 | ' '27.2 | 31.2 |\n' '\n' '| age | 52 | 54 | 54 | 56 | 57 | 58 | 58 | 60 | 61 |\n' '|-----|----|----|----|----|----|----|----|----|----|\n' '| %fat | 34.6 | 42.5 | 28.8 | 33.4 | 30.2 | 34.1 | 32.9 ' '| 41.2 | 35.7 |\n' '\n' '(a) Calculate the mean, median, and standard deviation ' 'of age and %fat. \n' '(b) Draw the **boxplots** for age and %fat. \n' '(c) Draw a **scatter plot** and a **q-q plot** based on ' 'these two variables.\n' '\n' '## 2.5 \n' '\n' 'Briefly outline how to compute the dissimilarity ' 'between objects described by the following: \n' '(a) **Nominal attributes** \n' '(b) **Asymmetric binary attributes**'}, 'score': 0.0, 'values': []}, {'id': 'ae4c78ca-f450-4a00-a667-4875b4df40cb-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 118.0, 'text': '## 2.7 Bibliographic Notes\n' '\n' '### 2.6\n' '\n' 'Given two objects represented by the tuples (22, 1, 42, ' '10) and (20, 0, 36, 8):\n' '\n' '1. Compute the **Euclidean distance** between the two ' 'objects.\n' '2. Compute the **Manhattan distance** between the two ' 'objects.\n' '3. Compute the **Minkowski distance** between the two ' 'objects, using \\( q = 3 \\).\n' '4. Compute the **supremum distance** between the two ' 'objects.\n' '\n' '### 2.7\n' '\n' 'The median is one of the most important holistic ' 'measures in data analysis. Propose several methods for ' 'median approximation. Analyze their respective ' 'complexity under different parameter settings and ' 'decide to what extent the real value can be ' 'approximated. Moreover, suggest a heuristic strategy to ' 'balance between accuracy and complexity and then apply ' 'it to all methods you have given.\n' '\n' '### 2.8\n' '\n' 'It is important to define or select similarity measures ' 'in data analysis. However, there is no commonly ' 'accepted subjective similarity measure. Results can ' 'vary depending on the similarity measures used. ' 'Nonetheless, seemingly different similarity measures ' 'may be equivalent after some transformation.\n' '\n' 'Suppose we have the following 2-D data set:\n' '\n' '| A₁ | A₂ |\n' '|------|------|\n' '| x₁ | 1.5 | 1.7 |\n' '| x₂ | 2.2 | 1.9 |\n' '| x₃ | 1.6 | 1.8 |\n' '| x₄ | 1.2 | 1.5 |\n' '| x₅ | 1.5 | 1.0 |\n' '\n' '1. Consider the data as 2-D data points. Given a new ' 'data point, \\( x = (1.4, 1.6) \\) as a query, rank the ' 'database points based on similarity with the query ' 'using Euclidean distance, Manhattan distance, supremum ' 'distance, and cosine similarity.\n' '2. Normalize the data set to make the norm of each data ' 'point equal to 1. Use Euclidean distance on the ' 'transformed data to rank the data points.\n' '\n' '### 2.7 Bibliographic Notes\n' '\n' 'Methods for descriptive data summarization have been ' 'studied in the statistics literature long before the ' 'onset of computers. Good summaries of statistical ' 'descriptive data mining methods include Freedman, ' 'Pisani, and Purves [FP97] and Devore [Dev95].'}, 'score': 0.0, 'values': []}, {'id': '8e54491b-8ee7-4310-b6b5-a1417129d6dc-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 118.0, 'text': '# 2.7 Bibliographic Notes\n' '\n' '## 2.6\n' 'Given two objects represented by the tuples (22, 1, 42, ' '10) and (20, 0, 36, 8):\n' '\n' '1. Compute the Euclidean distance between the two ' 'objects.\n' '2. Compute the Manhattan distance between the two ' 'objects.\n' '3. Compute the Minkowski distance between the two ' 'objects, using \\( q = 3 \\).\n' '4. Compute the supremum distance between the two ' 'objects.\n' '\n' '## 2.7\n' 'The median is one of the most important holistic ' 'measures in data analysis. Propose several methods for ' 'median approximation. Analyze their respective ' 'complexity under different parameter settings and ' 'decide to what extent the real value can be ' 'approximated. Moreover, suggest a heuristic strategy to ' 'balance between accuracy and complexity and then apply ' 'it to all methods you have given.\n' '\n' '## 2.8\n' 'It is important to define or select similarity measures ' 'in data analysis. However, there is no commonly ' 'accepted subjective similarity measure. Results can ' 'vary depending on the similarity measures used. ' 'Nonetheless, seemingly different similarity measures ' 'may be equivalent after some transformation.\n' '\n' 'Suppose we have the following 2-D data set:\n' '\n' '| | A₁ | A₂ |\n' '|-----|-------|-------|\n' '| x₁ | 1.5 | 1.7 |\n' '| x₂ | 2.2 | 1.9 |\n' '| x₃ | 1.6 | 1.8 |\n' '| x₄ | 1.2 | 1.5 |\n' '| x₅ | 1.5 | 1.0 |\n' '\n' '1. Consider the data as 2-D data points. Given a new ' 'data point, \\( x = (1.4, 1.6) \\) as a query, rank the ' 'database points based on similarity with the query ' 'using Euclidean distance, Manhattan distance, supremum ' 'distance, and cosine similarity.\n' '2. Normalize the data set to make the norm of each data ' 'point equal to 1. Use Euclidean distance on the ' 'transformed data to rank the data points.'}, 'score': 0.0, 'values': []}, {'id': '159bf86c-fefb-4522-88f6-1cc3887343b1-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 119.0, 'text': 'HAN09-ch02-039-082-97801238147912011/6/13:15Page82#4482Chapter2GettingtoKnowYourDatastatistics-basedvisualizationofdatausingboxplots,quantileplots,quantile–quantileplots,scatterplots,andloesscurves,seeCleveland[Cle93].PioneeringworkondatavisualizationtechniquesisdescribedinTheVisualDis-playofQuantitativeInformation[Tuf83],EnvisioningInformation[Tuf90],andVisualExplanations:ImagesandQuantities,EvidenceandNarrative[Tuf97],allbyTufte,inadditiontoGraphicsandGraphicInformationProcessingbyBertin[Ber81],VisualizingDatabyCleveland[Cle93],andInformationVisualizationinDataMiningandKnowledgeDiscoveryeditedbyFayyad,Grinstein,andWierse[FGW01].MajorconferencesandsymposiumsonvisualizationincludeACMHumanFactorsinComputingSystems(CHI),Visualization,andtheInternationalSymposiumonInfor-mationVisualization.ResearchonvisualizationisalsopublishedinTransactionsonVisualizationandComputerGraphics,JournalofComputationalandGraphicalStatistics,andIEEEComputerGraphicsandApplications.Manygraphicaluserinterfacesandvisualizationtoolshavebeendevelopedandcanbefoundinvariousdataminingproducts.Severalbooksondatamining(e.g.,DataMiningSolutionsbyWestphalandBlaxton[WB98])presentmanygoodexamplesandvisualsnapshots.Forasurveyofvisualizationtechniques,see“Visualtechniquesforexploringdatabases”byKeim[Kei97].Similarityanddistancemeasuresamongvariousvariableshavebeenintroducedinmanytextbooksthatstudyclusteranalysis,includingHartigan[Har75];JainandDubes[JD88];KaufmanandRousseeuw[KR90];andArabie,Hubert,anddeSoete[AHS96].MethodsforcombiningattributesofdifferenttypesintoasingledissimilaritymatrixwereintroducedbyKaufmanandRousseeuw[KR90].'}, 'score': 0.0, 'values': []}, {'id': 'a1df9f49-4d25-47d5-aa1a-cd7a778d3571-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 119.0, 'text': 'HAN09-ch02-039-082-97801238147912011/6/13:15Page82#4482Chapter2GettingtoKnowYourDatastatistics-basedvisualizationofdatausingboxplots,quantileplots,quantile–quantileplots,scatterplots,andloesscurves,seeCleveland[Cle93].PioneeringworkondatavisualizationtechniquesisdescribedinTheVisualDis-playofQuantitativeInformation[Tuf83],EnvisioningInformation[Tuf90],andVisualExplanations:ImagesandQuantities,EvidenceandNarrative[Tuf97],allbyTufte,inadditiontoGraphicsandGraphicInformationProcessingbyBertin[Ber81],VisualizingDatabyCleveland[Cle93],andInformationVisualizationinDataMiningandKnowledgeDiscoveryeditedbyFayyad,Grinstein,andWierse[FGW01].MajorconferencesandsymposiumsonvisualizationincludeACMHumanFactorsinComputingSystems(CHI),Visualization,andtheInternationalSymposiumonInfor-mationVisualization.ResearchonvisualizationisalsopublishedinTransactionsonVisualizationandComputerGraphics,JournalofComputationalandGraphicalStatistics,andIEEEComputerGraphicsandApplications.Manygraphicaluserinterfacesandvisualizationtoolshavebeendevelopedandcanbefoundinvariousdataminingproducts.Severalbooksondatamining(e.g.,DataMiningSolutionsbyWestphalandBlaxton[WB98])presentmanygoodexamplesandvisualsnapshots.Forasurveyofvisualizationtechniques,see“Visualtechniquesforexploringdatabases”byKeim[Kei97].Similarityanddistancemeasuresamongvariousvariableshavebeenintroducedinmanytextbooksthatstudyclusteranalysis,includingHartigan[Har75];JainandDubes[JD88];KaufmanandRousseeuw[KR90];andArabie,Hubert,anddeSoete[AHS96].MethodsforcombiningattributesofdifferenttypesintoasingledissimilaritymatrixwereintroducedbyKaufmanandRousseeuw[KR90].'}, 'score': 0.0, 'values': []}, {'id': '9646aa8f-e69d-4ac8-9cc7-d7dd90158dc7-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 125.0, 'text': 'HAN10-ch03-083-124-97801238147912011/6/13:16Page88#688Chapter3DataPreprocessing3.2DataCleaningReal-worlddatatendtobeincomplete,noisy,andinconsistent.Datacleaning(ordatacleansing)routinesattempttofillinmissingvalues,smoothoutnoisewhileidenti-fyingoutliers,andcorrectinconsistenciesinthedata.Inthissection,youwillstudybasicmethodsfordatacleaning.Section3.2.1looksatwaysofhandlingmissingvalues.Section3.2.2explainsdatasmoothingtechniques.Section3.2.3discussesapproachestodatacleaningasaprocess.3.2.1MissingValuesImaginethatyouneedtoanalyzeAllElectronicssalesandcustomerdata.Younotethatmanytupleshavenorecordedvalueforseveralattributessuchascustomerincome.Howcanyougoaboutfillinginthemissingvaluesforthisattribute?Let’slookatthefollowingmethods.1.Ignorethetuple:Thisisusuallydonewhentheclasslabelismissing(assumingtheminingtaskinvolvesclassification).Thismethodisnotveryeffective,unlessthetuplecontainsseveralattributeswithmissingvalues.Itisespeciallypoorwhenthepercent-ageofmissingvaluesperattributevariesconsiderably.Byignoringthetuple,wedonotmakeuseoftheremainingattributes’valuesinthetuple.Suchdatacouldhavebeenusefultothetaskathand.2.Fillinthemissingvaluemanually:Ingeneral,thisapproachistimeconsumingandmaynotbefeasiblegivenalargedatasetwithmanymissingvalues.3.Useaglobalconstanttofillinthemissingvalue:Replaceallmissingattributevaluesbythesameconstantsuchasalabellike“Unknown”or−∞.Ifmissingvaluesarereplacedby,say,“Unknown,”thentheminingprogrammaymistakenlythinkthattheyformaninterestingconcept,sincetheyallhaveavalueincommon—thatof“Unknown.”Hence,althoughthismethodissimple,itisnotfoolproof.4.Useameasureofcentraltendencyfortheattribute(e.g.,themeanormedian)tofillinthemissingvalue:Chapter2discussedmeasuresofcentraltendency,whichindicatethe“middle”valueofadatadistribution.Fornormal(symmetric)datadis-tributions,themeancanbeused,whileskeweddatadistributionshouldemploythemedian(Section2.2).Forexample,supposethatthedatadistributionregard-ingtheincomeofAllElectronicscustomersissymmetricandthatt'}, 'score': 0.0, 'values': []}, {'id': '1fff04a1-9180-42f1-9fd7-ce8c9987d716-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 125.0, 'text': 'HAN10-ch03-083-124-97801238147912011/6/13:16Page88#688Chapter3DataPreprocessing3.2DataCleaningReal-worlddatatendtobeincomplete,noisy,andinconsistent.Datacleaning(ordatacleansing)routinesattempttofillinmissingvalues,smoothoutnoisewhileidenti-fyingoutliers,andcorrectinconsistenciesinthedata.Inthissection,youwillstudybasicmethodsfordatacleaning.Section3.2.1looksatwaysofhandlingmissingvalues.Section3.2.2explainsdatasmoothingtechniques.Section3.2.3discussesapproachestodatacleaningasaprocess.3.2.1MissingValuesImaginethatyouneedtoanalyzeAllElectronicssalesandcustomerdata.Younotethatmanytupleshavenorecordedvalueforseveralattributessuchascustomerincome.Howcanyougoaboutfillinginthemissingvaluesforthisattribute?Let’slookatthefollowingmethods.1.Ignorethetuple:Thisisusuallydonewhentheclasslabelismissing(assumingtheminingtaskinvolvesclassification).Thismethodisnotveryeffective,unlessthetuplecontainsseveralattributeswithmissingvalues.Itisespeciallypoorwhenthepercent-ageofmissingvaluesperattributevariesconsiderably.Byignoringthetuple,wedonotmakeuseoftheremainingattributes’valuesinthetuple.Suchdatacouldhavebeenusefultothetaskathand.2.Fillinthemissingvaluemanually:Ingeneral,thisapproachistimeconsumingandmaynotbefeasiblegivenalargedatasetwithmanymissingvalues.3.Useaglobalconstanttofillinthemissingvalue:Replaceallmissingattributevaluesbythesameconstantsuchasalabellike“Unknown”or−∞.Ifmissingvaluesarereplacedby,say,“Unknown,”thentheminingprogrammaymistakenlythinkthattheyformaninterestingconcept,sincetheyallhaveavalueincommon—thatof“Unknown.”Hence,althoughthismethodissimple,itisnotfoolproof.4.Useameasureofcentraltendencyfortheattribute(e.g.,themeanormedian)tofillinthemissingvalue:Chapter2discussedmeasuresofcentraltendency,whichindicatethe“middle”valueofadatadistribution.Fornormal(symmetric)datadis-tributions,themeancanbeused,whileskeweddatadistributionshouldemploythemedian(Section2.2).Forexample,supposethatthedatadistributionregard-ingtheincomeofAllElectronicscustomersissymmetricandthatt'}, 'score': 0.0, 'values': []}, {'id': '645d8252-8c31-4134-8481-d457f8c96be6-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 126.0, 'text': 'HAN10-ch03-083-124-97801238147912011/6/13:16Page89#73.2DataCleaning89induction.Forexample,usingtheothercustomerattributesinyourdataset,youmayconstructadecisiontreetopredictthemissingvaluesforincome.DecisiontreesandBayesianinferencearedescribedindetailinChapters8and9,respectively,whileregressionisintroducedinSection3.4.5.Methods3through6biasthedata—thefilled-invaluemaynotbecorrect.Method6,however,isapopularstrategy.Incomparisontotheothermethods,itusesthemostinformationfromthepresentdatatopredictmissingvalues.Byconsideringtheotherattributes’valuesinitsestimationofthemissingvalueforincome,thereisagreaterchancethattherelationshipsbetweenincomeandtheotherattributesarepreserved.Itisimportanttonotethat,insomecases,amissingvaluemaynotimplyanerrorinthedata!Forexample,whenapplyingforacreditcard,candidatesmaybeaskedtosupplytheirdriver’slicensenumber.Candidateswhodonothaveadriver’slicensemaynaturallyleavethisfieldblank.Formsshouldallowrespondentstospecifyvaluessuchas“notapplicable.”Softwareroutinesmayalsobeusedtouncoverothernullvalues(e.g.,“don’tknow,”“?”or“none”).Ideally,eachattributeshouldhaveoneormorerulesregardingthenullcondition.Therulesmayspecifywhetherornotnullsareallowedand/orhowsuchvaluesshouldbehandledortransformed.Fieldsmayalsobeinten-tionallyleftblankiftheyaretobeprovidedinalaterstepofthebusinessprocess.Hence,althoughwecantryourbesttocleanthedataafteritisseized,gooddatabaseanddataentryproceduredesignshouldhelpminimizethenumberofmissingvaluesorerrorsinthefirstplace.3.2.2NoisyData“Whatisnoise?”Noiseisarandomerrororvarianceinameasuredvariable.InChapter2,wesawhowsomebasicstatisticaldescriptiontechniques(e.g.,boxplotsandscatterplots),andmethodsofdatavisualizationcanbeusedtoidentifyoutliers,whichmayrepresentnoise.Givenanumericattributesuchas,say,price,howcanwe“smooth”outthedatatoremovethenoise?Let’slookatthefollowingdatasmoothingtechniques.Binning:Binningmethodssmoothasorteddatavaluebyconsultingits“neighbor-hood,”thatis,thevaluesaroundit.Thesortedvaluesaredistributedintoa'}, 'score': 0.0, 'values': []}, {'id': '645d8252-8c31-4134-8481-d457f8c96be6-1', 'metadata': {'chunk': 1.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 126.0, 'text': 'saredistributedintoanumberof“buckets,”orbins.Becausebinningmethodsconsulttheneighborhoodofvalues,theyperformlocalsmoothing.Figure3.2illustratessomebinningtechniques.Inthisexample,thedataforpricearefirstsortedandthenpartitionedintoequal-frequencybinsofsize3(i.e.,eachbincontainsthreevalues).Insmoothingbybinmeans,eachvalueinabinisreplacedbythemeanvalueofthebin.Forexample,themeanofthevalues4,8,and15inBin1is9.Therefore,eachoriginalvalueinthisbinisreplacedbythevalue9.Similarly,smoothingbybinmedianscanbeemployed,inwhicheachbinvalueisreplacedbythebinmedian.Insmoothingbybinboundaries,theminimumandmaximumvaluesinagivenbinareidentifiedasthebinboundaries.Eachbinvalueisthenreplacedbytheclosestboundaryvalue.Ingeneral,thelargerthewidth,the'}, 'score': 0.0, 'values': []}, {'id': 'e6421aad-3a4c-499e-8a0d-e79d024abd6d-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 126.0, 'text': 'HAN10-ch03-083-124-97801238147912011/6/13:16Page89#73.2DataCleaning89induction.Forexample,usingtheothercustomerattributesinyourdataset,youmayconstructadecisiontreetopredictthemissingvaluesforincome.DecisiontreesandBayesianinferencearedescribedindetailinChapters8and9,respectively,whileregressionisintroducedinSection3.4.5.Methods3through6biasthedata—thefilled-invaluemaynotbecorrect.Method6,however,isapopularstrategy.Incomparisontotheothermethods,itusesthemostinformationfromthepresentdatatopredictmissingvalues.Byconsideringtheotherattributes’valuesinitsestimationofthemissingvalueforincome,thereisagreaterchancethattherelationshipsbetweenincomeandtheotherattributesarepreserved.Itisimportanttonotethat,insomecases,amissingvaluemaynotimplyanerrorinthedata!Forexample,whenapplyingforacreditcard,candidatesmaybeaskedtosupplytheirdriver’slicensenumber.Candidateswhodonothaveadriver’slicensemaynaturallyleavethisfieldblank.Formsshouldallowrespondentstospecifyvaluessuchas“notapplicable.”Softwareroutinesmayalsobeusedtouncoverothernullvalues(e.g.,“don’tknow,”“?”or“none”).Ideally,eachattributeshouldhaveoneormorerulesregardingthenullcondition.Therulesmayspecifywhetherornotnullsareallowedand/orhowsuchvaluesshouldbehandledortransformed.Fieldsmayalsobeinten-tionallyleftblankiftheyaretobeprovidedinalaterstepofthebusinessprocess.Hence,althoughwecantryourbesttocleanthedataafteritisseized,gooddatabaseanddataentryproceduredesignshouldhelpminimizethenumberofmissingvaluesorerrorsinthefirstplace.3.2.2NoisyData“Whatisnoise?”Noiseisarandomerrororvarianceinameasuredvariable.InChapter2,wesawhowsomebasicstatisticaldescriptiontechniques(e.g.,boxplotsandscatterplots),andmethodsofdatavisualizationcanbeusedtoidentifyoutliers,whichmayrepresentnoise.Givenanumericattributesuchas,say,price,howcanwe“smooth”outthedatatoremovethenoise?Let’slookatthefollowingdatasmoothingtechniques.Binning:Binningmethodssmoothasorteddatavaluebyconsultingits“neighbor-hood,”thatis,thevaluesaroundit.Thesortedvaluesaredistributedintoa'}, 'score': 0.0, 'values': []}, {'id': 'e6421aad-3a4c-499e-8a0d-e79d024abd6d-1', 'metadata': {'chunk': 1.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 126.0, 'text': 'saredistributedintoanumberof“buckets,”orbins.Becausebinningmethodsconsulttheneighborhoodofvalues,theyperformlocalsmoothing.Figure3.2illustratessomebinningtechniques.Inthisexample,thedataforpricearefirstsortedandthenpartitionedintoequal-frequencybinsofsize3(i.e.,eachbincontainsthreevalues).Insmoothingbybinmeans,eachvalueinabinisreplacedbythemeanvalueofthebin.Forexample,themeanofthevalues4,8,and15inBin1is9.Therefore,eachoriginalvalueinthisbinisreplacedbythevalue9.Similarly,smoothingbybinmedianscanbeemployed,inwhicheachbinvalueisreplacedbythebinmedian.Insmoothingbybinboundaries,theminimumandmaximumvaluesinagivenbinareidentifiedasthebinboundaries.Eachbinvalueisthenreplacedbytheclosestboundaryvalue.Ingeneral,thelargerthewidth,the'}, 'score': 0.0, 'values': []}, {'id': 'b30a8b5b-9ba6-4e99-8ba7-eb4acca306b8-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 127.0, 'text': 'HAN10-ch03-083-124-97801238147912011/6/13:16Page90#890Chapter3DataPreprocessingSorteddataforprice(indollars):4,8,15,21,21,24,25,28,34Partitioninto(equal-frequency)bins:Bin1:4,8,15Bin2:21,21,24Bin3:25,28,34Smoothingbybinmeans:Bin1:9,9,9Bin2:22,22,22Bin3:29,29,29Smoothingbybinboundaries:Bin1:4,4,15Bin2:21,21,24Bin3:25,25,34Figure3.2Binningmethodsfordatasmoothing.greatertheeffectofthesmoothing.Alternatively,binsmaybeequalwidth,wheretheintervalrangeofvaluesineachbinisconstant.BinningisalsousedasadiscretizationtechniqueandisfurtherdiscussedinSection3.5.Regression:Datasmoothingcanalsobedonebyregression,atechniquethatcon-formsdatavaluestoafunction.Linearregressioninvolvesfindingthe“best”linetofittwoattributes(orvariables)sothatoneattributecanbeusedtopredicttheother.Multiplelinearregressionisanextensionoflinearregression,wheremorethantwoattributesareinvolvedandthedataarefittoamultidimensionalsurface.RegressionisfurtherdescribedinSection3.4.5.Outlieranalysis:Outliersmaybedetectedbyclustering,forexample,wheresimilarvaluesareorganizedintogroups,or“clusters.”Intuitively,valuesthatfalloutsideofthesetofclustersmaybeconsideredoutliers(Figure3.3).Chapter12isdedicatedtothetopicofoutlieranalysis.Manydatasmoothingmethodsarealsousedfordatadiscretization(aformofdatatransformation)anddatareduction.Forexample,thebinningtechniquesdescribedbeforereducethenumberofdistinctvaluesperattribute.Thisactsasaformofdatareductionforlogic-baseddataminingmethods,suchasdecisiontreeinduction,whichrepeatedlymakesvaluecomparisonsonsorteddata.Concepthierarchiesareaformofdatadiscretizationthatcanalsobeusedfordatasmoothing.Aconcepthierarchyforprice,forexample,maymaprealpricevaluesintoinexpensive,moderatelypriced,andexpensive,therebyreducingthenumberofdatavaluestobehandledbythemining'}, 'score': 0.0, 'values': []}, {'id': '174b0117-4d78-46da-b719-a2936367d864-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 127.0, 'text': 'HAN10-ch03-083-124-97801238147912011/6/13:16Page90#890Chapter3DataPreprocessingSorteddataforprice(indollars):4,8,15,21,21,24,25,28,34Partitioninto(equal-frequency)bins:Bin1:4,8,15Bin2:21,21,24Bin3:25,28,34Smoothingbybinmeans:Bin1:9,9,9Bin2:22,22,22Bin3:29,29,29Smoothingbybinboundaries:Bin1:4,4,15Bin2:21,21,24Bin3:25,25,34Figure3.2Binningmethodsfordatasmoothing.greatertheeffectofthesmoothing.Alternatively,binsmaybeequalwidth,wheretheintervalrangeofvaluesineachbinisconstant.BinningisalsousedasadiscretizationtechniqueandisfurtherdiscussedinSection3.5.Regression:Datasmoothingcanalsobedonebyregression,atechniquethatcon-formsdatavaluestoafunction.Linearregressioninvolvesfindingthe“best”linetofittwoattributes(orvariables)sothatoneattributecanbeusedtopredicttheother.Multiplelinearregressionisanextensionoflinearregression,wheremorethantwoattributesareinvolvedandthedataarefittoamultidimensionalsurface.RegressionisfurtherdescribedinSection3.4.5.Outlieranalysis:Outliersmaybedetectedbyclustering,forexample,wheresimilarvaluesareorganizedintogroups,or“clusters.”Intuitively,valuesthatfalloutsideofthesetofclustersmaybeconsideredoutliers(Figure3.3).Chapter12isdedicatedtothetopicofoutlieranalysis.Manydatasmoothingmethodsarealsousedfordatadiscretization(aformofdatatransformation)anddatareduction.Forexample,thebinningtechniquesdescribedbeforereducethenumberofdistinctvaluesperattribute.Thisactsasaformofdatareductionforlogic-baseddataminingmethods,suchasdecisiontreeinduction,whichrepeatedlymakesvaluecomparisonsonsorteddata.Concepthierarchiesareaformofdatadiscretizationthatcanalsobeusedfordatasmoothing.Aconcepthierarchyforprice,forexample,maymaprealpricevaluesintoinexpensive,moderatelypriced,andexpensive,therebyreducingthenumberofdatavaluestobehandledbythemining'}, 'score': 0.0, 'values': []}, {'id': '02429114-bfb6-4ab2-b0b6-d41fa3e5eeb2-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 128.0, 'text': 'HAN10-ch03-083-124-97801238147912011/6/13:16Page91#93.2DataCleaning91Figure3.3A2-Dcustomerdataplotwithrespecttocustomerlocationsinacity,showingthreedataclusters.Outliersmaybedetectedasvaluesthatfalloutsideoftheclustersets.process.DatadiscretizationisdiscussedinSection3.5.Somemethodsofclassification(e.g.,neuralnetworks)havebuilt-indatasmoothingmechanisms.ClassificationisthetopicofChapters8and9.3.2.3DataCleaningasaProcessMissingvalues,noise,andinconsistenciescontributetoinaccuratedata.Sofar,wehavelookedattechniquesforhandlingmissingdataandforsmoothingdata.“Butdataclean-ingisabigjob.Whataboutdatacleaningasaprocess?Howexactlydoesoneproceedintacklingthistask?Arethereanytoolsouttheretohelp?”Thefirststepindatacleaningasaprocessisdiscrepancydetection.Discrepanciescanbecausedbyseveralfactors,includingpoorlydesigneddataentryformsthathavemanyoptionalfields,humanerrorindataentry,deliberateerrors(e.g.,respondentsnotwant-ingtodivulgeinformationaboutthemselves),anddatadecay(e.g.,outdatedaddresses).Discrepanciesmayalsoarisefrominconsistentdatarepresentationsandinconsistentuseofcodes.Othersourcesofdiscrepanciesincludeerrorsininstrumentationdevicesthatrecorddataandsystemerrors.Errorscanalsooccurwhenthedataare(inadequately)usedforpurposesotherthanoriginallyintended.Theremayalsobeinconsistenciesduetodataintegration(e.g.,whereagivenattributecanhavedifferentnamesindifferentdatabases).22DataintegrationandtheremovalofredundantdatathatcanresultfromsuchintegrationarefurtherdescribedinSection3.3.'}, 'score': 0.0, 'values': []}, {'id': 'f03440a4-0409-4026-becd-0393816a9f89-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 128.0, 'text': 'HAN10-ch03-083-124-97801238147912011/6/13:16Page91#93.2DataCleaning91Figure3.3A2-Dcustomerdataplotwithrespecttocustomerlocationsinacity,showingthreedataclusters.Outliersmaybedetectedasvaluesthatfalloutsideoftheclustersets.process.DatadiscretizationisdiscussedinSection3.5.Somemethodsofclassification(e.g.,neuralnetworks)havebuilt-indatasmoothingmechanisms.ClassificationisthetopicofChapters8and9.3.2.3DataCleaningasaProcessMissingvalues,noise,andinconsistenciescontributetoinaccuratedata.Sofar,wehavelookedattechniquesforhandlingmissingdataandforsmoothingdata.“Butdataclean-ingisabigjob.Whataboutdatacleaningasaprocess?Howexactlydoesoneproceedintacklingthistask?Arethereanytoolsouttheretohelp?”Thefirststepindatacleaningasaprocessisdiscrepancydetection.Discrepanciescanbecausedbyseveralfactors,includingpoorlydesigneddataentryformsthathavemanyoptionalfields,humanerrorindataentry,deliberateerrors(e.g.,respondentsnotwant-ingtodivulgeinformationaboutthemselves),anddatadecay(e.g.,outdatedaddresses).Discrepanciesmayalsoarisefrominconsistentdatarepresentationsandinconsistentuseofcodes.Othersourcesofdiscrepanciesincludeerrorsininstrumentationdevicesthatrecorddataandsystemerrors.Errorscanalsooccurwhenthedataare(inadequately)usedforpurposesotherthanoriginallyintended.Theremayalsobeinconsistenciesduetodataintegration(e.g.,whereagivenattributecanhavedifferentnamesindifferentdatabases).22DataintegrationandtheremovalofredundantdatathatcanresultfromsuchintegrationarefurtherdescribedinSection3.3.'}, 'score': 0.0, 'values': []}, {'id': '8809c262-492c-41cc-8286-0bcfeb97de6b-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 132.0, 'text': 'HAN10-ch03-083-124-97801238147912011/6/13:16Page95#133.3DataIntegration95χ2CorrelationTestforNominalDataFornominaldata,acorrelationrelationshipbetweentwoattributes,AandB,canbediscoveredbyaχ2(chi-square)test.SupposeAhascdistinctvalues,namelya1,a2,...ac.Bhasrdistinctvalues,namelyb1,b2,...br.ThedatatuplesdescribedbyAandBcanbeshownasacontingencytable,withthecvaluesofAmakingupthecolumnsandthervaluesofBmakinguptherows.Let(Ai,Bj)denotethejointeventthatattributeAtakesonvalueaiandattributeBtakesonvaluebj,thatis,where(A=ai,B=bj).Eachandeverypossible(Ai,Bj)jointeventhasitsowncell(orslot)inthetable.Theχ2value(alsoknownasthePearsonχ2statistic)iscomputedasχ2=c(cid:88)i=1r(cid:88)j=1(oij−eij)2eij,(3.1)whereoijistheobservedfrequency(i.e.,actualcount)ofthejointevent(Ai,Bj)andeijistheexpectedfrequencyof(Ai,Bj),whichcanbecomputedaseij=count(A=ai)×count(B=bj)n,(3.2)wherenisthenumberofdatatuples,count(A=ai)isthenumberoftupleshavingvalueaiforA,andcount(B=bj)isthenumberoftupleshavingvaluebjforB.ThesuminEq.(3.1)iscomputedoverallofther×ccells.Notethatthecellsthatcontributethemosttotheχ2valuearethoseforwhichtheactualcountisverydifferentfromthatexpected.Theχ2statisticteststhehypothesisthatAandBareindependent,thatis,thereisnocorrelationbetweenthem.Thetestisbasedonasignificancelevel,with(r−1)×(c−1)degreesoffreedom.WeillustratetheuseofthisstatisticinExample3.1.Ifthehypothesiscanberejected,thenwesaythatAandBarestatisticallycorrelated.Example3.1Correlationanalysisofnominalattributesusingχ2.Supposethatagroupof1500peoplewassurveyed.Thegenderofeachpersonwasnoted.Eachpersonwaspolledastowhetherhisorherpreferredtypeofreadingmaterialwasfictionornonfiction.Thus,wehavetwoattributes,genderandpreferredreading.Theobservedfrequency(orcount)ofeachpossiblejointeventissummarizedinthecontingencytableshowninTable3.1,wherethenumbersinparenthesesaretheexpectedfrequencies.Theexpectedfrequen-ciesarecalculatedbasedonthedatadistributionforbothattributesusingEq.(3.2).UsingEq.(3.2),wecanverifytheexpectedfrequenciesforeachc'}, 'score': 0.0, 'values': []}, {'id': '8809c262-492c-41cc-8286-0bcfeb97de6b-1', 'metadata': {'chunk': 1.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 132.0, 'text': 'dfrequenciesforeachcell.Forexample,theexpectedfrequencyforthecell(male,fiction)ise11=count(male)×count(fiction)n=300×4501500=90,andsoon.Noticethatinanyrow,thesumoftheexpectedfrequenciesmustequalthetotalobservedfrequencyforthatrow,andthesumoftheexpectedfrequenciesinanycolumnmustalsoequalthetotalobservedfrequencyforthatcolumn.'}, 'score': 0.0, 'values': []}, {'id': 'cd743133-4ead-471c-9417-f2ead7e03cda-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 132.0, 'text': 'HAN10-ch03-083-124-97801238147912011/6/13:16Page95#133.3DataIntegration95χ2CorrelationTestforNominalDataFornominaldata,acorrelationrelationshipbetweentwoattributes,AandB,canbediscoveredbyaχ2(chi-square)test.SupposeAhascdistinctvalues,namelya1,a2,...ac.Bhasrdistinctvalues,namelyb1,b2,...br.ThedatatuplesdescribedbyAandBcanbeshownasacontingencytable,withthecvaluesofAmakingupthecolumnsandthervaluesofBmakinguptherows.Let(Ai,Bj)denotethejointeventthatattributeAtakesonvalueaiandattributeBtakesonvaluebj,thatis,where(A=ai,B=bj).Eachandeverypossible(Ai,Bj)jointeventhasitsowncell(orslot)inthetable.Theχ2value(alsoknownasthePearsonχ2statistic)iscomputedasχ2=c(cid:88)i=1r(cid:88)j=1(oij−eij)2eij,(3.1)whereoijistheobservedfrequency(i.e.,actualcount)ofthejointevent(Ai,Bj)andeijistheexpectedfrequencyof(Ai,Bj),whichcanbecomputedaseij=count(A=ai)×count(B=bj)n,(3.2)wherenisthenumberofdatatuples,count(A=ai)isthenumberoftupleshavingvalueaiforA,andcount(B=bj)isthenumberoftupleshavingvaluebjforB.ThesuminEq.(3.1)iscomputedoverallofther×ccells.Notethatthecellsthatcontributethemosttotheχ2valuearethoseforwhichtheactualcountisverydifferentfromthatexpected.Theχ2statisticteststhehypothesisthatAandBareindependent,thatis,thereisnocorrelationbetweenthem.Thetestisbasedonasignificancelevel,with(r−1)×(c−1)degreesoffreedom.WeillustratetheuseofthisstatisticinExample3.1.Ifthehypothesiscanberejected,thenwesaythatAandBarestatisticallycorrelated.Example3.1Correlationanalysisofnominalattributesusingχ2.Supposethatagroupof1500peoplewassurveyed.Thegenderofeachpersonwasnoted.Eachpersonwaspolledastowhetherhisorherpreferredtypeofreadingmaterialwasfictionornonfiction.Thus,wehavetwoattributes,genderandpreferredreading.Theobservedfrequency(orcount)ofeachpossiblejointeventissummarizedinthecontingencytableshowninTable3.1,wherethenumbersinparenthesesaretheexpectedfrequencies.Theexpectedfrequen-ciesarecalculatedbasedonthedatadistributionforbothattributesusingEq.(3.2).UsingEq.(3.2),wecanverifytheexpectedfrequenciesforeachc'}, 'score': 0.0, 'values': []}, {'id': 'cd743133-4ead-471c-9417-f2ead7e03cda-1', 'metadata': {'chunk': 1.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 132.0, 'text': 'dfrequenciesforeachcell.Forexample,theexpectedfrequencyforthecell(male,fiction)ise11=count(male)×count(fiction)n=300×4501500=90,andsoon.Noticethatinanyrow,thesumoftheexpectedfrequenciesmustequalthetotalobservedfrequencyforthatrow,andthesumoftheexpectedfrequenciesinanycolumnmustalsoequalthetotalobservedfrequencyforthatcolumn.'}, 'score': 0.0, 'values': []}, {'id': '5acca956-5b5b-4090-8322-9d7b6748147e-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 133.0, 'text': 'HAN10-ch03-083-124-97801238147912011/6/13:16Page96#1496Chapter3DataPreprocessingTable3.1Example2.1’s2×2ContingencyTableDatamalefemaleTotalfiction250(90)200(360)450nonfiction50(210)1000(840)1050Total30012001500Note:Aregenderandpreferredreadingcorrelated?UsingEq.(3.1)forχ2computation,wegetχ2=(250−90)290+(50−210)2210+(200−360)2360+(1000−840)2840=284.44+121.90+71.11+30.48=507.93.Forthis2×2table,thedegreesoffreedomare(2−1)(2−1)=1.For1degreeoffree-dom,theχ2valueneededtorejectthehypothesisatthe0.001significancelevelis10.828(takenfromthetableofupperpercentagepointsoftheχ2distribution,typicallyavail-ablefromanytextbookonstatistics).Sinceourcomputedvalueisabovethis,wecanrejectthehypothesisthatgenderandpreferredreadingareindependentandconcludethatthetwoattributesare(strongly)correlatedforthegivengroupofpeople.CorrelationCoefficientforNumericDataFornumericattributes,wecanevaluatethecorrelationbetweentwoattributes,AandB,bycomputingthecorrelationcoefficient(alsoknownasPearson’sproductmomentcoefficient,namedafteritsinventer,KarlPearson).ThisisrA,B=n(cid:88)i=1(ai−¯A)(bi−¯B)nσAσB=n(cid:88)i=1(aibi)−n¯A¯BnσAσB,(3.3)wherenisthenumberoftuples,aiandbiaretherespectivevaluesofAandBintuplei,¯Aand¯BaretherespectivemeanvaluesofAandB,σAandσBaretherespectivestandarddeviationsofAandB(asdefinedinSection2.2.2),and(cid:54)(aibi)isthesumoftheABcross-product(i.e.,foreachtuple,thevalueforAismultipliedbythevalueforBinthattuple).Notethat−1≤rA,B≤+1.IfrA,Bisgreaterthan0,thenAandBarepositivelycorrelated,meaningthatthevaluesofAincreaseasthevaluesofBincrease.Thehigherthevalue,thestrongerthecorrelation(i.e.,themoreeachattributeimpliestheother).Hence,ahighervaluemayindicatethatA(orB)mayberemovedasaredundancy.Iftheresultingvalueisequalto0,thenAandBareindependentandthereisnocorrelationbetweenthem.Iftheresultingvalueislessthan0,thenAandBarenegativelycorrelated,wherethevaluesofoneattributeincreaseasthevaluesoftheotherattributedecrease.Thismeansthateachattributediscouragestheother.Scatterplotscanalsobeusedtovie'}, 'score': 0.0, 'values': []}, {'id': '5acca956-5b5b-4090-8322-9d7b6748147e-1', 'metadata': {'chunk': 1.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 133.0, 'text': 'tscanalsobeusedtoviewcorrelationsbetweenattributes(Section2.2.3).Forexample,Figure2.8’s'}, 'score': 0.0, 'values': []}, {'id': '1099b6ee-a5c5-468b-bfaf-272c82fdc7d3-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 133.0, 'text': 'HAN10-ch03-083-124-97801238147912011/6/13:16Page96#1496Chapter3DataPreprocessingTable3.1Example2.1’s2×2ContingencyTableDatamalefemaleTotalfiction250(90)200(360)450nonfiction50(210)1000(840)1050Total30012001500Note:Aregenderandpreferredreadingcorrelated?UsingEq.(3.1)forχ2computation,wegetχ2=(250−90)290+(50−210)2210+(200−360)2360+(1000−840)2840=284.44+121.90+71.11+30.48=507.93.Forthis2×2table,thedegreesoffreedomare(2−1)(2−1)=1.For1degreeoffree-dom,theχ2valueneededtorejectthehypothesisatthe0.001significancelevelis10.828(takenfromthetableofupperpercentagepointsoftheχ2distribution,typicallyavail-ablefromanytextbookonstatistics).Sinceourcomputedvalueisabovethis,wecanrejectthehypothesisthatgenderandpreferredreadingareindependentandconcludethatthetwoattributesare(strongly)correlatedforthegivengroupofpeople.CorrelationCoefficientforNumericDataFornumericattributes,wecanevaluatethecorrelationbetweentwoattributes,AandB,bycomputingthecorrelationcoefficient(alsoknownasPearson’sproductmomentcoefficient,namedafteritsinventer,KarlPearson).ThisisrA,B=n(cid:88)i=1(ai−¯A)(bi−¯B)nσAσB=n(cid:88)i=1(aibi)−n¯A¯BnσAσB,(3.3)wherenisthenumberoftuples,aiandbiaretherespectivevaluesofAandBintuplei,¯Aand¯BaretherespectivemeanvaluesofAandB,σAandσBaretherespectivestandarddeviationsofAandB(asdefinedinSection2.2.2),and(cid:54)(aibi)isthesumoftheABcross-product(i.e.,foreachtuple,thevalueforAismultipliedbythevalueforBinthattuple).Notethat−1≤rA,B≤+1.IfrA,Bisgreaterthan0,thenAandBarepositivelycorrelated,meaningthatthevaluesofAincreaseasthevaluesofBincrease.Thehigherthevalue,thestrongerthecorrelation(i.e.,themoreeachattributeimpliestheother).Hence,ahighervaluemayindicatethatA(orB)mayberemovedasaredundancy.Iftheresultingvalueisequalto0,thenAandBareindependentandthereisnocorrelationbetweenthem.Iftheresultingvalueislessthan0,thenAandBarenegativelycorrelated,wherethevaluesofoneattributeincreaseasthevaluesoftheotherattributedecrease.Thismeansthateachattributediscouragestheother.Scatterplotscanalsobeusedtovie'}, 'score': 0.0, 'values': []}, {'id': '1099b6ee-a5c5-468b-bfaf-272c82fdc7d3-1', 'metadata': {'chunk': 1.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 133.0, 'text': 'tscanalsobeusedtoviewcorrelationsbetweenattributes(Section2.2.3).Forexample,Figure2.8’s'}, 'score': 0.0, 'values': []}, {'id': 'edcf5ae3-5d09-43bc-8824-966567562ad6-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 134.0, 'text': 'HAN10-ch03-083-124-97801238147912011/6/13:16Page97#153.3DataIntegration97scatterplotsrespectivelyshowpositivelycorrelateddataandnegativelycorrelateddata,whileFigure2.9displaysuncorrelateddata.Notethatcorrelationdoesnotimplycausality.Thatis,ifAandBarecorrelated,thisdoesnotnecessarilyimplythatAcausesBorthatBcausesA.Forexample,inanalyzingademographicdatabase,wemayfindthatattributesrepresentingthenumberofhospitalsandthenumberofcartheftsinaregionarecorrelated.Thisdoesnotmeanthatonecausestheother.Bothareactuallycausallylinkedtoathirdattribute,namely,population.CovarianceofNumericDataInprobabilitytheoryandstatistics,correlationandcovariancearetwosimilarmeasuresforassessinghowmuchtwoattributeschangetogether.ConsidertwonumericattributesAandB,andasetofnobservations{(a1,b1),...,(an,bn)}.ThemeanvaluesofAandB,respectively,arealsoknownastheexpectedvaluesonAandB,thatis,E(A)=¯A=(cid:80)ni=1ainandE(B)=¯B=(cid:80)ni=1bin.ThecovariancebetweenAandBisdefinedasCov(A,B)=E((A−¯A)(B−¯B))=(cid:80)ni=1(ai−¯A)(bi−¯B)n.(3.4)IfwecompareEq.(3.3)forrA,B(correlationcoefficient)withEq.(3.4)forcovariance,weseethatrA,B=Cov(A,B)σAσB,(3.5)whereσAandσBarethestandarddeviationsofAandB,respectively.ItcanalsobeshownthatCov(A,B)=E(A·B)−¯A¯B.(3.6)Thisequationmaysimplifycalculations.FortwoattributesAandBthattendtochangetogether,ifAislargerthan¯A(theexpectedvalueofA),thenBislikelytobelargerthan¯B(theexpectedvalueofB).Therefore,thecovariancebetweenAandBispositive.Ontheotherhand,ifoneoftheattributestendstobeaboveitsexpectedvaluewhentheotherattributeisbelowitsexpectedvalue,thenthecovarianceofAandBisnegative.IfAandBareindependent(i.e.,theydonothavecorrelation),thenE(A·B)=E(A)·E(B).Therefore,thecovarianceisCov(A,B)=E(A·B)−¯A¯B=E(A)·E(B)−¯A¯B=0.However,theconverseisnottrue.Somepairsofrandomvariables(attributes)mayhaveacovarianceof0butarenotindependent.Onlyundersomeadditionalassumptions'}, 'score': 0.0, 'values': []}, {'id': '03c1db3d-48bc-4273-8832-c0a03abc66ca-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 134.0, 'text': 'HAN10-ch03-083-124-97801238147912011/6/13:16Page97#153.3DataIntegration97scatterplotsrespectivelyshowpositivelycorrelateddataandnegativelycorrelateddata,whileFigure2.9displaysuncorrelateddata.Notethatcorrelationdoesnotimplycausality.Thatis,ifAandBarecorrelated,thisdoesnotnecessarilyimplythatAcausesBorthatBcausesA.Forexample,inanalyzingademographicdatabase,wemayfindthatattributesrepresentingthenumberofhospitalsandthenumberofcartheftsinaregionarecorrelated.Thisdoesnotmeanthatonecausestheother.Bothareactuallycausallylinkedtoathirdattribute,namely,population.CovarianceofNumericDataInprobabilitytheoryandstatistics,correlationandcovariancearetwosimilarmeasuresforassessinghowmuchtwoattributeschangetogether.ConsidertwonumericattributesAandB,andasetofnobservations{(a1,b1),...,(an,bn)}.ThemeanvaluesofAandB,respectively,arealsoknownastheexpectedvaluesonAandB,thatis,E(A)=¯A=(cid:80)ni=1ainandE(B)=¯B=(cid:80)ni=1bin.ThecovariancebetweenAandBisdefinedasCov(A,B)=E((A−¯A)(B−¯B))=(cid:80)ni=1(ai−¯A)(bi−¯B)n.(3.4)IfwecompareEq.(3.3)forrA,B(correlationcoefficient)withEq.(3.4)forcovariance,weseethatrA,B=Cov(A,B)σAσB,(3.5)whereσAandσBarethestandarddeviationsofAandB,respectively.ItcanalsobeshownthatCov(A,B)=E(A·B)−¯A¯B.(3.6)Thisequationmaysimplifycalculations.FortwoattributesAandBthattendtochangetogether,ifAislargerthan¯A(theexpectedvalueofA),thenBislikelytobelargerthan¯B(theexpectedvalueofB).Therefore,thecovariancebetweenAandBispositive.Ontheotherhand,ifoneoftheattributestendstobeaboveitsexpectedvaluewhentheotherattributeisbelowitsexpectedvalue,thenthecovarianceofAandBisnegative.IfAandBareindependent(i.e.,theydonothavecorrelation),thenE(A·B)=E(A)·E(B).Therefore,thecovarianceisCov(A,B)=E(A·B)−¯A¯B=E(A)·E(B)−¯A¯B=0.However,theconverseisnottrue.Somepairsofrandomvariables(attributes)mayhaveacovarianceof0butarenotindependent.Onlyundersomeadditionalassumptions'}, 'score': 0.0, 'values': []}, {'id': 'fbf30844-4a66-4ca2-9825-406a4b5c112a-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 135.0, 'text': 'HAN10-ch03-083-124-97801238147912011/6/13:16Page98#1698Chapter3DataPreprocessingTable3.2StockPricesforAllElectronicsandHighTechTimepointAllElectronicsHighTecht1620t2510t3414t435t525(e.g.,thedatafollowmultivariatenormaldistributions)doesacovarianceof0implyindependence.Example3.2Covarianceanalysisofnumericattributes.ConsiderTable3.2,whichpresentsasim-plifiedexampleofstockpricesobservedatfivetimepointsforAllElectronicsandHighTech,ahigh-techcompany.Ifthestocksareaffectedbythesameindustrytrends,willtheirpricesriseorfalltogether?E(AllElectronics)=6+5+4+3+25=205=$4andE(HighTech)=20+10+14+5+55=545=$10.80.Thus,usingEq.(3.4),wecomputeCov(AllElectroncis,HighTech)=6×20+5×10+4×14+3×5+2×55−4×10.80=50.2−43.2=7.Therefore,giventhepositivecovariancewecansaythatstockpricesforbothcompaniesrisetogether.Varianceisaspecialcaseofcovariance,wherethetwoattributesareidentical(i.e.,thecovarianceofanattributewithitself).VariancewasdiscussedinChapter2.3.3.3TupleDuplicationInadditiontodetectingredundanciesbetweenattributes,duplicationshouldalsobedetectedatthetuplelevel(e.g.,wheretherearetwoormoreidenticaltuplesforagivenuniquedataentrycase).Theuseofdenormalizedtables(oftendonetoimproveper-formancebyavoidingjoins)isanothersourceofdataredundancy.Inconsistenciesoftenarisebetweenvariousduplicates,duetoinaccuratedataentryorupdatingsomebutnotalldataoccurrences.Forexample,ifapurchaseorderdatabasecontainsattributesfor'}, 'score': 0.0, 'values': []}, {'id': 'd6540d22-c5f2-4301-843c-d4a56059fd5b-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 135.0, 'text': 'HAN10-ch03-083-124-97801238147912011/6/13:16Page98#1698Chapter3DataPreprocessingTable3.2StockPricesforAllElectronicsandHighTechTimepointAllElectronicsHighTecht1620t2510t3414t435t525(e.g.,thedatafollowmultivariatenormaldistributions)doesacovarianceof0implyindependence.Example3.2Covarianceanalysisofnumericattributes.ConsiderTable3.2,whichpresentsasim-plifiedexampleofstockpricesobservedatfivetimepointsforAllElectronicsandHighTech,ahigh-techcompany.Ifthestocksareaffectedbythesameindustrytrends,willtheirpricesriseorfalltogether?E(AllElectronics)=6+5+4+3+25=205=$4andE(HighTech)=20+10+14+5+55=545=$10.80.Thus,usingEq.(3.4),wecomputeCov(AllElectroncis,HighTech)=6×20+5×10+4×14+3×5+2×55−4×10.80=50.2−43.2=7.Therefore,giventhepositivecovariancewecansaythatstockpricesforbothcompaniesrisetogether.Varianceisaspecialcaseofcovariance,wherethetwoattributesareidentical(i.e.,thecovarianceofanattributewithitself).VariancewasdiscussedinChapter2.3.3.3TupleDuplicationInadditiontodetectingredundanciesbetweenattributes,duplicationshouldalsobedetectedatthetuplelevel(e.g.,wheretherearetwoormoreidenticaltuplesforagivenuniquedataentrycase).Theuseofdenormalizedtables(oftendonetoimproveper-formancebyavoidingjoins)isanothersourceofdataredundancy.Inconsistenciesoftenarisebetweenvariousduplicates,duetoinaccuratedataentryorupdatingsomebutnotalldataoccurrences.Forexample,ifapurchaseorderdatabasecontainsattributesfor'}, 'score': 0.0, 'values': []}, {'id': 'b5d29282-e626-4558-b3c6-de3902b7225f-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 138.0, 'text': 'HAN10-ch03-083-124-97801238147912011/6/13:16Page101#193.4DataReduction101cleaningaswell.Givenasetofcoefficients,anapproximationoftheoriginaldatacanbeconstructedbyapplyingtheinverseoftheDWTused.TheDWTiscloselyrelatedtothediscreteFouriertransform(DFT),asignalprocess-ingtechniqueinvolvingsinesandcosines.Ingeneral,however,theDWTachievesbetterlossycompression.Thatis,ifthesamenumberofcoefficientsisretainedforaDWTandaDFTofagivendatavector,theDWTversionwillprovideamoreaccurateapproxima-tionoftheoriginaldata.Hence,foranequivalentapproximation,theDWTrequireslessspacethantheDFT.UnliketheDFT,waveletsarequitelocalizedinspace,contributingtotheconservationoflocaldetail.ThereisonlyoneDFT,yetthereareseveralfamiliesofDWTs.Figure3.4showssomewaveletfamilies.PopularwavelettransformsincludetheHaar-2,Daubechies-4,andDaubechies-6.Thegeneralprocedureforapplyingadiscretewavelettransformusesahierarchicalpyramidalgorithmthathalvesthedataateachiteration,resultinginfastcomputationalspeed.Themethodisasfollows:1.Thelength,L,oftheinputdatavectormustbeanintegerpowerof2.Thisconditioncanbemetbypaddingthedatavectorwithzerosasnecessary(L≥n).2.Eachtransforminvolvesapplyingtwofunctions.Thefirstappliessomedatasmooth-ing,suchasasumorweightedaverage.Thesecondperformsaweighteddifference,whichactstobringoutthedetailedfeaturesofthedata.3.ThetwofunctionsareappliedtopairsofdatapointsinX,thatis,toallpairsofmeasurements(x2i,x2i+1).ThisresultsintwodatasetsoflengthL/2.Ingeneral,theserepresentasmoothedorlow-frequencyversionoftheinputdataandthehigh-frequencycontentofit,respectively.4.Thetwofunctionsarerecursivelyappliedtothedatasetsobtainedinthepreviousloop,untiltheresultingdatasetsobtainedareoflength2.5.Selectedvaluesfromthedatasetsobtainedinthepreviousiterationsaredesignatedthewaveletcoefficientsofthetransformeddata.02460.80.60.40.20.0(cid:2)1.0(cid:2)0.50.00.5(a) ' 'Haar-2(b) ' 'Daubechies-41.01.52.00.60.40.20.0Figure3.4Examplesofwaveletfamilies.Thenumbernexttoawaveletnameisthenumberofvanishingmomentsofthewavelet.Thisisasetofmathematicalrelationshipsthatthecoefficientsmustsatisfyandisrelatedtothenumberofcoefficients.'}, 'score': 0.0, 'values': []}, {'id': 'bc386041-1d29-4ce7-a911-c415f7616b5a-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 138.0, 'text': 'HAN10-ch03-083-124-97801238147912011/6/13:16Page101#193.4DataReduction101cleaningaswell.Givenasetofcoefficients,anapproximationoftheoriginaldatacanbeconstructedbyapplyingtheinverseoftheDWTused.TheDWTiscloselyrelatedtothediscreteFouriertransform(DFT),asignalprocess-ingtechniqueinvolvingsinesandcosines.Ingeneral,however,theDWTachievesbetterlossycompression.Thatis,ifthesamenumberofcoefficientsisretainedforaDWTandaDFTofagivendatavector,theDWTversionwillprovideamoreaccurateapproxima-tionoftheoriginaldata.Hence,foranequivalentapproximation,theDWTrequireslessspacethantheDFT.UnliketheDFT,waveletsarequitelocalizedinspace,contributingtotheconservationoflocaldetail.ThereisonlyoneDFT,yetthereareseveralfamiliesofDWTs.Figure3.4showssomewaveletfamilies.PopularwavelettransformsincludetheHaar-2,Daubechies-4,andDaubechies-6.Thegeneralprocedureforapplyingadiscretewavelettransformusesahierarchicalpyramidalgorithmthathalvesthedataateachiteration,resultinginfastcomputationalspeed.Themethodisasfollows:1.Thelength,L,oftheinputdatavectormustbeanintegerpowerof2.Thisconditioncanbemetbypaddingthedatavectorwithzerosasnecessary(L≥n).2.Eachtransforminvolvesapplyingtwofunctions.Thefirstappliessomedatasmooth-ing,suchasasumorweightedaverage.Thesecondperformsaweighteddifference,whichactstobringoutthedetailedfeaturesofthedata.3.ThetwofunctionsareappliedtopairsofdatapointsinX,thatis,toallpairsofmeasurements(x2i,x2i+1).ThisresultsintwodatasetsoflengthL/2.Ingeneral,theserepresentasmoothedorlow-frequencyversionoftheinputdataandthehigh-frequencycontentofit,respectively.4.Thetwofunctionsarerecursivelyappliedtothedatasetsobtainedinthepreviousloop,untiltheresultingdatasetsobtainedareoflength2.5.Selectedvaluesfromthedatasetsobtainedinthepreviousiterationsaredesignatedthewaveletcoefficientsofthetransformeddata.02460.80.60.40.20.0(cid:2)1.0(cid:2)0.50.00.5(a) ' 'Haar-2(b) ' 'Daubechies-41.01.52.00.60.40.20.0Figure3.4Examplesofwaveletfamilies.Thenumbernexttoawaveletnameisthenumberofvanishingmomentsofthewavelet.Thisisasetofmathematicalrelationshipsthatthecoefficientsmustsatisfyandisrelatedtothenumberofcoefficients.'}, 'score': 0.0, 'values': []}, {'id': '21725803-86a6-4d80-8b0d-d1704cbc4c76-1', 'metadata': {'chunk': 1.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 139.0, 'text': 'setofvariables.Theinitialdatacanthenbeprojectedontothissmallerset.PCAoftenrevealsrelationshipsthatwerenotpreviouslysuspectedandtherebyallowsinterpretationsthatwouldnotordinarilyresult.Thebasicprocedureisasfollows:1.Theinputdataarenormalized,sothateachattributefallswithinthesamerange.Thisstephelpsensurethatattributeswithlargedomainswillnotdominateattributeswithsmallerdomains.2.PCAcomputeskorthonormalvectorsthatprovideabasisforthenormalizedinputdata.Theseareunitvectorsthateachpointinadirectionperpendiculartotheothers.Thesevectorsarereferredtoastheprincipalcomponents.Theinputdataarealinearcombinationoftheprincipalcomponents.3.Theprincipalcomponentsaresortedinorderofdecreasing“significance”orstrength.Theprincipalcomponentsessentiallyserveasanewsetofaxesforthedata,'}, 'score': 0.0, 'values': []}, {'id': '1f290b2d-4b2c-4882-8b76-9763a23998af-1', 'metadata': {'chunk': 1.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 139.0, 'text': 'setofvariables.Theinitialdatacanthenbeprojectedontothissmallerset.PCAoftenrevealsrelationshipsthatwerenotpreviouslysuspectedandtherebyallowsinterpretationsthatwouldnotordinarilyresult.Thebasicprocedureisasfollows:1.Theinputdataarenormalized,sothateachattributefallswithinthesamerange.Thisstephelpsensurethatattributeswithlargedomainswillnotdominateattributeswithsmallerdomains.2.PCAcomputeskorthonormalvectorsthatprovideabasisforthenormalizedinputdata.Theseareunitvectorsthateachpointinadirectionperpendiculartotheothers.Thesevectorsarereferredtoastheprincipalcomponents.Theinputdataarealinearcombinationoftheprincipalcomponents.3.Theprincipalcomponentsaresortedinorderofdecreasing“significance”orstrength.Theprincipalcomponentsessentiallyserveasanewsetofaxesforthedata,'}, 'score': 0.0, 'values': []}, {'id': 'ebc1038c-090a-44d6-81a9-c86c08714766-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 143.0, 'text': 'HAN10-ch03-083-124-97801238147912011/6/13:16Page106#24106Chapter3DataPreprocessingspecifytheslopeofthelineandthey-intercept,respectively.Thesecoefficientscanbesolvedforbythemethodofleastsquares,whichminimizestheerrorbetweentheactuallineseparatingthedataandtheestimateoftheline.Multiplelinearregressionisanextensionof(simple)linearregression,whichallowsaresponsevariable,y,tobemodeledasalinearfunctionoftwoormorepredictorvariables.Log-linearmodelsapproximatediscretemultidimensionalprobabilitydistributions.Givenasetoftuplesinndimensions(e.g.,describedbynattributes),wecancon-sidereachtupleasapointinann-dimensionalspace.Log-linearmodelscanbeusedtoestimatetheprobabilityofeachpointinamultidimensionalspaceforasetofdis-cretizedattributes,basedonasmallersubsetofdimensionalcombinations.Thisallowsahigher-dimensionaldataspacetobeconstructedfromlower-dimensionalspaces.Log-linearmodelsarethereforealsousefulfordimensionalityreduction(sincethelower-dimensionalpointstogethertypicallyoccupylessspacethantheoriginaldatapoints)anddatasmoothing(sinceaggregateestimatesinthelower-dimensionalspacearelesssubjecttosamplingvariationsthantheestimatesinthehigher-dimensionalspace).Regressionandlog-linearmodelscanbothbeusedonsparsedata,althoughtheirapplicationmaybelimited.Whilebothmethodscanhandleskeweddata,regressiondoesexceptionallywell.Regressioncanbecomputationallyintensivewhenappliedtohigh-dimensionaldata,whereaslog-linearmodelsshowgoodscalabilityforupto10orsodimensions.Severalsoftwarepackagesexisttosolveregressionproblems.ExamplesincludeSAS(www.sas.com),SPSS(www.spss.com),andS-Plus(www.insightful.com).AnotherusefulresourceisthebookNumericalRecipesinC,byPress,Teukolsky,Vetterling,andFlannery[PTVF07],anditsassociatedsourcecode.3.4.6HistogramsHistogramsusebinningtoapproximatedatadistributionsandareapopularformofdatareduction.HistogramswereintroducedinSection2.2.3.Ahistogramforanattribute,A,partitionsthedatadistributionofAintodisjointsubsets,referredtoasbucketsorbins.Ifeachbucketrepresentsonlyasingl'}, 'score': 0.0, 'values': []}, {'id': 'ebc1038c-090a-44d6-81a9-c86c08714766-1', 'metadata': {'chunk': 1.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 143.0, 'text': 'representsonlyasingleattribute–value/frequencypair,thebucketsarecalledsingletonbuckets.Often,bucketsinsteadrepresentcontinuousrangesforthegivenattribute.Example3.3Histograms.ThefollowingdataarealistofAllElectronicspricesforcommonlysolditems(roundedtothenearestdollar).Thenumbershavebeensorted:1,1,5,5,5,5,5,8,8,10,10,10,10,12,14,14,14,15,15,15,15,15,15,18,18,18,18,18,18,18,18,20,20,20,20,20,20,20,21,21,21,21,25,25,25,25,25,28,28,30,30,30.Figure3.7showsahistogramforthedatausingsingletonbuckets.Tofurtherreducethedata,itiscommontohaveeachbucketdenoteacontinuousvaluerangeforthegivenattribute.InFigure3.8,eachbucketrepresentsadifferent$10rangeforprice.'}, 'score': 0.0, 'values': []}, {'id': '96c981d4-3e42-4f32-80f9-c90852a997b0-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 143.0, 'text': 'HAN10-ch03-083-124-97801238147912011/6/13:16Page106#24106Chapter3DataPreprocessingspecifytheslopeofthelineandthey-intercept,respectively.Thesecoefficientscanbesolvedforbythemethodofleastsquares,whichminimizestheerrorbetweentheactuallineseparatingthedataandtheestimateoftheline.Multiplelinearregressionisanextensionof(simple)linearregression,whichallowsaresponsevariable,y,tobemodeledasalinearfunctionoftwoormorepredictorvariables.Log-linearmodelsapproximatediscretemultidimensionalprobabilitydistributions.Givenasetoftuplesinndimensions(e.g.,describedbynattributes),wecancon-sidereachtupleasapointinann-dimensionalspace.Log-linearmodelscanbeusedtoestimatetheprobabilityofeachpointinamultidimensionalspaceforasetofdis-cretizedattributes,basedonasmallersubsetofdimensionalcombinations.Thisallowsahigher-dimensionaldataspacetobeconstructedfromlower-dimensionalspaces.Log-linearmodelsarethereforealsousefulfordimensionalityreduction(sincethelower-dimensionalpointstogethertypicallyoccupylessspacethantheoriginaldatapoints)anddatasmoothing(sinceaggregateestimatesinthelower-dimensionalspacearelesssubjecttosamplingvariationsthantheestimatesinthehigher-dimensionalspace).Regressionandlog-linearmodelscanbothbeusedonsparsedata,althoughtheirapplicationmaybelimited.Whilebothmethodscanhandleskeweddata,regressiondoesexceptionallywell.Regressioncanbecomputationallyintensivewhenappliedtohigh-dimensionaldata,whereaslog-linearmodelsshowgoodscalabilityforupto10orsodimensions.Severalsoftwarepackagesexisttosolveregressionproblems.ExamplesincludeSAS(www.sas.com),SPSS(www.spss.com),andS-Plus(www.insightful.com).AnotherusefulresourceisthebookNumericalRecipesinC,byPress,Teukolsky,Vetterling,andFlannery[PTVF07],anditsassociatedsourcecode.3.4.6HistogramsHistogramsusebinningtoapproximatedatadistributionsandareapopularformofdatareduction.HistogramswereintroducedinSection2.2.3.Ahistogramforanattribute,A,partitionsthedatadistributionofAintodisjointsubsets,referredtoasbucketsorbins.Ifeachbucketrepresentsonlyasingl'}, 'score': 0.0, 'values': []}, {'id': '96c981d4-3e42-4f32-80f9-c90852a997b0-1', 'metadata': {'chunk': 1.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 143.0, 'text': 'representsonlyasingleattribute–value/frequencypair,thebucketsarecalledsingletonbuckets.Often,bucketsinsteadrepresentcontinuousrangesforthegivenattribute.Example3.3Histograms.ThefollowingdataarealistofAllElectronicspricesforcommonlysolditems(roundedtothenearestdollar).Thenumbershavebeensorted:1,1,5,5,5,5,5,8,8,10,10,10,10,12,14,14,14,15,15,15,15,15,15,18,18,18,18,18,18,18,18,20,20,20,20,20,20,20,21,21,21,21,25,25,25,25,25,28,28,30,30,30.Figure3.7showsahistogramforthedatausingsingletonbuckets.Tofurtherreducethedata,itiscommontohaveeachbucketdenoteacontinuousvaluerangeforthegivenattribute.InFigure3.8,eachbucketrepresentsadifferent$10rangeforprice.'}, 'score': 0.0, 'values': []}, {'id': '85b4ecfe-13d6-4ad2-adbf-20f73f8c463c-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 144.0, 'text': 'HAN10-ch03-083-124-97801238147912011/6/13:16Page107#253.4DataReduction10751010987654321015202530price ' '($)countFigure3.7Ahistogramforpriceusingsingletonbuckets—eachbucketrepresentsoneprice–value/frequencypair.1–1011–2021–30price ' '($)count2520151050Figure3.8Anequal-widthhistogramforprice,wherevaluesareaggregatedsothateachbuckethasauniformwidthof$10.“Howarethebucketsdeterminedandtheattributevaluespartitioned?”Thereareseveralpartitioningrules,includingthefollowing:Equal-width:Inanequal-widthhistogram,thewidthofeachbucketrangeisuniform(e.g.,thewidthof$10forthebucketsinFigure3.8).Equal-frequency(orequal-depth):Inanequal-frequencyhistogram,thebucketsarecreatedsothat,roughly,thefrequencyofeachbucketisconstant(i.e.,eachbucketcontainsroughlythesamenumberofcontiguousdatasamples).'}, 'score': 0.0, 'values': []}, {'id': '7fa53cf1-8199-4c6e-93b9-db6c9b75cd84-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 144.0, 'text': 'HAN10-ch03-083-124-97801238147912011/6/13:16Page107#253.4DataReduction10751010987654321015202530price ' '($)countFigure3.7Ahistogramforpriceusingsingletonbuckets—eachbucketrepresentsoneprice–value/frequencypair.1–1011–2021–30price ' '($)count2520151050Figure3.8Anequal-widthhistogramforprice,wherevaluesareaggregatedsothateachbuckethasauniformwidthof$10.“Howarethebucketsdeterminedandtheattributevaluespartitioned?”Thereareseveralpartitioningrules,includingthefollowing:Equal-width:Inanequal-widthhistogram,thewidthofeachbucketrangeisuniform(e.g.,thewidthof$10forthebucketsinFigure3.8).Equal-frequency(orequal-depth):Inanequal-frequencyhistogram,thebucketsarecreatedsothat,roughly,thefrequencyofeachbucketisconstant(i.e.,eachbucketcontainsroughlythesamenumberofcontiguousdatasamples).'}, 'score': 0.0, 'values': []}, {'id': 'e4d97bf1-086b-45eb-aaaf-316f5f3ff887-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 145.0, 'text': 'HAN10-ch03-083-124-97801238147912011/6/13:16Page108#26108Chapter3DataPreprocessingHistogramsarehighlyeffectiveatapproximatingbothsparseanddensedata,aswellashighlyskewedanduniformdata.Thehistogramsdescribedbeforeforsingleattributescanbeextendedformultipleattributes.Multidimensionalhistogramscancap-turedependenciesbetweenattributes.Thesehistogramshavebeenfoundeffectiveinapproximatingdatawithuptofiveattributes.Morestudiesareneededregardingtheeffectivenessofmultidimensionalhistogramsforhighdimensionalities.Singletonbucketsareusefulforstoringhigh-frequencyoutliers.3.4.7ClusteringClusteringtechniquesconsiderdatatuplesasobjects.Theypartitiontheobjectsintogroups,orclusters,sothatobjectswithinaclusterare“similar”tooneanotherand“dis-similar”toobjectsinotherclusters.Similarityiscommonlydefinedintermsofhow“close”theobjectsareinspace,basedonadistancefunction.The“quality”ofaclustermayberepresentedbyitsdiameter,themaximumdistancebetweenanytwoobjectsinthecluster.Centroiddistanceisanalternativemeasureofclusterqualityandisdefinedastheaveragedistanceofeachclusterobjectfromtheclustercentroid(denotingthe“averageobject,”oraveragepointinspaceforthecluster).Figure3.3showeda2-Dplotofcustomerdatawithrespecttocustomerlocationsinacity.Threedataclustersarevisible.Indatareduction,theclusterrepresentationsofthedataareusedtoreplacetheactualdata.Theeffectivenessofthistechniquedependsonthedata’snature.Itismuchmoreeffectivefordatathatcanbeorganizedintodistinctclustersthanforsmeareddata.Therearemanymeasuresfordefiningclustersandclusterquality.Clusteringmeth-odsarefurtherdescribedinChapters10and11.3.4.8SamplingSamplingcanbeusedasadatareductiontechniquebecauseitallowsalargedatasettoberepresentedbyamuchsmallerrandomdatasample(orsubset).Supposethatalargedataset,D,containsNtuples.Let’slookatthemostcommonwaysthatwecouldsampleDfordatareduction,asillustratedinFigure3.9.Simplerandomsamplewithoutreplacement(SRSWOR)ofsizes:ThisiscreatedbydrawingsoftheNtuplesfromD(s0,ausercanspecifythetoleranceofaveragenoiseperelementagainstaperfectbicluster,becauseinEq.(11.19)theresidueoneachelementisresidue(eij)=eij−eiJ−eIj+eIJ.(11.20)Amaximalδ-biclusterisaδ-biclusterI×Jsuchthattheredoesnotexistanotherδ-biclusterI(cid:48)×J(cid:48),andI⊆I(cid:48),J⊆J(cid:48),andatleastoneinequalityholds.Findingthe'}, 'score': 0.0, 'values': []}, {'id': 'e828546d-9670-4fad-aa96-bf3c7ef29eab-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 555.0, 'text': 'HAN18-ch11-497-542-97801238147912011/6/13:24Page518#22518Chapter11AdvancedClusterAnalysismaximalδ-biclusterofthelargestsizeiscomputationallycostly.Therefore,wecanuseaheuristicgreedysearchmethodtoobtainalocaloptimalcluster.Thealgorithmworksintwophases.Inthedeletionphase,westartfromthewholematrix.Whilethemean-squaredresidueofthematrixisoverδ,weiterativelyremoverowsandcolumns.Ateachiteration,foreachrowi,wecomputethemean-squaredresidueasd(i)=1|J|(cid:88)j∈J(eij−eiJ−eIj+eIJ)2.(11.21)Moreover,foreachcolumnj,wecomputethemean-squaredresidueasd(j)=1|I|(cid:88)i∈I(eij−eiJ−eIj+eIJ)2.(11.22)Weremovetheroworcolumnofthelargestmean-squaredresidue.Attheendofthisphase,weobtainasubmatrixI×Jthatisaδ-bicluster.However,thesubmatrixmaynotbemaximal.Intheadditionphase,weiterativelyexpandtheδ-biclusterI×Jobtainedinthedele-tionphaseaslongastheδ-biclusterrequirementismaintained.Ateachiteration,weconsiderrowsandcolumnsthatarenotinvolvedinthecurrentbiclusterI×Jbycal-culatingtheirmean-squaredresidues.Aroworcolumnofthesmallestmean-squaredresidueisaddedintothecurrentδ-bicluster.Thisgreedyalgorithmcanfindoneδ-biclusteronly.Tofindmultiplebiclustersthatdonothaveheavyoverlaps,wecanrunthealgorithmmultipletimes.Aftereachexecu-tionwhereaδ-biclusterisoutput,wecanreplacetheelementsintheoutputbiclusterbyrandomnumbers.Althoughthegreedyalgorithmmayfindneithertheoptimalbiclustersnorallbiclusters,itisveryfastevenonlargematrices.EnumeratingAllBiclustersUsingMaPleAsmentioned,asubmatrixI×Jisabiclusterwithcoherentvaluesifandonlyifforanyi1,i2∈Iandj1,j2∈J,ei1j1−ei2j1=ei1j2−ei2j2.Forany2×2submatrixofI×J,wecandefineap-scoreasp-score(cid:32)ei1j1ei1j2ei2j1ei2j2(cid:33)=|(ei1j1−ei2j1)−(ei1j2−ei2j2)|.(11.23)AsubmatrixI×Jisaδ-pCluster(forpattern-basedcluster)ifthep-scoreofevery2×2submatrixofI×Jisatmostδ,whereδ≥0isathresholdspecifyingauser’stoleranceofnoiseagainstaperfectbicluster.Here,thep-scorecontrolsthenoiseoneveryelementinabicluster,whilethemean-squaredresiduecapturestheaveragenoise.Aninterestingpropertyofδ-pClust'}, 'score': 0.0, 'values': []}, {'id': 'e828546d-9670-4fad-aa96-bf3c7ef29eab-1', 'metadata': {'chunk': 1.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 555.0, 'text': 'ngpropertyofδ-pClusteristhatifI×Jisaδ-pCluster,theneveryx×y(x,y≥2)submatrixofI×Jisalsoaδ-pCluster.Thismonotonicityenables'}, 'score': 0.0, 'values': []}, {'id': 'b5e26d2b-069b-4f3a-b09b-cfb267f1137f-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 556.0, 'text': 'HAN18-ch11-497-542-97801238147912011/6/13:24Page519#2311.2ClusteringHigh-DimensionalData519ustoobtainasuccinctrepresentationofnonredundantδ-pClusters.Aδ-pClusterismaximalifnomorerowsorcolumnscanbeaddedintotheclusterwhilemaintainingtheδ-pClusterproperty.Toavoidredundancy,insteadoffindingallδ-pClusters,weonlyneedtocomputeallmaximalδ-pClusters.MaPleisanalgorithmthatenumeratesallmaximalδ-pClusters.Itsystematicallyenumerateseverycombinationofconditionsusingasetenumerationtreeandadepth-firstsearch.Thisenumerationframeworkisthesameasthepattern-growthmethodsforfrequentpatternmining(Chapter6).Considergeneexpressiondata.Foreachcon-ditioncombination,J,MaPlefindsthemaximalsubsetsofgenes,I,suchthatI×Jisaδ-pCluster.IfI×Jisnotasubmatrixofanotherδ-pCluster,thenI×Jisamaximalδ-pCluster.Theremaybeahugenumberofconditioncombinations.MaPleprunesmanyunfruitfulcombinationsusingthemonotonicityofδ-pClusters.Foraconditioncom-bination,J,iftheredoesnotexistasetofgenes,I,suchthatI×Jisaδ-pCluster,thenwedonotneedtoconsideranysupersetofJ.Moreover,weshouldconsiderI×Jasacandidateofaδ-pClusteronlyifforevery(|J|−1)-subsetJ(cid:48)ofJ,I×J(cid:48)isaδ-pCluster.MaPlealsoemploysseveralpruningtechniquestospeedupthesearchwhileretainingthecompletenessofreturningallmaximalδ-pClusters.Forexample,whenexaminingacurrentδ-pCluster,I×J,MaPlecollectsallthegenesandconditionsthatmaybeaddedtoexpandthecluster.IfthesecandidategenesandconditionstogetherwithIandJformasubmatrixofaδ-pClusterthathasalreadybeenfound,thenthesearchofI×JandanysupersetofJcanbepruned.InterestedreadersmayrefertothebibliographicnotesforadditionalinformationontheMaPlealgorithm(Section11.7).Aninterestingobservationhereisthatthesearchformaximalδ-pClustersinMaPleissomewhatsimilartominingfrequentcloseditemsets.Consequently,MaPleborrowsthedepth-firstsearchframeworkandideasfromthepruningtechniquesofpattern-growthmethodsforfrequentpatternmining.Thisisanexamplewherefrequentpatternminingandclusteranalysismaysharesimilartechniquesandideas.AnadvantageofMaPleandth'}, 'score': 0.0, 'values': []}, {'id': 'b5e26d2b-069b-4f3a-b09b-cfb267f1137f-1', 'metadata': {'chunk': 1.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 556.0, 'text': 'dvantageofMaPleandtheotheralgorithmsthatenumerateallbiclustersisthattheyguaranteethecompletenessoftheresultsanddonotmissanyoverlappingbiclus-ters.However,achallengeforsuchenumerationalgorithmsisthattheymaybecomeverytimeconsumingifamatrixbecomesverylarge,suchasacustomer-purchasematrixofhundredsofthousandsofcustomersandmillionsofproducts.11.2.4DimensionalityReductionMethodsandSpectralClusteringSubspaceclusteringmethodstrytofindclustersinsubspacesoftheoriginaldataspace.Insomesituations,itismoreeffectivetoconstructanewspaceinsteadofusingsubspacesoftheoriginaldata.Thisisthemotivationbehinddimensionalityreductionmethodsforclusteringhigh-dimensionaldata.Example11.14Clusteringinaderivedspace.ConsiderthethreeclustersofpointsinFigure11.9.Itisnotpossibletoclusterthesepointsinanysubspaceoftheoriginalspace,X×Y,because'}, 'score': 0.0, 'values': []}, {'id': 'af513579-60b2-4b77-a4e4-c6b618ad5e37-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 557.0, 'text': '# Chapter 11 Advanced Cluster Analysis\n' '\n' '![Figure 11.9](image_url) Clustering in a derived space ' 'may be more effective.\n' '\n' 'all three clusters would end up being projected onto ' 'overlapping areas in the x and y axes. What if, ' 'instead, we construct a new dimension, ' '\\(-\\frac{x^2}{2} + \\frac{y^2}{2}\\) (shown as a ' 'dashed line in the figure)? By projecting the points ' 'onto this new dimension, the three clusters become ' 'apparent.\n' '\n' 'Although Example 11.14 involves only two dimensions, ' 'the idea of constructing a new space (so that any ' 'clustering structure that is hidden in the data becomes ' 'well manifested) can be extended to high-dimensional ' 'data. Preferably, the newly constructed space should ' 'have low dimensionality.\n' '\n' 'There are many dimensionality reduction methods. A ' 'straightforward approach is to apply feature selection ' 'and extraction methods to the data set as those ' 'discussed in Chapter 3. However, such methods may not ' 'be able to detect the clustering structure. Therefore, ' 'methods that combine feature extraction and clustering ' 'are preferred. In this section, we introduce **spectral ' 'clustering**, a group of methods that are effective in ' 'high-dimensional data applications.\n' '\n' '## Figure 11.10\n' '\n' 'shows the general framework for spectral clustering ' 'approaches. The Ng-Jordan-Weiss algorithm is a spectral ' 'clustering method. Let’s have a look at each step of ' 'the framework. In doing so, we also note special ' 'conditions that apply to the Ng-Jordan-Weiss algorithm ' 'as an example.\n' '\n' 'Given a set of objects \\(o_1, o_2, \\ldots, o_n\\), ' 'the distance between each pair of objects, \\(dist(o_i, ' 'o_j)\\) \\((1 \\leq i < j \\leq n)\\), and the desired ' 'number \\(k\\) of clusters, a spectral clustering ' 'approach works as follows:\n' '\n' '1. Using the distance measure, calculate an affinity ' 'matrix, \\(W\\), such that\n' '\n' ' \\[\n' ' W_{ij} = e^{-\\frac{(dist(o_i, ' 'o_j))^2}{\\sigma^2}},\n' ' \\]\n' '\n' ' where \\(\\sigma\\) is a scaling parameter that ' 'controls how fast the affinity \\(W_{ij}\\) decreases ' 'as \\(dist(o_i, o_j)\\) increases. In the ' 'Ng-Jordan-Weiss algorithm, \\(W_{ij}\\) is set to 0.'}, 'score': 0.0, 'values': []}, {'id': '243d6945-3b3f-4cb7-a6ef-937e6fac248f-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 558.0, 'text': 'HAN18-ch11-497-542-97801238147912011/6/13:24Page521#2511.2ClusteringHigh-DimensionalData521[wij]Affinity ' 'matrixDataA=f(w)Av=λvClustering inthe new spaceCompute ' 'leadingk eigenvectorsof AProject backto cluster ' 'theoriginal ' 'dataFigure11.10Theframeworkofspectralclusteringapproaches.Source:AdaptedfromSlide8athttp://videolectures.net/micued08azranmcl/.2.UsingtheaffinitymatrixW,deriveamatrixA=f(W).Thewayinwhichthisisdonecanvary.TheNg-Jordan-Weissalgorithmdefinesamatrix,D,asadiagonalmatrixsuchthatDiiisthesumoftheithrowofW,thatis,Dii=n(cid:88)j=1Wij.(11.24)AisthensettoA=D−12WD−12.(11.25)3.FindthekleadingeigenvectorsofA.Recallthattheeigenvectorsofasquarematrixarethenonzerovectorsthatremainproportionaltotheoriginalvectorafterbeingmultipliedbythematrix.Mathematically,avectorvisaneigenvectorofmatrixAifAv=λv,whereλiscalledthecorrespondingeigenvalue.ThisstepderivesknewdimensionsfromA,whicharebasedontheaffinitymatrixW.Typically,kshouldbemuchsmallerthanthedimensionalityoftheoriginaldata.TheNg-Jordan-Weissalgorithmcomputesthekeigenvectorswiththelargesteigenvaluesx1,...,xkofA.4.Usingthekleadingeigenvectors,projecttheoriginaldataintothenewspacedefinedbythekleadingeigenvectors,andrunaclusteringalgorithmsuchask-meanstofindkclusters.TheNg-Jordan-WeissalgorithmstackstheklargesteigenvectorsincolumnstoformamatrixX=[x1x2···xk]∈Rn×k.ThealgorithmformsamatrixYbyrenormalizingeachrowinXtohaveunitlength,thatis,Yij=Xij(cid:113)(cid:80)kj=1X2ij.(11.26)ThealgorithmthentreatseachrowinYasapointinthek-dimensionalspaceRk,andrunsk-means(oranyotheralgorithmservingthepartitioningpurpose)toclusterthepointsintokclusters.'}, 'score': 0.0, 'values': []}, {'id': '3b8ce4a9-446a-4367-83de-54a3aabc95da-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 559.0, 'text': '# 522\n' '## Chapter 11 Advanced Cluster Analysis\n' '\n' '![Figure ' '1.1.11](http://videlectures.net/micub08.azrn.mdl) \n' 'The new dimensions and the clustering results of the ' 'Ng-Jordan-Weiss algorithm. Source: Adapted from Slide 9 ' 'at \n' '\n' '5. Assign the original data points to clusters ' 'according to how the transformed points are assigned in ' 'the clusters obtained in step 4. \n' ' In the Ng-Jordan-Weiss algorithm, the original ' 'object \\( y_i \\) is assigned to the jth cluster if ' "and only if matrix \\( Y \\)'s row is assigned to the " 'jth cluster as a result of step 4.\n' '\n' 'In spectral clustering methods, the dimensionality of ' 'the new space is set to the desired number of clusters. ' 'This setting expects that each new dimension should be ' 'able to manifest a cluster.\n' '\n' '### Example 1.1.15\n' '**The Ng-Jordan-Weiss algorithm.** Consider the set of ' 'points in Figure 1.1.11. The data set, the affinity ' 'matrix, the three largest eigenvectors, and the ' 'normalized vectors are shown. Note that with the three ' 'new dimensions (formed by the three largest ' 'eigenvectors), the clusters are easily detected.\n' '\n' 'Spectral clustering is effective in high-dimensional ' 'applications such as image processing. Theoretically, ' 'it works well when certain conditions apply. ' 'Scalability, however, is a challenge. Computing ' 'eigenvectors on a large matrix is costly. Spectral ' 'clustering can be combined with other clustering ' 'methods, such as clustering. Additional information on ' 'other dimensionality reduction clustering methods, such ' 'as kernel PCA, can be found in the bibliographic notes ' '(Section 11.7).\n' '\n' '## 11.3 Clustering Graph and Network Data\n' 'Cluster analysis on graph and network data extracts ' 'valuable knowledge and information. Such data are ' 'increasingly popular in many applications. We discuss ' 'applications and challenges of clustering graph and ' 'network data in Section 11.3.1. Similarity measures for ' 'this form of clustering are given in Section 11.3.2. ' 'You will learn about graph clustering methods in ' 'Section 11.3.3.\n' '\n' 'In general, the terms graph and network can be used ' 'interchangeably. In the rest of this section, we mainly ' 'use the term **graph**.'}, 'score': 0.0, 'values': []}, {'id': 'b454a2ae-5a4c-43c3-8e09-f423fdc02552-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 560.0, 'text': 'HAN18-ch11-497-542-97801238147912011/6/13:24Page523#2711.3ClusteringGraphandNetworkData52311.3.1ApplicationsandChallengesAsacustomerrelationshipmanageratAllElectronics,younoticethatalotofdatarelatingtocustomersandtheirpurchasebehaviorcanbepreferablymodeledusinggraphs.Example11.16Bipartitegraph.ThecustomerpurchasebehavioratAllElectronicscanberepresentedinabipartitegraph.Inabipartitegraph,verticescanbedividedintotwodisjointsetssothateachedgeconnectsavertexinonesettoavertexintheotherset.FortheAllElectronicscustomerpurchasedata,onesetofverticesrepresentscustomers,withonecustomerpervertex.Theothersetrepresentsproducts,withoneproductpervertex.Anedgeconnectsacustomertoaproduct,representingthepurchaseoftheproductbythecustomer.Figure11.12showsanillustration.“Whatkindofknowledgecanweobtainbyaclusteranalysisofthecustomer-productbipartitegraph?”Byclusteringthecustomerssuchthatthosecustomersbuyingsimilarsetsofproductsareplacedintoonegroup,acustomerrelationshipmanagercanmakeproductrecommendations.Forexample,supposeAdabelongstoacustomerclusterinwhichmostofthecustomerspurchasedadigitalcamerainthelast12months,butAdahasyettopurchaseone.Asmanager,youdecidetorecommendadigitalcameratoher.Alternatively,wecanclusterproductssuchthatthoseproductspurchasedbysimilarsetsofcustomersaregroupedtogether.Thisclusteringinformationcanalsobeusedforproductrecommendations.Forexample,ifadigitalcameraandahigh-speedflashmemorycardbelongtothesameproductcluster,thenwhenacustomerpurchasesadigitalcamera,wecanrecommendthehigh-speedflashmemorycard.Bipartitegraphsarewidelyusedinmanyapplications.Consideranotherexample.Example11.17Websearchengines.Inwebsearchengines,searchlogsarearchivedtorecorduserqueriesandthecorrespondingclick-throughinformation.(Theclick-throughinforma-tiontellsusonwhichpages,givenasaresultofasearch,theuserclicked.)Thequeryandclick-throughinformationcanberepresentedusingabipartitegraph,wherethetwosetsCustomersProductsFigure11.12Bipartitegraphrepresentingcustomer-purchasedata.'}, 'score': 0.0, 'values': []}, {'id': '82962e3b-4174-4b0e-9df7-0a189349addd-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 561.0, 'text': 'HAN18-ch11-497-542-97801238147912011/6/13:24Page524#28524Chapter11AdvancedClusterAnalysisofverticescorrespondtoqueriesandwebpages,respectively.Anedgelinksaquerytoawebpageifauserclicksthewebpagewhenaskingthequery.Valuableinformationcanbeobtainedbyclusteranalysesonthequery–webpagebipartitegraph.Forinstance,wemayidentifyqueriesposedindifferentlanguages,butthatmeanthesamething,iftheclick-throughinformationforeachqueryissimilar.Asanotherexample,allthewebpagesontheWebformadirectedgraph,alsoknownasthewebgraph,whereeachwebpageisavertex,andeachhyperlinkisanedgepointingfromasourcepagetoadestinationpage.Clusteranalysisonthewebgraphcandisclosecommunities,findhubsandauthoritativewebpages,anddetectwebspams.Inadditiontobipartitegraphs,clusteranalysiscanalsobeappliedtoothertypesofgraphs,includinggeneralgraphs,aselaboratedExample11.18.Example11.18Socialnetwork.Asocialnetworkisasocialstructure.Itcanberepresentedasagraph,wheretheverticesareindividualsororganizations,andthelinksareinterdependenciesbetweenthevertices,representingfriendship,commoninterests,orcollaborativeactivi-ties.AllElectronics’customersformasocialnetwork,whereeachcustomerisavertex,andanedgelinkstwocustomersiftheyknoweachother.Ascustomerrelationshipmanager,youareinterestedinfindingusefulinformationthatcanbederivedfromAllElectronics’socialnetworkthroughclusteranalysis.Youobtainclustersfromthenetwork,wherecustomersinaclusterknoweachotherorhavefriendsincommon.Customerswithinaclustermayinfluenceoneanotherregard-ingpurchasedecisionmaking.Moreover,communicationchannelscanbedesignedtoinformthe“heads”ofclusters(i.e.,the“best”connectedpeopleintheclusters),sothatpromotionalinformationcanbespreadoutquickly.Thus,youmayusecustomerclusteringtopromotesalesatAllElectronics.Asanotherexample,theauthorsofscientificpublicationsformasocialnetwork,wheretheauthorsareverticesandtwoauthorsareconnectedbyanedgeiftheyco-authoredapublication.Thenetworkis,ingeneral,aweightedgraphbecauseanedgebetweentwoauthorscancarryaweightrepresentingthestrength'}, 'score': 0.0, 'values': []}, {'id': '82962e3b-4174-4b0e-9df7-0a189349addd-1', 'metadata': {'chunk': 1.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 561.0, 'text': 'resentingthestrengthofthecollaborationsuchashowmanypublicationsthetwoauthors(astheendvertices)coauthored.Clus-teringthecoauthornetworkprovidesinsightastocommunitiesofauthorsandpatternsofcollaboration.“Arethereanychallengesspecifictoclusteranalysisongraphandnetworkdata?”Inmostoftheclusteringmethodsdiscussedsofar,objectsarerepresentedusingasetofattributes.Auniquefeatureofgraphandnetworkdataisthatonlyobjects(asvertices)andrelationshipsbetweenthem(asedges)aregiven.Nodimensionsorattributesareexplicitlydefined.Toconductclusteranalysisongraphandnetworkdata,therearetwomajornewchallenges.“Howcanwemeasurethesimilaritybetweentwoobjectsonagraphaccordingly?”Typically,wecannotuseconventionaldistancemeasures,suchasEuclideandis-tance.Instead,weneedtodevelopnewmeasurestoquantifythesimilarity.Such'}, 'score': 0.0, 'values': []}, {'id': 'f7ccb328-3f7d-4daa-a6ef-3149aaaa28a3-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 562.0, 'text': 'HAN18-ch11-497-542-97801238147912011/6/13:24Page525#2911.3ClusteringGraphandNetworkData525measuresoftenarenotmetric,andthusraisenewchallengesregardingthedevelop-mentofefficientclusteringmethods.SimilaritymeasuresforgraphsarediscussedinSection11.3.2.“Howcanwedesignclusteringmodelsandmethodsthatareeffectiveongraphandnetworkdata?”Graphandnetworkdataareoftencomplicated,carryingtopologicalstructuresthataremoresophisticatedthantraditionalclusteranalysisapplications.Manygraphdatasetsarelarge,suchasthewebgraphcontainingatleasttensofbillionsofwebpagesinthepubliclyindexableWeb.Graphscanalsobesparsewhere,onaverage,avertexisconnectedtoonlyasmallnumberofotherverticesinthegraph.Todiscoveraccurateandusefulknowledgehiddendeepinthedata,agoodclusteringmethodhastoaccommodatethesefactors.ClusteringmethodsforgraphandnetworkdataareintroducedinSection11.3.3.11.3.2SimilarityMeasures“Howcanwemeasurethesimilarityordistancebetweentwoverticesinagraph?”Inourdiscussion,weexaminetwotypesofmeasures:geodesicdistanceanddistancebasedonrandomwalk.GeodesicDistanceAsimplemeasureofthedistancebetweentwoverticesinagraphistheshortestpathbetweenthevertices.Formally,thegeodesicdistancebetweentwoverticesisthelengthintermsofthenumberofedgesoftheshortestpathbetweenthevertices.Fortwoverticesthatarenotconnectedinagraph,thegeodesicdistanceisdefinedasinfinite.Usinggeodesicdistance,wecandefineseveralotherusefulmeasurementsforgraphanalysisandclustering.GivenagraphG=(V,E),whereVisthesetofverticesandEisthesetofedges,wedefinethefollowing:Foravertextv∈V,theeccentricityofv,denotedeccen(v),isthelargestgeodesicdistancebetweenvandanyothervertexu∈V−{v}.Theeccentricityofvcaptureshowfarawayvisfromitsremotestvertexinthegraph.TheradiusofgraphGistheminimumeccentricityofallvertices.Thatis,r=minv∈Veccen(v).(11.27)Theradiuscapturesthedistancebetweenthe“mostcentralpoint”andthe“farthestborder”ofthegraph.ThediameterofgraphGisthemaximumeccentricityofallvertices.Thatis,d=maxv∈Veccen(v).(11.28)Thediameterrepresentsthelargestdistancebetw'}, 'score': 0.0, 'values': []}, {'id': 'f7ccb328-3f7d-4daa-a6ef-3149aaaa28a3-1', 'metadata': {'chunk': 1.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 562.0, 'text': 'elargestdistancebetweenanypairofvertices.Aperipheralvertexisavertexthatachievesthediameter.'}, 'score': 0.0, 'values': []}, {'id': '08ab1cfa-cf42-4664-b0e3-2346252d52e2-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 563.0, 'text': 'HAN18-ch11-497-542-97801238147912011/6/13:24Page526#30526Chapter11AdvancedClusterAnalysisabcedFigure11.13Agraph,G,whereverticesc,d,andeareperipheral.Example11.19Measurementsbasedongeodesicdistance.ConsidergraphGinFigure11.13.Theeccentricityofais2,thatis,eccen(a)=2,eccen(b)=2,andeccen(c)=eccen(d)=eccen(e)=3.Thus,theradiusofGis2,andthediameteris3.Notethatitisnotnecessarythatd=2×r.Verticesc,d,andeareperipheralvertices.SimRank:SimilarityBasedonRandomWalkandStructuralContextForsomeapplications,geodesicdistancemaybeinappropriateinmeasuringthesimi-laritybetweenverticesinagraph.HereweintroduceSimRank,asimilaritymeasurebasedonrandomwalkandonthestructuralcontextofthegraph.Inmathematics,arandomwalkisatrajectorythatconsistsoftakingsuccessiverandomsteps.Example11.20Similaritybetweenpeopleinasocialnetwork.Let’sconsidermeasuringthesimilaritybetweentwoverticesintheAllElectronicscustomersocialnetworkofExample11.18.Here,similaritycanbeexplainedastheclosenessbetweentwoparticipantsinthenet-work,thatis,howclosetwopeopleareintermsoftherelationshiprepresentedbythesocialnetwork.“Howwellcanthegeodesicdistancemeasuresimilarityandclosenessinsuchanetwork?”SupposeAdaandBobaretwocustomersinthenetwork,andthenetworkisundirected.Thegeodesicdistance(i.e.,thelengthoftheshortestpathbetweenAdaandBob)istheshortestpaththatamessagecanbepassedfromAdatoBobandviceversa.However,thisinformationisnotusefulforAllElectronics’customerrelationshipmanagementbecausethecompanytypicallydoesnotwanttosendaspecificmessagefromonecustomertoanother.Therefore,geodesicdistancedoesnotsuittheapplication.“Whatdoessimilaritymeaninasocialnetwork?”Weconsidertwowaystodefinesimilarity:Twocustomersareconsideredsimilartooneanotheriftheyhavesimilarneighborsinthesocialnetwork.Thisheuristicisintuitivebecause,inpractice,twopeoplereceivingrecommendationsfromagoodnumberofcommonfriendsoftenmakesimilardecisions.Thiskindofsimilarityisbasedonthelocalstructure(i.e.,theneighborhoods)ofthevertices,andthusiscalledstructuralcontext–basedsimilarity.'}, 'score': 0.0, 'values': []}, {'id': '6e67c466-9b88-4817-905b-54fb29acb9e7-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 564.0, 'text': 'HAN18-ch11-497-542-97801238147912011/6/13:24Page527#3111.3ClusteringGraphandNetworkData527SupposeAllElectronicssendspromotionalinformationtobothAdaandBobinthesocialnetwork.AdaandBobmayrandomlyforwardsuchinformationtotheirfriends(orneighbors)inthenetwork.TheclosenessbetweenAdaandBobcanthenbemeasuredbythelikelihoodthatothercustomerssimultaneouslyreceivethepro-motionalinformationthatwasoriginallysenttoAdaandBob.Thiskindofsimilarityisbasedontherandomwalkreachabilityoverthenetwork,andthusisreferredtoassimilaritybasedonrandomwalk.Let’shaveacloserlookatwhatismeantbysimilaritybasedonstructuralcontext,andsimilaritybasedonrandomwalk.Theintuitionbehindsimilaritybasedonstructuralcontextisthattwoverticesinagrapharesimilariftheyareconnectedtosimilarvertices.Tomeasuresuchsimilarity,weneedtodefinethenotionofindividualneighborhood.InadirectedgraphG=(V,E),whereVisthesetofverticesandE⊆V×Visthesetofedges,foravertexv∈V,theindividualin-neighborhoodofvisdefinedasI(v)={u|(u,v)∈E}.(11.29)Symmetrically,wedefinetheindividualout-neighborhoodofvasO(v)={w|(v,w)∈E}.(11.30)FollowingtheintuitionillustratedinExample11.20,wedefineSimRank,astructural-contextsimilarity,withavaluethatisbetween0and1foranypairofver-tices.Foranyvertex,v∈V,thesimilaritybetweenthevertexanditselfiss(v,v)=1becausetheneighborhoodsareidentical.Forverticesu,v∈Vsuchthatu(cid:54)=v,wecandefines(u,v)=C|I(u)||I(v)|(cid:88)x∈I(u)(cid:88)y∈I(v)s(x,y),(11.31)whereCisaconstantbetween0and1.Avertexmaynothaveanyin-neighbors.Thus,wedefineEq.(11.31)tobe0wheneitherI(u)orI(v)is∅.ParameterCspecifiestherateofdecayassimilarityispropagatedacrossedges.“HowcanwecomputeSimRank?”AstraightforwardmethoditerativelyevaluatesEq.(11.31)untilafixedpointisreached.Letsi(u,v)betheSimRankscorecalculatedattheithround.Tobegin,wesets0(u,v)=(cid:40)0ifu(cid:54)=v1ifu=v.(11.32)WeuseEq.(11.31)tocomputesi+1fromsiassi+1(u,v)=C|I(u)||I(v)|(cid:88)x∈I(u)(cid:88)y∈I(v)si(x,y).(11.33)'}, 'score': 0.0, 'values': []}, {'id': 'f8bf5db7-f7e0-4795-93f1-1e519b430b72-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 565.0, 'text': 'HAN18-ch11-497-542-97801238147912011/6/13:24Page528#32528Chapter11AdvancedClusterAnalysisItcanbeshownthatlimi→∞si(u,v)=s(u,v).AdditionalmethodsforapproximatingSimRankaregiveninthebibliographicnotes(Section11.7).Now,let’sconsidersimilaritybasedonrandomwalk.Adirectedgraphisstronglyconnectedif,foranytwonodesuandv,thereisapathfromutovandanotherpathfromvtou.Inastronglyconnectedgraph,G=(V,E),foranytwovertices,u,v∈V,wecandefinetheexpecteddistancefromutovasd(u,v)=(cid:88)t:u(cid:32)vP[t]l(t),(11.34)whereu(cid:32)visapathstartingfromuandendingatvthatmaycontaincyclesbutdoesnotreachvuntiltheend.Foratravelingtour,t=w1→w2→···→wk,itslengthisl(t)=k−1.TheprobabilityofthetourisdefinedasP[t]=(cid:40)(cid:81)k−1i=11|O(wi)|ifl(t)>00ifl(t)=0.(11.35)Tomeasuretheprobabilitythatavertexwreceivesamessagethatoriginatedsimulta-neouslyfromuandv,weextendtheexpecteddistancetothenotionofexpectedmeetingdistance,thatis,m(u,v)=(cid:88)t:(u,v)(cid:32)(x,x)P[t]l(t),(11.36)where(u,v)(cid:32)(x,x)isapairoftoursu(cid:32)xandv(cid:32)xofthesamelength.UsingaconstantCbetween0and1,wedefinetheexpectedmeetingprobabilityasp(u,v)=(cid:88)t:(u,v)(cid:32)(x,x)P[t]Cl(t),(11.37)whichisasimilaritymeasurebasedonrandomwalk.Here,theparameterCspecifiestheprobabilityofcontinuingthewalkateachstepofthetrajectory.Ithasbeenshownthats(u,v)=p(u,v)foranytwovertices,uandv.Thatis,SimRankisbasedonbothstructuralcontextandrandomwalk.11.3.3GraphClusteringMethodsLet’sconsiderhowtoconductclusteringonagraph.Wefirstdescribetheintuitionbehindgraphclustering.Wethendiscusstwogeneralcategoriesofgraphclusteringmethods.Tofindclustersinagraph,imaginecuttingthegraphintopieces,eachpiecebeingacluster,suchthattheverticeswithinaclusterarewellconnectedandtheverticesindifferentclustersareconnectedinamuchweakerway.Formally,foragraph,G=(V,E),'}, 'score': 0.0, 'values': []}, {'id': '405fc018-13aa-4703-b4bb-17fee7c21789-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 566.0, 'text': 'HAN18-ch11-497-542-97801238147912011/6/13:24Page529#3311.3ClusteringGraphandNetworkData529acut,C=(S,T),isapartitioningofthesetofverticesVinG,thatis,V=S∪TandS∩T=∅.Thecutsetofacutisthesetofedges,{(u,v)∈E|u∈S,v∈T}.Thesizeofthecutisthenumberofedgesinthecutset.Forweightedgraphs,thesizeofacutisthesumoftheweightsoftheedgesinthecutset.“Whatkindsofcutsaregoodforderivingclustersingraphs?”Ingraphtheoryandsomenetworkapplications,aminimumcutisofimportance.Acutisminimumifthecut’ssizeisnotgreaterthananyothercut’ssize.Therearepolynomialtimealgorithmstocomputeminimumcutsofgraphs.Canweusethesealgorithmsingraphclustering?Example11.21Cutsandclusters.ConsidergraphGinFigure11.14.Thegraphhastwoclusters:{a,b,c,d,e,f}and{g,h,i,j,k},andoneoutliervertex,l.ConsidercutC1=({a,b,c,d,e,f,g,h,i,j,k},{l}).Onlyoneedge,namely,(e,l),crossesthetwopartitionscreatedbyC1.Therefore,thecutsetofC1is{(e,l)}andthesizeofC1is1.(Notethatthesizeofanycutinaconnectedgraphcannotbesmallerthan1.)Asaminimumcut,C1doesnotleadtoagoodclusteringbecauseitonlyseparatestheoutliervertex,l,fromtherestofthegraph.CutC2=({a,b,c,d,e,f,l},{g,h,i,j,k})leadstoamuchbetterclusteringthanC1.TheedgesinthecutsetofC2arethoseconnectingthetwo“naturalclusters”inthegraph.Specifically,foredges(d,h)and(e,k)thatareinthecutset,mostoftheedgesconnectingd,h,e,andkbelongtoonecluster.Example11.21indicatesthatusingaminimumcutisunlikelytoleadtoagoodclus-tering.Wearebetteroffchoosingacutwhere,foreachvertexuthatisinvolvedinanedgeinthecutset,mostoftheedgesconnectingtoubelongtoonecluster.Formally,letdeg(u)bethedegreeofu,thatis,thenumberofedgesconnectingtou.ThesparsityofacutC=(S,T)isdefinedas(cid:56)=cutsizemin{|S|,|T|}.(11.38)Sparsest ' 'cut C2Minimum cut ' 'C1abdefghijklcFigure11.14AgraphGandtwocuts.'}, 'score': 0.0, 'values': []}, {'id': 'f9b7e01a-6db3-4b97-8a3e-52573ecaf4a9-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 567.0, 'text': 'HAN18-ch11-497-542-97801238147912011/6/13:24Page530#34530Chapter11AdvancedClusterAnalysisAcutissparsestifitssparsityisnotgreaterthanthesparsityofanyothercut.Theremaybemorethanonesparsestcut.InExample11.21andFigure11.14,C2isasparsestcut.Usingsparsityastheobjectivefunction,asparsestcuttriestominimizethenumberofedgescrossingthepartitionsandbalancethepartitionsinsize.ConsideraclusteringonagraphG=(V,E)thatpartitionsthegraphintokclusters.ThemodularityofaclusteringassessesthequalityoftheclusteringandisdefinedasQ=k(cid:88)i=1(cid:32)li|E|−(cid:18)di2|E|(cid:19)2(cid:33),(11.39)whereliisthenumberofedgesbetweenverticesintheithcluster,anddiisthesumofthedegreesoftheverticesintheithcluster.Themodularityofaclusteringofagraphisthedifferencebetweenthefractionofalledgesthatfallintoindividualclustersandthefractionthatwoulddosoifthegraphverticeswererandomlyconnected.Theoptimalclusteringofgraphsmaximizesthemodularity.Theoretically,manygraphclusteringproblemscanberegardedasfindinggoodcuts,suchasthesparsestcuts,onthegraph.Inpractice,however,anumberofchallengesexist:Highcomputationalcost:Manygraphcutproblemsarecomputationallyexpen-sive.Thesparsestcutproblem,forexample,isNP-hard.Therefore,findingtheoptimalsolutionsonlargegraphsisoftenimpossible.Agoodtrade-offbetweenefficiency/scalabilityandqualityhastobeachieved.Sophisticatedgraphs:Graphscanbemoresophisticatedthantheonesdescribedhere,involvingweightsand/orcycles.Highdimensionality:Agraphcanhavemanyvertices.Inasimilaritymatrix,avertexisrepresentedasavector(arowinthematrix)withadimensionalitythatisthenumberofverticesinthegraph.Therefore,graphclusteringmethodsmusthandlehighdimensionality.Sparsity:Alargegraphisoftensparse,meaningeachvertexonaverageconnectstoonlyasmallnumberofothervertices.Asimilaritymatrixfromalargesparsegraphcanalsobesparse.Therearetwokindsofmethodsforclusteringgraphdata,whichaddressthesechallenges.Oneusesclusteringmethodsforhigh-dimensionaldata,whiletheotherisdesignedspecificallyforclusteringgraphs.Thefirstgroupofmethodsisba'}, 'score': 0.0, 'values': []}, {'id': 'f9b7e01a-6db3-4b97-8a3e-52573ecaf4a9-1', 'metadata': {'chunk': 1.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 567.0, 'text': 'stgroupofmethodsisbasedongenericclusteringmethodsforhigh-dimensionaldata.TheyextractasimilaritymatrixfromagraphusingasimilaritymeasuresuchasthosediscussedinSection11.3.2.Agenericclusteringmethodcanthenbeappliedonthesimilaritymatrixtodiscoverclusters.Clusteringmethodsfor'}, 'score': 0.0, 'values': []}, {'id': '996ffcd2-6e4a-4419-8e21-e975be411e84-0', 'metadata': {'chunk': 0.0, 'file_name': 'Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf', 'page': 568.0, 'text': 'HAN18-ch11-497-542-97801238147912011/6/13:24Page531#3511.3ClusteringGraphandNetworkData531high-dimensionaldataaretypicallyemployed.Forexample,inmanyscenarios,onceasimilaritymatrixisobtained,spectralclusteringmethods(Section11.2.4)canbeapplied.Spectralclusteringcanapproximateoptimalgraphcutsolutions.Foradditionalinformation,pleaserefertothebibliographicnotes(Section11.7).Thesecondgroupofmethodsisspecifictographs.Theysearchthegraphtofindwell-connectedcomponentsasclusters.Let’slookatamethodcalledSCAN(StructuralClusteringAlgorithmforNetworks)asanexample.Givenanundirectedgraph,G=(V,E),foravertex,u∈V,theneighborhoodofuis(cid:48)(u)={v|(u,v)∈E}∪{u}.Usingtheideaofstructural-contextsimilarity,SCANmeasuresthesimilaritybetweentwovertices,u,v∈V,bythenormalizedcommonneighborhoodsize,thatis,σ(u,v)=|(cid:48)(u)∩(cid:48)(v)|√|(cid:48)(u)||(cid:48)(v)|.(11.40)Thelargerthevaluecomputed,themoresimilarthetwovertices.SCANusesasimilaritythresholdεtodefinetheclustermembership.Foravertex,u∈V,theε-neighborhoodofuisdefinedasNε(u)={v∈(cid:48)(u)|σ(u,v)≥ε}.Theε-neighborhoodofucontainsallneighborsofuwithastructural-contextsimilaritytouthatisatleastε.InSCAN,acorevertexisavertexinsideofacluster.Thatis,u∈Visacorever-texif|Nε(u)|≥µ,whereµisapopularitythreshold.SCANgrowsclustersfromcorevertices.Ifavertexvisintheε-neighborhoodofacoreu,thenvisassignedtothesameclusterasu.Thisprocessofgrowingclusterscontinuesuntilnoclustercanbefurthergrown.Theprocessissimilartothedensity-basedclusteringmethod,DBSCAN(Chapter10).Formally,avertexvcanbedirectlyreachedfromacoreuifv∈Nε(u).Transitively,avertexvcanbereachedfromacoreuifthereexistverticesw1,...,wnsuchthatw1canbereachedfromu,wicanbereachedfromwi−1for10,ausercanspecifythetoleranceofaveragenoiseperelementagainstaperfectbicluster,becauseinEq.(11.19)theresidueoneachelementisresidue(eij)=eij−eiJ−eIj+eIJ.(11.20)Amaximalδ-biclusterisaδ-biclusterI×Jsuchthattheredoesnotexistanotherδ-biclusterI(cid:48)×J(cid:48),andI⊆I(cid:48),J⊆J(cid:48),andatleastoneinequalityholds.Findingthe #################### File: Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf Page: 555 Context: HAN18-ch11-497-542-97801238147912011/6/13:24Page518#22518Chapter11AdvancedClusterAnalysismaximalδ-biclusterofthelargestsizeiscomputationallycostly.Therefore,wecanuseaheuristicgreedysearchmethodtoobtainalocaloptimalcluster.Thealgorithmworksintwophases.Inthedeletionphase,westartfromthewholematrix.Whilethemean-squaredresidueofthematrixisoverδ,weiterativelyremoverowsandcolumns.Ateachiteration,foreachrowi,wecomputethemean-squaredresidueasd(i)=1|J|(cid:88)j∈J(eij−eiJ−eIj+eIJ)2.(11.21)Moreover,foreachcolumnj,wecomputethemean-squaredresidueasd(j)=1|I|(cid:88)i∈I(eij−eiJ−eIj+eIJ)2.(11.22)Weremovetheroworcolumnofthelargestmean-squaredresidue.Attheendofthisphase,weobtainasubmatrixI×Jthatisaδ-bicluster.However,thesubmatrixmaynotbemaximal.Intheadditionphase,weiterativelyexpandtheδ-biclusterI×Jobtainedinthedele-tionphaseaslongastheδ-biclusterrequirementismaintained.Ateachiteration,weconsiderrowsandcolumnsthatarenotinvolvedinthecurrentbiclusterI×Jbycal-culatingtheirmean-squaredresidues.Aroworcolumnofthesmallestmean-squaredresidueisaddedintothecurrentδ-bicluster.Thisgreedyalgorithmcanfindoneδ-biclusteronly.Tofindmultiplebiclustersthatdonothaveheavyoverlaps,wecanrunthealgorithmmultipletimes.Aftereachexecu-tionwhereaδ-biclusterisoutput,wecanreplacetheelementsintheoutputbiclusterbyrandomnumbers.Althoughthegreedyalgorithmmayfindneithertheoptimalbiclustersnorallbiclusters,itisveryfastevenonlargematrices.EnumeratingAllBiclustersUsingMaPleAsmentioned,asubmatrixI×Jisabiclusterwithcoherentvaluesifandonlyifforanyi1,i2∈Iandj1,j2∈J,ei1j1−ei2j1=ei1j2−ei2j2.Forany2×2submatrixofI×J,wecandefineap-scoreasp-score(cid:32)ei1j1ei1j2ei2j1ei2j2(cid:33)=|(ei1j1−ei2j1)−(ei1j2−ei2j2)|.(11.23)AsubmatrixI×Jisaδ-pCluster(forpattern-basedcluster)ifthep-scoreofevery2×2submatrixofI×Jisatmostδ,whereδ≥0isathresholdspecifyingauser’stoleranceofnoiseagainstaperfectbicluster.Here,thep-scorecontrolsthenoiseoneveryelementinabicluster,whilethemean-squaredresiduecapturestheaveragenoise.Aninterestingpropertyofδ-pClust #################### File: Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf Page: 555 Context: ngpropertyofδ-pClusteristhatifI×Jisaδ-pCluster,theneveryx×y(x,y≥2)submatrixofI×Jisalsoaδ-pCluster.Thismonotonicityenables #################### File: Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf Page: 556 Context: HAN18-ch11-497-542-97801238147912011/6/13:24Page519#2311.2ClusteringHigh-DimensionalData519ustoobtainasuccinctrepresentationofnonredundantδ-pClusters.Aδ-pClusterismaximalifnomorerowsorcolumnscanbeaddedintotheclusterwhilemaintainingtheδ-pClusterproperty.Toavoidredundancy,insteadoffindingallδ-pClusters,weonlyneedtocomputeallmaximalδ-pClusters.MaPleisanalgorithmthatenumeratesallmaximalδ-pClusters.Itsystematicallyenumerateseverycombinationofconditionsusingasetenumerationtreeandadepth-firstsearch.Thisenumerationframeworkisthesameasthepattern-growthmethodsforfrequentpatternmining(Chapter6).Considergeneexpressiondata.Foreachcon-ditioncombination,J,MaPlefindsthemaximalsubsetsofgenes,I,suchthatI×Jisaδ-pCluster.IfI×Jisnotasubmatrixofanotherδ-pCluster,thenI×Jisamaximalδ-pCluster.Theremaybeahugenumberofconditioncombinations.MaPleprunesmanyunfruitfulcombinationsusingthemonotonicityofδ-pClusters.Foraconditioncom-bination,J,iftheredoesnotexistasetofgenes,I,suchthatI×Jisaδ-pCluster,thenwedonotneedtoconsideranysupersetofJ.Moreover,weshouldconsiderI×Jasacandidateofaδ-pClusteronlyifforevery(|J|−1)-subsetJ(cid:48)ofJ,I×J(cid:48)isaδ-pCluster.MaPlealsoemploysseveralpruningtechniquestospeedupthesearchwhileretainingthecompletenessofreturningallmaximalδ-pClusters.Forexample,whenexaminingacurrentδ-pCluster,I×J,MaPlecollectsallthegenesandconditionsthatmaybeaddedtoexpandthecluster.IfthesecandidategenesandconditionstogetherwithIandJformasubmatrixofaδ-pClusterthathasalreadybeenfound,thenthesearchofI×JandanysupersetofJcanbepruned.InterestedreadersmayrefertothebibliographicnotesforadditionalinformationontheMaPlealgorithm(Section11.7).Aninterestingobservationhereisthatthesearchformaximalδ-pClustersinMaPleissomewhatsimilartominingfrequentcloseditemsets.Consequently,MaPleborrowsthedepth-firstsearchframeworkandideasfromthepruningtechniquesofpattern-growthmethodsforfrequentpatternmining.Thisisanexamplewherefrequentpatternminingandclusteranalysismaysharesimilartechniquesandideas.AnadvantageofMaPleandth #################### File: Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf Page: 556 Context: dvantageofMaPleandtheotheralgorithmsthatenumerateallbiclustersisthattheyguaranteethecompletenessoftheresultsanddonotmissanyoverlappingbiclus-ters.However,achallengeforsuchenumerationalgorithmsisthattheymaybecomeverytimeconsumingifamatrixbecomesverylarge,suchasacustomer-purchasematrixofhundredsofthousandsofcustomersandmillionsofproducts.11.2.4DimensionalityReductionMethodsandSpectralClusteringSubspaceclusteringmethodstrytofindclustersinsubspacesoftheoriginaldataspace.Insomesituations,itismoreeffectivetoconstructanewspaceinsteadofusingsubspacesoftheoriginaldata.Thisisthemotivationbehinddimensionalityreductionmethodsforclusteringhigh-dimensionaldata.Example11.14Clusteringinaderivedspace.ConsiderthethreeclustersofpointsinFigure11.9.Itisnotpossibletoclusterthesepointsinanysubspaceoftheoriginalspace,X×Y,because #################### File: Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf Page: 557 Context: # Chapter 11 Advanced Cluster Analysis ![Figure 11.9](image_url) Clustering in a derived space may be more effective. all three clusters would end up being projected onto overlapping areas in the x and y axes. What if, instead, we construct a new dimension, \(-\frac{x^2}{2} + \frac{y^2}{2}\) (shown as a dashed line in the figure)? By projecting the points onto this new dimension, the three clusters become apparent. Although Example 11.14 involves only two dimensions, the idea of constructing a new space (so that any clustering structure that is hidden in the data becomes well manifested) can be extended to high-dimensional data. Preferably, the newly constructed space should have low dimensionality. There are many dimensionality reduction methods. A straightforward approach is to apply feature selection and extraction methods to the data set as those discussed in Chapter 3. However, such methods may not be able to detect the clustering structure. Therefore, methods that combine feature extraction and clustering are preferred. In this section, we introduce **spectral clustering**, a group of methods that are effective in high-dimensional data applications. ## Figure 11.10 shows the general framework for spectral clustering approaches. The Ng-Jordan-Weiss algorithm is a spectral clustering method. Let’s have a look at each step of the framework. In doing so, we also note special conditions that apply to the Ng-Jordan-Weiss algorithm as an example. Given a set of objects \(o_1, o_2, \ldots, o_n\), the distance between each pair of objects, \(dist(o_i, o_j)\) \((1 \leq i < j \leq n)\), and the desired number \(k\) of clusters, a spectral clustering approach works as follows: 1. Using the distance measure, calculate an affinity matrix, \(W\), such that \[ W_{ij} = e^{-\frac{(dist(o_i, o_j))^2}{\sigma^2}}, \] where \(\sigma\) is a scaling parameter that controls how fast the affinity \(W_{ij}\) decreases as \(dist(o_i, o_j)\) increases. In the Ng-Jordan-Weiss algorithm, \(W_{ij}\) is set to 0. #################### File: Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf Page: 558 Context: HAN18-ch11-497-542-97801238147912011/6/13:24Page521#2511.2ClusteringHigh-DimensionalData521[wij]Affinity matrixDataA=f(w)Av=λvClustering inthe new spaceCompute leadingk eigenvectorsof AProject backto cluster theoriginal dataFigure11.10Theframeworkofspectralclusteringapproaches.Source:AdaptedfromSlide8athttp://videolectures.net/micued08azranmcl/.2.UsingtheaffinitymatrixW,deriveamatrixA=f(W).Thewayinwhichthisisdonecanvary.TheNg-Jordan-Weissalgorithmdefinesamatrix,D,asadiagonalmatrixsuchthatDiiisthesumoftheithrowofW,thatis,Dii=n(cid:88)j=1Wij.(11.24)AisthensettoA=D−12WD−12.(11.25)3.FindthekleadingeigenvectorsofA.Recallthattheeigenvectorsofasquarematrixarethenonzerovectorsthatremainproportionaltotheoriginalvectorafterbeingmultipliedbythematrix.Mathematically,avectorvisaneigenvectorofmatrixAifAv=λv,whereλiscalledthecorrespondingeigenvalue.ThisstepderivesknewdimensionsfromA,whicharebasedontheaffinitymatrixW.Typically,kshouldbemuchsmallerthanthedimensionalityoftheoriginaldata.TheNg-Jordan-Weissalgorithmcomputesthekeigenvectorswiththelargesteigenvaluesx1,...,xkofA.4.Usingthekleadingeigenvectors,projecttheoriginaldataintothenewspacedefinedbythekleadingeigenvectors,andrunaclusteringalgorithmsuchask-meanstofindkclusters.TheNg-Jordan-WeissalgorithmstackstheklargesteigenvectorsincolumnstoformamatrixX=[x1x2···xk]∈Rn×k.ThealgorithmformsamatrixYbyrenormalizingeachrowinXtohaveunitlength,thatis,Yij=Xij(cid:113)(cid:80)kj=1X2ij.(11.26)ThealgorithmthentreatseachrowinYasapointinthek-dimensionalspaceRk,andrunsk-means(oranyotheralgorithmservingthepartitioningpurpose)toclusterthepointsintokclusters. #################### File: Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf Page: 559 Context: # 522 ## Chapter 11 Advanced Cluster Analysis ![Figure 1.1.11](http://videlectures.net/micub08.azrn.mdl) The new dimensions and the clustering results of the Ng-Jordan-Weiss algorithm. Source: Adapted from Slide 9 at 5. Assign the original data points to clusters according to how the transformed points are assigned in the clusters obtained in step 4. In the Ng-Jordan-Weiss algorithm, the original object \( y_i \) is assigned to the jth cluster if and only if matrix \( Y \)'s row is assigned to the jth cluster as a result of step 4. In spectral clustering methods, the dimensionality of the new space is set to the desired number of clusters. This setting expects that each new dimension should be able to manifest a cluster. ### Example 1.1.15 **The Ng-Jordan-Weiss algorithm.** Consider the set of points in Figure 1.1.11. The data set, the affinity matrix, the three largest eigenvectors, and the normalized vectors are shown. Note that with the three new dimensions (formed by the three largest eigenvectors), the clusters are easily detected. Spectral clustering is effective in high-dimensional applications such as image processing. Theoretically, it works well when certain conditions apply. Scalability, however, is a challenge. Computing eigenvectors on a large matrix is costly. Spectral clustering can be combined with other clustering methods, such as clustering. Additional information on other dimensionality reduction clustering methods, such as kernel PCA, can be found in the bibliographic notes (Section 11.7). ## 11.3 Clustering Graph and Network Data Cluster analysis on graph and network data extracts valuable knowledge and information. Such data are increasingly popular in many applications. We discuss applications and challenges of clustering graph and network data in Section 11.3.1. Similarity measures for this form of clustering are given in Section 11.3.2. You will learn about graph clustering methods in Section 11.3.3. In general, the terms graph and network can be used interchangeably. In the rest of this section, we mainly use the term **graph**. #################### File: Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf Page: 560 Context: HAN18-ch11-497-542-97801238147912011/6/13:24Page523#2711.3ClusteringGraphandNetworkData52311.3.1ApplicationsandChallengesAsacustomerrelationshipmanageratAllElectronics,younoticethatalotofdatarelatingtocustomersandtheirpurchasebehaviorcanbepreferablymodeledusinggraphs.Example11.16Bipartitegraph.ThecustomerpurchasebehavioratAllElectronicscanberepresentedinabipartitegraph.Inabipartitegraph,verticescanbedividedintotwodisjointsetssothateachedgeconnectsavertexinonesettoavertexintheotherset.FortheAllElectronicscustomerpurchasedata,onesetofverticesrepresentscustomers,withonecustomerpervertex.Theothersetrepresentsproducts,withoneproductpervertex.Anedgeconnectsacustomertoaproduct,representingthepurchaseoftheproductbythecustomer.Figure11.12showsanillustration.“Whatkindofknowledgecanweobtainbyaclusteranalysisofthecustomer-productbipartitegraph?”Byclusteringthecustomerssuchthatthosecustomersbuyingsimilarsetsofproductsareplacedintoonegroup,acustomerrelationshipmanagercanmakeproductrecommendations.Forexample,supposeAdabelongstoacustomerclusterinwhichmostofthecustomerspurchasedadigitalcamerainthelast12months,butAdahasyettopurchaseone.Asmanager,youdecidetorecommendadigitalcameratoher.Alternatively,wecanclusterproductssuchthatthoseproductspurchasedbysimilarsetsofcustomersaregroupedtogether.Thisclusteringinformationcanalsobeusedforproductrecommendations.Forexample,ifadigitalcameraandahigh-speedflashmemorycardbelongtothesameproductcluster,thenwhenacustomerpurchasesadigitalcamera,wecanrecommendthehigh-speedflashmemorycard.Bipartitegraphsarewidelyusedinmanyapplications.Consideranotherexample.Example11.17Websearchengines.Inwebsearchengines,searchlogsarearchivedtorecorduserqueriesandthecorrespondingclick-throughinformation.(Theclick-throughinforma-tiontellsusonwhichpages,givenasaresultofasearch,theuserclicked.)Thequeryandclick-throughinformationcanberepresentedusingabipartitegraph,wherethetwosetsCustomersProductsFigure11.12Bipartitegraphrepresentingcustomer-purchasedata. #################### File: Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf Page: 561 Context: HAN18-ch11-497-542-97801238147912011/6/13:24Page524#28524Chapter11AdvancedClusterAnalysisofverticescorrespondtoqueriesandwebpages,respectively.Anedgelinksaquerytoawebpageifauserclicksthewebpagewhenaskingthequery.Valuableinformationcanbeobtainedbyclusteranalysesonthequery–webpagebipartitegraph.Forinstance,wemayidentifyqueriesposedindifferentlanguages,butthatmeanthesamething,iftheclick-throughinformationforeachqueryissimilar.Asanotherexample,allthewebpagesontheWebformadirectedgraph,alsoknownasthewebgraph,whereeachwebpageisavertex,andeachhyperlinkisanedgepointingfromasourcepagetoadestinationpage.Clusteranalysisonthewebgraphcandisclosecommunities,findhubsandauthoritativewebpages,anddetectwebspams.Inadditiontobipartitegraphs,clusteranalysiscanalsobeappliedtoothertypesofgraphs,includinggeneralgraphs,aselaboratedExample11.18.Example11.18Socialnetwork.Asocialnetworkisasocialstructure.Itcanberepresentedasagraph,wheretheverticesareindividualsororganizations,andthelinksareinterdependenciesbetweenthevertices,representingfriendship,commoninterests,orcollaborativeactivi-ties.AllElectronics’customersformasocialnetwork,whereeachcustomerisavertex,andanedgelinkstwocustomersiftheyknoweachother.Ascustomerrelationshipmanager,youareinterestedinfindingusefulinformationthatcanbederivedfromAllElectronics’socialnetworkthroughclusteranalysis.Youobtainclustersfromthenetwork,wherecustomersinaclusterknoweachotherorhavefriendsincommon.Customerswithinaclustermayinfluenceoneanotherregard-ingpurchasedecisionmaking.Moreover,communicationchannelscanbedesignedtoinformthe“heads”ofclusters(i.e.,the“best”connectedpeopleintheclusters),sothatpromotionalinformationcanbespreadoutquickly.Thus,youmayusecustomerclusteringtopromotesalesatAllElectronics.Asanotherexample,theauthorsofscientificpublicationsformasocialnetwork,wheretheauthorsareverticesandtwoauthorsareconnectedbyanedgeiftheyco-authoredapublication.Thenetworkis,ingeneral,aweightedgraphbecauseanedgebetweentwoauthorscancarryaweightrepresentingthestrength #################### File: Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf Page: 561 Context: resentingthestrengthofthecollaborationsuchashowmanypublicationsthetwoauthors(astheendvertices)coauthored.Clus-teringthecoauthornetworkprovidesinsightastocommunitiesofauthorsandpatternsofcollaboration.“Arethereanychallengesspecifictoclusteranalysisongraphandnetworkdata?”Inmostoftheclusteringmethodsdiscussedsofar,objectsarerepresentedusingasetofattributes.Auniquefeatureofgraphandnetworkdataisthatonlyobjects(asvertices)andrelationshipsbetweenthem(asedges)aregiven.Nodimensionsorattributesareexplicitlydefined.Toconductclusteranalysisongraphandnetworkdata,therearetwomajornewchallenges.“Howcanwemeasurethesimilaritybetweentwoobjectsonagraphaccordingly?”Typically,wecannotuseconventionaldistancemeasures,suchasEuclideandis-tance.Instead,weneedtodevelopnewmeasurestoquantifythesimilarity.Such #################### File: Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf Page: 562 Context: HAN18-ch11-497-542-97801238147912011/6/13:24Page525#2911.3ClusteringGraphandNetworkData525measuresoftenarenotmetric,andthusraisenewchallengesregardingthedevelop-mentofefficientclusteringmethods.SimilaritymeasuresforgraphsarediscussedinSection11.3.2.“Howcanwedesignclusteringmodelsandmethodsthatareeffectiveongraphandnetworkdata?”Graphandnetworkdataareoftencomplicated,carryingtopologicalstructuresthataremoresophisticatedthantraditionalclusteranalysisapplications.Manygraphdatasetsarelarge,suchasthewebgraphcontainingatleasttensofbillionsofwebpagesinthepubliclyindexableWeb.Graphscanalsobesparsewhere,onaverage,avertexisconnectedtoonlyasmallnumberofotherverticesinthegraph.Todiscoveraccurateandusefulknowledgehiddendeepinthedata,agoodclusteringmethodhastoaccommodatethesefactors.ClusteringmethodsforgraphandnetworkdataareintroducedinSection11.3.3.11.3.2SimilarityMeasures“Howcanwemeasurethesimilarityordistancebetweentwoverticesinagraph?”Inourdiscussion,weexaminetwotypesofmeasures:geodesicdistanceanddistancebasedonrandomwalk.GeodesicDistanceAsimplemeasureofthedistancebetweentwoverticesinagraphistheshortestpathbetweenthevertices.Formally,thegeodesicdistancebetweentwoverticesisthelengthintermsofthenumberofedgesoftheshortestpathbetweenthevertices.Fortwoverticesthatarenotconnectedinagraph,thegeodesicdistanceisdefinedasinfinite.Usinggeodesicdistance,wecandefineseveralotherusefulmeasurementsforgraphanalysisandclustering.GivenagraphG=(V,E),whereVisthesetofverticesandEisthesetofedges,wedefinethefollowing:Foravertextv∈V,theeccentricityofv,denotedeccen(v),isthelargestgeodesicdistancebetweenvandanyothervertexu∈V−{v}.Theeccentricityofvcaptureshowfarawayvisfromitsremotestvertexinthegraph.TheradiusofgraphGistheminimumeccentricityofallvertices.Thatis,r=minv∈Veccen(v).(11.27)Theradiuscapturesthedistancebetweenthe“mostcentralpoint”andthe“farthestborder”ofthegraph.ThediameterofgraphGisthemaximumeccentricityofallvertices.Thatis,d=maxv∈Veccen(v).(11.28)Thediameterrepresentsthelargestdistancebetw #################### File: Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf Page: 562 Context: elargestdistancebetweenanypairofvertices.Aperipheralvertexisavertexthatachievesthediameter. #################### File: Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf Page: 563 Context: HAN18-ch11-497-542-97801238147912011/6/13:24Page526#30526Chapter11AdvancedClusterAnalysisabcedFigure11.13Agraph,G,whereverticesc,d,andeareperipheral.Example11.19Measurementsbasedongeodesicdistance.ConsidergraphGinFigure11.13.Theeccentricityofais2,thatis,eccen(a)=2,eccen(b)=2,andeccen(c)=eccen(d)=eccen(e)=3.Thus,theradiusofGis2,andthediameteris3.Notethatitisnotnecessarythatd=2×r.Verticesc,d,andeareperipheralvertices.SimRank:SimilarityBasedonRandomWalkandStructuralContextForsomeapplications,geodesicdistancemaybeinappropriateinmeasuringthesimi-laritybetweenverticesinagraph.HereweintroduceSimRank,asimilaritymeasurebasedonrandomwalkandonthestructuralcontextofthegraph.Inmathematics,arandomwalkisatrajectorythatconsistsoftakingsuccessiverandomsteps.Example11.20Similaritybetweenpeopleinasocialnetwork.Let’sconsidermeasuringthesimilaritybetweentwoverticesintheAllElectronicscustomersocialnetworkofExample11.18.Here,similaritycanbeexplainedastheclosenessbetweentwoparticipantsinthenet-work,thatis,howclosetwopeopleareintermsoftherelationshiprepresentedbythesocialnetwork.“Howwellcanthegeodesicdistancemeasuresimilarityandclosenessinsuchanetwork?”SupposeAdaandBobaretwocustomersinthenetwork,andthenetworkisundirected.Thegeodesicdistance(i.e.,thelengthoftheshortestpathbetweenAdaandBob)istheshortestpaththatamessagecanbepassedfromAdatoBobandviceversa.However,thisinformationisnotusefulforAllElectronics’customerrelationshipmanagementbecausethecompanytypicallydoesnotwanttosendaspecificmessagefromonecustomertoanother.Therefore,geodesicdistancedoesnotsuittheapplication.“Whatdoessimilaritymeaninasocialnetwork?”Weconsidertwowaystodefinesimilarity:Twocustomersareconsideredsimilartooneanotheriftheyhavesimilarneighborsinthesocialnetwork.Thisheuristicisintuitivebecause,inpractice,twopeoplereceivingrecommendationsfromagoodnumberofcommonfriendsoftenmakesimilardecisions.Thiskindofsimilarityisbasedonthelocalstructure(i.e.,theneighborhoods)ofthevertices,andthusiscalledstructuralcontext–basedsimilarity. #################### File: Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf Page: 564 Context: HAN18-ch11-497-542-97801238147912011/6/13:24Page527#3111.3ClusteringGraphandNetworkData527SupposeAllElectronicssendspromotionalinformationtobothAdaandBobinthesocialnetwork.AdaandBobmayrandomlyforwardsuchinformationtotheirfriends(orneighbors)inthenetwork.TheclosenessbetweenAdaandBobcanthenbemeasuredbythelikelihoodthatothercustomerssimultaneouslyreceivethepro-motionalinformationthatwasoriginallysenttoAdaandBob.Thiskindofsimilarityisbasedontherandomwalkreachabilityoverthenetwork,andthusisreferredtoassimilaritybasedonrandomwalk.Let’shaveacloserlookatwhatismeantbysimilaritybasedonstructuralcontext,andsimilaritybasedonrandomwalk.Theintuitionbehindsimilaritybasedonstructuralcontextisthattwoverticesinagrapharesimilariftheyareconnectedtosimilarvertices.Tomeasuresuchsimilarity,weneedtodefinethenotionofindividualneighborhood.InadirectedgraphG=(V,E),whereVisthesetofverticesandE⊆V×Visthesetofedges,foravertexv∈V,theindividualin-neighborhoodofvisdefinedasI(v)={u|(u,v)∈E}.(11.29)Symmetrically,wedefinetheindividualout-neighborhoodofvasO(v)={w|(v,w)∈E}.(11.30)FollowingtheintuitionillustratedinExample11.20,wedefineSimRank,astructural-contextsimilarity,withavaluethatisbetween0and1foranypairofver-tices.Foranyvertex,v∈V,thesimilaritybetweenthevertexanditselfiss(v,v)=1becausetheneighborhoodsareidentical.Forverticesu,v∈Vsuchthatu(cid:54)=v,wecandefines(u,v)=C|I(u)||I(v)|(cid:88)x∈I(u)(cid:88)y∈I(v)s(x,y),(11.31)whereCisaconstantbetween0and1.Avertexmaynothaveanyin-neighbors.Thus,wedefineEq.(11.31)tobe0wheneitherI(u)orI(v)is∅.ParameterCspecifiestherateofdecayassimilarityispropagatedacrossedges.“HowcanwecomputeSimRank?”AstraightforwardmethoditerativelyevaluatesEq.(11.31)untilafixedpointisreached.Letsi(u,v)betheSimRankscorecalculatedattheithround.Tobegin,wesets0(u,v)=(cid:40)0ifu(cid:54)=v1ifu=v.(11.32)WeuseEq.(11.31)tocomputesi+1fromsiassi+1(u,v)=C|I(u)||I(v)|(cid:88)x∈I(u)(cid:88)y∈I(v)si(x,y).(11.33) #################### File: Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf Page: 565 Context: HAN18-ch11-497-542-97801238147912011/6/13:24Page528#32528Chapter11AdvancedClusterAnalysisItcanbeshownthatlimi→∞si(u,v)=s(u,v).AdditionalmethodsforapproximatingSimRankaregiveninthebibliographicnotes(Section11.7).Now,let’sconsidersimilaritybasedonrandomwalk.Adirectedgraphisstronglyconnectedif,foranytwonodesuandv,thereisapathfromutovandanotherpathfromvtou.Inastronglyconnectedgraph,G=(V,E),foranytwovertices,u,v∈V,wecandefinetheexpecteddistancefromutovasd(u,v)=(cid:88)t:u(cid:32)vP[t]l(t),(11.34)whereu(cid:32)visapathstartingfromuandendingatvthatmaycontaincyclesbutdoesnotreachvuntiltheend.Foratravelingtour,t=w1→w2→···→wk,itslengthisl(t)=k−1.TheprobabilityofthetourisdefinedasP[t]=(cid:40)(cid:81)k−1i=11|O(wi)|ifl(t)>00ifl(t)=0.(11.35)Tomeasuretheprobabilitythatavertexwreceivesamessagethatoriginatedsimulta-neouslyfromuandv,weextendtheexpecteddistancetothenotionofexpectedmeetingdistance,thatis,m(u,v)=(cid:88)t:(u,v)(cid:32)(x,x)P[t]l(t),(11.36)where(u,v)(cid:32)(x,x)isapairoftoursu(cid:32)xandv(cid:32)xofthesamelength.UsingaconstantCbetween0and1,wedefinetheexpectedmeetingprobabilityasp(u,v)=(cid:88)t:(u,v)(cid:32)(x,x)P[t]Cl(t),(11.37)whichisasimilaritymeasurebasedonrandomwalk.Here,theparameterCspecifiestheprobabilityofcontinuingthewalkateachstepofthetrajectory.Ithasbeenshownthats(u,v)=p(u,v)foranytwovertices,uandv.Thatis,SimRankisbasedonbothstructuralcontextandrandomwalk.11.3.3GraphClusteringMethodsLet’sconsiderhowtoconductclusteringonagraph.Wefirstdescribetheintuitionbehindgraphclustering.Wethendiscusstwogeneralcategoriesofgraphclusteringmethods.Tofindclustersinagraph,imaginecuttingthegraphintopieces,eachpiecebeingacluster,suchthattheverticeswithinaclusterarewellconnectedandtheverticesindifferentclustersareconnectedinamuchweakerway.Formally,foragraph,G=(V,E), #################### File: Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf Page: 566 Context: HAN18-ch11-497-542-97801238147912011/6/13:24Page529#3311.3ClusteringGraphandNetworkData529acut,C=(S,T),isapartitioningofthesetofverticesVinG,thatis,V=S∪TandS∩T=∅.Thecutsetofacutisthesetofedges,{(u,v)∈E|u∈S,v∈T}.Thesizeofthecutisthenumberofedgesinthecutset.Forweightedgraphs,thesizeofacutisthesumoftheweightsoftheedgesinthecutset.“Whatkindsofcutsaregoodforderivingclustersingraphs?”Ingraphtheoryandsomenetworkapplications,aminimumcutisofimportance.Acutisminimumifthecut’ssizeisnotgreaterthananyothercut’ssize.Therearepolynomialtimealgorithmstocomputeminimumcutsofgraphs.Canweusethesealgorithmsingraphclustering?Example11.21Cutsandclusters.ConsidergraphGinFigure11.14.Thegraphhastwoclusters:{a,b,c,d,e,f}and{g,h,i,j,k},andoneoutliervertex,l.ConsidercutC1=({a,b,c,d,e,f,g,h,i,j,k},{l}).Onlyoneedge,namely,(e,l),crossesthetwopartitionscreatedbyC1.Therefore,thecutsetofC1is{(e,l)}andthesizeofC1is1.(Notethatthesizeofanycutinaconnectedgraphcannotbesmallerthan1.)Asaminimumcut,C1doesnotleadtoagoodclusteringbecauseitonlyseparatestheoutliervertex,l,fromtherestofthegraph.CutC2=({a,b,c,d,e,f,l},{g,h,i,j,k})leadstoamuchbetterclusteringthanC1.TheedgesinthecutsetofC2arethoseconnectingthetwo“naturalclusters”inthegraph.Specifically,foredges(d,h)and(e,k)thatareinthecutset,mostoftheedgesconnectingd,h,e,andkbelongtoonecluster.Example11.21indicatesthatusingaminimumcutisunlikelytoleadtoagoodclus-tering.Wearebetteroffchoosingacutwhere,foreachvertexuthatisinvolvedinanedgeinthecutset,mostoftheedgesconnectingtoubelongtoonecluster.Formally,letdeg(u)bethedegreeofu,thatis,thenumberofedgesconnectingtou.ThesparsityofacutC=(S,T)isdefinedas(cid:56)=cutsizemin{|S|,|T|}.(11.38)Sparsest cut C2Minimum cut C1abdefghijklcFigure11.14AgraphGandtwocuts. #################### File: Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf Page: 567 Context: HAN18-ch11-497-542-97801238147912011/6/13:24Page530#34530Chapter11AdvancedClusterAnalysisAcutissparsestifitssparsityisnotgreaterthanthesparsityofanyothercut.Theremaybemorethanonesparsestcut.InExample11.21andFigure11.14,C2isasparsestcut.Usingsparsityastheobjectivefunction,asparsestcuttriestominimizethenumberofedgescrossingthepartitionsandbalancethepartitionsinsize.ConsideraclusteringonagraphG=(V,E)thatpartitionsthegraphintokclusters.ThemodularityofaclusteringassessesthequalityoftheclusteringandisdefinedasQ=k(cid:88)i=1(cid:32)li|E|−(cid:18)di2|E|(cid:19)2(cid:33),(11.39)whereliisthenumberofedgesbetweenverticesintheithcluster,anddiisthesumofthedegreesoftheverticesintheithcluster.Themodularityofaclusteringofagraphisthedifferencebetweenthefractionofalledgesthatfallintoindividualclustersandthefractionthatwoulddosoifthegraphverticeswererandomlyconnected.Theoptimalclusteringofgraphsmaximizesthemodularity.Theoretically,manygraphclusteringproblemscanberegardedasfindinggoodcuts,suchasthesparsestcuts,onthegraph.Inpractice,however,anumberofchallengesexist:Highcomputationalcost:Manygraphcutproblemsarecomputationallyexpen-sive.Thesparsestcutproblem,forexample,isNP-hard.Therefore,findingtheoptimalsolutionsonlargegraphsisoftenimpossible.Agoodtrade-offbetweenefficiency/scalabilityandqualityhastobeachieved.Sophisticatedgraphs:Graphscanbemoresophisticatedthantheonesdescribedhere,involvingweightsand/orcycles.Highdimensionality:Agraphcanhavemanyvertices.Inasimilaritymatrix,avertexisrepresentedasavector(arowinthematrix)withadimensionalitythatisthenumberofverticesinthegraph.Therefore,graphclusteringmethodsmusthandlehighdimensionality.Sparsity:Alargegraphisoftensparse,meaningeachvertexonaverageconnectstoonlyasmallnumberofothervertices.Asimilaritymatrixfromalargesparsegraphcanalsobesparse.Therearetwokindsofmethodsforclusteringgraphdata,whichaddressthesechallenges.Oneusesclusteringmethodsforhigh-dimensionaldata,whiletheotherisdesignedspecificallyforclusteringgraphs.Thefirstgroupofmethodsisba #################### File: Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf Page: 567 Context: stgroupofmethodsisbasedongenericclusteringmethodsforhigh-dimensionaldata.TheyextractasimilaritymatrixfromagraphusingasimilaritymeasuresuchasthosediscussedinSection11.3.2.Agenericclusteringmethodcanthenbeappliedonthesimilaritymatrixtodiscoverclusters.Clusteringmethodsfor #################### File: Data%20Mining%20Concepts%20and%20Techniques%20-%20Jiawei%20Han%2C%20Micheline%20Kamber%2C%20Jian%20Pei%20%28PDF%29.pdf Page: 568 Context: HAN18-ch11-497-542-97801238147912011/6/13:24Page531#3511.3ClusteringGraphandNetworkData531high-dimensionaldataaretypicallyemployed.Forexample,inmanyscenarios,onceasimilaritymatrixisobtained,spectralclusteringmethods(Section11.2.4)canbeapplied.Spectralclusteringcanapproximateoptimalgraphcutsolutions.Foradditionalinformation,pleaserefertothebibliographicnotes(Section11.7).Thesecondgroupofmethodsisspecifictographs.Theysearchthegraphtofindwell-connectedcomponentsasclusters.Let’slookatamethodcalledSCAN(StructuralClusteringAlgorithmforNetworks)asanexample.Givenanundirectedgraph,G=(V,E),foravertex,u∈V,theneighborhoodofuis(cid:48)(u)={v|(u,v)∈E}∪{u}.Usingtheideaofstructural-contextsimilarity,SCANmeasuresthesimilaritybetweentwovertices,u,v∈V,bythenormalizedcommonneighborhoodsize,thatis,σ(u,v)=|(cid:48)(u)∩(cid:48)(v)|√|(cid:48)(u)||(cid:48)(v)|.(11.40)Thelargerthevaluecomputed,themoresimilarthetwovertices.SCANusesasimilaritythresholdεtodefinetheclustermembership.Foravertex,u∈V,theε-neighborhoodofuisdefinedasNε(u)={v∈(cid:48)(u)|σ(u,v)≥ε}.Theε-neighborhoodofucontainsallneighborsofuwithastructural-contextsimilaritytouthatisatleastε.InSCAN,acorevertexisavertexinsideofacluster.Thatis,u∈Visacorever-texif|Nε(u)|≥µ,whereµisapopularitythreshold.SCANgrowsclustersfromcorevertices.Ifavertexvisintheε-neighborhoodofacoreu,thenvisassignedtothesameclusterasu.Thisprocessofgrowingclusterscontinuesuntilnoclustercanbefurthergrown.Theprocessissimilartothedensity-basedclusteringmethod,DBSCAN(Chapter10).Formally,avertexvcanbedirectlyreachedfromacoreuifv∈Nε(u).Transitively,avertexvcanbereachedfromacoreuifthereexistverticesw1,...,wnsuchthatw1canbereachedfromu,wicanbereachedfromwi−1for1