1 00:00:04,960 --> 00:00:08,928 [MUSIC PLAYING] 2 00:00:11,027 --> 00:00:13,110 MIKE TEODORESCU: Hello, and welcome to this module 3 00:00:13,110 --> 00:00:15,585 on protected attributes and fairness through unawareness. 4 00:00:15,585 --> 00:00:16,710 My name is Mike Teodorescu. 5 00:00:16,710 --> 00:00:18,752 I'm an assistant professor of information systems 6 00:00:18,752 --> 00:00:21,890 at Boston College, as well as a visiting scholar at MIT D-Lab. 7 00:00:21,890 --> 00:00:23,520 What this module will cover will be 8 00:00:23,520 --> 00:00:26,850 examples of laws that codify protected attributes, as well 9 00:00:26,850 --> 00:00:28,575 as the base case scenario for fairness 10 00:00:28,575 --> 00:00:30,450 in machine learning, which is called fairness 11 00:00:30,450 --> 00:00:32,932 through unawareness. 12 00:00:32,932 --> 00:00:34,890 The use of machine learning presents both risks 13 00:00:34,890 --> 00:00:36,675 and opportunities. 14 00:00:36,675 --> 00:00:38,050 Machine learning can reduce costs 15 00:00:38,050 --> 00:00:42,680 by automating repetitive tasks, but could also increase biases. 16 00:00:42,680 --> 00:00:44,510 Certain individual attributes are commonly 17 00:00:44,510 --> 00:00:46,820 labeled as protected attributes, as they 18 00:00:46,820 --> 00:00:49,160 can be sources of social bias. 19 00:00:49,160 --> 00:00:53,550 These are race, religion, national origin, gender, 20 00:00:53,550 --> 00:00:55,805 marital status, age, and socioeconomic status. 21 00:00:59,090 --> 00:01:00,860 In the United States, discrimination 22 00:01:00,860 --> 00:01:03,890 based on these protected attributes in housing, lending, 23 00:01:03,890 --> 00:01:05,750 and employment is illegal. 24 00:01:05,750 --> 00:01:08,770 Some of the laws are listed here for your reference. 25 00:01:08,770 --> 00:01:11,003 However, regardless of the legal framework, 26 00:01:11,003 --> 00:01:12,670 machine learning still has the potential 27 00:01:12,670 --> 00:01:14,980 to unintentionally embed bias. 28 00:01:14,980 --> 00:01:17,605 In this lecture, we look at a few examples. 29 00:01:17,605 --> 00:01:19,480 The next lecture will explore some approaches 30 00:01:19,480 --> 00:01:23,820 to mitigate unintentional bias. 31 00:01:23,820 --> 00:01:26,700 Even large companies could unintentionally discriminate. 32 00:01:26,700 --> 00:01:28,590 For example, Amazon used a machine 33 00:01:28,590 --> 00:01:30,595 learning algorithm to screen resumes, 34 00:01:30,595 --> 00:01:32,970 and later found out that the algorithm was discriminating 35 00:01:32,970 --> 00:01:34,554 against female applicants. 36 00:01:37,220 --> 00:01:39,550 In another example, this time from the criminal justice 37 00:01:39,550 --> 00:01:42,130 system, machine learning is used to determine 38 00:01:42,130 --> 00:01:44,280 the risk of recidivism. 39 00:01:44,280 --> 00:01:46,920 This system has been questioned in a variety of studies, 40 00:01:46,920 --> 00:01:48,400 in particular, with reference to the protected 41 00:01:48,400 --> 00:01:49,710 attribute of race and gender. 42 00:01:52,290 --> 00:01:55,340 In another example of a large organization 43 00:01:55,340 --> 00:01:58,300 employing machine learning, this time, to display ads, 44 00:01:58,300 --> 00:02:00,850 Facebook has been named in a suit over alleged violations 45 00:02:00,850 --> 00:02:04,090 of the Fair Housing Act. 46 00:02:04,090 --> 00:02:05,680 Therefore, even large organizations, 47 00:02:05,680 --> 00:02:07,222 like Amazon an Facebook, find machine 48 00:02:07,222 --> 00:02:09,490 learning fairness challenging. 49 00:02:09,490 --> 00:02:12,182 We hope this course will be helpful in preparing 50 00:02:12,182 --> 00:02:14,390 software engineers to better address machine learning 51 00:02:14,390 --> 00:02:14,890 fairness. 52 00:02:17,360 --> 00:02:19,850 Generally speaking, before you implement a machine learning 53 00:02:19,850 --> 00:02:23,930 algorithm, you need to collect data to train that algorithm. 54 00:02:23,930 --> 00:02:26,350 This data set would have your outcome variable, 55 00:02:26,350 --> 00:02:29,050 in this figure, Y, for example, the decision 56 00:02:29,050 --> 00:02:33,550 to hire or not to hire, as well as all of the predictors, 57 00:02:33,550 --> 00:02:36,910 for example, features collected from resumes. 58 00:02:36,910 --> 00:02:38,440 This complete data set would then 59 00:02:38,440 --> 00:02:42,310 have to be split in a training set on which the model would 60 00:02:42,310 --> 00:02:45,280 learn and a test set on which we would determine the model 61 00:02:45,280 --> 00:02:47,100 performance. 62 00:02:47,100 --> 00:02:49,295 There are other details, like cross-validation, 63 00:02:49,295 --> 00:02:50,670 which we'll not cover here, but I 64 00:02:50,670 --> 00:02:51,920 encourage you to read more on. 65 00:02:54,970 --> 00:02:58,310 Fairness starts with a good quality training set. 66 00:02:58,310 --> 00:02:59,960 Low quality data, generally speaking, 67 00:02:59,960 --> 00:03:02,330 leads to bad predictions. 68 00:03:02,330 --> 00:03:03,960 The individuals labeling the data, 69 00:03:03,960 --> 00:03:08,800 for example, managers labeling resume as a higher or no hire, 70 00:03:08,800 --> 00:03:10,650 may carry biases, which are then picked up 71 00:03:10,650 --> 00:03:13,080 by the machine learning algorithm. 72 00:03:13,080 --> 00:03:15,750 Training data may not be representative of all groups, 73 00:03:15,750 --> 00:03:17,945 which could lead to bias. 74 00:03:17,945 --> 00:03:20,070 And there may be hidden correlations in input data, 75 00:03:20,070 --> 00:03:23,020 for example, between a protected attribute and the predictor. 76 00:03:23,020 --> 00:03:26,320 That can also lead to bias. 77 00:03:26,320 --> 00:03:27,940 Individuals labeling the training data 78 00:03:27,940 --> 00:03:31,030 may misremember past situations, a phenomenon known 79 00:03:31,030 --> 00:03:33,730 as selective perception, which may in itself become 80 00:03:33,730 --> 00:03:34,540 a source of bias. 81 00:03:37,240 --> 00:03:39,700 The default fairness method in machine learning 82 00:03:39,700 --> 00:03:42,082 is fairness through unawareness. 83 00:03:42,082 --> 00:03:43,540 Fairness through unawareness refers 84 00:03:43,540 --> 00:03:46,520 to leaving out protected attributes such as gender, 85 00:03:46,520 --> 00:03:50,340 race, and other characteristics deemed sensitive. 86 00:03:50,340 --> 00:03:53,600 And, while it was thought to erase inequality, 87 00:03:53,600 --> 00:03:56,100 it was found, actually, to perpetuate it. 88 00:03:56,100 --> 00:03:59,655 It may do so by, for example, having other attributes that 89 00:03:59,655 --> 00:04:01,030 are correlated with the protected 90 00:04:01,030 --> 00:04:03,960 attributes in the data, which, by ignoring the protected 91 00:04:03,960 --> 00:04:07,740 attributes, we could just still include in our model. 92 00:04:07,740 --> 00:04:12,180 This could actually perpetuate inequality. 93 00:04:12,180 --> 00:04:14,370 When race, gender, and other sensitive variables 94 00:04:14,370 --> 00:04:18,060 are treated as protected, other variables, such as college 95 00:04:18,060 --> 00:04:20,430 attended, hometown, or other resume indicators 96 00:04:20,430 --> 00:04:21,990 that remain unprotected, may still 97 00:04:21,990 --> 00:04:25,200 be highly correlated with these protected attributes. 98 00:04:25,200 --> 00:04:28,200 Thus, ignoring the protected attributes altogether 99 00:04:28,200 --> 00:04:31,050 may not reveal these hidden correlations in data. 100 00:04:31,050 --> 00:04:34,310 This is called redundant encoding. 101 00:04:34,310 --> 00:04:37,040 In one example, researchers at Carnegie Mellon University 102 00:04:37,040 --> 00:04:40,070 found that gender caused an unintentional change 103 00:04:40,070 --> 00:04:43,610 in Google's advertising system such an ad listings targeted 104 00:04:43,610 --> 00:04:45,750 for user's seeking high-income jobs 105 00:04:45,750 --> 00:04:48,425 were presented to men at nearly six times the rate they 106 00:04:48,425 --> 00:04:49,970 were presented to women. 107 00:04:49,970 --> 00:04:55,100 This may be an example of fairness through unawareness. 108 00:04:55,100 --> 00:04:59,125 Here are some review questions about this material. 109 00:04:59,125 --> 00:05:00,500 What are the sensitive attributes 110 00:05:00,500 --> 00:05:02,578 in the context in which you work? 111 00:05:02,578 --> 00:05:04,995 Do you think that the current list of protected attributes 112 00:05:04,995 --> 00:05:07,120 is exhaustive? 113 00:05:07,120 --> 00:05:10,270 What is fairness through unawareness? 114 00:05:10,270 --> 00:05:12,310 What variables might lead to biased predictions 115 00:05:12,310 --> 00:05:15,860 for a machine learning hiring system in your country? 116 00:05:15,860 --> 00:05:18,110 What are some risks to an organization choosing 117 00:05:18,110 --> 00:05:21,690 unawareness? 118 00:05:21,690 --> 00:05:24,360 I'd like to thank my co-authors Lily Morse, Gerald Kane, 119 00:05:24,360 --> 00:05:26,550 and Yazeed Awwad for a study that 120 00:05:26,550 --> 00:05:28,860 helped me put together also these slides, 121 00:05:28,860 --> 00:05:31,830 as well as USAID for a grant on appropriate use of machine 122 00:05:31,830 --> 00:05:33,860 learning in developing countries and the Carroll 123 00:05:33,860 --> 00:05:37,090 School of Management at Boston College for research funding. 124 00:05:37,090 --> 00:05:39,330 I'd also like to thank everyone who helped contribute 125 00:05:39,330 --> 00:05:41,580 with feedback to these videos, as well as 126 00:05:41,580 --> 00:05:45,243 the accompanying manuscripts. 127 00:05:45,243 --> 00:05:47,660 I'd like to share with you some references I found helpful 128 00:05:47,660 --> 00:05:49,500 [INAUDIBLE] this material. 129 00:05:49,500 --> 00:05:51,440 I hope you will read more about this, 130 00:05:51,440 --> 00:05:53,380 and I'd like to thank you for your attention 131 00:05:53,380 --> 00:05:55,680 in watching this video. 132 00:05:55,680 --> 00:05:57,630 Thank you so much for watching this video. 133 00:05:57,630 --> 00:05:59,547 We hope you find it useful and you'll continue 134 00:05:59,547 --> 00:06:01,820 watching the rest of the class. 135 00:06:01,820 --> 00:06:05,770 [MUSIC PLAYING]