1 00:00:00,000 --> 00:00:03,360 [MUSIC PLAYING] 2 00:00:06,563 --> 00:00:08,230 AMIT GANDHI: Hi, my name is Amit Gandhi. 3 00:00:08,230 --> 00:00:10,510 And I'm a graduate researcher at MIT. 4 00:00:10,510 --> 00:00:12,370 Welcome to this course on exploring fairness 5 00:00:12,370 --> 00:00:15,070 in machine learning for international development. 6 00:00:15,070 --> 00:00:16,882 In this video, we will examine bias 7 00:00:16,882 --> 00:00:19,090 in machine learning models through a pulmonary health 8 00:00:19,090 --> 00:00:20,620 diagnostic case study. 9 00:00:20,620 --> 00:00:22,570 In particular, we will explore the influence 10 00:00:22,570 --> 00:00:27,740 of representative data on accuracy when building a model. 11 00:00:27,740 --> 00:00:31,940 Pulmonary diseases, including asthma COPD, allergic rhinitis, 12 00:00:31,940 --> 00:00:34,760 and others, can have significant detrimental health impacts 13 00:00:34,760 --> 00:00:36,620 if undetected. 14 00:00:36,620 --> 00:00:39,330 In remote areas with limited access to health care, 15 00:00:39,330 --> 00:00:41,948 they can often go undiagnosed and untreated. 16 00:00:41,948 --> 00:00:44,240 The motivation for this work was to develop a screening 17 00:00:44,240 --> 00:00:46,010 tool for community health workers 18 00:00:46,010 --> 00:00:48,290 to determine if patients who were presenting symptoms 19 00:00:48,290 --> 00:00:50,945 of pulmonary disease actually have pulmonary disease. 20 00:00:53,530 --> 00:00:55,930 To develop the tool, data was collected 21 00:00:55,930 --> 00:00:58,600 from 303 patients who sought medical care 22 00:00:58,600 --> 00:01:04,087 at health clinics between 2015 and 2018 in Kuna, India. 23 00:01:04,087 --> 00:01:05,920 Patient data was collected at health clinics 24 00:01:05,920 --> 00:01:08,770 from two exams administered by researchers-- 25 00:01:08,770 --> 00:01:10,660 a mobile health diagnostic kit developed 26 00:01:10,660 --> 00:01:13,300 by Dr. Fletcher's group and a set of measurements 27 00:01:13,300 --> 00:01:15,860 from a pulmonary function test lab. 28 00:01:15,860 --> 00:01:18,290 Health diagnoses were performed by medical staff 29 00:01:18,290 --> 00:01:21,200 with a focus on asthma, allergic rhinitis, and COPD. 30 00:01:23,930 --> 00:01:26,480 The overall disease distribution among the patients 31 00:01:26,480 --> 00:01:27,980 is shown in the plot. 32 00:01:27,980 --> 00:01:32,120 The data included 175 patients with pulmonary diseases 33 00:01:32,120 --> 00:01:34,460 and 87 healthy patients. 34 00:01:34,460 --> 00:01:37,460 Patients may also have multiple pulmonary diseases-- 35 00:01:37,460 --> 00:01:39,335 for example, asthma and COPD. 36 00:01:42,550 --> 00:01:44,710 The exploration of representative sampling 37 00:01:44,710 --> 00:01:48,790 on accuracy was conducted across two protected variables, gender 38 00:01:48,790 --> 00:01:49,990 and income. 39 00:01:49,990 --> 00:01:52,300 The population distributions for the two variables 40 00:01:52,300 --> 00:01:53,950 can be seen in the slides. 41 00:01:53,950 --> 00:01:55,630 For income considerations, patients 42 00:01:55,630 --> 00:01:58,240 were categorized as either low income or high income. 43 00:02:00,760 --> 00:02:02,380 The overall approach to the bias study 44 00:02:02,380 --> 00:02:05,040 was to divide the data set into a larger training data 45 00:02:05,040 --> 00:02:07,430 superset in a test data set. 46 00:02:07,430 --> 00:02:10,180 A logistic regression model with L2 regularization 47 00:02:10,180 --> 00:02:12,640 was used to make predictions on disease. 48 00:02:12,640 --> 00:02:14,830 To train the model, training data subsets 49 00:02:14,830 --> 00:02:16,630 were randomly sampled from the superset 50 00:02:16,630 --> 00:02:18,790 that intentionally introduce imbalances 51 00:02:18,790 --> 00:02:21,300 along protected variables. 52 00:02:21,300 --> 00:02:23,590 For example, with regards to income, 53 00:02:23,590 --> 00:02:26,270 training data subsets ranged from 50 percent 50% 54 00:02:26,270 --> 00:02:32,730 and 50% low income to 87.5% high income and 12.5% low income. 55 00:02:32,730 --> 00:02:34,890 To account for stochastic error, this process 56 00:02:34,890 --> 00:02:37,350 was run 1,000 times for each test. 57 00:02:37,350 --> 00:02:40,200 The area under the curve of the receiver operating 58 00:02:40,200 --> 00:02:45,210 characteristic curve was used as a metric bracket for accuracy. 59 00:02:45,210 --> 00:02:47,430 Starting with gender bias analysis, 60 00:02:47,430 --> 00:02:51,300 our training data sets and test data set were divided as shown. 61 00:02:51,300 --> 00:02:54,840 Male-female representativeness was varied from 50-50 62 00:02:54,840 --> 00:02:59,370 to 87.5 to 12.5. 63 00:02:59,370 --> 00:03:01,150 The results for predictive accuracy 64 00:03:01,150 --> 00:03:03,810 for allergic rhinitis, asthma, and COPD 65 00:03:03,810 --> 00:03:05,520 are shown on the slide. 66 00:03:05,520 --> 00:03:08,840 The data shows no significant decrease in algorithm accuracy 67 00:03:08,840 --> 00:03:11,753 as gender imbalances are introduced in the data. 68 00:03:11,753 --> 00:03:13,170 This may be surprising considering 69 00:03:13,170 --> 00:03:15,628 how we have highlighted the principle of representativeness 70 00:03:15,628 --> 00:03:17,620 in data throughout this course. 71 00:03:17,620 --> 00:03:20,190 However, it is important to note that protective variables do 72 00:03:20,190 --> 00:03:22,380 not necessarily affect outcome variables 73 00:03:22,380 --> 00:03:23,820 and the lack of representativeness 74 00:03:23,820 --> 00:03:27,690 may not always introduce bias or fairness into models. 75 00:03:27,690 --> 00:03:29,430 Looking at our results, we also notice 76 00:03:29,430 --> 00:03:31,500 that our algorithm is more accurate at predicting 77 00:03:31,500 --> 00:03:33,880 COPD in women than men. 78 00:03:33,880 --> 00:03:35,590 Exploring the results further, we 79 00:03:35,590 --> 00:03:40,000 look at other variables in the correlation with gender. 80 00:03:40,000 --> 00:03:42,370 In our data set, we found that smoking heavily 81 00:03:42,370 --> 00:03:44,100 correlated with gender. 82 00:03:44,100 --> 00:03:47,380 55% of men reported that they were nonsmokers 83 00:03:47,380 --> 00:03:51,400 whereas 100% of women reported that they were nonsmokers. 84 00:03:51,400 --> 00:03:53,290 As a result, the population of women 85 00:03:53,290 --> 00:03:58,640 was more homogeneous, allowing for higher predictive accuracy. 86 00:03:58,640 --> 00:04:01,250 Moving on to the income bias analysis, 87 00:04:01,250 --> 00:04:03,110 the training data sets and test data sets 88 00:04:03,110 --> 00:04:04,880 were divided as shown. 89 00:04:04,880 --> 00:04:07,400 Similar to the gender study, representativeness 90 00:04:07,400 --> 00:04:11,510 based on income was varied for the training data set. 91 00:04:11,510 --> 00:04:13,100 The results were predictive accuracy 92 00:04:13,100 --> 00:04:15,500 for allergic rhinitis, asthma, and COPD 93 00:04:15,500 --> 00:04:16,890 are shown on the slide. 94 00:04:16,890 --> 00:04:18,890 Again, we see very little difference in accuracy 95 00:04:18,890 --> 00:04:22,370 as we change representativeness within the sample. 96 00:04:22,370 --> 00:04:25,260 COPD is the most sensitive to socioeconomic status, 97 00:04:25,260 --> 00:04:28,520 with a 4% difference in model accuracy for high income 98 00:04:28,520 --> 00:04:30,730 and low income populations. 99 00:04:30,730 --> 00:04:32,330 Asthma and allergic rhinitis show 100 00:04:32,330 --> 00:04:36,010 no difference in performance. 101 00:04:36,010 --> 00:04:38,250 In summary, we found that representativeness 102 00:04:38,250 --> 00:04:41,010 across the protected variables of gendered income 103 00:04:41,010 --> 00:04:43,410 do not play a large role in model accuracy 104 00:04:43,410 --> 00:04:46,815 for this example on pulmonary diseases in India. 105 00:04:46,815 --> 00:04:48,690 As part of building a machine learning model, 106 00:04:48,690 --> 00:04:51,510 it is always important to check what effects, if any, 107 00:04:51,510 --> 00:04:54,000 attentive attributes may have on the model. 108 00:04:54,000 --> 00:04:56,490 In the real world, it will be impossible to find perfectly 109 00:04:56,490 --> 00:04:57,653 balanced data sets. 110 00:04:57,653 --> 00:04:59,070 And test such as the one described 111 00:04:59,070 --> 00:05:02,160 can be used to check for the effect of representativeness 112 00:05:02,160 --> 00:05:05,985 across protected variables on data and model accuracy. 113 00:05:05,985 --> 00:05:07,860 It is important to understand these tradeoffs 114 00:05:07,860 --> 00:05:10,402 so that you can make informed decisions when building models. 115 00:05:13,038 --> 00:05:15,330 Thank you for taking the time to watch this case study. 116 00:05:15,330 --> 00:05:17,288 And we hope that you'll watch the other content 117 00:05:17,288 --> 00:05:19,550 in the series. 118 00:05:19,550 --> 00:05:26,200 [MUSIC PLAYING]