1
00:00:04,960 --> 00:00:08,928
[MUSIC PLAYING]

2
00:00:11,027 --> 00:00:13,110
MIKE TEODORESCU: Hello,
and welcome to this module

3
00:00:13,110 --> 00:00:15,585
on protected attributes and
fairness through unawareness.

4
00:00:15,585 --> 00:00:16,710
My name is Mike Teodorescu.

5
00:00:16,710 --> 00:00:18,752
I'm an assistant professor
of information systems

6
00:00:18,752 --> 00:00:21,890
at Boston College, as well as a
visiting scholar at MIT D-Lab.

7
00:00:21,890 --> 00:00:23,520
What this module
will cover will be

8
00:00:23,520 --> 00:00:26,850
examples of laws that codify
protected attributes, as well

9
00:00:26,850 --> 00:00:28,575
as the base case
scenario for fairness

10
00:00:28,575 --> 00:00:30,450
in machine learning,
which is called fairness

11
00:00:30,450 --> 00:00:32,932
through unawareness.

12
00:00:32,932 --> 00:00:34,890
The use of machine learning
presents both risks

13
00:00:34,890 --> 00:00:36,675
and opportunities.

14
00:00:36,675 --> 00:00:38,050
Machine learning
can reduce costs

15
00:00:38,050 --> 00:00:42,680
by automating repetitive tasks,
but could also increase biases.

16
00:00:42,680 --> 00:00:44,510
Certain individual
attributes are commonly

17
00:00:44,510 --> 00:00:46,820
labeled as protected
attributes, as they

18
00:00:46,820 --> 00:00:49,160
can be sources of social bias.

19
00:00:49,160 --> 00:00:53,550
These are race, religion,
national origin, gender,

20
00:00:53,550 --> 00:00:55,805
marital status, age, and
socioeconomic status.

21
00:00:59,090 --> 00:01:00,860
In the United States,
discrimination

22
00:01:00,860 --> 00:01:03,890
based on these protected
attributes in housing, lending,

23
00:01:03,890 --> 00:01:05,750
and employment is illegal.

24
00:01:05,750 --> 00:01:08,770
Some of the laws are listed
here for your reference.

25
00:01:08,770 --> 00:01:11,003
However, regardless of
the legal framework,

26
00:01:11,003 --> 00:01:12,670
machine learning still
has the potential

27
00:01:12,670 --> 00:01:14,980
to unintentionally embed bias.

28
00:01:14,980 --> 00:01:17,605
In this lecture, we
look at a few examples.

29
00:01:17,605 --> 00:01:19,480
The next lecture will
explore some approaches

30
00:01:19,480 --> 00:01:23,820
to mitigate unintentional bias.

31
00:01:23,820 --> 00:01:26,700
Even large companies could
unintentionally discriminate.

32
00:01:26,700 --> 00:01:28,590
For example, Amazon
used a machine

33
00:01:28,590 --> 00:01:30,595
learning algorithm
to screen resumes,

34
00:01:30,595 --> 00:01:32,970
and later found out that the
algorithm was discriminating

35
00:01:32,970 --> 00:01:34,554
against female applicants.

36
00:01:37,220 --> 00:01:39,550
In another example, this time
from the criminal justice

37
00:01:39,550 --> 00:01:42,130
system, machine learning
is used to determine

38
00:01:42,130 --> 00:01:44,280
the risk of recidivism.

39
00:01:44,280 --> 00:01:46,920
This system has been questioned
in a variety of studies,

40
00:01:46,920 --> 00:01:48,400
in particular, with
reference to the protected

41
00:01:48,400 --> 00:01:49,710
attribute of race and gender.

42
00:01:52,290 --> 00:01:55,340
In another example of
a large organization

43
00:01:55,340 --> 00:01:58,300
employing machine learning,
this time, to display ads,

44
00:01:58,300 --> 00:02:00,850
Facebook has been named in a
suit over alleged violations

45
00:02:00,850 --> 00:02:04,090
of the Fair Housing Act.

46
00:02:04,090 --> 00:02:05,680
Therefore, even
large organizations,

47
00:02:05,680 --> 00:02:07,222
like Amazon an
Facebook, find machine

48
00:02:07,222 --> 00:02:09,490
learning fairness challenging.

49
00:02:09,490 --> 00:02:12,182
We hope this course will
be helpful in preparing

50
00:02:12,182 --> 00:02:14,390
software engineers to better
address machine learning

51
00:02:14,390 --> 00:02:14,890
fairness.

52
00:02:17,360 --> 00:02:19,850
Generally speaking, before you
implement a machine learning

53
00:02:19,850 --> 00:02:23,930
algorithm, you need to collect
data to train that algorithm.

54
00:02:23,930 --> 00:02:26,350
This data set would have
your outcome variable,

55
00:02:26,350 --> 00:02:29,050
in this figure, Y, for
example, the decision

56
00:02:29,050 --> 00:02:33,550
to hire or not to hire, as
well as all of the predictors,

57
00:02:33,550 --> 00:02:36,910
for example, features
collected from resumes.

58
00:02:36,910 --> 00:02:38,440
This complete data
set would then

59
00:02:38,440 --> 00:02:42,310
have to be split in a training
set on which the model would

60
00:02:42,310 --> 00:02:45,280
learn and a test set on which
we would determine the model

61
00:02:45,280 --> 00:02:47,100
performance.

62
00:02:47,100 --> 00:02:49,295
There are other details,
like cross-validation,

63
00:02:49,295 --> 00:02:50,670
which we'll not
cover here, but I

64
00:02:50,670 --> 00:02:51,920
encourage you to read more on.

65
00:02:54,970 --> 00:02:58,310
Fairness starts with a
good quality training set.

66
00:02:58,310 --> 00:02:59,960
Low quality data,
generally speaking,

67
00:02:59,960 --> 00:03:02,330
leads to bad predictions.

68
00:03:02,330 --> 00:03:03,960
The individuals
labeling the data,

69
00:03:03,960 --> 00:03:08,800
for example, managers labeling
resume as a higher or no hire,

70
00:03:08,800 --> 00:03:10,650
may carry biases, which
are then picked up

71
00:03:10,650 --> 00:03:13,080
by the machine
learning algorithm.

72
00:03:13,080 --> 00:03:15,750
Training data may not be
representative of all groups,

73
00:03:15,750 --> 00:03:17,945
which could lead to bias.

74
00:03:17,945 --> 00:03:20,070
And there may be hidden
correlations in input data,

75
00:03:20,070 --> 00:03:23,020
for example, between a protected
attribute and the predictor.

76
00:03:23,020 --> 00:03:26,320
That can also lead to bias.

77
00:03:26,320 --> 00:03:27,940
Individuals labeling
the training data

78
00:03:27,940 --> 00:03:31,030
may misremember past
situations, a phenomenon known

79
00:03:31,030 --> 00:03:33,730
as selective perception,
which may in itself become

80
00:03:33,730 --> 00:03:34,540
a source of bias.

81
00:03:37,240 --> 00:03:39,700
The default fairness
method in machine learning

82
00:03:39,700 --> 00:03:42,082
is fairness through unawareness.

83
00:03:42,082 --> 00:03:43,540
Fairness through
unawareness refers

84
00:03:43,540 --> 00:03:46,520
to leaving out protected
attributes such as gender,

85
00:03:46,520 --> 00:03:50,340
race, and other characteristics
deemed sensitive.

86
00:03:50,340 --> 00:03:53,600
And, while it was thought
to erase inequality,

87
00:03:53,600 --> 00:03:56,100
it was found, actually,
to perpetuate it.

88
00:03:56,100 --> 00:03:59,655
It may do so by, for example,
having other attributes that

89
00:03:59,655 --> 00:04:01,030
are correlated
with the protected

90
00:04:01,030 --> 00:04:03,960
attributes in the data, which,
by ignoring the protected

91
00:04:03,960 --> 00:04:07,740
attributes, we could just
still include in our model.

92
00:04:07,740 --> 00:04:12,180
This could actually
perpetuate inequality.

93
00:04:12,180 --> 00:04:14,370
When race, gender, and
other sensitive variables

94
00:04:14,370 --> 00:04:18,060
are treated as protected, other
variables, such as college

95
00:04:18,060 --> 00:04:20,430
attended, hometown, or
other resume indicators

96
00:04:20,430 --> 00:04:21,990
that remain
unprotected, may still

97
00:04:21,990 --> 00:04:25,200
be highly correlated with
these protected attributes.

98
00:04:25,200 --> 00:04:28,200
Thus, ignoring the protected
attributes altogether

99
00:04:28,200 --> 00:04:31,050
may not reveal these hidden
correlations in data.

100
00:04:31,050 --> 00:04:34,310
This is called
redundant encoding.

101
00:04:34,310 --> 00:04:37,040
In one example, researchers
at Carnegie Mellon University

102
00:04:37,040 --> 00:04:40,070
found that gender caused
an unintentional change

103
00:04:40,070 --> 00:04:43,610
in Google's advertising system
such an ad listings targeted

104
00:04:43,610 --> 00:04:45,750
for user's seeking
high-income jobs

105
00:04:45,750 --> 00:04:48,425
were presented to men at
nearly six times the rate they

106
00:04:48,425 --> 00:04:49,970
were presented to women.

107
00:04:49,970 --> 00:04:55,100
This may be an example of
fairness through unawareness.

108
00:04:55,100 --> 00:04:59,125
Here are some review
questions about this material.

109
00:04:59,125 --> 00:05:00,500
What are the
sensitive attributes

110
00:05:00,500 --> 00:05:02,578
in the context in
which you work?

111
00:05:02,578 --> 00:05:04,995
Do you think that the current
list of protected attributes

112
00:05:04,995 --> 00:05:07,120
is exhaustive?

113
00:05:07,120 --> 00:05:10,270
What is fairness
through unawareness?

114
00:05:10,270 --> 00:05:12,310
What variables might lead
to biased predictions

115
00:05:12,310 --> 00:05:15,860
for a machine learning hiring
system in your country?

116
00:05:15,860 --> 00:05:18,110
What are some risks to
an organization choosing

117
00:05:18,110 --> 00:05:21,690
unawareness?

118
00:05:21,690 --> 00:05:24,360
I'd like to thank my co-authors
Lily Morse, Gerald Kane,

119
00:05:24,360 --> 00:05:26,550
and Yazeed Awwad
for a study that

120
00:05:26,550 --> 00:05:28,860
helped me put together
also these slides,

121
00:05:28,860 --> 00:05:31,830
as well as USAID for a grant
on appropriate use of machine

122
00:05:31,830 --> 00:05:33,860
learning in developing
countries and the Carroll

123
00:05:33,860 --> 00:05:37,090
School of Management at Boston
College for research funding.

124
00:05:37,090 --> 00:05:39,330
I'd also like to thank
everyone who helped contribute

125
00:05:39,330 --> 00:05:41,580
with feedback to these
videos, as well as

126
00:05:41,580 --> 00:05:45,243
the accompanying manuscripts.

127
00:05:45,243 --> 00:05:47,660
I'd like to share with you
some references I found helpful

128
00:05:47,660 --> 00:05:49,500
[INAUDIBLE] this material.

129
00:05:49,500 --> 00:05:51,440
I hope you will read
more about this,

130
00:05:51,440 --> 00:05:53,380
and I'd like to thank
you for your attention

131
00:05:53,380 --> 00:05:55,680
in watching this video.

132
00:05:55,680 --> 00:05:57,630
Thank you so much for
watching this video.

133
00:05:57,630 --> 00:05:59,547
We hope you find it
useful and you'll continue

134
00:05:59,547 --> 00:06:01,820
watching the rest of the class.

135
00:06:01,820 --> 00:06:05,770
[MUSIC PLAYING]