Man vs Machine Learning: Criminal Justice in the 21st Century | Jens Ludwig | TEDxPennsylvaniaAvenue
Translator: Delia Cohen Reviewer: Ellen Maloney I have an idea for how to make the world a better place, and like all truly good ideas, this one starts with a roadtrip. The protagonists on the road trip are me, a University of Chicago professor who studies crime and the criminal justice system, and a friend of mine who's a professor at an Ivy League medical school. And so, there we were, two at least moderately distinguished academics driving up Route 95. (Laughter) We had several hours to kill on the way from New York City to New England, and so I tried to use some of those conversational skills that professors are famous for. I turned to my friend and I said, "Tell me about the biggest mistake that you've ever made in your life." He paused, then turned to me and said, "I've got an idea: why don't you go first?" (Laughter) So, I said to my friend "I'm the one who asked the question. Why don't you go first?" So, my friend told me about a time that he was working in the ER, and a patient came in complaining of chest pain. The standard protocol in situations like that is to do a cardiac enzyme test to see what's in the blood to try and predict whether the patient is having a heart attack at the time. The patient comes in, and they administer the test. The level of the enzyme is above the threshold, and usually the default would be to take the patient into the intensive care unit. But before they do that, my friend goes into the waiting room to see the patient in person first, and the patient is sitting there snacking on a watermelon. My friend talks to him for a few minutes, goes back out to meet the rest of the team. Now, the rest of the team - the doctors and nurses on duty in the ER - they haven't seen the patient. All they've seen are the data in the chart and the test level above the threshold. So they start saying, "We've got to go; let's get this guy up to the ICU." My friend says, "No, no, no. I went and met with the patient; he's totally chill. He's sitting there, he's having a snack. I think he's okay; let's leave him where he is." And then, a half hour later, the guy goes into cardiac arrest, and they have to race him to the operating room. That is an illustration of a lesson that we've learned from behavioral economics and psychology about how easy it is for the human brain to get distracted by irrelevant but very salient information. That got me thinking of a problem that my own research center I help run at the University of Chicago at The Crime Lab, has been working on for several years which is the problem of the jail system in the United States. Millions and millions of times a year, judges have to make a decision about when someone is arrested, where that person awaits trial: do they get to go home or do they have to sit in jail? And by law, that decision is supposed to hinge on the judge's prediction of what the defendant would do if they were released: Is that person a flight risk? Is that person a public safety risk? This is an enormously high-stakes decision. If the judge puts you in jail, you will on average sit there for two to three months or longer, sometimes much, much longer. The flip side of this is: if the judge releases someone who goes on to commit a new crime, that could be horrible in its own way. And this is a decision that's very difficult for the judge for the same reason that the emergency room decision is difficult for my doctor friend. ER doctors at least have the benefit of something like a cardiac enzyme test to help them make those sorts of decisions. We give judges a stack of manila files with some information about what the person was arrested for and the person's prior criminal record, and then the judge has to make a decision in their head. To think about how crazy this is, consider that the very same judge who spends all day reading through folders and making predictions that will change the course of people's lives, they go home and they want to relax at the end of the day by watching a movie on TV and for help with that critical decision the judge gets access to Netflix, which uses some of the most sophisticated machine learning technology on the planet, to help predict what movie the judge is going to like. Why aren't we using some of these technologies that have been deployed so productively in the commercial sector to help us solves these really important public policy problems as well? Now, to think about whether this would actually be helpful or not, I think it's useful for starters to have a little bit more sense for what machine learning is and how it works. Let me talk you through briefly a canonical problem in computer science, which is called sentiment analysis. So here's what that is: It is basically taking a snippet of text and trying to determine what the author's affect was: is the author trying to convey a positive or negative emotion? So, here's how that looks for a more or less randomly selected consumer product, the Hutzler 571 Banana Slicer. (Laughter) Now, here's a review by Thrifty E: "I bought this in order to speed up cutting up a banana for my cereal. Any time I saved in that endeavor was spent cleaning this implement." (Laughter) "It is not easy to clean. You have to scrub between every rung to thoroughly clean it." Now, we read that; it's trivially easy for us to tell that is a negative review. And we can confirm our assessment by looking at the star rating, merely two out of five stars. Here's another one by Uncle Pookie who says "Great gift." (Laughter) "Once I figured out I had to peel the banana before using it," (Laughter) "it works much better." Five-star review. Here's one by Q-Tip: "Confusing. There's no way to tell if this is a standard or metric banana slicer." (Laughter) "Additional markings on it would help greatly." And here's one more by J. Anderson: "Angle is wrong. I tried this banana slicer and found it unacceptable. As shown in the picture, the slice is curved from left to right and all of my bananas are bent the other way." (Laughter) Now, reading through these text reviews, you realize that it is very, very easy for us to do this, and that gave the early computer scientists an idea about how to get computers to do this. Why don't we just introspect on how we do this, and then try and program the computer to do exactly what we're doing? Here's the results of a study that tries to do sentiment analysis using what's called a programming approach for movie reviews. The data set that we have of movie reviews it's half positive reviews, half negative reviews. And so, an accuracy rate of 50% would basically be just like random guessing. And so, you get a bunch of programmers, they sort of introspect on what words you would expect to see in a positive review, in a negative review. Here are some of the positive words that you think you would expect to see in a good review and some of the words you'd see in a negative review. And when you do this, you get an accuracy rate on the order of like 60%. Now, that's better than random guessing, but not much better. This is the challenge that the computer scientists kept running into in this area. Even with pretty basic problems, it turned out to be very, very hard to program computers up to do what we're doing and get good performance. The reason for that is that it turns out to be much more difficult for us to fully introspect and figure out what we are doing when we do these tasks. My psychology friends call that the "introspection illusion." Progress in this area really only came once the computer scientists realized that we needed to just completely forget that we knew how to do these things ourselves and turned these tasks into just brute force data exercises. In the movie review analysis, here's what that would look like: You would take a large sample of movie reviews where you know whether they're good or bad reviews by the star rating and you would let the computer learn which words tend to come up in good reviews and which words tend to come up in bad reviews. Okay? And then use those words as your prediction algorithm for future reviews. And once you adopt that data-driven approach, these are the words that the computer learns, that the machine learns, are indicative of positive and negative reviews. Now we can get up to accuracy rates on the order of 95%. This, I think, is really the magic behind machine learning, and you can see how you would apply this then to something like pre-trial release decisions. Let the computer learn what case characteristics, or combination of case characteristics, are actually most predictive of flight risk or public safety risk. I've been working as part of a research team for the last several years trying to build a prediction algorithm for pre-trial release to see if we can be helpful to judges. We've been doing this with data from a large, anonymous American city of 8.5 million people. (Laughter) What we discovered is that it's not so hard to actually build the algorithm. You can download free software off the internet and figure out how to do that. The hard part here is testing the algorithm and seeing whether it will actually make the world a better place or not. For Netflix, this is not such a hard problem. Everything that Netflix does is in this sort of self-contained online environment. But testing an algorithm in the real world for public policy applications is often much more complicated. This is a difficult problem to solve, absent the ability to do a randomized trial, and it's a difficult social science problem, not a difficult computer science problem that we run into. And it's so difficult that many of the people who are now thinking about taking these machine learning tools and bringing them into the public policy arena are tempted to just give up on the testing stage and take tools right from the drawing board of the computer into the real world. And I think that would be a mistake. It is very possible to inadvertently build a tool that can wind up making the world a worse place, not a better place. For the project that we've been working on the hardest part for us has been to figure out how to test the tool and make sure it's actually helpful. The way that we have come up with to test the tool builds on two insights. Notice why this problem is difficult in the pre-trial case. We build an algorithmic rule to inform pre-trial release that says let's prioritize the people with highest predicted risk for jailing and let everybody else go. That algorithmic rule will inevitably want to release someone that the judge jailed. And when the algorithm wants to do that, we can't see what that person would have done had they been released because the judge actually jailed them. So, we have this very difficult missing data problem. On the flip side, though, if the algorithm wants to jail someone that the judge released, we don't have an evaluation problem because we know what the effect of putting someone in jail is on their flight risk or their public safety risk. Being in jail eliminates the risk that you won't show up in court or get re-arrested. That's insight number one that this missing data challenge is one sided. And the second insight that helps us here is that in the big city in which we've been working, cases are more or less randomly assigned to judges. What that means, then, is that we have a sample of judges who are hearing very similar caseloads. The judges turn out to differ a lot with respect to their strictness and leniency. So, here's what we can do in that case. Imagine that we have two judges: a lenient judge that releases 90% of the cases and a stricter judge that releases 80%. We can basically compare how the judges perform when they become stricter compared to how the algorithm would choose to become stricter, as a fair test of the algorithm's performance. Here's what that would look like. Here's the lenient judge who releases 90% of their cases. We can observe all the outcomes for the people that judge releases. And the algorithm would say if we wanted to become stricter and go from a 90% to an 80% release rate, the algorithm would just say: let's identify the highest risk 10% of people in the judge's caseload and prioritize them for jail. Now we're down to an 80% release rate, and we can observe what the crime rate would be that we could get, and then we could compare that to how the stricter judge did in getting us down from a 90% to an 80% release rate. This gives us a way to fairly compare the algorithm's performance against the judge's on a comparable set of cases, focusing on the algorithmic task where we don't have this missing data problem, where the algorithm is just selecting people to jail from among the pool of people that the judges let go. Now, having solved the evaluation problem, the testing problem, we can do some policy simulations to suggest what would happen if we actually followed the algorithm's rule instead of standard practice in the criminal justice system. What we find is that if you follow the recommendations of the algorithm, you'd be able to reduce the crime rate by fully 25% without having to put a single additional person in jail. Alternatively, you could reduce the jail population by fully 42% without any increase in the crime rate at all. And the reason that the algorithm is capable of giving us such big gains over the status quo criminal justice system is we can see in the data that the judges, just like my ER doctor friend, are getting distracted by irrelevant but very salient information about these cases. And that's especially true among the highest-risk cases in the defendant pool. So, what I've just done is I've showed you the upside of applying machine learning to these policy problems. There's a potential downside as well, which is the possibility that these algorithms, once we apply them to policy problems, maybe especially criminal justice problems, might get us gains on some outcomes but compromise other things that we care about like fairness. You can see why people are worried about this. In the city in which we are working, fully 89% of people in jail are minorities - in a city where I can assure you the overall city population is not anywhere like 89% minority. The people who are concerned about the use of machine learning for these problems, I think, are right in a way in that we have discovered that if you build an algorithm in a release rule that ignores this issue entirely, it is indeed possible to build a tool that makes this problem, if anything, a little bit worse. But what we've also found is that if you build an algorithm paying attention to this problem, you can design a decision aid that would simultaneously let you reduce crime, reduce jail populations, and reduce racial disparities in the criminal justice system as well. How does the algorithm let you do that? Well, what is race, after all, but an irrelevant but highly salient piece of information in the courtroom? What is an implicit bias other than a version of the introspection illusion? The algorithm is not prone to those challenges to human judgment and decision making. I think what's particularly exciting about bail is that it is just one illustration of a larger class of public policy problems that hinge on a prediction that a human being is currently making, but in principle could be informed by machine learning algorithms. There is an active debate underway about whether it's a good idea or a bad idea to take these algorithms from the commercial sector and bring them in to the public policy arena. Should we do that or not? I think that that actually is the wrong way to frame the debate and frame the question, and here's a thought exercise about why. Imagine that I could magically transport you back to the beginning of the 20th century. You would arrive telling people about this new technology that was on the horizon that would very quickly become one of the leading causes of death and have massively adverse impacts on the environment. And yet, I think relatively few of us here would argue that we shouldn't have adopted the internal combustion engine automobile. Imagine what life would be like without cars. We wouldn't have had anything like the economic growth we've seen over the last 100 years. Our lives would be impoverished in countless ways, and we wouldn't have road trips. (Laughter) And so I think the right conversation to be having about the use of machine learning for policy applications over the next ten years is not whether to adopt these new technologies but how. Thank you very much. (Applause)