Friday, January 8, 2016

Inside the Wall Street Journal's Prediction Calculator

By Martin Burch

Inside the Wall Street Journal's Prediction Calculator

(Wall Street Journal)

Our interactive asked for just two things: a person’s last name and address. In return, it produced a guess about that person’s race and ethnicity. For us at the Wall Street Journal, it also produced another kind of result—an unusually thoughtful dialogue with our readers on social media.

The interactive was called “How the Government Predicts Race and Ethnicity,” and it accompanied our story on an $80 million auto-loan discrimination settlement. To create the graphic, reporters AnnaMaria Andriotis and Rachel Ensign dug up methodology from the Consumer Financial Protection Bureau (CFPB), showing how the CFPB calculated alleged discrimination against minority borrowers. Then, as our Halloween deadline loomed, I Frankensteined it all together, recreating the CFPB’s calculations in an interactive calculator.

Beyond a single sentence, our finished interactive doesn’t attempt to explain how the CFPB’s predictions work or how accurate they are. I think this factor fueled the discussion that happened next.

A Conversation Readers Wanted to Have

My initial social share was this Twitter poll:

As my unread notifications icon on Twitter hit the double digits, I was momentarily alarmed. Was something broken? Had I been thoroughly insensitive? Was my interactive caught up in some partisan bickering? No, none of the above. People just wanted to share screenshots of the interactive while trying to make sense of the results.

Reader engagement can become a rote checklist. “Share this on Twitter!” we urge. “Take our poll!” “Leave a comment!”

This interactive was different. Readers were eager to have a conversation with me, sharing their discoveries and reasoning about how the calculator worked.

Why We Made the Calculator

I work for a large and growing interactive graphics team at The Wall Street Journal. I mostly work on data development (e.g. web scraping) for large projects in collaboration with a front-end person. But, for small projects like this one, we work as one-person teams. As a result, this project had a simple user interface.

We first considered showing how a name and address combine to produce a race and ethnicity probability using the CFPB’s methodology. And we thought about making a list of example names and addresses (try Clinton at the White House) to show potential problems with the CFPB’s method. Those ideas didn’t make the cut.

Instead, we focused on making a simple graphic, embedding it in the story, and letting the story provide context. We also produced a separate version, as we often do, to share on social media. It’s this standalone version that became unexpectedly popular.

The Effect of the Standalone Graphic

Chart of traffic to the WSJ story

Our initial traffic mostly came from article views, but at 8pm on the day of publication, we had our first hour where a majority of pageviews came from the standalone version of the graphic. Over the first 10 days, there were 0.97 calculations per pageview, and 54% of the interactive’s pageviews came from the standalone version. (The rest of our readers saw the interactive within the story’s context.)

We had a chance to re-run this interactive almost a month later, embedded in a follow-up story. This time, we didn’t promote the standalone version on social media. Over the first follow-up day, 99% of interactive views came from the embedded version, compared to 61% on the interactive’s initial day of publication. There were 0.38 calculations per pageview during the first follow-up day, compared to 0.91 calculations per pageview during a comparable time of the initial day.

This data suggests that the standalone version resulted in more exploration, but we’d need more detailed analytics to be sure.

The standalone version did link back to the article, but not in a really obvious way. Only about 5% of readers clicked through from the standalone interactive to read the story, according to one of our analytics reports. I’d wager nobody clicked on the gray hyperlinked source line to the Consumer Financial Protection Bureau’s 37-page methodology paper.

How We Made the Calculator

As our reporters note in the story, some people have criticized the CFPB’s use of a race-guessing algorithm. But one thing is hard to criticize: how open the CFPB has been about how the algorithm works.

We used the CFPB’s public GitHub repository to recreate the agency’s calculations. The repo also includes Stata-formatted data files with the racial probabilities for each name and location; a quick trip through Python’s pandas module turned them into a MySQL database. (The CFPB’s calculations used five Census racial categories: Black, White, Asian, American Indian, and multiracial, all non-Hispanic, and one Census ethnic category, Hispanic.)

We also used another open government resource: the U.S. Census Bureau’s open geocoder. This allows us to easily attach Census areas—tracts and block group identifiers—to addresses. Then, we can use these identifiers to look up race and ethnicity percentages in the CFPB’s data tables.

Can You Read This Formula? If Not, Don’t Panic.

The official name for the CFPB’s methodology is Bayesian Improved Surname Geocoding. It was developed by statistician Marc Elliott at RAND Corporation, a think tank. Here’s the formula, as presented by the CFPB:

Formula used in CFPB method

Early on in this process, someone at the Journal asked me if I understood the math involved here. Yes, I do, but not from reading this formula. (I’m a journalist by training, not a mathematician.) Thankfully, the CFPB provided the same logic in computer code, not mathematical notation.

The agency also provided an example calculation, finding the race and ethnicity probabilities for a person named Smith who lives in California. When the formula is broken out into its component parts—and familiar numbers replace letters and symbols—the math becomes basic arithmetic and looks a lot less intimidating.

In this specific calculation, 22.22% of people named Smith in the 2000 Census reported a race of non-Hispanic Black. This is multiplied by the percentage of the non-Hispanic Black population who lives in California, 6.03%. (The other 93.97% of non-Hispanic Blacks live in other states.)

The product of this multiplication is divided by the sum of all the products of the six combinations of surname and race/ethnicity percentage, multiplied by the matching location and race/ethnicity percentage. (Notice 22.22% * 6.03% is in the denominator, too.) This adds the important location-based race information: how often non-Hispanic Blacks are likely to be found among Californians.

Formula used in CFPB method

The final result of this calculation suggests a 16.61% chance that a person named Smith who lives in California is Black. Without knowing that Smith lives in California, we would have guessed a 22.22% chance based on last name alone.

Let’s See Some Code

In the interest of open source, here’s the function for computing the racial/ethnic probabilities that I wrote in PHP. All the array key names come from the Stata data files.

 function makeBISG($surname_data,$geo_data){

 $bisg_array = array(

 "hispanic" => $surname_data["pcthispanic"] * $geo_data["here_given_hispanic"],

 "white" => $surname_data["pctwhite"] * $geo_data["here_given_white"],

 "black" => $surname_data["pctblack"] * $geo_data["here_given_black"],

 "api" => $surname_data["pctapi"] * $geo_data["here_given_api"],

 "aian" => $surname_data["pctaian"] * $geo_data["here_given_aian"],

 "multi" => $surname_data["pct2prace"] * $geo_data["here_given_mult_other"]

 );

 $bisg_denominator = array_sum(array_values($bisg_array));

 $bisg_results = array(

 "hispanic" => (string) round(($bisg_array["hispanic"] / $bisg_denominator) * 100, 2),

 "white" => (string) round(($bisg_array["white"] / $bisg_denominator) * 100, 2),

 "black" => (string) round(($bisg_array["black"] / $bisg_denominator) * 100, 2),

 "api" => (string) round(($bisg_array["api"] / $bisg_denominator) * 100, 2),

 "aian" => (string) round(($bisg_array["aian"] / $bisg_denominator) * 100, 2),

 "multi" => (string) round(($bisg_array["multi"] / $bisg_denominator) * 100, 2)

 );

 return $bisg_results;

} 

The other bit of magic you’ll need to make this come together is an understanding of how to call the Census geocoder. In this bit of jQuery, I ask the geocoder API to look up the geographical coordinates for an address, attach two Census statistical area IDs, and return a JSONP response (callback) to the web browser.

$.getJSON("http://geocoding.geo.census.gov/geocoder/geographies/onelineaddress?callback=?",{ 
            "benchmark":"Public_AR_Census2010", 
            "vintage":"Census2010_Census2010", 
            "layers":"Census Block Groups,Census Tracts", 
            "format":"jsonp", 
            "address":"1211 Ave of the Americas, New York, NY"} 

Data sources: benchmark, vintage, layers

What We Learned

The CFPB calculation is certainly prone to error, because people of different races and ethnicities are not completely segregated by last name and location. Thankfully.

(We didn’t use the word segregation in the story or graphic, by the way. It was suggested by our audience.)

The method’s errors in guessing race and ethnicity can come from several sources. The first problem: last names that aren’t common enough to appear in the data (about 10% of Americans) don’t produce a probability at all. That may represent an important portion of the auto loan holders.

Also, if a last name is not a good predictor of race or ethnicity, the geography is very influential. Thus, predictions for people who live in racially integrated areas will be imprecise.

Finally, name changes (such as a spouse taking a partner’s name) can cause wild swings in the prediction.

The CFPB validated its method using mortgage data, showing that the number of minority borrowers was overestimated. But, the agency said, the minority status of car loan borrowers could be more accurately determined.

Our reporters wrote:

A CFPB spokesman said that the bureau believes the methodology is more accurate with car loans, because the universe of borrowers is more representative of the general population.

Our interactive highlighted the problem with the CFPB’s statement: if the predictive method is inaccurate to begin with, the fact that it is more accurate for car loan borrowers may not be that important. (Without an expensive study to validate the CFPB’s methodology as it applies to these specific, allegedly discriminatory auto loans, it’s impossible to know if the method was accurate.)

Readers’ results—i.e. voting in my unscientific Twitter poll—show a combination of all these errors. A majority of respondents, 57%, still found the prediction to be accurate, but the other 43% did not. If the error rate of the auto loan discrimination prediction was 43%, is that acceptable?

With a large reported error rate among our readers, you might think the CFPB would have come in for heavy criticism on social media. But some were understanding of the limitations:

Thinking About the 'Box'

On reflection, I realized that I’d just published a “black box” graphic.

In this diagram of a basic black box system, the observer (reader) provides input (name and address) to the black box, receives output, and may experiment with other input.

Diagram of a black box system

A black box system. (Wikipedia.)

Sometimes, this opaqueness may have caused our audience to guess, and possibly misunderstand, how the interactive worked.

“I’ve been playing with your address/last name tool,” one reader wrote to me on Twitter. “It is FAR more based on racial makeup of neighborhood than anything else.”

But my colleague had discovered one 100% White surname, proving location doesn’t always matter. I was excited to share this.

“Name can also be an important factor if heavily predictive,” I wrote back, giving the reader an example to try (“Slobin”).

Readers with theories can be a thorn in the side of reporters, introducing the looming horror of a correction as we investigate and ultimately dismiss misinformed ideas. However, with this “black box” concept, readers had room to share ideas, and I could respond with suggestions that readers could try themselves.

Choosing to Use a Black Box Graphic

There are many cases where a black box would not be desirable.

  • Quick hit stories where information is needed first.
  • Predictable stories; producing really boring black boxes with obvious inputs and outputs.
  • Black boxes which contain dice rolls, coin tosses, and other randomness that causes unreproducible results.
  • Investigative stories which focus on laying out proof of wrongdoing in a step-by-step way.

(And of course, a black box graphic wouldn’t work well in print, where readers would need to use arithmetic or complicated tables.)

However, a black box graphic worked out well for us. It stimulated critical engagement with a story about a government agency using last names and addresses to find minority borrowers. Readers had to experiment (or read the story) to figure out how the calculation worked.

To Explain or Not to Explain

As people who make interactive graphics, it may not be in our natures to consider making black box graphics. We usually want to explain a concept fully, drawing our audience’s attention to important ideas.

But, as we search for new forms possible only in interactive graphics, a black box is worth considering—particularly to increase readers’ engagement with the story, and with you.



Read Full Story from Source https://source.opennews.org/articles/inside-wall-street-journals-pre/
This article by Martin Burch originally appeared on source.opennews.org on January 08, 2016 at 07:00PM

Latest Posts