Del dato, al sistema

Les comparto este magnifico post de Jer Thorp, quien habla de tomar un enfoque sistémico para el diseño de visualización de datos, que no sólo permite resolver los problemas de manera más eficiente, sino nos ayuda a reflexionar sobre como los datos afectan  nuestra vida cotidiana.

In the fall of 2009, I wrote a pair of algorithms to place nearly 3,000 names on the 9/11 memorial in Manhattan. The crux of the problem was to design a layout for the names that allowed for what the memorial designers called ‘meaningful adjacencies’. These were requests made by next-of-kin for their family members to appear on the memorial next to — or as close as possible to — other victims. Siblings, mothers and daughters, business partners, co-workers, these connections represented deep affinities in the real world. There were nearly 1,400 of these adjacencies that a layout of the names would ideally honour.

In December of that year, I flew to New York to meet with some of the project’s stakeholders and to present the results of the algorithms that I’d developed. I came into the meeting disheveled and nervous. Disheveled because I’d flown into La Guardia that morning, having spent much of the plane ride revising and re-revising my presentation. Nervous because I had found out the day before that another team had also been working on the layout problem; a group of financial analysts (‘quants’) who almost certainly all had at least one PhD.

It must’ve been a strange sight. A small army of besuited financial professionals, across the table from a long-haired artist from Canada with an old, broken laptop. The quants went first: they’d run permutation after permutation on their server clusters, and they were confident they’d found the optimal solution for the adjacencies: a maximum about 93 percent of them could be satisfied. They’d asked to speak first because they wanted to ‘save us all some time’, since they knew, mathematically, that they had found the most highly optimized solution.

It was a persuasive argument. I let them finish, then I turned my laptop around on the table to show them a layout that I’d generated about a week before — one that was 99.99% solved.

The lesson here is not ‘don’t get a math PhD’. Nor is it (specifically) ‘hire a long-haired data artist from Canada’. The lesson is to not look just at the data, but at the entire system that the data is a part of. Taking a systems approach to data thinking allows you not only to solve problems more efficiently, but to more deeply understand (and critique) the data machinery that ubiquitously affects our day-to-day lives.

An over-simplified and dangerously reductive diagram of a data system might look like this:

Collection → Computation → Representation

Whenever you look at data — as a spreadsheet or database view or a visualization, you are looking at an artifact of such a system. What this diagram doesn’t capture is the immense branching of choice that happens at each step along the way. As you make each decision — to omit a row of data, or to implement a particular database structure or to use a specific colour palette you are treading down a path through this wild, tall grass of possibility. It will be tempting to look back and see your trail as the only one that you could have taken, but in reality a slightly divergent you who’d made slightly divergent choices might have ended up somewhere altogether different. To think in data systems is to consider all three of these stages at once, but for now let’s look at them one at a time.

Collection

Any path through a data system starts with collection. As data are artifacts of measurement, they are beholden to the processes by which we measure. This means that by the time you look at your .CSV or your .JSON feed or your Excel graph, it has already been molded by the methodologies, constraints, and omissions of the act of collection and recording.

The most obvious thing that can go wrong at the start of a data system is error, which is rife in data collection. Consider the medical field: A 2012 study of a set of prestigious East Coast hospitals found that only 3% of clocks in hospital devices were set correctly, meaning that any data carrying a timestamp was fundamentally incorrect. In 2013, researchers in India analyzed results from the humbly analogue blood pressure cuff in hospitals and clinics and found the devices carried calibration errors in the range of 10% across the board.

These kinds of measurement errors are pervasive, inside of hospitals and out. Errors may be unintended, the results of mis-calibrated sensors, poorly worded surveys, or uncounted ballots. They can also be deliberate, stemming from purposeful omissions or applications of heavy-handed filters or conveniently beneficial calibrations.

Going further back from how the data is collected, you should also ask why — or why not. Artist and data researcher Mimi Ohuoha, whose practice focuses on missing data, tells us that the very decision of what to collect or what not to collect is political. “For every dataset where there’s an impetus for someone not to collect”, she writes, “there’s a group of people who would benefit from its presence”. Onuoha neatly distilled the importance of understanding collection to the understanding of a data system as a whole in her recent talk at the Eyeo Festival in Minneapolis: “If you haven’t considered the collection process”, she stated neatly, “you haven’t considered the data.”

Computation

After collection, data is almost certainly bound to be computed upon. It may be rounded up or down, truncated, filtered, scaled or edited. Very often it’ll be fed into some kind of algorithmic machinery, meant to classify it into meaningful categories, to detect a pattern, or to predict what future data points from the same system might look like. We’ve seen over the last few years that these algorithms can carry tremendous bias and wield alarming amounts of power. But this isn’t another essay about algorithmic bias. There are many other aspects of computation that should considered when taking the measure of a data system.

In Jacob Harris’s 2015 essay Consider the Boolean, he writes about how seemingly inconsequential coding decisions can have extraordinarily impact on the stories our data might ultimately tell. Harris proposes that the harsh true-false logic of computation and the ‘ideal views’ of data that we endeavour to create with code are often insufficient to represent the ‘murky reality the data is trying to describe’. Importantly, he underlines the fact that while computational bias can come from big decisions, it can also come from small ones. While we urgently need to be critical of the way we our author machine learning systems, we also need to pay attention to the impact of procedural minutiae — like wether we’re storing a data point as a boolean or a string.

Representation

As you’ve seen, the processes of collection and computation are rampant with decision points, each of which can greatly increase or greatly limit the ways in which our data systems function. When we reach the representation stage, and begin to decide how our data might tell its story to humans, possibility space goes critical. Each time you pick a chart type or colour palette or a line weight or an axis label, you’re trimming the possibility space of communication. Even before that, the choice of a medium for representation has already had a predestinatory effect. A web page, a gatefold print, a bronze parapet — each of these media is embedded with its own special opportunities, and its own unavoidable constraints.

Whatever the medium, many of the points that Mimi Onuoha makes about collection can be mapped directly to visualization: questions about what is shown in a visualization and how it is shown must be paired with questions about what isn’t shown and why someone has chosen not to show it. In a quest to avoid the daunting spectre of bias, data visualization practitioners often style themselves as apolitical. However, the very process of visualization is necessarily a political one; as I’ve said for years to my students at NYU, the true medium of data visualization is not color or shape; it’s the decision.

By being mindful of the decisions we’re making when we’re authoring visualizations we can make better work; by seeing these decisions in work made by others we can be more usefully critical of the data media that we consume.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s