First They Came For The Data Analysts, And I Did Not Speak Out…

Data storage is cheap, and odds are good that any information you store today – if you care just a little about preserving it – can last well beyond your own lifespan. If you’re an intelligence agency and you’re collecting all of the surveillance information you possibly can, the easiest part of your job is probably siloing it so that you’ll have it for hundreds of years. If you’ve got any kind of budget for it, it’s easy to hold on to data practically indefinitely. So, if you’re the subject of surveillance by any of that sort of intelligence agency, all sorts of information collected about you may exist in intelligence silos for decades to come, probably long after you’ve forgotten it. That information exists, for practical purposes, effectively forever.

Suppose that your nation’s intelligence agency decides to collect information in bulk on every citizen it can, including you, and you judge that they are responsible and deserving of your trust, so you don’t mind that they are gathering this information about you and storing it indefinitely. Suppose that they actually are deserving of your trust, and the potentially massive amount of information that they collect and silo about you (and everyone else) is never abused, or even seen by a human analyst. Instead it sits in some massive underground data center, occasionally browsed through by algorithms combing for actual, specific security threats.

Trustworthy governments seem to be pretty stable governments, which is fortunate for people lucky enough to be governed by them. Year after year, there is a very high likelihood that the government will still be pretty great. But that likelihood can never be 100%, which is unfortunate because when you have a non-zero likelihood of something happening and you then compound it over a time scale like “effectively forever”, that puts you in uncomfortable territory. It’s hard to anticipate what sort of threats might exist five years from now, and harder to anticipate what might happen in 20. You have no idea what sort of world you’ll live in 40 years from now, but there are good odds that the extensive information siloed away today will still be around.

When I read Scott Alexander’s review of Manufacturing Consent, it was apparent that throughout the 20th century and clear into the present day, places that were stable at one point in time become unstable, and death squads followed shortly after. The Khmer Rouge killed about 25% of the population of Cambodia from 1975 to 1979. 1975 is too close to the present to comfortably say that we exist in a modern world where we don’t have to worry about genocide and mass-murdering states.

We have no idea what the mass-murderers of the distant future will care about. Many of them will probably have fairly commonplace criteria for the groups they want to purge based on such things as race, religion, cultural heritage, sexual orientation, and so on. But some will devise criteria we can’t even begin to imagine. In the middle of the 19th century, only a tiny minority of people had even heard of communism, but a generation or so later that doctrine caused the death of millions of people in camps, wars, purges, and famines. Perhaps we’ve exhausted the space of ideologies that are willing to kill entire categories of people, and maybe we’ve identified all of the categories of people that you can identify and decide to purge.  But are you willing to bet money, much less your life, on the prediction that you won’t belong to some future class of deplorables?

In some of the purges of history, people had a chance to pretend not to be one of the undesirables. There’s no obvious sign that a Pear Party-affiliated death squad can use to identify a member of the Pineapple Party when the Pineapple Party government is toppled, so long as the Pineapplists know that they’re being targeted by Pear partisans and now is the time to scrape off their Pineapple Party ’88 bumper stickers. High-profile Pineapplists have no option but to flee the country, but the average member can try to lay low through the ensuing sectarian violence. That’s how it used to be, at least. But today people can scroll back 5 years in your Facebook profile and see that you were posting pro-Pineapple links then that you’ve since forgotten.

But open support of the Pineapple Party is too obvious. The undesirables of the future may have enough foresight to cover their tracks when it comes to clear-cut evidence like that. But, returning to the trustworthy intelligence agency we’ve mandated with finding people who want to harm us but also don’t want to be found, there are other ways to filter people. Machine learning and big data analysis are mixed bags. If you really, really need them to preemptively identify people who are about to commit atrocities, you’re probably going to be let down. It’s hard to sift through immense streams of data to find people who don’t want to be found. Not impossible, but machine learning isn’t a magic wand. That said, people are impressed with machine learning for a reason. Sometimes it pulls a surprising amount of signal out of what was previously only noise. And we are, today, the worst at discerning signal from noise that we will ever be. Progress in computational statistics could hit a wall next year, and then we can all temper our paranoia about targeted advertisements predicting our deepest, darkest secrets and embarrassing us with extremely specific ad pitches when our friends are looking over our shoulders. Maybe.

But perhaps it’s possible, if you’re patient and have gigantic piles of data lying around, to combine text analysis, social graph information, and decades-old Foursquare check-ins in order to identify closeted Pineapple Party members. And maybe it requires a small army of statisticians and programmers to do so, so you’re really not worried when the first paper is published that shows that researchers were able to identify supporters of Pineapplism with 65% accuracy. But then maybe another five years goes by and the work that previously took that small army of researchers months to do is now available as an R package that anyone with a laptop and knowledge of Statistics 101 can download and use. And that is the point where having gigantic piles of data siloed for a practically infinite amount of time becomes a scary liability.

The scenario where Pearists topple the government, swarm into the intelligence agency’s really big data center, and then know exactly where to go to round up undesirables might be fairly unlikely on its own. But there’s actually a much larger number of less-obvious opportunities for would-be Pearist mass-murderers. But maybe someone finds a decades-old flaw in a previously trusted security protocol and Pear-affiliated hackers breach the silo. Maybe they get information from the giant surveillance silo of a country that, now that we think of it, no one should have sold all of that surveillance software to. Maybe the intelligence agency has a Pearist mole. Maybe the whole intelligence apparatus is Pear-leaning the whole time. Maybe a sizeable majority of the country elects a Pearist demagogue that promises to round up Pineapplists and put them in camps. This sort of thing isn’t behind us.

The data silo is a threat to everyone. In the long run, we can’t anticipate who will have access to it. We can’t anticipate what new category will define the undesirables of the future. And those unknowing future undesirables don’t know what presently-inconspicuous evidence is being filed away in the silo now to resurface decades in the future. But the trend, as it exists, points to a future where large caches of personal data are a liability because future off-the-shelf machine learning tools may be as easy to use and overpowered relative to machine learning’s bleeding edge today as our smartphones are compared to the Apollo Guidance Computer. The wide availability of information on the open internet might itself be dangerous looked at through this lens. But if your public tweets are like dry leaves accumulating in your yard and increasing the risk of a dangerous data-fueled-pogrom wildfire, then mass surveillance silos are like giant rusty storage tanks next to your house that intelligence agencies are pumping full of high-octane petroleum as fast as they can.


Comment on reddit.

Picture credit: Wikimedia Foundation Servers by Wikipedia user Victor Grigas, licensed under CC-BY-SA-3.0.