First They Came For The Data Analysts, And I Did Not Speak Out…

Data storage is cheap, and odds are good that any information you store today – if you care just a little about preserving it – can last well beyond your own lifespan. If you’re an intelligence agency and you’re collecting all of the surveillance information you possibly can, the easiest part of your job is probably siloing it so that you’ll have it for hundreds of years. If you’ve got any kind of budget for it, it’s easy to hold on to data practically indefinitely. So, if you’re the subject of surveillance by any of that sort of intelligence agency, all sorts of information collected about you may exist in intelligence silos for decades to come, probably long after you’ve forgotten it. That information exists, for practical purposes, effectively forever.

Suppose that your nation’s intelligence agency decides to collect information in bulk on every citizen it can, including you, and you judge that they are responsible and deserving of your trust, so you don’t mind that they are gathering this information about you and storing it indefinitely. Suppose that they actually are deserving of your trust, and the potentially massive amount of information that they collect and silo about you (and everyone else) is never abused, or even seen by a human analyst. Instead it sits in some massive underground data center, occasionally browsed through by algorithms combing for actual, specific security threats.

Trustworthy governments seem to be pretty stable governments, which is fortunate for people lucky enough to be governed by them. Year after year, there is a very high likelihood that the government will still be pretty great. But that likelihood can never be 100%, which is unfortunate because when you have a non-zero likelihood of something happening and you then compound it over a time scale like “effectively forever”, that puts you in uncomfortable territory. It’s hard to anticipate what sort of threats might exist five years from now, and harder to anticipate what might happen in 20. You have no idea what sort of world you’ll live in 40 years from now, but there are good odds that the extensive information siloed away today will still be around.

When I read Scott Alexander’s review of Manufacturing Consent, it was apparent that throughout the 20th century and clear into the present day, places that were stable at one point in time become unstable, and death squads followed shortly after. The Khmer Rouge killed about 25% of the population of Cambodia from 1975 to 1979. 1975 is too close to the present to comfortably say that we exist in a modern world where we don’t have to worry about genocide and mass-murdering states.

We have no idea what the mass-murderers of the distant future will care about. Many of them will probably have fairly commonplace criteria for the groups they want to purge based on such things as race, religion, cultural heritage, sexual orientation, and so on. But some will devise criteria we can’t even begin to imagine. In the middle of the 19th century, only a tiny minority of people had even heard of communism, but a generation or so later that doctrine caused the death of millions of people in camps, wars, purges, and famines. Perhaps we’ve exhausted the space of ideologies that are willing to kill entire categories of people, and maybe we’ve identified all of the categories of people that you can identify and decide to purge.  But are you willing to bet money, much less your life, on the prediction that you won’t belong to some future class of deplorables?

In some of the purges of history, people had a chance to pretend not to be one of the undesirables. There’s no obvious sign that a Pear Party-affiliated death squad can use to identify a member of the Pineapple Party when the Pineapple Party government is toppled, so long as the Pineapplists know that they’re being targeted by Pear partisans and now is the time to scrape off their Pineapple Party ’88 bumper stickers. High-profile Pineapplists have no option but to flee the country, but the average member can try to lay low through the ensuing sectarian violence. That’s how it used to be, at least. But today people can scroll back 5 years in your Facebook profile and see that you were posting pro-Pineapple links then that you’ve since forgotten.

But open support of the Pineapple Party is too obvious. The undesirables of the future may have enough foresight to cover their tracks when it comes to clear-cut evidence like that. But, returning to the trustworthy intelligence agency we’ve mandated with finding people who want to harm us but also don’t want to be found, there are other ways to filter people. Machine learning and big data analysis are mixed bags. If you really, really need them to preemptively identify people who are about to commit atrocities, you’re probably going to be let down. It’s hard to sift through immense streams of data to find people who don’t want to be found. Not impossible, but machine learning isn’t a magic wand. That said, people are impressed with machine learning for a reason. Sometimes it pulls a surprising amount of signal out of what was previously only noise. And we are, today, the worst at discerning signal from noise that we will ever be. Progress in computational statistics could hit a wall next year, and then we can all temper our paranoia about targeted advertisements predicting our deepest, darkest secrets and embarrassing us with extremely specific ad pitches when our friends are looking over our shoulders. Maybe.

But perhaps it’s possible, if you’re patient and have gigantic piles of data lying around, to combine text analysis, social graph information, and decades-old Foursquare check-ins in order to identify closeted Pineapple Party members. And maybe it requires a small army of statisticians and programmers to do so, so you’re really not worried when the first paper is published that shows that researchers were able to identify supporters of Pineapplism with 65% accuracy. But then maybe another five years goes by and the work that previously took that small army of researchers months to do is now available as an R package that anyone with a laptop and knowledge of Statistics 101 can download and use. And that is the point where having gigantic piles of data siloed for a practically infinite amount of time becomes a scary liability.

The scenario where Pearists topple the government, swarm into the intelligence agency’s really big data center, and then know exactly where to go to round up undesirables might be fairly unlikely on its own. But there’s actually a much larger number of less-obvious opportunities for would-be Pearist mass-murderers. But maybe someone finds a decades-old flaw in a previously trusted security protocol and Pear-affiliated hackers breach the silo. Maybe they get information from the giant surveillance silo of a country that, now that we think of it, no one should have sold all of that surveillance software to. Maybe the intelligence agency has a Pearist mole. Maybe the whole intelligence apparatus is Pear-leaning the whole time. Maybe a sizeable majority of the country elects a Pearist demagogue that promises to round up Pineapplists and put them in camps. This sort of thing isn’t behind us.

The data silo is a threat to everyone. In the long run, we can’t anticipate who will have access to it. We can’t anticipate what new category will define the undesirables of the future. And those unknowing future undesirables don’t know what presently-inconspicuous evidence is being filed away in the silo now to resurface decades in the future. But the trend, as it exists, points to a future where large caches of personal data are a liability because future off-the-shelf machine learning tools may be as easy to use and overpowered relative to machine learning’s bleeding edge today as our smartphones are compared to the Apollo Guidance Computer. The wide availability of information on the open internet might itself be dangerous looked at through this lens. But if your public tweets are like dry leaves accumulating in your yard and increasing the risk of a dangerous data-fueled-pogrom wildfire, then mass surveillance silos are like giant rusty storage tanks next to your house that intelligence agencies are pumping full of high-octane petroleum as fast as they can.


Comment on reddit.

Picture credit: Wikimedia Foundation Servers by Wikipedia user Victor Grigas, licensed under CC-BY-SA-3.0.

Legal Innovation: Warrant Canaries

I recently came across a fascinating legal concept called warrant canaries. I’m going to cover the idea briefly, but if you want to know more about them in detail, I highly recommend this Warrant Canary FAQ at the Electronic Frontier Foundation.

The context is that many online services based in the United States can be compelled by the FBI to give whatever information they have to law enforcement through National Security Letters. Those documents often gag the companies from informing their customers they are being spied on, even if the service is being provided specifically so that users can get encrypted, private communication. It’s hard to pin down the exact constitutionality of NSLs. They were ruled unconstitutional in 2013, but it looks like the case was remanded in 2015 after the passage of the USA Freedom Act. Given the government’s continued efforts to obtain information regardless of constitutionality and limitations placed on them by Congress, it would be nice if we had some way to communicate if a service was under duress from the government.

The usefulness of warrant canaries (I’ll get to what they are in a moment) is based on two legal concepts: (1) it’s not illegal to inform anyone of a warrant you haven’t been served, and (2) the state cannot compel false speech.

The first statement is common sense, since you can’t be curtailed from simply stating something hasn’t happened yet.  The second is a bit more subtle; a stronger statement is that the state cannot compel speech at all, but that’s not always true. The state can sometimes compel commercial speech to inform consumers of information so they can make accurate decisions. The EFF elaborates that “…the cases on compelled speech have tended to rely on truth as a minimum requirement”.

This is essential because it allows companies with encryption products to convey highly relevant information to their customers. Companies can publicly post a message indicating they have not received a warrant because of the first legal concept, and they can immediately take down their public message when they do receive a warrant because the state cannot compel false speech.

To ensure the authenticity of the message stating that the given company has not been subject to a NSL, many go an extra step and sign their messages with a PGP key (example here).

Of course, a foolproof way to ensure no data is lost is to simply make all data encrypted, like Apple has with the iPhone, ProtonMail does for email, and everyone who has ever sent encrypted emails has been doing since the 90s. But I still like this idea, because individuals who run encryption services should not be forced to be government puppets, like the FBI hoped to do to Ledar Levison.

The weakness is that we don’t know what we don’t know, so it’s possible the government already has a new Secret National Security Letter which it uses to compel companies to lie under some made up interpretation of an arcane piece of legislation. The only real security is end-to-end encrypted communication or being Hillary Clinton.