Pizzagate’s Big Data Problem

One of the most interesting phenomena to come out of the leaks of Hilary Clinton’s campaign chairman John Podesta’s e-mails last year was the Pizzagate affair, which has also morphed into “Paedogate” (For those of you interested in getting a flavour of Pizzagate the #Pizzagate hashtag on Twitter will give you a good overview).

For those of you not familiar with Pizzagate it is a narrative (or depending on your point of view a conspiracy theory) which claims that a child sex ring operated within the senior echelons of the United States Democratic Party.  This child sex ring was claimed to be linked to various restaurants, but in particular a Pizza restaurant called the Comet Ping Pong Pizzeria in Washington DC.  It started with leaking of the e-mails by WikiLeaks and the story spread rapidly on Social Media particularly on Twitter, 4Chan and Reddit. Subsequent publication of the story by certain Turkish pro-government official news media gave the story a further boost.

The basis of the claim was that some of the leaked e-mails contained coded references to child abuse, prostitution and human trafficking, so for example “cheese pizza” was supposed to be a code for child pornography; other words contained in the e-mails are also claimed to be codes such as:

Hot dog = boy
Pizza = girl
Ice cream = male prostitute
Sauce = orgy

The whole Pizzagate story has been debunked by the US Police, fact checking organisations and mainstream news organisations; however, its proponents claim that the code words occur so often in the e-mails that the story must have some truth in it.  At first appraisal this is not necessarily an unreasonable position; there is absolutely no reason why an e-mail or any other form of text could not contain a secret code.  However, on closer examination the theory suffers from both an obvious and fundamental problem to anybody familiar data analysis – which is the number of e-mails.

There were some 19,000 e-mails leaked in October 2016, with many having multiple pages and containing attachments. Analysing such a huge dataset leads to a common “Big Data” problem which is the issue of random matches.  Given the huge number of e-mails and the fact that some will inevitably be about food (and pizza is a food option), it is inevitable that just by chance words will come together in a sequence that can be interpreted to mean something completely different from their literal meaning.  So finding a phrase that can be interpreted as referring to child abuse or indeed any other subject by random chance is not only feasible, it is indeed likely to happen and if the data set is large enough it will inevitably happen.

Given the lack of any other hard evidence to support the existence of a child sex ring, it is clear that those people believing the Pizzagate e-mails relate to a child sex ring are being fooled by random matches.

Of course it is not just the proponents of Pizzagate that are confounded by random matches in big datasets. Random matches are one of the problems encountered in the analysis of any large dataset of any type.  This includes the vast amounts of data hoovered up by Governments from mass surveillance and in the UK, the sort of internet snooping brought about by Investigatory Powers Act.  Something that should be remembered by those politicians and securocrats who view a surveillance state as some sort of security utopia.