An exercise in e-discovery built around Hillary Clinton’s “damn emails” underscores the importance of AI assistance in the era of big data.
At the first Democratic primary debate in 2015, Bernie Sanders famously quipped to Hillary Clinton, “We’re tired of hearing about your damn emails.”
Since then, we’ve been reminded of Hillary’s emails almost daily. It’s July 2018, and James Comey is on a book tour (indeed, he made a stop at the OpenText Enfuse Conference – did you catch his keynote?). But at least the saga popularized e-discovery and now my friends and family kind of understand what I do for a living.
In the midst of all this, my colleagues and I took on a little project surrounding Hillary Clinton’s emails from her tenure as secretary of state, released under the Freedom of Information Act. Forget politics or Hillary’s conduct – we thought it would be interesting to discuss what we learned from loading a set of 20,000 emails into Axcelerate and investigating them with AI and analytics. Here are the lessons:
Lesson #1: When life gives you PDFs stripped of data, get scripting
The first step in an e-discovery is standardizing the data for processing in a usable format. Normally, this is a simple matter of ingesting an EnCase LEF file. Of course, the government’s FOIA response came in one of the least-usable formats possible. Each email was redacted, printed and then scanned back as an individual PDF file, basically stripping out the metadata. I presume they repeated this process a couple more times because the images have artifacts and noise all over.
It turned out not to be a huge problem. I used scripts from my friends in OpenText Professional Services to OCR the PDFs and extract and map the key metadata fields. We could accurately leverage the metadata necessary to filter on the basics – like time, day, sender, recipient, domain and more. Of course, my millennial influence only goes so far, so I couldn’t get the dataset perfect. But to borrow from Voltaire, I didn’t want to let perfect be the enemy of good. After all, e-discovery is about reasonableness, not perfection. To address simple data failures like corrupt documents and incorrect file types, I leveraged a basic set of exclusionary filters applied inside the review interface.
Lesson #2: Jump-start the investigation with unsupervised machine learning (AI)
Artificial intelligence is a force-multiplier for human effort; it enables lean teams to investigate at scale. We work with clients that use AI on every project to prioritize hot documents and double-check human decisions. There are two kinds of AI for this scenario: unsupervised machine learning (generally known as concept analysis or clustering) and supervised machine learning (generally known as predictive coding or TAR).
Concept analysis can use a statistical algorithm or a taxonomy to categorize conceptually similar documents into related groups. I’m using a statistical version (probabilistic latent semantic analysis, or PLSA) that learns semantic relationships based on the content itself, looking at dimensions like repetition, proximity, phrases and more. It has the added benefit of being adaptive to the domain, changes as the corpus changes, and works in any language. It displays the actual words contained in the data that it thinks are most representative as labels to indicate the kind of content you might find within the concept group.
The concept groups generated by the FOIA release are compelling. For example, one group includes documents about topics such as Bin Laden, Al Qaeda, Afghanistan, Pakistan and insurgents. If we were investigating terrorism or foreign policy in the Middle East, this would be a good place to start.
Meanwhile, a second group has documents with similar, but conceptually distinct labels about topics such as embassy, cable, consulate, diplomat, incident and evacuate. It’s not hard to guess the subject of documents in that group: Benghazi.
As you’ll see, a third group is the most important to my investigation. It has hundreds of documents about topics such as university, academy, empowerment, civil society, culture, women, foreign service and public diplomacy. This group stands out because of its positive, aspirational language, so I investigated further.
In this concept group, I found something interesting and very personal to me: speech notes for my college graduation! Hillary Clinton spoke at my NYU commencement on May 13, 2009. (It’s worth noting that she incorporated these suggestions having received them just a day or two before speaking in front of 40,000 people – not something I could do.) What I find particularly interesting about this document is that it never says the words “university” or “academy,” yet the machine effectively understood based on the co-occurrence of words that NYU stands for New York University.
And, sure enough, the content of the speech notes closely aligns with those other labels that the machine identified – service, equality and empowerment. I know because I was there.
Lesson #3: Use communication maps to find hidden email and chat patterns
Visualizing sender and receiver patterns identifies anomalous behavior. Next I turned to the Axcelerate communications map, Hypergraph. This tool visualizes the relationships between domains and individuals for any type of document that has a sender and receiver field (emails and chats, for instance). You see not only what domains are most active, but also the individuals and how they interact. It’s far easier to understand the key players at a glance than to interpret a massive table of text logs. As they say, a picture is worth a thousand words (or, in this case, a thousand hours of FBI investigation).
When I clicked to visualize “email domain activity,” the findings aligned with the premise of the FBI investigation; much of the released email was routed through the “clintonemail.com” domain. But here’s the more intriguing bit: Even in this highly redacted, sanitized dataset, we still found instances of other personal email accounts (i.e., Gmail.com).
In the figure above, the left circle was a distribution of Hillary Clinton’s schedule for the day. Fairly mundane stuff, except that it listed a meeting with Archbishop Desmond Tutu at 10:30 a.m. And this document now resided on a Gmail server.
The second circle was more interesting in that the sender and receiver lines had been redacted. So why, then, did they show up in our Gmail search if there are no Gmail domains specified? I investigated further and discovered that the “unsubscribe line” at the bottom of the email had not been redacted!
From there, my fact-finding journey took a new turn, as I now wanted to investigate this new “person of interest.” Even though she wasn’t a custodian, several of her documents were still contained in the collection, and I could run other searches to find more.
Lesson #4: Use phrase analysis to see how names and places appear in context
Keywords continue to be the building blocks of successful investigations, but context can completely change their impact. Names are notoriously tricky to search for. We often construct Boolean searches with proximities and expanders to catch all the different permutations of a name (e.g., Adam; Adam Kuhn; Adam Harrison Kuhn; A. Kuhn; akuhn; Kuhn, Adam). With phrase analysis, however, we can search all those proximities and variations at once, with a helpful preview.
I ran a search on my person of interest’s last name and, using phrase analysis, was able to easily select all the permutations of her name in the dataset. Looking at the results, I quickly discovered why her gmail.com domain appeared in the first place (see image below).
I also entered some other simple search terms for people I thought might be interesting. “Bill,” for instance, displayed phrase results for Bill Clinton, President Bill Clinton, Bill Burns, appropriations bill, healthcare bill, and other results that incorporate the keyword as part of a larger word, like “billion.”
A similar search for “Sean” turned up none other than Academy Award-winning actor Sean Penn! Apparently, he and Cheryl Mills are friends and exchanged emails after he won the Peace Summit Award for his disaster recovery work in Haiti.
Lesson #5: Use supervised machine learning to find similar content and double-check human decisions
Most e-discovery applications today also include some form of supervised machine learning. This type of AI learns from human relevancy decisions, building a data model that looks for similar content. In effect, it’s not unlike Pandora Radio or Netflix; as you rank songs or movies, the algorithm is able to learn what you like and suggest similar media. Likewise, as a human investigator moves through the review using the methods discussed above, the algorithm is in the background – learning continuously and evaluating the other documents in the dataset for similarities. Predictive coding is neither complicated nor rigid, and I leveraged it on the fly in this dataset with no special preparation.
I wanted to find documents about Libya and the Libyan civil war, and I did that by using predictive coding and just one obvious keyword: Benghazi. To kick-start the training, I bulk-coded a couple hundred documents that simply had that keyword. In less than a minute, Axcelerate suggested a batch of potentially similar documents about Libya and Gaddafi. My workflow was rudimentary: I’d review only about 10 to 20 documents and then retrain the system (with a click) to find more. By my fifth iteration, the machine-generated suggestions were approaching 90 percent relevance to my topic of interest.
It is hard to imagine completing a successful e-discovery project or digital investigation in the age of big data without some form of AI assistance. In one function, it helps to passively organize and categorize large volumes of data. In another function, it actively scales human judgment, intuition and curiosity to meet the demands of contemporary data-driven investigations. And now machine learning is democratized to the point that it can be driven without an extended team of programmers and data scientists.
Having used this technology firsthand, I am willing to bet that it will become de rigueur and even a legal requirement for e-discovery within my lifetime.