THE PANTSUIT REPORT: Michael Wolraich’s Confession: ‘I Sorted Hillary Clinton’s Email’Posted: March 19, 2015
Natural language — the way people ordinarily speak and write — is notoriously difficult to parse
Michael Wolraich writes: When Hillary Clinton released emails from her personal account last week, many assumed that her attorneys had personally reviewed the messages before sending them to the State Department, but that’s not what
happened. As detailed in her press statement, the review team used keyword searches to automatically filter over 60,000 messages, flagging about half as work related.
“I have absolute confidence that everything that could be in any way connected to work is now in the possession of the State Department,” Clinton declared.
I’m afraid that I don’t share her confidence, and I speak from experience. Twenty years ago, I used the same method to sort the Clinton administration’s email communications, including those of First Lady Hillary Clinton. It failed miserably.
Email did not exist when Congress established the Freedom of Information Act in 1967, and government officials did not originally consider electronic communications to be public records that they had to preserve and disseminate. On the last day of Ronald Reagan’s presidency, a group of organizations representing archivists and libraries sued the White House to prevent the administration from deleting email relating to the Iran-Contra scandal. A temporary injunction was issued, and the case wound its way through the courts until 1993, when a federal judge ordered President Bill Clinton to preserve all electronic communication under the Freedom of Information Act.
“Even after significant tweaking, I don’t recall achieving more than a 70 percent success rate, which is particularly poor when you consider that random sorting would yield 50 percent if the distribution were even.”
In 1994, I was 22 years old, fresh out of college and working as a computer programmer for a company called Information Management Consultants. IMC was one of many three-letter-acronym corporations that ring Washington’s famous beltway and feed off government contracts. I dressed in a gray J.C. Penney suit and programmed three-letter-acronym computer languages (SQL, 4GL) for three-letter-acronym federal agencies (IRS, OPM, DOI, OMB, DOD). It was dull work, made duller by my company’s decision to block employee access to the “World Wide Web” so that we would not be distracted from our work.
“Those were heady days for a young government IT contractor. We had a special office in Arlington, Virginia, where we were could dress casually while pursuing important, groundbreaking work for the President of the United States!”
One day a colleague invited me to join a mysterious new project for the Executive Office of the President (EOP). The White House had hired IMC to archive its email after the court ordered it to preserve electronic records. Few people had multiple email accounts back then and many federal employees used their work accounts for personal communication, so we had to figure out some way to distinguish work email from personal correspondence.
Those were heady days for a young government IT contractor. We had a special office in Arlington, Virginia, where we were could dress casually while pursuing important, groundbreaking work for the President of the United States! We lounged around the conference table in our khakis and scrawled deep thoughts on the big whiteboard. Mostly, we wrote words: president, federal, treasury, treaty, China, Serbia, ambassador, military, classified, and so on. These were the keywords with which we hoped to flag all the work-related messages, or at least the vast majority. We included the names of federal officials, common misspellings, and, of course, numerous three-letter acronyms. Since I had a philosophy degree, the team leader asked me to design logic to make the search smarter, e.g., “white AND house.”
“To make sense of natural language, it’s not sufficient to recognize the words; you also need to understand grammar, appreciate nuance, interpret metaphors, grasp allusions…”
To test our algorithm, the administration gave us a batch of sample messages. They included official business, such as a debate about a public scandal in which an official traveled by federal helicopter to play golf, and less official business, such as a private love note between two staff members. We ran our algorithm and crossed our fingers.
The results were abysmal. Even after significant tweaking, I don’t recall achieving more than a 70 percent success rate, which is particularly poor when you consider that random sorting would yield 50 percent if the distribution were even. IMC ultimately scrapped our troubled sorting project in favor of a feature that allowed users to manually flag messages that should not be archived….(read more)
Michael Wolraich is the author of “Unreasonable Men: Theodore Roosevelt and the Republican Rebels Who Created Progressive Politics” (Palgrave Macmillan).