Thursday, March 30, 2006

Massive multiplayer fact-checking

Julian Sanchez of Reason Magazine writes about the Washington Post's Ben Domenech, hired as a conservative blogger and then forced to resign days later after bloggers noticed he plagiarized parts of several articles earlier in his career. Bloggers did what the Post's HR staff did not, or, as Sanchez argues, could not. The task is just too large for a few people, but not too large for thousands of energetic dorks avoiding their bosses' gaze:
Maybe someone saw a phrase they thought looked familiar and started Googling. Once the first instance of apparent plagiarism was spotted and blogged, thousands more began looking through that same body of writing, perhaps with each individual only checking a few pieces, a few phrases at a time. The same task would have taken a committed body of researchers days, but because the task was what Net theorist Yochai Benkler would call highly modular and granular—capable of being broken up into highly fine-grained microtasks—a distributed swarm of bloggers was able to accomplish it incredibly quickly, turning up many more instances in a matter of hours.
Sanchez notes, however, that this advantage does not make blogs better than newspapers, just different:
The blogosphere's virtues on this front are not necessarily the Post's defects, any more than it's a problem with the blogosphere per se that it's less well suited to producing intensive, sustained investigative reporting on stories that aren't similarly modular and granular. They're different kinds of information systems with different comparative advantages.
This comes at the same time as news that the US government is using the distributed skills of internet users to translate a huge trove of once-classified internal documents of Saddam Hussein's goverment:
The documents' value is uncertain--intelligence officials say that they are giving each one a quick review to remove anything sensitive. Skeptics of the war, suspicious of the Bush administration, believe that means the postings are either useless or cherry-picked to bolster arguments for the war.
There are up to 55,000 boxes, with possibly millions of pages. The documents are being posted a few at a time--so far, about 600--on a Pentagon Web site, often in Arabic with an English summary.
"The secret of the 21st century is attract a lot of smart people to focus on problems that you think are important," said Glenn Reynolds, the conservative blogger at ...
This trend owes a lot to Distributed Proofreading, an offshoot of public domain e-book library Project Gutenberg. Distributed Proofreading scans books that have fallen into the public domain, then uses optical character recognition software to generate best-guess text files from the scanned images. On their website, anyone can proofread the books, one page at a time.

From their front page:
When a proofreader elects to proofread a page of a particular book, the text and image file are displayed on a single web page. This allows the page text to be easily reviewed and compared to the image file, thus assisting the proofreading of the page text. The edited text is then submitted back to the site via the same web page that it was edited on. A second proofreader is then presented with the work of the first proofreader and the page image. Once they have verified the work of the first proofreader and corrected any additional errors the page text is again submitted back to the site. The book then progresses through two formatting rounds using the same web interface.

Once all pages for a particular book have been processed, a post-processor joins the pieces, properly formats them into a Project Gutenberg e-book and submits it to the Project Gutenberg archive.

It's going well: they just celebrated their 8,000th digitized book, W.E.B. DuBois's The Suppression of the African Slave-trade to the United States.

Distributed projects like this bear more than a passing resemblance to massive multiplayer online role playing games (MMORPGs) like EverQuest, World of Warcraft and Second Life. There's even a page on Wikipedia's ever-fascinating Meta-Wikipedia site that lists reasons why Wikipedia might be considered to be an MMORPG:
  • Thousands of articles (magical items)...
  • Editors/players who seem addicted, unable to leave the site, who spend all their waking hours on the site, and whose "real" lives and work are suffering as a result...
  • The accumulation of experience points (edits) leading to higher levels (higher rankings)...
  • People with similar ideas and goals form guilds...
  • Player-killing, which is strongly discouraged, but nevertheless happens, taking several forms, among them edit warring, banning and blocking; player-killers may be taken before a magisterial court, the Arbitration Committee...
  • Trolls - controversial or unpopular people whose goal is to fight the dominant groupthink. Seen as enemies or bosses to fight.
Blogger Alice on Thu Mar 30, 11:05:00 AM:
This is a fascinating post, Ben! Have you read about
the Britannica-Nature-Wikipedia bickering re: error identification and correction? It's very interesting.
Blogger Ben on Thu Mar 30, 12:27:00 PM:
Wow, what's it called when one blogging partner leaves the other a complimentary comment? Blog-cest? If we coin a catchy word for this, we can be #1 on for a few hours!

Thanks, Alice!