Microsoft shepherding spell-check


In the past two years, few names have become as recognizable as Barack Obama’s, a rise that continues tonight as he accepts the Democratic Party’s presidential nomination in Denver.

But until spring 2007, “Obama” was unknown to Microsoft’s spell-checker. The suggested correction was “Osama,” a name that differs by a single letter but carries lots of baggage.

Even though Osama is a common name around the world, it is inextricably linked to Osama bin Laden. And the association in the spell-checker was fodder for the ongoing, false rumors surrounding the candidate’s religion, which is Christian.

This example highlights the challenge Mike Calcagno and his team face in keeping up with the evolution of language.

The Microsoft Natural Language Group’s aim is to build tools that help people improve their writing and avoid embarrassing mistakes. The job is increasingly complicated as more writing is done electronically and people blindly trust the judgment of the spell-checker.

“The speller is looked at sometimes as an arbiter of language,” Calcagano said. “Once a word is in the Microsoft spell-checker, there’s a notion that the word is now official and the word is now important.”

Spell-checkers use logic to suggest corrections. In the case of Obama-Osama, the “edit distance” that separates the words is only one letter. Without the political context, Osama would be a reasonable suggestion for Obama.

“There’s no amount of logic that we would ever build into the speller that would suggest that we wouldn’t do that,” Calcagno said.

Microsoft added “Obama” to the spell checkers in both Office 2003 and 2007 in spring of 2007, and any word can be added to an individual’s custom dictionary. But Obama-Osama continues to persist as people encounter it on computers that have not been updated.

Spell-checkers have come a long way, correcting countless misspellings. But the number of errors introduced is also substantial, and well-documented.

The phenomenon is known as the Cupertino effect. Ben Zimmer, executive producer of the Visual Thesaurus, has written extensively about it on the University of Pennsylvania’s online Language Log.

Writers and translators at the European Union came up with the name after they discovered the old spell checkers did not recognize the correctly spelled word “cooperation,” without a hyphen, Zimmer said.

The suggested correction, which made it into several documents that can still be found online, was Cupertino, the town in California. That particular problem has long since been corrected, but the phenomenon persists.

“Most of the time the Cupertino effect happens because of proper names, which are really difficult for a dictionary to handle,” Zimmer said. “[Microsoft] could have made the decision early on that we’re not going to include any proper names and that might have spared them a lot of grief.”

Calcagno has other reasons why the spell-checker should not be looked to as the authority on language. Correctly spelled words are often left out on purpose.

Take “calender,” for example. The Merriam-Webster Online Dictionary, calender, with an -er, means “to press (as cloth, rubber, or paper) between rollers or plates in order to smooth and glaze or to thin into sheets.”

“You can find it in any dictionary,” Calcagno said.

But, the team asked itself, should “calender” be flagged have the red squiggly underline that indicates a misspelling? Yes, because letting it go through as correct “more often masks the really common spelling error that people make for calendar.”

“We basically ask that question across dozens of languages on a massive scale,” Calcagno said. “There are thousands and thousands of words that aren’t yet in our speller, which are infrequently used.”

So what impact has the spell-checker had on the evolution of language?

Jerrold Zar, a biologist and statistician at Northern Illinois University, wrote a poem in 1992, “Candidate for a Pullet Surprise,” to highlight how easily homonyms slip through the spell-checker’s net. He sees the proliferation of word-processing, e-mailing and spell-checking accelerating changes in language, and not always for the better.

“Electronic writing has a tendency to value speed over accuracy, consistency, and clarity,” he wrote in an e-mail.