academia | advice | alcohol | American Indians | architecture | art | artificial intelligence | Barnard | best | biography | bitcoin | blogging | broken umbrellas | candide | censorship | children's books | Columbia | comics | consciousness | cooking | crime | criticism | dance | data analysis | design | dishonesty | economics | education | energy | epistemology | error correction | essays | family | fashion | finance | food | foreign policy | futurism | games | gender | Georgia | health | history | inspiration | intellectual property | Israel | journalism | Judaism | labor | language | law | leadership | letters | literature | management | marketing | memoir | movies | music | mystery | mythology | New Mexico | New York | parenting | philosophy | photography | podcast | poetry | politics | prediction | product | productivity | programming | psychology | public transportation | publishing | puzzles | race | reading | recommendation | religion | reputation | review | RSI | Russia | sci-fi | science | sex | short stories | social justice | social media | sports | startups | statistics | teaching | technology | Texas | theater | translation | travel | trivia | tv | typography | unreliable narrators | video | video games | violence | war | weather | wordplay | writing

Tuesday, May 02, 2006

How many English words are there?

The Oxford English Dictionary's website discusses the questions of how many words there are in the English language, and if English has the greatest word count among all world languages:
The Second Edition of the Oxford English Dictionary contains full entries for 171,476 words in current use, and 47,156 obsolete words. To this may be added around 9,500 derivative words included as subentries. Over half of these words are nouns, about a quarter adjectives, and about a seventh verbs; the rest is made up of interjections, conjunctions, prepositions, suffixes, etc.

That's a lot, and the OED agrees with the popular view that the large number is due to the promiscuous history of the formation of English, which involved England being conquered by so many different invaders, and America amalgamating so many different immigrant groups.

As for the exact number of words, that's impossible to say:

What about medical and scientific terms? Latin words used in law, French words used in cooking, German words used in academic writing, Japanese words used in martial arts? Do you count Scots dialect? Youth slang? Computing jargon?
Jesse Sheidlower in Slate recently complained about the slimy media manipulators who call themselves "Global Language Monitor", who slapped together a list of English words that just happened to stop 1% short of a million. Their website, of course, displays a counter, which increases every time the company takes another step towards the millionth word and tons of free press.

Sheidlower calls assembling a comprehensive count of English words a fool's errand:

What about Frizzie, "student of Ms. Frizzle" or busigator, "the Magic School Bus transformed into an alligator," in the books I'm reading to my daughter? What about Giant, "a player on the N.Y. Giants football team"? The most comprehensive abbreviations-dictionaries include about 500,000 entries, most of which wouldn't be found in standard dictionaries. The American Chemical Society has a registry of over 84 million named chemical substances, and there are about a million named species of insects alone; surely these must count as words?

What about obvious forms? Dictionaries include great-grandfather but not great-great-great-great-great-great-grandfather, which is real enough to get over 3,500 Google hits. Only the most basic numbers are typically included; Merriam-Webster, for example, includes twenty-one and twenty-two, but not twenty-three or thirty-one. In fact, if you were to count every number between 0 and 999,999 as a word, you'd have a cool million right there—and still have the rest of the English language to account for.

If our requirements for including words in the count are liberal, the OED futher points out, we cannot justify ignoring "'agglutinative' languages such as Finnish, in which words can be stuck together in long strings of indefinite length, and which therefore have an almost infinite number of 'words'."

Labels: , ,

Anonymous Anonymous on Wed May 03, 03:39:00 PM:
Re: "Jesse Sheidlower in Slate recently complained about the slimy media manipulators who call themselves "Global Language Monitor", who slapped together a list of English words that just happened to stop 1% short of a million. Their website, of course, displays a counter, which increases every time the company takes another step towards the millionth word and tons of free press".

We at the Global Language Monitor began our count three years ago at 823,164 after a rigorous and continuing analysis analysis that is described on our site. The fact that NPR and the New York Times picked it up at 986,120 does not render us 'shameless media manipulators".

Reputable media, such as the New York Times might disagree with Mr. Sheidlower's assessment of our efforts, but then they actually talked to us, checked our work, and worked with us over extended periods of time before publishing their articles!?

In fact we worked with The Times for more than two weeks on the said article (on Real Estate jargon, Jan 29., 2006).

We, of course, would have preferred to have received a call either from Mr. Sheidlower or the Slate folks before they released his article and interview that contained any number of errors. Don't the rules of journalism require such?

Many of these falsehoods could have been corrected by simply thoroughly reading the material on the site, but we understand that having the right information would have undermined his central premise.

Why does the fact that it is difficult to estimate the number of words in the language, mean that no one should even attempt to do so? And why the invective? Where does the anger come from?

Scientists routinely estimate the number of galaxies, stars, and even atomic particles in the universe (~10 to the 72nd power), the weight of the earth, the number of habitable planets in the galaxy, etc.

Do we think all these attempts are also "fool's errand(s)" in Mr. Sheidlower's words?

As with every quantative analysis, we created and tested rules and guidelines. GLM then assigned a number to the rate of creation of new words and the adoption and absorption of foreign vocabulary into the language. The result, though an estimate, has been found to be quite useful as a starting point of the discussion for lay persons, students, and scholars the world over.
Blogger Ben on Fri May 05, 10:20:00 AM:
Sorry to have called you slimy, but there is something pretty strongly redolent in The GLM's snake-oil claim to sophistication:

"GLM then created a proprietary algorithm, the Predictive Quantities Indicator (PQI) that attempts to measure the language as currently found in print (including technical and scientific journals), the electronic media (transcripts from radio and television), on the Internet and, increasingly, in web logs (blogs).

The Global Language Monitor's proprietary algorithm, the Predictive Quantities Indicator tracks the frequency of words and phrases in the global print and electronic media, on the Internet, throughout the Blogosphere, as well as accessing proprietary databases (Factiva, Lexis-Nexis, etc.)."

For reasons I think Sheidlower made very clear, no algorithm, no matter how precious, can do a good job of even vaguely estimating the number of words in English.