Go | New | Find | Notify | Tools | Reply |
Member |
In our local coffee house they have a trivia question everyday where you can get a dime off on your coffee (woo hoo! ) Shu always gets it right, which is annoying. However, he wasn't with me today. The question was, what English letter is used the least? I guessed "x," but the guy said the answer is "q." I argued with him, reminding him of he common "q" words, like "quit" or "quiet" and then asked him how often he hears "x-ray." He countered with "exit," but still. I think "x" is used less than "q," don't you? He said that "someone" counted all the letters in the dictionary, and "q" lost. Has anyone here heard of that before? | ||
|
Member |
Z. Letter frequencies were counted in the nineteenth century so that typesetters would have suitable numbers of each. This is the origin of the filler text ETAOIN SHRDLU that used to be seen in newspapers: those are the most common in order. Lists vary slightly but Z is always at the bottom. The list I memorized many years ago is ETAON RISHDLU CMFGY PWBV KXJQZ. Because of its importance in cryptography, there's been a large amount of text analysed by computer, so the figures in this list are probably robust. That gives Z (0.1%) distinctly down from J (0.2%), Q and K (0.3%), and X (0.5%). A large corpus like that would presumably be composed largely of texts that use Z in words like 'realize'. In '-ise' varieties such as Australian and journalistic British the frequency of Z would be even lower. | |||
|
Member |
Well, I don't know about going through all the words in a dictionary, but this site shows the results of the analysis of 18584 common base words, and of 45406 common words. Interestingly, "j" comes last in the common base words, and is second to last in the common words. Build a man a fire and he's warm for a day. Set a man on fire and he's warm for the rest of his life. | |||
|
Member |
A live experiment on a mini-corpus: the first thousand words of Pride and Prejudice fixed, next, extraordinary, vexing, mixture, experience acknowledged, known, Park, know, taken, know, taken, take, week, know, thinking, talk, likely, like, thinking, think, think, know, like, quickness, take, mistake, know, quick, make, knowledge, like, know, likes quickness, quick just, objection, Jane Lizzy, Lizzy, Lizzy, Lizzy, Elizabeth And to avoid conversational words, the first thousand words of The Origin of Species: exposed, excess, exposed, experiments, exception, exceptions, exotic, exact, extremely, exposed, exactly look, strikes, think, think, Knight, make, remarkable, weak, sickly, taken, like, kept, remarked frequent, quite, quite subject, just organization | |||
|
Member |
I ran Pride and Prejudice through a simple histogram program. Q came in first with 627, followed by X at 839, J with 873, and Z with 936. Here's the results: a: 41684 b: 9086 c: 13457 d: 22295 e: 69346 f: 11994 g: 10029 h: 34055 i: 37809 j: 873 k: 3207 l: 21583 m: 14755 n: 37670 o: 40020 p: 8225 q: 627 r: 32289 s: 33101 t: 46621 u: 14975 v: 5723 w: 12296 x: 839 y: 12697 z: 936 | |||
|
<wordnerd> |
jheem: I ran Pride and Prejudice through a simple histogram program. OK, I'll bite. What's a historgram? Is 'simple histogram' an oxymoron? | ||
Member |
OK, I'll bite. What's a historgram? A-H: "A bar graph of a frequency distribution in which the widths of the bars are proportional to the classes into which the variable has been divided and the heights of the bars are proportional to the class frequencies." More inforomation here I usually assign my intro programming students to implement a histogram, and tabulate and chart the frequencies of letters in different public domain books (usually from Gutenberg). Next assignment is to count the occurrences of words in a text. | |||
|
Member |
I make it 947 Z's, in the edition on the 'Republic of Pemberley', but close enough. But Z has such an unfair advantage here. Now take out the 633 mentions of Elizabeth and the 96 of Lizzy and 24 of Eliza and we're down to 98. Take out 34 mentions of Colonel Fitzwilliam, and 3 of Fitzwilliam Darcy, and we're down to 61. Of these, 11 are forms of 'teaze', so no longer current: down to only 50 present-day dictionary words containing Z in the whole book, of which by the way 19 are forms of 'amaze'. | |||
|
Member |
You're right aput. I thought of proper nouns skewing the stats after I'd posted. The interesting thing about doing this finding out stylistic tidbits, like the author's use of the "amaze" forms. Did you filter the X, Q, and, J words, too? | |||
|
Member |
Well Jane accounts for 284 of the J's, but you expect rather a lot of J names. It's only the Z's that are totally skewed here. | |||
|
Member |
Yes, I looked through the Q words, and there's quite a few: all with oone or two occurrences. I think the Zs have it. | |||
|
Member |
Well, either way, my "x" lost. I am surprised about "j" and would just love to see all those "x" words! jheem, I am dying to know what your avatar means. | |||
|
Member |
Its my screenname, jheem, written in Devanagri, the syllabary used to write Sanskrit and Hindi. I couldn't resist having a picture of a (non-)word. I guess I should've written avatar in Devanagri since it is a Sanskrit word. | |||
|
Member |
jheem, what a creative avatar; it is especially appropriate on a word board! Now, aput says that etaoin are the most common letters in that order, though he acknowledges some variances. Yet, from arnie's site "t" comes in 5th or 7th. Is that because the 18584 Common Base Words or the 45406 Common Words just don't have that many "t's?" Those most common letters are "eisar" or "aeirt." | |||
|
Member |
That other site is a list counted by different words: that is, ignoring the fact that 'the' and other common words are repeated constantly. So the frequencies are not as they appear in text, and it's a rather odd measure. Compare initial letters: the most common in text are TAOSW in that order; but in a dictionary, counting each word only once, they're... well, whichever bits of the dictionary are thickest, but with a heavy bias towards little-used words beginning with 'pre-' or 'un-'. | |||
|
Member |
Yes, I am beginning to see the importance of what document you look at in order to see frequency of letters. The "Pride and Prejudice" example was another good one with the "Lizzys." BTW, aput, in our wordplay thread, under "The Bluffing Game" I have nominated you to be next up. All you have to do is post a word that you think no one will know. Then people will send you private topics with fake definitions. You then post all the answers and people guess. If you fool everyone, you get 3 points. If someone picks the right answer, he gets 2 points, and people get 1 point every time someone picks their fake answers. We'd love to have you play! | |||
|