Wordcraft Community Home Page
A brief study of letter frequency in English

This topic can be found at:
https://wordcraft.infopop.cc/eve/forums/a/tpc/f/932607094/m/3991045064

May 08, 2007, 16:31
BobKberg
A brief study of letter frequency in English
Ok, American - but that's what I'm used to. This is a little project that I did based on a conversation with my dad (he of the dozens of dictionaries) just for fun. I've done many other silly, but amusing projects, but this is one of the few that I bothered to document. I thought that some of you might find this interesting or amusing. I also thought that it should put to rest any unwritten speculation that you've heard the last of me. However, my contributions here are going to occur, for the most part, when business is slow. Enjoy, Bob

Letter Frequency

The "dictionary" (Actually, just a long word list) from which these are drawn (Red Hat Linux) contains 93,397 words. All proper names (such as November) have been converted to lower case to simplify the counting process. There are (to the best of my knowledge) no "commercial" names in this list. The actual counts read left to right, not downward.


Descending Frequency Count for first letters of words in the English Language:

s, c, p, a, d, m, b, r, t, i, e, f. u, h, g, l, o, w, v, n, k, j, q, z, y, x.

11465 s; 7490 c; 6624 p; 6103 a; 5786 d;
5090 m; 5085 b; 5000 r; 4756 t; 4324 i;
4229 e; 3945 f; 3757 u; 3159 h; 2954 g;
2860 l; 2824 o; 1868 w; 1780 v; 1751 n;
791 k; 676 j; 594 q; 220 z; 209 y;
57 x.


Descending Frequency Count for second letters of words in the English Language:

e, a, o, i, n, r, u, l , h, t, p, c, m, y, x, v, s, b, d, w, q, f, g, k, z, j.

14479 e; 12364 a; 11282 o; 8827 i; 8533 n;
7354 r; 7313 u; 4206 l; 3269 h; 2251 t;
2169 p; 1817 c; 1735 m; 1414 y; 1118 x;
954 v; 924 s; 775 b; 738 d; 543 w;
362 q; 333 f; 312 g; 229 k; 43 z;
27 j,


Descending Frequency Count for third letters of words in the English Language:

r, a, n, t, s, e, l, i, o, c, p, m, u, d, b, g, f, v, h, w, y, x, k, z, j, q


8984 r; 8081 a; 7170 n; 6794 t; 6551 s;
6179 e; 5755 l; 5719 i; 5509 o; 5164 c;
4175 p; 3862 m; 3722 u; 3136 d; 2524 b;
2488 g; 1654 f; 1516 v; 985 h; 860 w;
807 y; 443 x; 373 k; 320 z; 286 j;
264 q.


Note the wide discrepancy from “etaoin” in first letters, moderate similarity to “etaoin” in second letter, and “in between” for third letters. Also note the relative clustering of frequency in third letters, versus the sharp drop-offs in frequencies (at different points) for first and second letters. Some day this might make an interesting graph. If I do it, I'll post it (or a URL to it).
May 08, 2007, 17:34
Seanahan
I would think a better judge would be a corpus of English text. The letter "e" may not be the most frequent amongst the words, but is the most commonly used letter in text.
May 09, 2007, 08:03
zmježd
This sort of thing is fun. For a data structures class I used to teach, I'd assign a homework assignment for a concordance to measure the frequency of words. No morphological analysis was possible for so short a project. The students had to choose a longish text from the Gutenberg library project and compare results. Most choose to analyze English language texts, even though over half the classes tended to be non-native speakers. I kept waiting for one of the Chinese to run their program on Chinese texts. Another fun thing to do would be to do phoneme frequency by using a dictionary that had phonological representations (i.e., pronunciation guides) to determine what a typical word (or better yet syllable) looks like in English.


Ceci n'est pas un seing.
May 09, 2007, 21:29
BobKberg
Ooh! zmjezd! very cool idea!!!

I'll have to play with that idea a little bit.
Although it occurs to me that the period in which the original text was written would doubtless play a role in the selections.

Bob
May 09, 2007, 23:34
BobKberg
I suspect Seanahan, that you are implying the repetition of articles, prepositions and such in regular usage.

If so, I don't doubt you for a moment.

I am simply having a little fun with the language, and the odd/interesting patterns one can encounter.

Bob