Literary Vocabulary: A Study of Ten Writers

Research into the Growth and Decay of Vocabulary in Novelists of the 19th and Early 20th Centuries


In a recent study of the changes in Henry James's style over his long career, I show that authorship attribution tools are very effective in grouping his novels by date of publication (see Fig. 1, below). One of my current research projects is to investigate changes in vocabulary in ten novelists with long careers to see if there are any general principles that apply to all or most authors.

One simple question is what happens to the frequencies of individual words over an author's career. It seems intuitively reasonable that the major change should be that authors continually add to their working vocabularies over time by learning new words, but research so far does not suggest that author's vocabularies become richer or more varied over time. It seems important to examine this issue more closely.

If James's career is divided into early, middle, and late parts (see samples below the figures) on the basis of gaps in the publication of his novels, there are six possible patterns of increase and decrease in frequency for each word:
With six patterns, one would expect about 16.17% of James's words to show each of these patterns, but an examination of the 9,000 most frequent words in 19 of his major novels  reveals that many more words than would be expected by chance decline gradually in frequency or increase gradually in frequency. Perhaps surprisingly, the patterns of increase and decrease are fairly consistent across the entire word frequency spectrum, from the most common words to those that occur only a few times in all nineteen novels combined, as Fig. 2, below, shows. A quick check of James and other authors shows that the patterns are fairly diverse (see Fig. 3 and Fig. 4, below). And for some authors, the patterns are not consistent across the frequency spectrum: Fig. 5 shows that, for Trollope, the common words act very differently from the uncommon ones. Much more research will be needed before any conclusions are possible.

I would welcome collaboration from undergraduate or graduate students who are interested in 19th Century Literature, Style, Statistics, and the study of memory. Some of the specific kinds of research that might be done are as follows: 

  1. A student could help to determine which author's texts are available electronically and are appropriate for the project (not revised heavily later, not radically altered by editors, and so forth). In this case the research would be specifically literary, and the student could propose authors in whom they are already interested. 

  2. A student could help to prepare the texts for analysis: here the task would be to clean up and edit the texts to make them consistent and as accurate as possible; this sounds less interesting than it turns out to be, because it raises important questions about publishing practices, authorial quirks, editing, the origins of e-texts, linguistic change, and so forth. In some cases, especially where early or late texts by an author are not available, it might be desirable to create a few additional electronic texts during this process, using optical character recognition. 

  3. A student with a background in statistics could work on developing and revising appropriate statistical techniques for analyzing the data: this is an area in which quite a lot of statistical work is done, but often not at a very sophisticated level. 

  4. More speculatively, a student with an interest in visualization or graphing might do research on new ways of presenting the large amounts of data that will result from the project. How do we present evidence based on, say, the 9,000 most frequent words of 30 novels so that it is both clear and effective?

I am in the process of applying for funding to pay student collaborators, but do not currently have any funding. Some kinds of projects might result in an independent study. You may contact me at david.hoover@nyu.edu.

Figure 1--Principal Components Analysis of 19 Henry James Novels


19 James Novels: PCA

Figure 2--Patterns of Common and Uncommon Words in Henry James

James Patterns 1000-Word Strata

Figure 3--Patterns of Frequencies in Eight Authors

Frequency Patterns in Eight Authors

Figure 4--Patterns of Frequencies in Six Authors

James and Five Others

Figure 5--Patterns of Common and Uncommon Words in Anthony Trollope

Trollope Frequency Patterns--1000-word Strata

Early James

Newman looked at her a moment; he saw that she was pretty, but he was not in the least dazzled. He remembered poor M. Nioche's solicitude for her ‘innocence,’ and he laughed out again as his eyes met hers. Her face was the oddest mixture of youth and maturity, and beneath her candid brow her searching little smile seemed to contain a world of ambiguous intentions. She was pretty enough, certainly, to make her father nervous; but, as regards her innocence, Newman felt ready on the spot to affirm that she had never parted with it. She had simply never had any; she had been looking at the world since she was ten years old, and he would have been a wise man who could tell her any secrets.  The American (1877; 1879 ed.)

Middle James

The ladies, who were much the more numerous, wore their bonnets, like Miss Chancellor; the men were in the garb of toil, many of them in weary-looking overcoats. Two or three had retained their overshoes, and as you approached them the odour of the india-rubber was perceptible. It was not, however, that Miss Birdseye ever noticed anything of that sort; she neither knew what she smelled nor tasted what she ate. Most of her friends had an anxious, haggard look, though there were sundry exceptions -- half a dozen placid, florid faces. Basil Ransom wondered who they all were; he had a general idea they were mediums, communists, vegetarians. The Bostonians, 1886

Late James

There were people on the ship with whom he had easily consorted -- so far as ease could up to now be imputed to him -- and who for the most part plunged straight into the current that set from the landing-stage to London; there were others who had invited him to a tryst at the inn and had even invoked his aid for a "look round" at the beauties of Liverpool; but he had stolen away from every one alike, had kept no appointment and renewed no acquaintance, had been indifferently aware of the number of persons who esteemed themselves fortunate in being, unlike himself, "met," and had even independently, unsociably, alone, without encounter or relapse and by mere quiet evasion, given his afternoon and evening to the immediate and the sensible. The Ambassadors (1903; NY ed., 1909)