Literary Vocabulary: A Study of Ten Writers
Research into the Growth and Decay of Vocabulary in Novelists of the 19th and Early 20th Centuries
In a recent study of the changes in Henry James's style over his long
career, I show that authorship attribution tools are very effective in
grouping his novels by date of publication (see Fig. 1, below). One of
my current research projects is to investigate changes in
vocabulary in ten novelists with long careers to see if there are any
general principles that apply to all or most authors.
One simple question is what happens to the frequencies of individual
words over an author's career. It seems intuitively reasonable that
the major change should be that authors continually add to their
working vocabularies over time by learning new words, but research so
far does not suggest that author's vocabularies become richer or
more varied over time. It seems important to examine this issue more
closely.
If James's career is divided into early, middle, and late parts (see
samples below the figures) on the basis of gaps in the publication of his
novels, there are six possible patterns of increase and decrease
in frequency for each word:
- E > M > L decrease
- E > L > M
- M > E > L
- M > L > E
- L > E > M
- L > M> E increase
With six patterns, one would expect about 16.17% of James's words to
show each of these patterns, but an examination of the 9,000 most
frequent words in 19 of his major novels reveals that many
more words than would be expected by chance decline gradually in
frequency or increase gradually in frequency. Perhaps surprisingly, the
patterns of increase and decrease are fairly consistent across the
entire word frequency spectrum, from the most common words to those
that occur only a few times in all nineteen novels combined, as Fig. 2,
below, shows. A quick check of James and other authors shows that
the patterns are fairly
diverse (see Fig. 3 and Fig. 4, below). And for some authors, the
patterns are not consistent across the frequency spectrum: Fig. 5 shows
that, for Trollope, the common words act very differently from the
uncommon ones. Much more research will be needed before
any conclusions are possible.
I would welcome collaboration from undergraduate or graduate
students who are interested in 19th Century Literature, Style,
Statistics, and the study of memory. Some of the specific kinds of
research that might be done are as follows:
-
A student could help to determine which author's texts are available
electronically and are appropriate for the project (not revised heavily
later, not radically altered by editors, and so forth). In this case the
research would be specifically literary, and the student could propose
authors in whom they are already interested.
-
A student could help to prepare the texts for analysis: here the task
would be to clean up and edit the texts to make them consistent and as
accurate as possible; this sounds less interesting than it turns out to
be, because it raises important questions about publishing practices,
authorial quirks, editing, the origins of e-texts, linguistic change,
and so forth. In some cases, especially where early or late texts by an
author are not available, it might be desirable to create a few
additional electronic texts during this process, using optical character
recognition.
-
A student with a background in statistics could work on developing
and revising appropriate statistical techniques for analyzing the data:
this is an area in which quite a lot of statistical work is done, but
often not at a very sophisticated level.
-
More speculatively, a student with an interest in visualization or
graphing might do research on new ways of presenting the large amounts
of data that will result from the project. How do we present evidence
based on, say, the 9,000 most frequent words of 30 novels so that it is
both clear and effective?
I am in the process of applying for funding to pay
student collaborators, but do not currently have any funding. Some
kinds of projects might result in an independent study. You may
contact me at david.hoover@nyu.edu.
Figure 1--Principal Components Analysis of 19 Henry James Novels

Figure 2--Patterns of Common and Uncommon Words in Henry James

Figure 3--Patterns of Frequencies in Eight Authors

Figure 4--Patterns of Frequencies in Six Authors

Figure 5--Patterns of Common and Uncommon Words in Anthony Trollope

Early James
Newman looked at her a moment; he saw that she was
pretty, but he was not in the least dazzled. He remembered poor M.
Nioche's solicitude for her ‘innocence,’ and he laughed out
again as his eyes met hers. Her face was the oddest mixture of youth
and maturity, and beneath her candid brow her searching little smile
seemed to contain a world of ambiguous intentions. She was pretty
enough, certainly, to make her father nervous; but, as regards her
innocence, Newman felt ready on the spot to affirm that she had never
parted with it. She had simply never had any; she had been looking at
the world since she was ten years old, and he would have been a wise
man who could tell her any secrets. The American (1877; 1879 ed.)
Middle James
The ladies, who were much the more numerous, wore
their bonnets, like Miss Chancellor; the men were in the garb of toil,
many of them in weary-looking overcoats. Two or three had retained
their overshoes, and as you approached them the odour of the
india-rubber was perceptible. It was not, however, that Miss Birdseye
ever noticed anything of that sort; she neither knew what she smelled
nor tasted what she ate. Most of her friends had an anxious, haggard
look, though there were sundry exceptions -- half a dozen placid,
florid faces. Basil Ransom wondered who they all were; he had a general
idea they were mediums, communists, vegetarians. The Bostonians, 1886
Late James
There were people on the ship with whom he had easily
consorted -- so far as ease could up to now be imputed to him -- and
who for the most part plunged straight into the current that set from
the landing-stage to London; there were others who had invited him to a
tryst at the inn and had even invoked his aid for a "look round" at the
beauties of Liverpool; but he had stolen away from every one alike, had
kept no appointment and renewed no acquaintance, had been indifferently
aware of the number of persons who esteemed themselves fortunate in
being, unlike himself, "met," and had even independently, unsociably,
alone, without encounter or relapse and by mere quiet evasion, given
his afternoon and evening to the immediate and the sensible. The Ambassadors (1903; NY ed., 1909)