There’s been quite a bit of discussion of Matthew Jockers’ “Syuzhet”package since it was first released in February of this year. The
discussion that I have seen has been entirely on scholarly blogs; I haven’t
been to any DH conferences where this was discussed so I may have missed some threads. The blog posts I have
read about this, however, are remarkably smart, thoughtful, and respectful. When we talked about Syuzhet in class in our DH class this past week, one of my students noted that if this same back-and-forth had taken place in scholarly journals it would have taken years to have the conversation play out. So: score one for scholarly blog conversation.
My goal here is *not* to add anything substantive or new to the discussion
(besides a few graphs from six George Eliot novels below), but rather to describe
my own learning curve and give others (including students) a few tips as to how
to do things in order to play around further. In short, "Syuzhet for Dummies" -- where by "dummies" I mean "smart readers of literature who have simply never had any reason in the past to mess around with code."
I’ll start with a brief summary
of the scholarly debate about Jockers’ work and method as I understand it, and then give readers
a step-by-step account of how to emulate the process of using the R programming
language in order to apply them to other texts.
* * *
The Debate About
Syuzhet
(People who are already familiar with the Syuzhet debate might want to scroll down to "How To Try It Yourself")
Start by reading Jockers’ intitial post on Syuzhet here:
And the follow-up here:
A basic point or two that it took me at least a little while
to comprehend. First, Jockers is not inventing the idea of sentiment analysis out
of the blue. He is borrowing the algorithms for analyzing sentences from four
separate processes that have been developed to do that (some of them were, in fact, originally intended by the academics who made them to scrape Twitter in order to ascertain consumer responses to commercial products in the marketplace). Algorithms like the Bing algorithm (named after CS professor Bing Liu, not Microsoft's "Bing") are derived from a dictionary-based method where a lexicon of words
has assigned sentiment values (presumably a word like “adore” would have a
positive sentiment value, while a word like “detest” would have a negative
one). The “Stanford” natural language processing method is more complex; they have used computational linguistics to
diagram whole sentences and plot their sentiment arc. This ought to be a more sophisticated and accurate
way of reading the sentiment of individual sentences, but in a subsequent test, Jockers decided that the Stanford method actually corresponds less well than dictionary-based
methods like Bing to his own, hand-tagged reading of a large but limited set of trial sentences (the entire text of Joyce's Portrait of the Artist as a Young Man!). In most of his blog posts on Sentiment Analysis, Jockers
bases his examples on the Bing algorithm.
Next, read Annie Swafford’s first critique here (there are several). There are two
prongs of the critique, one mathematical (do these sentiment-processing
algorithms read the emotional valence of individual sentences correctly?), and
the other oriented to the method by which Jockers creates his rather
astonishing visualizations of novels.
Jockers responds to the mathematical questions Swafford
raises by arguing that while dictionary-based algorithms do indeed have the
potential to misread individual sentences, most times in traditional literary
works ambiguous sentences will be surrounded by contextual sentences that clarify
the mood. On average, Jockers suggests, the algorithms do a pretty good job
finding the average sentiment of larger chunks of text, while they might indeed be prone to misread very small samples.
On the visualization question, the most damning point Swafford makes is that Jockers’ beautiful, hyper-simplified sentiment graphs are
based on a method (Fourier Transformation + a low pass filter) that is likely to introduce false artifacts upon visualization (she calls them "ringing artifacts"). So the problem isn't that these diagrams are too clean and simplified to be useful, the problem is in fact that they might not be true to the text.
Jockers initially pushed back against Swafford (and other
critics who made arguments similar to hers). But in April, in a blog post
called “Requiem for a Low Pass Filter,” he changed his mind about the Fourier
transformation he had been using and advocating. In effect, he acknowledged that the beautiful “foundation
shapes” he had been hoping to use his “Syuzhet” package to create (and allow
others to create) gives us images that are actually too distorted to trust.
All we have is an R package, syuzhet, which does something I would call exploratory data analysis. And it’s hard to evaluate exploratory data analysis in the absence of a specific argument.
For instance, does syuzhet smooth plot arcs appropriately? I don’t know. Without a specific thesis we’re trying to test, how would we decide what scale of variation matters? In some novels it might be a scene-to-scene rhythm; in others it might be a long arc. Until I know what scale of variation matters for a particular question, I have no way of knowing what kind of smoothing is “too much” or “too little.”
How To Try It
Yourself. For non-coders (really, if I could do it you can do too)
Before we start, keep in mind that what we are doing here is
already considered somewhat “obsolete” in the sense that Jockers has acknowledged
that the Fourier transformation function he built into his package might introduce more problems than it solves. As of
his last blog post on the subject, Jockers had embraced a different approach to
transforming data (for me, that’s going to be the next thing I attempt to try,
though figuring out how to apply the "Loess" filter is beyond the scope of this blog post). That said, also keep
in mind that all of this is meant to be exploratory.
If we turn George Eliot’s novels into “sentiment vectors,” and then attempt to
visualize that, what do we get? I would not present these images as the “truth”
of Eliot’s plots. Rather, the question I would ask might be this. Given our experience having read Eliot’s novels
closely, do these images show us anything we might not already have known?
Now for the step-by-step.
First up, you need to install R on your computer. I would install RStudio
https://www.rstudio.com/products/rstudio/download/
https://www.rstudio.com/products/rstudio/download/
The R Studio version has some features that make it easier to use than other versions of R (including the R GUI version you can get from the CRAN website). In particular, it is easier to install packages in R Studio than in R GUI.
The first thing you need to do once you have R up and
running is install Jockers’ “Syuzhet” package.
In R Studio, all you have to do is click on "Packages" and then on "Install Packages." You'll have to set a mirror (meaning, the server from which you are actually downloading the files), and R Studio should do the rest.
In R Studio, all you have to do is click on "Packages" and then on "Install Packages." You'll have to set a mirror (meaning, the server from which you are actually downloading the files), and R Studio should do the rest.
From there, you can basically follow the instructions
Jockers prepared at this site:
(What I am going to do below is give you a highly simplified
version of what he describes at the link above. I'm going to color code various variables we'll be defining to make it easier to see the relationship between one line and the next. Needless to say, if I have it in a color other than black, you can also change it or personalize it as needed.)
At the command line in R, type this to load the syuzhet
package into memory:
Then, we need to call up a text file. For example purposes I
will use “Middlemarch.txt,” a file I created when I downloaded Middlemarch from
Project Gutenberg. (Ideally, you should open this in Wordpad or another text
editor and lop off the intro text and any concluding text added in by the
Gutenberg people.)
1. Type this in R:
middlemarch <- get_text_as_string("c:/middlemarch.txt")
This brings the text of the file into working memory in R as
an object called "middlemarch" from the root directory on my hard drive. From
this point on you don’t need to mess with traditional file names on your
computer. (FYI It took me a while to get R to like the way I specified
filenames. Note the forward slash!) If you have your text file located somewhere
else on your computer, you’ll have to figure out the exact path (which in current Windows is sometimes a little tricky). And if you are on a Mac, you’re a bit on
your own on this part, sorry.
Hint: if you’re doing a novel that has more than one word in the title, use
underscore marks to separate words rather than spaces, and aim for abbreviated
versions to keep things simple and reduce the likelihood of typos. Instead of tess_of_the_durvebilles, just call it tess.
2. Type this in R
s_v <- get_sentences(middlemarch)
The “get_sentences” function parses the text file into a
string of sentences, which needs to happen before you can create a vector (a
mathematical interpretation of the file). “s_v” is the sentence information rendered
into something mathematical we can use in the next step.
3. Type this in R:
sentiment_vector <- get_sentiment(s_v, method="bing")
This should create a new vector, corresponding to the
sentiment values. On the “Vignette” page linked to above, and in Jockers’
various blog posts, he has indicated that in fact there are three other method
choices you can try here (including the Stanford NLP method). “Bing” is named
after the researcher (of Chinese descent!) who created it, not something
associated with Microsoft.
4. Making a noisy graph!
Type this in R:
plot(sentiment_vector,
type="l", main="Plot Trajectory", xlab = "Narrative
Time", ylab= "Emotional Valence")
If everything worked the way it was supposed to, we should
now see a noisy graph that looks like the image below. If you are getting
errors, most likely it’s because you have a quotation mark in the wrong place. Also
note that there’s no space after the command “Plot”.
5. Making a smoothed out / low pass filter image. Again,
there’s much discussion of the Fourier transformation and the low pass filter
Jockers set up in his package that leads to these pretty sinusoidal pictures. Bearing
that in mind, we still want to see what it looks like for our particular novel,
do we not?
Type this in R:
ft_values <- get_transformed_values(sentiment_vector, low_pass_size=3, x_reverse_len=100, scale_vals=TRUE, scale_range=FALSE)
That should apply the Fourier transformation to our noisy
data, creating a smoother curve line for sentiment values in the novel. (Incidentally,
you can alter the “low_pass_size” parameter and try different number values
there to see what happens.
In order to see a pretty graph a la Jockers' “foundation
shape,” type this in R:
plot(ft_values, type
="h", main ="Eliot’s Middlemarch transformed", xlab =
"narrative time", ylab = "Emotional Valence", col =
"red")
That should give you this:
(On the one hand, I recognize the danger of "ringing artifacts" here. On the other, I can't help but think, as I am in the middle of teaching this book this fall, that there's something of the, er, ring of truth to the shape above.)
I tried the same thing with five other Eliot novels, just to
see how they compare.
Here’s the noisy sentiment values chart for Eliot’s Adam Bede:
And here’s the Fourier Transformation / low pass filter
version:
Here’s the noisy graph for Daniel Deronda:
And here’s the transformed version:
Here’s the noisy Mill
on the Floss:
Here’s the noisy Romola:
And the transformed version of Romola:
And here’s the transformed Silas Marner:
What can we learn from these visualizations?
I’m not entirely sure yet; I also want to try the other, non-sinusoidal transformations mentioned above. I know Eliot’s works pretty well (and I’m teaching Middlemarch right now in my other
class), but I’m still a little surprised by these graphs.
Surprise #1 might be just how
different the graphs are from one another. The six novels actually follow
rather different trajectories. (Would this also be true if we tried this with
Dickens… or Henry James?)
Surprise #2 might be the
really unusual shape for Silas Marner.
Rather than “man in hole,” we have a plot that looks more like “man has some
setbacks, but then it’s all good for the second half of the book.”
Surprise #3 might be that novels like Mill on the Floss and Adam Bede do not end with sentiment values near 0. Mill on the Floss is way below 0 and Adam Bede is way above.