Syuzhet Sentiment Analysis: for Beginners & Non-Coders (updated 2025)

This post was formerly titled "Syuzhet for Dummies." Lightly updated in February 2025. This blog post has been read about 5000 times as of the spring of 2025. 


There was quite a bit of discussion of Matthew Jockers’ “Syuzhet”package on scholary blogs when it was first released in February 2015. When we talked about Syuzhet in class in our DH class that fall, one of my students noted that if this same back-and-forth had taken place in scholarly journals it might have taken years to have the conversation play out. So: score one for scholarly blog conversation.

My goal here is not to add anything substantive to the discussion, but rather to describe my own learning curve and give others, including students, a few tips as to how to do things to play around further. My assumption is that the reader is not experienced with R, and I will guide you through installing it and getting started. If you are already an R user, Jockers' documentation will probably be more useful to you than the blog post below. 

I’ll start with a brief summary of the scholarly debate about Jockers’ work and method as I understand it, and then give readers a step-by-step account of how to emulate the process of using the R programming language in order to apply them to other texts.

* * *

The 2015 Debate About Syuzhet

(People who are already familiar with the Syuzhet debate might want to scroll down to "How To Try It Yourself")

Start by reading Matthew Jockers’ intitial post on Syuzhet here (Feb 02 2015):


And the follow-up here (Feb 22, 2015): 


A basic point or two that might not be obvious to newbies. First, Jockers is not inventing the idea of sentiment analysis out of the blue. He is working with algorithms for analyzing sentences from four separate processes that have been developed to do that. 

These are algorithms that are currently mostly used to scrape social media posts online to ascertain consumer responses to commercial products in the marketplace; academics have been using them for very different purposes. 

Sentiment analysis approaches like the Bing lexicon (named after CS professor Bing Liu, not Microsoft's "Bing")  are derived from a dictionary-based method where a lexicon of words has assigned sentiment values (presumably a word like “adore” would have a positive sentiment value, while a word like “detest” would have a negative one). The “Stanford” natural language processing method is more complex; they have used computational linguistics to diagram whole sentences and plot their sentiment arc. This ought to be a more sophisticated and accurate way of reading the sentiment of individual sentences, but in a subsequent test, Jockers decided that the Stanford method actually corresponds less well than dictionary-based methods like Bing to his own, hand-tagged reading of a large but limited set of trial sentences (the entire text of Joyce's Portrait of the Artist as a Young Man!). In most of his blog posts on Sentiment Analysis, Jockers bases his examples on the Bing algorithm.  

[Update from 2025: at the time I first wrote this, I didn't know about the NRC emotion lexicon. In fact, with its more complex range of affects -- joy, disgust, anger, trust, sadness, surprise, anticipation, fear -- NRC might ultimately have the potential to be more useful to literary critics, though the simple +/- graphs below might not be relevant. Instead, we might need a more complex mode of visualizing an eight-point framework. In Jockers' updated documentation from 2023, scroll down to "get_nrc_sentiment" too learn how to use this.]

Next, read Annie Swafford’s first critique here (there are several). There are two prongs of the critique, one mathematical (do these sentiment-processing algorithms read the emotional valence of individual sentences correctly?), and the other oriented to the method by which Jockers creates his rather astonishing visualizations of novels. 

Jockers responds to the mathematical questions Swafford raises by arguing that while dictionary-based algorithms do indeed have the potential to misread individual sentences, most times in traditional literary works ambiguous sentences will be surrounded by contextual sentences that clarify the mood. On average, Jockers suggests, the algorithms do a pretty good job finding the average sentiment of larger chunks of text, while they might indeed be prone to misread very small samples.

On the visualization question, the most damning point Swafford makes is that Jockers’ clear and striking sentiment graphs are based on a method (Fourier Transformation + a low pass filter)  that is likely to introduce false artifacts upon visualization (she calls them "ringing artifacts"). So the problem isn't that these diagrams are too clean and simplified to be useful, the problem is in fact that they might not be true to the text.

Jockers initially pushed back against Swafford (and other critics who made arguments similar to hers). But in April 2015, in a blog post called “Requiem for a Low Pass Filter,” he changed his mind about the Fourier transformation he had been using and advocating. In effect, he acknowledged that the beautiful “foundation shapes” he had been hoping to use his “Syuzhet” package to create (and allow others to create) gives us images that are actually too distorted to trust. 

I'm not going to take a strong position on the controversy around Syuzhet here. Ted Underwood made what I thought was a useful point about the issue in his blog post shortly after the controversy began:

All we have is an R package, syuzhet, which does something I would call exploratory data analysis. And it’s hard to evaluate exploratory data analysis in the absence of a specific argument.

For instance, does syuzhet smooth plot arcs appropriately? I don’t know. Without a specific thesis we’re trying to test, how would we decide what scale of variation matters? In some novels it might be a scene-to-scene rhythm; in others it might be a long arc. Until I know what scale of variation matters for a particular question, I have no way of knowing what kind of smoothing is “too much” or “too little.”
The key phrase for me is "exploratory." These methods lead to visualizations that might or might not be interesting; they might or might not tell us something new about novels we (presumably) have already read. I personally am not interested in using these techniques to make large-scale generalizations about the "basic shapes of stories," nor do I think the graphs give us access to fundamental truths about literary texts that supersedes understanding derived from actually reading them. At best, these methods piggyback on our existing close reading habits in a kind of hybrid formation. 

I take all of this with a grain of salt; it still seems worth exploring. Perhaps we'll get a clearer picture of what we can do with sentiment analysis in literary texts as we continue to try different things. 



How To Try It Yourself. For non-coders.


Keep in mind that what we're doing below is meant to be exploratory. If we turn George Eliot’s novels into “sentiment vectors,” and then attempt to visualize that, what do we get? I would not present these images as the “truth” of Eliot’s plots. Rather, the question I would ask might be this. Given our experience having read Eliot’s novels closely, do these images show us anything we might not already have known?

Now for the step-by-step.

1.First up, you need to install R on your computer. 


First, install just plain "R" via a download from here:
https://cran.r-project.org/

Then, I would install RStudio, which is a graphic user interface version & a little easier to use for non-coders. (This runs on top of "R.")

https://posit.co/download/rstudio-desktop/

The R Studio version has some features that make it easier to use than other versions of R (including the R GUI version you can get from the CRAN website). In particular, it is easier to install packages in R Studio than in R GUI.

2. The first thing you need to do once you have R up and running is install Jockers’ “Syuzhet” package.

In R Studio, all you have to do is click on "Packages" and then on "Install Packages." You'll have to set a mirror (meaning, the server from which you are actually downloading the files), and R Studio should do the rest.

From there, you can basically follow the instructions Jockers prepared at this site:


(What I am going to do below is give a highly simplified version of what he describes at the link above. I'm going to color-code various variables we'll be defining to make it easier to see the relationship between one line and the next. Needless to say, if I have it in a color other than black, you can also change it or personalize it as needed.)

3. At the command line in R, type this to load the syuzhet package into memory:

library(syuzhet)

(Note: no space between "library" and the parenthesis above. Also note: nothing will happen.)

Then, we need to call up a text file or past a text into memory. 

If you have something short, you can paste it like this:


If you want to work with something longer, you'll want to upload a text file. For example purposes, I will use “Middlemarch.txt,” a file I created when I downloaded Middlemarch from Project Gutenberg. (Ideally, you should open this in a text editor and lop off the intro text and any concluding text added in by the Gutenberg people.)

4. Type this in R:

middlemarch <- get_text_as_string("c:/middlemarch.txt")

(Note: for non-coders, the file path part of this could be a little tricky. On Windows, you can right-click on a text file in any folder, and click "Copy as path." Or simply click "Properties" and look for the full filepath. Note that you may need to change the direction of the slash from a backslash to a forward-leaning slash.  On a Mac, click Command-I to get filepath info. on a given file.

This brings the text of the file into working memory in R as an object called "middlemarch" derived from a text file called middlemarch.txt from the root directory on my hard drive. Happily, from this point on you don’t need to mess with traditional file names or filepaths on your computer. 

(Hint: It took me a while to get R to like the way I specified filenames. Note again the forward slash!) 

Another hint: if you’re doing a novel that has more than one word in the title, use underscore marks to separate words rather than spaces, and aim for abbreviated versions to keep things simple and reduce the likelihood of typos. Instead of tess_of_the_durvebilles, you might just call it tess for the purposes of this exercise.

Yet another hint from 2025: even if you do everything correctly in terms of how you give the file path, you might still see an error that looks like this: 

Warning message:
In readLines(path_to_file) : 
  incomplete final line found on 'https://www.site.uottawa.ca/~lucia/courses/2131-02/A2/trythemsource.txt' 

You can actually just ignore this error message! The subsequent steps will still work. 


5. Type this in R to parse the text into sentences.

s_v <- get_sentences(middlemarch)

The “get_sentences” function parses the text file into a string of sentences, which needs to happen before you can create a vector (a mathematical interpretation of the file). “s_v” is the sentence information rendered into something mathematical we can use in the next step.  


6. Type this in R to get sentiment scores on each of those sentences.

sentiment_vector <- get_sentiment(s_v, method="bing")

This should create a new vector, corresponding to the sentiment values. On the “Vignette” page linked to above, and in Jockers’ various blog posts, he has indicated that in fact there are three other method choices you can try here (including the Stanford NLP method). “Bing” is named after the researcher who created it, not something associated with Microsoft.


6a. Inspecting the data. Getting an overall sentiment score. (Added this section in 2025)


You might be curious to know how Syuzhet has parsed your text, sentence by sentence. You might also be curious about seeing an overall sentiment score for your text. (If you just want to start making graphs, skip to steps 7 and 8.)

To see just the sentences as parsed in step 5 above, you can simply type 

s_v

And you should see a list of all the sentences as interpreted by Syuzhet (if you're working with a long novel like Middlemarch, this will be very long, and I wouldn't exactly recommend it). 

When I have done this, I have sometimes been surprised at apparent glitches or odd choices made by the software along the way (i.e., ellipses can sometimes get counted as separate sentences). If you're working with a shorter text, this might be a moment to actually clean up the text a little so the parsing is more accurate, and then re-upload it into Syuzhet following the steps above. 


If you want to see what sentiment score each individual sentence in your document received, just type this into R:

sentiment_vector

That will give you a long list of numbers. Each number in the list corresponds to a sentence in your document, in order. 

More advanced: To make a table of the sentences (s_v) above alongside sentiment scores (sentiment_vector), try this command: 

df <- data.frame(s_v = s_v, sentiment = sentiment_vector, stringsAsFactors = FALSE)

To then print that as a .csv file you can open in Excel or Google Sheets, try this:

write.csv(df, "sentiment.csv", row.names = FALSE)

If everything worked right, you should now have on your computer (maybe in Documents) a CSV file called sentiment.csv that should have column one consist of sentences for your sample text (in order), and column 2 be the sentiment score for each corresponding sentence.  


To get an overall sentiment score, you can try:

summary(sentiment_vector)

That will give you a response that looks like this:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  -3.00   -1.00    0.00    0.12    1.00    4.00 

Or just try

mean(sentiment_vector)

Here, it should just give you a single number. 

The "Mean" is probably a more relevant number than the median for most beginner use cases. 

The Mean could come in handy if your goal is to compare sentiment scores across groups of texts. (For example: Are children's books for younger kids generally more upbeat than books for older kids and young adults? To answer that, you would probably focus on just the average score or mean, and ignore the graphing below.)



7. Making a noisy graph!

Type this in R:

plot(sentiment_vector, type="l", main="Plot Trajectory", xlab = "Narrative Time", ylab= "Emotional Valence")

If everything worked the way it was supposed to, we should now see a noisy graph that looks like the image below. If you are getting errors, most likely it’s because you have a quotation mark in the wrong place. Also, note that there’s no space after the command “Plot”.




8. Making a smoothed graph. 

Again, there’s much discussion of the algorithms used to create these pretty sinusoidal graphs. Despite the debate, and with all caveats in mind, I tend to think that these are still the main reason we are doing all this -- smoothed visualizations are far more useful than noisy graphs from step 7 above. 


[Update from 2025: in the version of this package Jockers released in 2020, he indicated that there might be a better transform function we should be using -- discrete cosine transformation [DCT], rather than the Fourier transformation. DCT is apparently a simpler algorithm; it is widely used in image compression for example. The instructions below use the older method. See Jockers' documentation here.]

Type this in R:

ft_values <- get_transformed_values(sentiment_vector, low_pass_size=3, x_reverse_len=100, scale_vals=TRUE, scale_range=FALSE)

That should apply the Fourier transformation to our noisy data, creating a smoother curve line for sentiment values in the novel. (Incidentally, you can alter the “low_pass_size” parameter and try different number values there to see what happens.

In order to see a pretty graph a la Jockers' “foundation shape,” type this in R:

plot(ft_values, type ="h", main ="Eliot’s Middlemarch transformed", xlab = "narrative time", ylab = "Emotional Valence", col = "red")

That should give you this:



(On the one hand, I recognize the danger of "ringing artifacts" here. On the other, I can't help but think, as I am in the middle of teaching this book this fall, that there's something of the, er, ring of truth to the shape above.)

I tried the same thing with five other Eliot novels, just to see how they compare.

Here’s the noisy sentiment values chart for Eliot’s Adam Bede:


And here’s the Fourier Transformation / low pass filter version:



Here’s the noisy graph for Daniel Deronda:


And here’s the transformed version:


Here’s the noisy Mill on the Floss:


And here’s the transformed version (notice how different it is from the others!):



Here’s the noisy Romola:



And the transformed version of Romola:


Finally, here’s the noisy Silas Marner:

And here’s the transformed Silas Marner:



What can we learn from these visualizations? I’m not entirely sure yet; I also want to try the other, non-sinusoidal transformations mentioned above. I know Eliot’s works pretty well (and I’m teaching Middlemarch right now in my other class), but I’m still a little surprised by these graphs.

Surprise #1 might be just how different the graphs are from one another. The six novels actually follow rather different trajectories. (Would this also be true if we tried this with Dickens… or Henry James?)

Surprise #2 might be the really unusual shape for Silas Marner. Rather than “man in hole,” we have a plot that looks more like “man has some setbacks, but then it’s all good for the second half of the book.”

Surprise #3 might be that novels like Mill on the Floss and Adam Bede do not end with sentiment values near 0. Mill on the Floss is way below 0 and Adam Bede is way above.