I wrote up the following as a primer for the students in my Digital Humanities seminar this fall, but I figured others might benefit from it as well. If you have favorite RegEx commands and tips, I would welcome them in the comments or hit me up on Twitter.
A lot of digital humanities work involves working with messy texts -- you get a PDF image file from Google Books, HathiTrust, Archive.org, or scans from old Microfilm, and you want to turn it into something you can work with, either for producing digital editions of texts or for quantitative analysis.
OCR (Optical character recognition) is software that converts image text in PDFs to Text. It is built into some PDF software (the non-free version of Adobe Acrobat has OCR, for instance), and you can find various PDF-Text converters online that will do it for you. Depending on the quality of the software and the quality of your page scan, OCR can be somewhere from 80-95% accurate. For most things (other than producing digital editions), 95% is pretty good. Still, I often find myself working hard to clean up the output of OCR to make sure it's useful for my various projects.
It's also worth mentioning that some image files are poor enough that it's not worth your while to use OCR at all -- there are so many mistakes that it might just be faster to retype the whole document, letter for letter.
Many digital humanities queries about literary texts require plain text files that don't have a lot of noise in them. If you are asking software to do word counts or study other features of the language inside a text, you want to make sure you have words by the authors themselves in the text, nothing else. If you have a folder full of Text files from Project Gutenberg, you need to go through and cut out the header and footer texts they attach to every text they publish online. If you have a text from HathiTrust that started as a PDF file, you probably want to cut out page numbers and page headers (Page 7, Page 8, etc.).
Below I give a few tips on how to do that type of clean-up work efficiently using Text Editing software.
1. Installing a text editor.
To get started, you need dedicated text editor software, and you probably need to plan to do this work on a computer rather than a tablet or phone. Note: you can't really do this work in Microsoft Word, Pages, or Google Docs! Those apps will keep trying to add material into your files you don't want -- formatting, bits of invisible code. They also lack the really sophisticated Find and Replace features ("Regular Expressions") we'll need later.
I use Notepad++ (free) on my Windows laptop and the CoT Editor (free) on the Mac in my office. Both are pretty small programs, and won't take up a lot of space on your hard drive.
I would also make a sub-folder in your Documents folder dedicated to working with texts.
1a. Two tips for saving files from the internet.
Wherever possible, when downloading text files from the internet, make sure to save them as Plain Text (UTF-8). The UTF-8 refers to a character set, and we can mostly not worry about it right now.
Mac users only: This can be a little more complicated on a Mac than on Windows. If you're using Safari on a Mac, you might have trouble figuring out how to save a plain text file from the internet (it will only give you the ".webarchive" format on the default "Save As..." option.
Try this: hit CTRL-click, and then select "Download Linked file as..." to save a file as plain text when using Safari on a Mac.
Also, when you save files from the internet, you should probably start getting in the habit of labeling them really specifically to help you figure out what you're looking at later. Don't just accept the file name and file type chosen by someone else. I typically use filenames like
"Pandita-Ramabai-The-High-Caste-Hindu-Woman-1888-nonfiction.txt"
It might seem like overkill if you just have three files. But when you have three hundred files to search through it will come in handy to know exactly what you're looking at.
Also: it would be good to get into the habit of creating filenames that don't have spaces. If and when your files are queried by other software (i.e., running in Python), those spaces will cause problems.
2. Find and Replace Function
Quite a lot of text processing can be done with advanced Find and Replace features in Text Editors. You don't need coding!
2a. Removing Numbers and Page Headers
If you copy and paste a text file from HathiTrust into a Text Editor, you might get something that looks like this:
Page Scan 13
A ROLLING STONE CHAPTER I LADY KESTERS After a day of strenuous social activities, Lady Kesters was enjoying a well-earned rest, reposing at full length on a luxurious Chesterfield, with cushions of old brocade piled at her back and a new French novel in her hand. Nevertheless, her attention wandered from Anatole France ; every few minutes she raised her head to listen intently, then, as a little silver clock chimed five thin strokes, she rose, went over to a window, and, with an impatient jerk, pulled aside the blind. She was looking down into Mount Street, W., and endeavouring to penetrate the gloom of a raw evening towards the end of March. It was evident that the lady was expecting some one, for there were two cups and saucers on a well-equipped tea-table, placed between the sofa and a cheerful log fire. As the mistress of the house peers eagerly at passers- by, we may avail ourselves of the opportunity to examine her surroundings. There is an agreeable feeling of ample space, softly shaded lights, and rich but subdued colours. The polished floor is strewn with ancient rugs ; bookcases and rare cabinets exhibit costly con» I
Page 2
2 A ROLLING STONE tents ; flowers arc in profusion ;
There are a few simple things we can do using Find and Replace automation to clean this up.
--First, make sure that the "Regular Expression" button is turned on in the Replace dialog box. (Regular expressions -- RegEx -- are little bits of code that help us automate certain tasks.)
To get rid of page numbers, use "Replace" (Ctrl-H in Notepad++ on Windows), and do the following three RegEx replace commands
Find: Page \d\d\d
Replace: [leave this blank]
(Hint: Make sure the "Regular Expressions" box is checked!) The "\d" in the find stands for a numerical digit. If you put three of those in a row, the software is looking for specifically -- and only -- three-digit numbers. If you do the above command and hit "Replace All" it should remove all of the "Page ..." above 100.
If you then do this:
Find: Page \d\d
Replace: [leave this blank]
It will then do the page numbers in the 10-99 range. Then repeat again with Find: "Page \d" --> Replace: blank. And that does pages 1-9.
(Why do it in this order? Try it the other way and see what happens. You'll figure out why it's best to start with the hundreds pretty quickly...)
You might also notice that all of the semi-colons in the paragraph above are preceded by a space. To get rid of those, you can do this:
Find: [space];
Replace: ;
If there are spaces before punctuation throughout a document, another hack might be to use a Regex command like this:
Find: (.*) ([::,.?!;])
Replace: \1\2
This one is harder to explain. Essentially, you are asking the Text Editor to 'capture' all text (.*) to memory, then a space, then 'capture' any common punctuation. Each parenthesis becomes a captured 'string' that is held in memory and numbered. You Replace with the two captured strings in sequence -- without a space in between them.
2b. Getting rid of other page headers. In the sample of text above, you see this:
2 A ROLLING STONE
That is a page header. If you look at the rest of the book, a version of that is on every other page. Again, we can automate the removal of this using RegEx:
Find: \d\d\d A ROLLING STONE
Replace: [Leave this blank.]
Then do the tens and ones again.
2c. Putting Line Breaks Before New Chapters.
In the chunk above, you see this:
A ROLLING STONE CHAPTER I LADY KESTERS After a day of strenuous social activities,
In this book, it looks like new Chapters are not going to be clearly demarcated with line breaks, but we probably want them to make the text file readable for humans.
To make sure new chapters are easy to find, you can put them in using this command:
Find: CHAPTER
Replace: \n\n CHAPTER
The "\n" is for new line. If you do the command above, it will put two line breaks before each instance of CHAPTER (make sure the "Match Case" option is turned on, or it might do this when it randomly encounters the word "chapter." Most likely, the only time you'll see the word CHAPTER in all caps is the beginning of a new chapter).
2d. HathiTrust-specific clean-up.
Documents derived from HathiTrust often look a little like this:
## p. (#5) ##################################################
A TINY SPARK
BY
CHRISTINA MOODY
Washington, D. C.
MURRAY BROTHERS PRESS
1910
## p. (#6) ##################################################
Find: ##.*\nReplace:
3. Putting in line breaks in unformatted poetry.
Sometimes when you bring text in from an OCRed PDF file, you get all of the text of the poems, but none of the line breaks. This is from a book of poetry I've been working with, in a file derived from HathiTrust:
SONG OF THE HINDUSTANEE MINSTREL WITH surmah tinge thy black eye's fringe, 'Twill sparkle like a star; With roses dress each raven tress, My only loved Dildar! II Dildar ! there's many a valued pearl In richest Oman's sea; But none, my fair Cashmerian girl! O ! none can rival thee. Ill In Busrah there is many a rose Which many a maid may seek, But who shall find a flower which blows Like that upon thy cheek? IV In verdant realms, 'neath sunny skies, With witching minstrelsy, We'll favor find in all young eyes, And all shall welcome thee.
This is a challenging one! I haven't found a way to fully automate introducing line breaks using RegEx, though I have found a command that works reasonably well to speed it up -- find capitalized letters, and insert a line break before.
For the above, do
Find: ([ABCDEFGHIJKLMNOPQRSTUVWXYZ])
Replace: \n\1
The brackets around the bracketed group tells the software you're looking only for these capital letters (make sure Match Case is on). This doesn't work perfectly, since the "I" will often catch the pronoun by itself (which might not be the beginning of a line). It will also catch randomly capitalized words and proper nouns (in the above, it will catch words like "Dildar"). So you can't just tell it to Replace All -- you have to go through and check each one. It's still faster than doing it completely without automation.
How it works: the parentheses around the bracket tells the software to "capture" the letter in question and keep it in memory for the Replace command.
The \1 in the replace command calls back the string we captured in the Find command, and tells the software to print the same letter again.
The above passage is particularly messy, but if you run the command above and make some judicious choices about likely line breaks using the Find/Replace dialog box (again, not using Replace All), you could end up with:
SONG OF THE HINDUSTANEE MINSTREL
I. WITH surmah tinge thy black eye's fringe,'Twill sparkle like a star;With roses dress each raven tress,My only loved Dildar!II. Dildar ! there's many a valued pearlIn richest Oman's sea;But none, my fair Cashmerian girl!O ! none can rival thee.III. In Busrah there is many a roseWhich many a maid may seek,But who shall find a flower which blowsLike that upon thy cheek?IV. In verdant realms, 'neath sunny skies,With witching minstrelsy,We'll favor find in all young eyes,And all shall welcome thee.
After the line breaks are in place, we might go through and use the hack above to clean up the "space before punctuation" problem.
4. A more advanced RegEx example: extracting a list of words from a tagged file.
RegEx is extremely sophisticated, and there are many more advanced commands I won't get into here (also: I am still very much a learner).
(It's basically a form of coding without actually writing programs... Interestingly, some full-fledged programming languages do allow RegEx code to be embedded within, so it might be worth your while to learn more of it...)
Here is a more advanced example that was shown to me by a software developer who works in the library at my institution (Rob Weidman). We have a file where we used Stanford Named Entity Recognition (NER) to tag every proper Name and every Location in a book (why we did that and how that works is a question for another day). The output it produces looks like this:
Major <PERSON>Carteret</PERSON>, though dressed in brown linen, had thrown off his coat for greater comfort. The stifling heat, in spite of the palm-leaf fan which he plied mechanically, was scarcely less oppressive than his own thoughts. Long ago, while yet a mere boy in years, he had come back from <LOCATION>Appomattox</LOCATION> to find his family, one of the oldest and proudest in the state, hopelessly impoverished by the war,--even their ancestral home swallowed up in the common ruin. His elder brother had sacrificed his life on the bloody altar of the lost cause, and his father, broken and chagrined, died not many years later, leaving the major the last of his line. He had tried in various pursuits to gain a foothold in the new life, but with indifferent success until he won the hand of <PERSON>Olivia Merkell</PERSON>, whom he had seen grow from a small girl to glorious womanhood.
Let's say we want a file with just the names of people referenced in this book -- the items the NER software has tagged as <PERSON>.
First, you want to make sure there aren't a lot of invisible line breaks in the text. It's ok to do a global Find/Replace where you replace blanks for \n.
Then:
1)Replace <PERSON with \n<PERSON
This puts a line break before each Person tag
2)Delete the first line of text (everything up to the first instance of
a person tag)
3)Replace <PERSON>(.*)<\/PERSON>.* with $1
This gets rid of the tags and everything outside of the tags and
replaces it with just the text within the tags.
This produces a list of just names of people tagged in the text. In the paragraph above, it would produce
Carteret
Olivia Merkell
What the various commands above are doing is complicated. The .* in the parentheses means capture everything. The $1 calls back the string we just captured between the tags. The .* after the second person tag is grabbing all of the text *outside* of the tags -- which we're going to delete.
For reference, ".*" is a really important RegEx command.
Also helpful is the ".+" command... And the "NOT" command (^)...
It goes on... I'll just recommend people look at the "RegEx Cheat Sheet" here.