A basic guide to using NLP for corpus analysis with R (Part 2): Processing text files

If you’re working with language data, you probably want to process text files rather than strings of words you type on to an R script. Here is how to deal with files. Refer to the previous post for setting the tools up if needed.  Again, please see the pdf version to see the R script output. *Note. cleanNLP has gone through an update and the functions have been renamed. I have updated this post accordingly. 

1. Processing text files

Place all text files that you want to process under the working directory. For example, currently my working directory is set as: C:/my/working/directory/. The .txt files that I will process are in a folder named corpus under this working directory: C:/my/working/directory/corpus. Before proceeding to the next part, load the cleanNLP and reticulate packages, and initiate spaCy by executing init_spacy.

1.1 Annotate a single text

Let’s say the name of the text file I want to analyze is: text_001.txt, and it’s in the corpus folder right under the working directory. Here is how to process this particular file:
single.text <- cnlp_annotate("corpus/text_01.txt")

It’s as simple as that. 

1.2 Annotate all files in a folder

Under the folder corpus, I actually have 50 text files and I’d like to process them all. Here is the code that will annotate all these files:
all.text <- cnlp_annotate("corpus/*.txt")

Again, simple! The package has annotated all .txt files under the folder corpus, and the results are saved in a data list named all.text. Something you might want to check at this point is whether all files in your folder are analyzed. Type and execute:

You will see something like this on your console:
# A tibble: 50 x 5
id time version language uri
1 1 2018-01-10 03:22:11 2.0.5 texts/text_01.txt
2 2 2018-01-10 03:22:11 2.0.5 texts/text_02.txt
3 3 2018-01-10 03:22:11 2.0.5 texts/text_03.txt
4 4 2018-01-10 03:22:11 2.0.5 texts/text_04.txt
5 5 2018-01-10 03:22:11 2.0.5 texts/text_05.txt
6 6 2018-01-10 03:22:11 2.0.5 texts/text_06.txt
7 7 2018-01-10 03:22:11 2.0.5 texts/text_07.txt
8 8 2018-01-10 03:22:11 2.0.5 texts/text_08.txt
9 9 2018-01-10 03:22:11 2.0.5 texts/text_09.txt
10 10 2018-01-10 03:22:11 2.0.5 texts/text_10.txt
# ... with 40 more rows

With this example, it shows the first 10 out of the 50 files that I’ve just processed. It says there are 90 more rows, so I assume all 50 files have been successfully included in the analysis. However, it doesn’t show the whole list, so you might want to save this data table as a data frame to view the entire list.

texts.doc <- cnlp_get_document(all.text)

Then of course you have all annotated objects that you can retrieve as I have previous talked about:

You can export these data to your computer drive yourselves but better yet, if you add something when you annotate them in the beginning, the results will automatically be saved as .csv files in a directory that you designate.

all.text <- cnlp_annotate("corpus/*.txt", output_dir = "corpus")

After running this line I will find four new files under the folder corpus: dependency.csv, document.csv, entity.csv, and token.csv

This might just be all you need and you can take it from here to analyze the results with this data with any other software. While we have R running, I will briefly look at some descriptive statistics by using R in the next section.

2. Describing data

Because we now have the language data under investigation as a data table with words, lemmas, and part-of-speech tags, we can easily describe the this data in frequencies. 

2.1 Frequency tables

The most interesting part of the data is included in the data frame token, so I’m saving it as a new data frame that I can easily access:
t.data <- as.data.frame(all.texts$token)

There are many different ways to get desired information, but I will just stick to using the table function for now. First, to see how many sentences there are in this data, remember that each sentence start is marked as “ROOT” when annotated? I’ll take advantage of that and type in:

What I see now is a table with labels FALSE and TRUE in the console. The number associated with FALSE is the number of lemmas other than ROOT (hence actual tokens but including punctuation marks, etc), and the one associated with TRUE is the number of ROOTs, in other words, sentences.
5600       449

So here I have 449 sentences.

Use the following codes to see the frequencies of universal POS:

After looking at the results, you will want to get rid of things tagged as “ROOT”, punctuation, spaces, etc. unless you are not interested in these features. An easy way to do it is to use the filter function from dplyr. I’m creating a new data frame that only includes the following tags:
c.data <- filter(t.data, upos %in% c("ADJ", "ADV", "NOUN", "VERB"))

See the list of tags explained here.

The new data frame, c.data, probably only includes words that are adjective, adverb, noun, or verb now. Run the codes for the frequency table again, replacing the name of the data frame to see the changes.

To look at the percentages, use prop.table. For example, see the percentage of each category under universal POS:

I will save two frequency tables as data frames, one for lemmas and one for universal POS tags.

freq.lem <- data.frame(table(c.data$lemma))
freq.upos <- data.frame(table(c.data$upos))

I will also order them by descending frequency using the arrange function in dplyr:
freq.lem <- arrange(freq.lem, desc(Freq))
freq.upos <- arrange(freq.upos, Var1)

2.2 Basic visualization

I will make one graph here, using the base graphics included in R. With the frequency data, it would make sense to visualize it as a bar plot.

Enter the following code. The first argument should be height, or the y-axis value (here, frequencies):
barplot(freq.upos$Freq, names.arg = freq.upos$Var1, xlab = "Universal part-of-speech tags", ylab = "Observed frequencies")


Here is another graph visualizing the top ten most frequent lemma, using the package ggplot2.
top.lem <- subset(freq.lem[1:10,])
names(top.lem)[1] <- "lemma"
top.lem$lemma <- levels(droplevels(top.lem$lemma))
ggplot(top.lem, aes(x = lemma, y = Freq, label = Freq)) + geom_bar(stat = "identity", fill = "black", size = 6) + geom_text(color = "white", size = 4, vjust = -1) + labs(x = "Lemma", y = "Frequency", title = "Top 10 most frequent lemmas") + coord_flip() + theme_minimal()


While the graph looks neat, one thing that bothers me is that spaCy lemmatizes all pronouns as “-PRON-”. This algorithm not only fails to distinguish between the first, second, and third person pronouns but also the hyphens can cause some issues with data handling. I’d run a code to re-lemmatize all pronouns if I wanted to include them in my analysis.

If you just need to tag texts for concordancing, it is much faster and appropriate to use existing software such as TagAnt (Anthony, 2015). The tools I’ve used here are effective for data analysis.

I’ll continue to post, for my own documentation purpose, tools and techniques related to NLP and corpus analysis.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s