Working with XML-formatted text annotations in R

In this post, I’m documenting how to reformat the XML-formatted files outputted by the Stanford CoreNLP tool. This might not be the most elegant way to go about it, but this is something that works for me. Here, I will be using R and the XML files produced in the previous step.

Creating tagged text

The files have been tokenized and POS-tagged using java in another platform. Here, I read in the annotated XML files and save them in a data frame with a row for each token ($token-node) and a column for each tag (variable) describing it.

First step, (install and) load the XML and plyr libraries:

library(XML)
library(plyr)

Next, I will read in the files. My preference is that before I start, I move the XML files to a new folder (and XML files only), usually under the working directory that I have set for the current R session. I’ll call this new folder xml (located under the current working directory in R). The following codes will read the filenames and then change the working directory to the folder that contains these files. The codes that appear in the remainder of this post won’t work if you don’t change the working directory to where the XML files are. Specify your own folder in the parentheses.

# Read the files
files <- list.files("xml")
setwd("xml")

Now this is a function that will 1) parse the XML and 2) extract the XML-values from the document. This is an adaptation from the codes I found from a Stack overflow post:

tags_df <- function(file, expr){
message("Loading ", file)
xfile <- xmlInternalTreeParse(file)
xtop <- xmlRoot(xfile)
tags_list <- xpathSApply(xtop, expr, function(x) xmlSApply(x, xmlValue))
tags <- data.frame(t(tags_list), row.names = NULL)
}

If you remember the tree structure of these XML files, the information about each token is saved under the token node. Therefore, I will create a data list that saves all values under this node (as realized by the xpath //token).

# Retrieve the values under the token nodes
all.tags <- lapply(files, tags_df, "//token")

The tags_df function will put out loading message for each file as it processes.

Currently, the information are stored in all.tags, a list which is difficult to access. I will create a data frame called xml.df and save necessary token information there.

# Create an empty data frame to store information
xml.df <- data.frame()

# Append each element on the data list to a data frame
for (i in 1:length(all.tags)) {
xml.d <- as.data.frame((all.tags[[i]]))
xml.df <- rbind(xml.df, xml.d)
}

Now, the data frame xml.df should have each word in a row along with the POS and lemma associated with it in columns.

What I eventually want to do is to make everything a string so that I can search for sequences of words (e.g., may + verb infinitive) using regular expressions. The result I want is a list of sentences that include the specific sequence of words. To do this, I will need to separate sentences from the entire text.

The only way I know how to do this is to label each token its token ID. Token ID indicates the n-th word in each sentence. Therefore, whenever I have token ID #1, I know it is a new sentence. I will come back to this idea later.

# Another function that retrieves the attribute information, which is the token ID
tid_df <- function(file, expr){
message("Loading ", file)
xfile <- xmlInternalTreeParse(file)
xtop <- xmlRoot(xfile)
tids_list <<- xpathSApply(xtop, expr, function(x) xmlSApply(x, xmlAttrs))
tids <- data.frame(t(tids_list), row.names = NULL)

# Create a data list of token IDs
all.tids <- lapply(files, tid_df, "//tokens")
tid <- unlist(all.tids)

# Append the token IDs to the data frame xml.df
xml.df$tid <- tid

Now that I have all the information I need in one place, I will save them as a text that looks something like this: {t=”1″}{pos=”PRP” lem=”I”}I {t=”2″}{pos=”VBP” lem=”like”}like {t=”3″}{pos=”NN” lem=”pizza”}… It’s confusing but because using the angled brackets(<>) triggers some html commands in the blog posts, I replaced it with {}. Also, the codes are written based on Gries’ (2017) Quantitative Corpus Linguistics with R.

text <- paste0("{t="\"", xml.df$tid, "\"}", "{pos=\"", xml.df$pos, 
" lem=\"", xml.df$lem, "\"}", xml.df$word, collapse = " ")

I have the output now saved as text. I will remove the token IDs but replace all {t=”1″}’s with {s}’s to mark the beginnings of sentences. Be sure to replace all {} with <>!

text <- gsub("{t=\"1\"}", "\n{s}", text, perl = TRUE)
text <- gsub("{t=\".+\"", "", text, perl = TRUE)

Save the resulting text as a text file named corpus:

write(text, "corpus.txt")

Now I have a single text file that composites all texts I originally had, all tagged with POS and lemma, each sentence separated by a new line and the tag {s}.

Example query and concordances

As an example, I will find sentences that include a grammatical structure, “may + verb infinitive”.

# Read the text file in
corpus <- scan("corpus.txt", what = "char", sep = "\n", quiet = TRUE)

# Change the text to lower case and save as 'working.corpus'
working.corpus <- tolower(corpus)

# Parse the text into sentences and save it
working.corpus <- grep("//s//", working.corpus, perl = TRUE, value = TRUE)

# Extract the sentences that include "may + verb"
find.matches <- grep("[^<]* {pos=\"vb\" lem=\"[^[^<]*", working.corpus, perl = TRUE, value = TRUE)

# Remove the POS tags to get clean sentences
clean.matches <- gsub("{.*?}", "", find.matches, perl = TRUE)

# Remove the space before punctuation
clean.matches <- gsub(" (?=[.,!?])", "", clean.matches, perl = TRUE)

See the results and save them.

print(clean.matches)
write.csv(clean.matches, "matches.csv")

The results are not shown here, and unfortunately I wasn’t able to create a markdown file because I couldn’t get it to produce files when changing directories is involved. If you’re working following this post and run into issues, feel free to contact me.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s