This is Part 1 of a basic guide for setting up and using a natural language processing (NLP) tool with R. I specifically utilze the spaCy “industrial strength natural language processing” Python library, and an R wrapper called cleanNLP that provides tools for annotating texts and obtaining data tables. In this post, I will explain step-by-step how to get ready to work with text data using these two packages, and a snippet of what they can do.
You might want to see this post in a pdf I’ve created with markdown, which shows the codes and results more clearly. The content is exactly the same.
I decided to write this post because for someone like me who has no background in computer science but wants to get their feet wet with NLP. It could be difficult to get started with these tools because you don’t know where to start or what to exactly look for. I have not yet seen any webpages that describe what I’m trying to do all in one place. It took a while for me to hunt down the information that I needed to carry out the kind of analyses that I wanted to do, and took many failed attempts to finally get the computer programs working. SpaCy and cleanNLP utilized in R have so far been the easiest to learn and work with in my experience as a beginner.
In this post, I assume you have some experience with R, but you should be able to follow along (though might not be able to understand what’s going on exactly) even if you haven’t used it before.
In case you need to install R and RStudio:
R for Windows: https://cran.r-project.org/bin/windows/base/
Before you begin, check your computer system environment and compare it with mine. My guidelines might not work in your setting and you might experience issues that I don’t mention here (use Google search!). My system environment is: Windows 10 64-bit operating system.
1. Installing Python
If you have never installed Python on your computer, great. You start with a clean slate and it might work the best. If you already have Python installed (must be Python 2.7 and up) but not sure if it works, skip to the end of this section (1.3) to test it.
Here, I’m installing the latest version of Python, 3.6.4, which works with my OS. If you don’t use Python for anything else, just following these instructions will probably be your best bet. There are different ways to install Python, for example, using Anaconda which provides GUI and may be better in the long run, but I won’t use it here.
1.1 Download Python
First step, download Python from here: link for Windows. Make sure to get an installer that matches your operating system and R. In other words, if you have 32-bit R installed, download the 32-bit version (labelled “Windows x86”). With 64-bit R, download 64-bit Python (labelled “Windows x86-64”).
1.2 Install Python
After the download is complete, execute the installer. Two important things. First, check the box that says “add to PATH” when you see it. For Python 2.7.x, this option has a different appearance – not as a checkbox but an icon to click somewhere during the setup. If you have accidentally neglected to check the “add to PATH” box, you will need to manually add the path for Python to the system environment variables (but it might be difficult to find what works for your system so you might as well just uninstall, reboot, and reinstall, enabling the above-mentioned option). Second, I recommend just use the default directory provided as the location. In my case, Python did not work properly when I installed it under a different location.
1.3 Test if Python works
Open up a Command Prompt by searching “command …” or “cmd” from the windows start menu. Preferably, open the program by right-clicking the mouse and selecting “run as administrator”. If you don’t run it as admin, it’s fine for now but you have to run it as admin when you execute Section 2.2 Download language models.
Type “python” (without the quotation marks) and hit enter.
Your installation was successful if you see something like this:
Python 3.6:d48eceb, Dec 19 2017, 06:54:40 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
If you get a message that says “python is not recognized as internal or external command”, which I don’t think you will as long as you have followed the instructions exactly so far, it means trouble. If you don’t know well about these issues, the simplest solution is to uninstall, reboot, and reinstall. Hopefully you have not encountered any issues and have Python up and running now.
Now type in “exit()” (without the quotation marks) to move on to installing spaCy.
2. Installing NLP backend: spaCy
To put it very simply, spaCy is a library that processes and analyzes languages. CleanNLP utilizes this specific Python library. Another option with cleanNLP is coreNLP (Java). I find spaCy lighter and easier to set up than coreNLP (a point also noted by the creators). Here I only demonstrate installing spaCy. See the “Backends” section from here for more information related to this.
2.1 Install spaCy
In your Command Prompt, enter the following code:
pip install -U spacy
It can take several minutes to finish installing. If it fails to install with the error message that says Microsoft Visual C++ 14.0 is required, go download Visual C++ Build Tools and retry.
2.2 Download language models
Next step is to install the language models (as in Python packages) that you want. spaCy provides packages for a number of languages: see the models for languages other than English. In this example, I will install English models that I will use to process texts in English. One thing to note is that within English, different models are offered. Go to this page and scroll down to read about the models. The larger the model is, the more accurate it is, though the differences don’t seem to be that big.
Enter the following code to the command line:
python -m spacy download en
The default model is
en_core_web_sm. You can install multiple models if you wish. I installed the largest model:
python -m spacy download en_core_web_lg
If you see an error message saying that linking was unsuccessful, it could be because you didn’t run the Command as administrator.
After downloading and linking is complete, test if the installed modules work by trying the following codes, one line at a time:
nlp = spacy.load('en')
doc = nlp(u'This is a sentence.')
Did the last code return “This is a sentence.”? If yes, yay! – it’s probably safe to assume everything is running smoothly. Quit the Command Prompt and start RStudio. (Restart RStudio if you have had it open while installing spaCy.)
3. Getting ready with RStudio
3.1 Install all requirements
Create a new R script. The required packages for cleanNLP are:
#install packages: cleanNLP, reticulate
#load the packages
To get started with processing texts, you need to initialize spaCy:
#initalized the NLP backend
Now, if this doesn’t work, try the following code to see whether there’s any problems with Python configuration on your computer.
#check your Python configuration
If you want to use a specific model, for example the “en_core_web_lg” model, specify it:
cnlp_init_spaCy(model_name = "en_core_web_lg")
A little anecdote is that when I first tested the automatic annotation, I used a line from “The Stairway to Heaven”: “There’s a lady who’s sure all that glitters is gold, and she’s buying a stairway to heaven.” The default model, the small and medium sized models marked “glitters” as a noun. The large model marked the word correctly as a verb. After seeing this, I’ve been using the large model.
3.2 Processing a text string
This section will serve as a simple test and demonstration of the basic use of the cleanNLP package. It is based on the procedure described on the cleanNLP webpage.
#make up a text
text <- "Let me be the one you call. If you jump, I'll break your fall."
Now we have a text to analyze. Give R a command to automatically annotate the text.
#run the annotator
obj <- cnlp_run_annotate(text, as_strings = T)
#as_strings = T when the data given to input is a vector
Now you should have new data named “obj”. It;s populated with lists of various data and, though invisible, the command has produced 4 csv files with this data in a temporary location. Let’s view a data table that includes annotated objects such as lemmas and part-of-speech (pos).
#view the annotated data
The table returns rows consisted of information regarding each token. Here are some detailed descriptions of what each column means:
- id: Text id. Here 'id's are all 1 because there is one text source.
- sid: Sentence id (or sentence number) nested under 'id'. We can see that we have 2 sentences in this text.
- tid: Token id nested under 'sid'. Note that for each sentence, 'tid' starts from 0 to mark ‘ROOT’, which indicates the start of a sentence. The default code hides the 0s, but they are there, and will be useful for corpus analysis.
- word: The raw word as is in the source text.
- lemma: The basic form of the token with no morphological changes.
- upos: Universal part-of-speech, more general than 'pos'.
- pos: Language specific part-of-speech. For example, you can see tokens that are labelled as ‘VERB’ under the 'upos' column are further specified into verb infinitive (VB), verb present tense (VBP), and modal verbs (MD).
?cnlp_get_token for more information.
Here is another data table you might be interested in:
This data frame contains annotations produced by grammatically parsing the text. In other words, they describe what roles the elements serve in each sentence. Though it is valuable information, I’m skeptical as to how accurate this parser is with learner language, so I won’t be dealing with this feature as much.
This concludes Part 1 of the series. In Part 2, I will go through how to process text files - single file, multiple files, and all files under a specific directory. Please contact me if you have any questions, corrections, or suggestions!