Scoring texts for the presence of phrases

programming
Author

Patrick Baylis

Published

December 5, 2019

In my text analysis work, I frequently score texts for the presence or absence of various ``keywords’’. Because I work with some large corpora (collections of texts), for example the billions of tweets in my job market paper, this can be a time-consuming task. I have previously done most of this in Python, but right now I’m also interested in doing it quickly in R for ad hoc analyses.

In this post I test two different methods for detecting words: a simpler `grepl’-based approach that using a regex search to identify any texts with at least one of the matching words, and slightly more involved approach first tokenizes (i.e., breaks up sentences into their component word ‘tokens’) the texts and then searches through those tokens to see if any match the given keywords.

library(tidyverse)

# Load 50 sample tweets
example_tweets <- read_csv("tweet-examples.csv") %>% pull(tweet50) 
# Simulate a corpus of 250,000 tweets
corpus <- rep(example_tweets, 5000) 

# Create the vector of ten words to search for
keywords <- c("money", "beer", "friends", "fhdjasl", "dfjha", "dshjfsa", "fdjh", "vjhd", "cvmna", "dfhda")

# Score using grepl approach ----
# We only want to match on distinct words, so we use the word boundaries approach and paste all of the words together with |
score_grepl <- function(texts, keywords) {
    sapply(texts, function(x) grepl(sprintf("\\b(%s)\\b", paste0(keywords, collapse = "|")), x), USE.NAMES = F)
}

system.time(score_grepl(corpus, keywords[1])) # About 2.85s
system.time(score_grepl(corpus, keywords)) # About 9.5s
system.time(score_grepl(corpus, rep(keywords, 10))) # About 120s

# Score with tokenizer appraoch using tokenizers package ----
library(tokenizers)
# Note that this is a bit more sophisiticated than just using grepl, since we ask it to strip out punctuation and urls.
score_tokenizers <- function(texts, keywords) {
    tkns <- tokenize_tweets(texts, strip_punct = TRUE, strip_url = T)
    sapply(tkns, function(x) any(keywords %in% x), USE.NAMES = F)
}

system.time(score_tokenizers(corpus, keywords[1])) # About 5s
system.time(score_tokenizers(corpus, keywords)) # About 5s
system.time(score_tokenizers(corpus, rep(keywords, 10))) # About 6s

What did I learn here? If I’m only interested in detecting a small number of keywords, a simple grepl based approach is fine. But if I want to search for more than 5-10 keywords, tokenizing first is better (I also test a tokenizing approach that uses the quanteda package, but do not see any notable differences in performance). Note that this would also be true for creating multiple scores from the same set of texts.