r ngram extraction with regex

Question

Karl Broman's post: https://kbroman.wordpress.com/2015/06/22/randomized-hobbit-2/ got me playing with regex and ngrams just for fun. I attempted to use regex to extract 2-grams. I know there are parsers to do this but am interested in the regex logic (i.e., it was a self challenge that I failed to meet).

Below I give a minimal example and the desired output. The problem in my attempt is 2 fold:

The grams (words) get eaten up and aren't available for the next pass. How can I make them available for the second pass? (e.g., I want like to be available for like toast after it's already been consumed previously in I like)
I couldn't make the space between words non-captured (notice the trailing white space in my output even though I used (?:\\s*)). How can I not capture trailing spaces on the nth (in this case second) word? I know this could be done simply with: "(\\b[A-Za-z']+\\s)(\\b[A-Za-z']+)" for a 2-gram but I want to extend the solution to n-grams. PS I know about \\w but I don't consider underscores and numbers as word parts, but do consider ' as a word part.

MWE:

library(stringi)

x <- "I like toast and jam."

stringi::stri_extract_all_regex(
    x,
    pattern = "((\\b[A-Za-z']+\\b)(?:\\s*)){2}"
)

## [[1]]
## [1] "I like "    "toast and "

Desired Output:

## [[1]]
## [1] "I like"  "like toast"    "toast and"  "and jam"

Maybe the best approach to problem # 2 is: `"(\\b[A-Za-z']+\\s+){1}(\\b[A-Za-z']+)"` where you extend the regex by adjusting the 1 to `n-1` — Tyler Rinker, Jun 23 '15 at 13:07

Matthew Plourde · Accepted Answer · 2015-06-23T13:21:44.410

8

Here's one way using base R regex. This can be easily extended to handle arbitrary n-grams. The trick is to put the capture group inside a positive look-ahead assertion, eg., (?=(my_overlapping_pattern))

x <- "I like toast and jam."
pattern <- "(?=(\\b[A-Za-z']+\\b \\b[A-Za-z']+\\b))"
matches<-gregexpr(pattern, x, perl=TRUE)
# a little post-processing needed to get the capture groups with regmatches
attr(matches[[1]], 'match.length') <- as.vector(attr(matches[[1]], 'capture.length')[,1])
regmatches(x, matches)

# [[1]]
# [1] "I like"     "like toast" "toast and"  "and jam"

edited Jun 23 '15 at 13:21

answered Jun 23 '15 at 13:14

Matthew Plourde

43,932
7
96
113

I think this is related for future searchers: http://stackoverflow.com/a/25800334/1000343 The relevant terminology that I was missing is *Overlapping matches* Thanks for the response. – Tyler Rinker Jun 23 '15 at 22:02

score 2 · Answer 2 · answered Jun 23 '15 at 15:04

2

Actually, there is an app for that: the quanteda package (for the quantitative analysis of textual data). My coauthor Paul Nulty and I are working hard to improve this, but it easily handles the use case you describe.

install.packages("quanteda")
require(quanteda)
x <- "I like toast and jam."
> ngrams(x, 2)
## [[1]]
## [1] "i_like"     "like_toast" "toast_and"  "and_jam"   
ngrams(x, n = 2, concatenator = " ", toLower = FALSE)
## [[1]]
## [1] "I like"     "like toast" "toast and"  "and jam"

No painful regexes required!

answered Jun 23 '15 at 15:04

Ken Benoit

14,454
27
50

I yield, you're right about the question but hopefully someone searching for questions about word ngrams will find this useful! – Ken Benoit Jun 24 '15 at 11:05
1

Thanks, @Ken-Benoit, this looks like a useful package and I am looking forward to checking it out. – Peter Verbeet Jul 01 '15 at 07:36
1

I used this at work today. Nice upgrades to the package. +1 – Tyler Rinker Jul 18 '15 at 01:21
Thanks @Tyler Rinker. I hope to have time soon to write some extensions to make it easy to convert or use quanteda objects with qdap. I could start using the "tm-qdap" vignette as a basis. – Ken Benoit Jul 30 '15 at 12:30

r ngram extraction with regex

2 Answers2

Linked