0

I'm working on some text analysis and am at the processing phase trying to stem and clean up words in the corpus.

The context are users who submit feedback via an online survey. There are thus, I'm finding, many many typos throughout the data.

Most responses to the survey are short, at most 2 or 3 sentences but typically short 2 to 5 word answers.

Each response is a document.

I was using gsub() to replace what appear to be mistyped words but could not wrap my head around some points. The following is an example of a line in a for() loop that goes through each document j and replaces using gsub():

docs[[j]] <- gsub("(^|\\s)anymor(\\s|$)", "\\sanymore\\s", docs[[j]])

So replace any mis spelt instances of "anymor" with "anymore".

My question:

I started the expression using (^|\\s) and ended using (\\s|$) to search for any instances of "anymor" as a free standing word. So the expression would presumably return 3 components: the starting expression using carat or a space, the string itself and then the ending piece for ends with a space or is end of the document. How does gsub know which component to replace? I tested in the console and it replaced the middle one, which happens to be what I want.

Putting the question another way, I use the starting and ending components (^|\\s) and (\\s|$) just to isolate the correct string (so e.g. I don't replace correctly spelt instances of "anymore" which would otherwise result in "anymoree".

At either side of the second parameter in my gsub I added a space \\s. This is to ensure I maintain whole words. If I wanted to avoid having to do that - to just replace the matched string as opposed to any expressions used to find the correct pattern, how would I specify that? So not replacing spaces used to help pinpoint the words to be replaced.

Got a feeling I've been over thinking this. Any help on understanding gsub and the regex it uses would be appreciated.

Doug Fir
  • 19,971
  • 47
  • 169
  • 299
  • 1
    Why didn't you use word boundaries? `gsub("\\banymor\\b", "anymore", docs[[j]])`? Is there any reason to discard them? Do you have to deal with hyphenated words? If yes, I'd rather go with a PCRE regex like `gsub("(^|\\s)anymor(?!\\S)", "\\1anymore", docs[[j]], perl=TRUE)` to ensure consecutive (overlapping) matches. Another point: use backreferences in the replacement part, not regex patterns, to reinsert the captured text back to the translation result (see `\\1` in my example above). – Wiktor Stribiżew Jun 21 '16 at 07:29
  • Hi @WiktorStribiżew thanks for the information and links to other answers. "Another point: use backreferences in the replacement part, not regex patterns, to reinsert the captured text back to the translation result (see \\1 in my example above)." I have not encountered \\1 before. Could you expand on this sentence if you have a chance? – Doug Fir Jun 21 '16 at 07:42
  • Oh I think I get it, 1 as in the second part of a zero based index of returned expressions? – Doug Fir Jun 21 '16 at 07:44
  • 1
    As for `\1`, it is a [**replacement backreference**](http://www.regular-expressions.info/replacebackref.html). Numbers correspond with the group order from left to right. However, unless you precise, the question looks like a duplicate to me. – Wiktor Stribiżew Jun 21 '16 at 07:47
  • (fixed link above just now) – Wiktor Stribiżew Jun 21 '16 at 07:48

0 Answers0