I'm working on some text analysis and am at the processing phase trying to stem and clean up words in the corpus.
The context are users who submit feedback via an online survey. There are thus, I'm finding, many many typos throughout the data.
Most responses to the survey are short, at most 2 or 3 sentences but typically short 2 to 5 word answers.
Each response is a document.
I was using gsub() to replace what appear to be mistyped words but could not wrap my head around some points. The following is an example of a line in a for() loop that goes through each document j and replaces using gsub():
docs[[j]] <- gsub("(^|\\s)anymor(\\s|$)", "\\sanymore\\s", docs[[j]])
So replace any mis spelt instances of "anymor" with "anymore".
My question:
I started the expression using (^|\\s)
and ended using (\\s|$)
to search for any instances of "anymor" as a free standing word. So the expression would presumably return 3 components: the starting expression using carat or a space, the string itself and then the ending piece for ends with a space or is end of the document. How does gsub know which component to replace? I tested in the console and it replaced the middle one, which happens to be what I want.
Putting the question another way, I use the starting and ending components (^|\\s)
and (\\s|$)
just to isolate the correct string (so e.g. I don't replace correctly spelt instances of "anymore" which would otherwise result in "anymoree".
At either side of the second parameter in my gsub I added a space \\s
. This is to ensure I maintain whole words. If I wanted to avoid having to do that - to just replace the matched string as opposed to any expressions used to find the correct pattern, how would I specify that? So not replacing spaces used to help pinpoint the words to be replaced.
Got a feeling I've been over thinking this. Any help on understanding gsub and the regex it uses would be appreciated.