1

I am having trouble with a regular expression in R. The goal is to parse a Markdown/reST/knitr report text file in R to remove my own custom comments. These comments are put in the following form:

Some sentence is about something <find a citation to this>.

As Markdown uses <> for HTML tags, I need to remove these comments (with my custom function) to avoid confusion. After I do that, the sentence takes the following form:

Some sentence is about something .

Note the space between the last word and the dot. It is easy to remove that, but then the text might contain reST comments incorporating R code (knitr) with beginning with ..:

.. {r chunk-name}
.. some R code 
.. ..

So basically I need to replace the " ." in the former case, but not in the latter. I though I would achieve this using the repetition modifier of R regexp atoms:

gsub(pattern=" \\.{1}",replacement=".",x="Something ..")
[1] "Something.."

I was expecting that this expression would match a single space followed by a single (but not more) dots. However the string gets replaced regardless of whether there is one dot or two. I am a real newbie with this, so probably missing something obvious. Even so, any help will greatly appreciated.

Regards, Maxim

Maxim.K
  • 4,120
  • 1
  • 26
  • 43

3 Answers3

3

The matching occurs as soon as the pattern matches. There is no look-forward to make sure the pattern is not recurring. I'm not sure if it's general enough but using a character class with a negation operator works in the offered single test case

> gsub(pattern=" \\.[^.]| \\.$",replacement=".",x="Something .")
[1] "Something."
> gsub(pattern=" \\.[^.]| \\.$",replacement=".",x="Something ..")
[1] "Something .."
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Thank you. First, this helps per se, and second it is useful for me as beginner to look at more complex regexps applicable to some practical example at hand. – Maxim.K Mar 20 '13 at 15:08
2

You can remove everything from the last space upto the . and paste a . at the end of the string, no?

# anything followed by any amount of space followed 
# by < followed by anything until the end of the sentence
paste0(gsub("(.*)[ ].*<.*$", "\\1", tt), ".")
# [1] "Some sentence is about something."

That said, you should really read this.

Alternatively, if the markup occurs in the middle of a sentence and you just want to remove them and the spaces around them, then:

# remove everything within <...> including < and > 
# and any spaces surrounding them
gsub("[ ]*<.*?>[ ]*", "", tt)
# [1] "Some sentence is about something."

# example:
tt <- ".. some sentences are wrong <bla bla>. But some are <bla bla> right."
gsub("[ ]*<.*?>[ ]*", "", tt)
# [1] ".. some sentences are wrong. But some are right."

Note the difference between .*> and .*?>. The first one is "greedy" in the sense that it'll match all characters until the last >. Whereas, the second one will stop after the first match, which is desirable here and you want to remove every occurrence.

Community
  • 1
  • 1
Arun
  • 116,683
  • 26
  • 284
  • 387
  • Thank you. First, this helps per se, and second it is useful for me as beginner to look at more complex regexps applicable to some practical example at hand. – Maxim.K Mar 20 '13 at 15:08
  • Sure. I'd prefer the 2nd (or last) version though. – Arun Mar 20 '13 at 15:12
  • Yes, the second option is more elegant indeed. – Maxim.K Mar 20 '13 at 15:34
  • By the way, what did you mean by the reference to the HTML parsing thread? That regex is not complex enough to parse natural language? I'll agree of course, but since I fully control the markup of my own text reports, there should not be any serious issues. Perhaps switching to something less ambivalent than <> for comments is in order though. – Maxim.K Mar 20 '13 at 15:54
  • If your text has *only* < and >, then it's okay. But if you're looking to parse a HTML file, then you should read that post. – Arun Mar 20 '13 at 15:55
1

You can accomplish what you want using the negative look ahead pattern in Perl regular expressions. This basically says to match the pattern, but only if not followed by this pattern. A quick example:

> gsub(pattern=" \\.(?!\\.)",replacement=".",x="Something .", perl=TRUE)
[1] "Something."
> gsub(pattern=" \\.(?!\\.)",replacement=".",x="Something ..", perl=TRUE)
[1] "Something .."
Greg Snow
  • 48,497
  • 6
  • 83
  • 110