-1

Language: R, IDE: R Studio

I'm writing a script to extract and exclude specific information from a pdf file (a.k.a a massive string). I used grep to split the string into pages I want. I'm looking to slim this down even further. My script to slim it down more is...

variablename <- grep("Additional Information:(?! )", AnyAdditionalInfoPages,   
     perl = TRUE, value = TRUE)

This works exactly how I want it. I'm new to R and regex, however, so I wanted to practice and I tried the following...

variablename <- grep("Additional Information:(?!\s)", AnyAdditionalInfoPages, 
    perl = TRUE, value = TRUE)

The result was - Error: '\s' is an unrecognized escape in character string starting ""Additional Information:(?!\s"

AND

variablename <- grep("Additional Information:(?!\\s)", AnyAdditionalInfoPages, 
    perl = TRUE, value = TRUE)

The result is an empty variable

> variablename
character(0)

What's going on? Why does " " work but the escape character for string \s not work?

Marcus Campbell
  • 2,746
  • 4
  • 22
  • 36
  • 1
    @MoeMichaelSmith It's kind of impossible to say anything other than whatever your input is doesn't get a match with your regular expression... – Dason Apr 12 '18 at 20:44
  • @Dason, my original one... grep("Additional Information:(?! )", does match what I want, exactly. I'm wondering why substituting the escape character for space in place of the real space in the parenthesis, doesn't work. Is there some fundamental difference between " " and \s? All the documentation I've seen says that a space " " should be included in \s. – MoeMichaelSmith Apr 12 '18 at 20:48
  • My comment was attempting to point out that you haven't provided a reproducible example. Try making a minimal reproducible example for us. In doing so I've found that a lot of times you might figure out the problem. If you don't then at the very least we will have actual code with actual data that will illustrate the problem. https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Dason Apr 12 '18 at 21:55
  • @Dason, ah fair enough. I'll include an example, as close to the format I'm working with as I can. In this case, simplistic data was what was making things difficult, and the formatting from my more complex data was much different. – MoeMichaelSmith Apr 12 '18 at 22:37
  • Wiktor Strib. I took a look at your suggestion, and your so called "Exact Duplicate" is inaccurate. First of all, that question regards Oracle products not regex. While there happens to be similarity, the context of this question is vastly different. Additionally, the answer was not remotely close to the answer provided in the other context. Please take the time to read the question instead of assuming it's a duplicate. Thank you to Marcus Campbell for actually taking the time to treat my question with respect, instead of immediately dismissing it. – MoeMichaelSmith Apr 13 '18 at 20:01

1 Answers1

0

Ah, this was a fun one to figure out.

The pattern "Additional Information:(?! )" will not select strings containing a single space after the ":", but using (?!\\s) will not select strings containing any whitespace character, such as a tab. One possible explanation is that you have "non-space" forms of whitespace in the vector that you are parsing.

AnyAdditionalInfoPages <- c("Additional Information: page 20", # one space
                            "Additional Information:  page 7", # two spaces
                            "Additional Information:\tpage 50", # tab
                            "Additional Information:\npage 60") # newline

# Print vector to observe true formatting
cat(AnyAdditionalInfoPages)

# Output:
Additional Information: page 20
Additional Information:  page 7
Additional Information:       page 50
Additional Information:
page 60


# Negative lookahead for spaces *only*
variablename <- grep("Additional Information:(?! )", AnyAdditionalInfoPages,   
                     perl = TRUE, value = TRUE)
# Output
[1] "Additional Information:\tpage 50"  "Additional Information:\npage 60"

# Negative lookahead for *any* whitespace
variablename <- grep("Additional Information:(?!\\s)", AnyAdditionalInfoPages,   
                     perl = TRUE, value = TRUE)
# Output
character(0)
Marcus Campbell
  • 2,746
  • 4
  • 22
  • 36
  • Marcus apologies for the weird asterisks. I was trying to bold the differences between my first one and my 2nd/3rd. Then someone suggested an edit and they came through as actual text rather than formatting. That's all gone now. The only thing I can tell that is different with yours is the 'c(' at the beginning of your variable declaration. What is this used for? Is it vital? – MoeMichaelSmith Apr 12 '18 at 21:47
  • Ah I see, your question makes much more sense now. Have you tried running your code again? I just tried it and using `\\s` worked fine. – Marcus Campbell Apr 12 '18 at 21:55
  • Also, `c()` stands for **c**oncatenate. It is simply a function call for constructing vectors, although it can also combine some other things; see https://stackoverflow.com/questions/11488820/why-use-c-to-define-vector – Marcus Campbell Apr 12 '18 at 21:59
  • See my updated answer. – Marcus Campbell Apr 12 '18 at 22:29
  • Oh cool. c() seems like a neat trick. I'll have to remember that. – MoeMichaelSmith Apr 12 '18 at 22:29
  • AH! That's wonderful! I do indeed have \\r\\n in my text after "Additional Information:" That totally makes sense now, because I was using a negative look behind I was negating all white space, not just a single space. – MoeMichaelSmith Apr 12 '18 at 22:35