30

I'd like to extract elements beginning with digits from a character vector but there's something about POSIX regular expression syntax that I don't understand.

I would think that

vec <- c("012 foo", "305 bar", "other", "notIt 7")
grep(pattern="[:digit:]", x=vec)

would return 1 2 4 since they are the four elements that have digits somewhere in them. But in fact it returns 3 4.

Likewise grep(pattern="^0", x=vec) returns 1 as I would expect because element 1 starts with a zero. However grep(pattern="^[:digit:]", x=vec) returns integer(0) whereas I would expect it to return 1 2 since those are the elements that start with digits.

How am I misunderstanding the syntax?

Drew Steen
  • 16,045
  • 12
  • 62
  • 90
  • 1
    Note that in *stringr* ICU regex patterns, you may use `[:digit:]` without extra brackets. However, it is advisable to keep them for cross-engine compatibility. – Wiktor Stribiżew Sep 06 '17 at 09:54

3 Answers3

39

Try

grep(pattern="[[:digit:]]", x=vec)

instead as the 'meta-patterns' between colons usually require double brackets.

Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
  • That gives the answer I was looking for. But can you explain what pattern="[:digit:]" was doing? I can't make any sense out of the first result, where grep(pattern="[:digit:]", x=vec) gives 3 4. I think that points to a big issue that I don't understand. – Drew Steen Jul 17 '12 at 15:45
  • 11
    It looks for colon, d, i, g, or t. – tripleee Jul 17 '12 at 15:46
  • Yep - it all makes sense now. Thanks. – Drew Steen Jul 17 '12 at 15:49
  • Thanks @Dirk I spent an afternoon trying to figure that one out. – Phil Sep 28 '15 at 12:14
  • 2
    The (currently nine) upvotes for @triplee's comment notwithstanding, it is entirely wrong. See eg [here](http://www.rdocumentation.org/packages/base/functions/regex.html) for documentation on this regular expression grammar defining `[[:digit:]]` as, well, a digit. – Dirk Eddelbuettel Sep 28 '15 at 12:31
  • 3
    I understand the question in the first comment to mean, what does `[:digit:]` -- as opposed to the correct `[[:digit:]]` -- actually do. – tripleee Sep 28 '15 at 12:32
  • My bad, and apologies. Read that way, and in response to the preceding comment, it makes perfect sense. – Dirk Eddelbuettel Sep 28 '15 at 12:38
15

Another solution

grep(pattern="\\d", x=vec)
Wojciech Sobala
  • 7,431
  • 2
  • 21
  • 27
  • However, the Perl-like shorthand character classes can't be used inside negated bracket expressions in the TRE patterns (those with `perl=FALSE`). – Wiktor Stribiżew Sep 06 '17 at 09:52
6
man 7 regex

Within a bracket expression, the name of a character class enclosed in "[:" and ":]" stands for the list of all characters belonging to that class. Standard character class names are:

         alnum       digit       punct
         alpha       graph       space
         blank       lower       upper
         cntrl       print       xdigit

Therefore a character class that is the sole member of a bracket expression will look like double-brackets, such as [[:digit:]]. As another example, consider that [[:alnum:]] is equivalent to [[:alpha:][:digit:]].

Blue Magister
  • 13,044
  • 5
  • 38
  • 56
alinsoar
  • 15,386
  • 4
  • 57
  • 74
  • 1
    Is it possible that you're not referring to R syntax? Man is not a command in R. – Drew Steen Jul 17 '12 at 15:48
  • Within R the equivalent resource is `?regex` – Josh O'Brien Jul 17 '12 at 15:53
  • he asked for the POSIX standard, and that page is the POSIX definition – alinsoar Jul 17 '12 at 16:43
  • 1
    He didn't ask for the definition, though, he asked "How am I misunderstanding the syntax?" Just telling him to RTFM is not really that helpful. – Matt Parker Jul 17 '12 at 17:04
  • 1
    and I answered him the syntax very clearly, quoting the definition. – alinsoar Jul 17 '12 at 17:07
  • But he's clearly using that syntax in his code and it's not working. That suggests to me that simply regurgitating the man page (and not even the help page for the language he's working in) is not going to be that useful. It's irrelevant if you technically correctly answered the question (which I don't believe you did); what's relevant is if you helped the person. – Matt Parker Jul 17 '12 at 17:16
  • Please also see [this meta discussion](http://meta.stackexchange.com/questions/98959/rtfm-like-answers-flag-them-or-allow-them) on RTFM questions. The first answer there gives an example of how to cite the manual in a useful, upvote-worthy way. – Matt Parker Jul 17 '12 at 17:17
  • 1
    I considered that the manual comment is evident: *Within a bracket expression, the name of a character class enclosed in "[:" and ":]"* -- that means [[:CLASS_NAME:]] – alinsoar Jul 17 '12 at 17:47
  • Ah, I see what you mean. That's a fairly subtle wording, at least to me. If you'd just add " -- that means [[:CLASS_NAME:]]" to the bottom of your answer, just to make it extra clear, I'd gladly upvote it. – Matt Parker Jul 17 '12 at 18:10
  • 1
    That's fine. I just thought that since I had critiqued your answer, I ought to be explicit about what I think would improve it. – Matt Parker Jul 17 '12 at 20:39