5

I want to have a regular expression that match anything that is not a correct mathematical number. the list below is a sample list as input for regex:

1

1.7654

-2.5

2-

2.

m

2..3

2....233..6

2.2.8

2--5

6-4-9

So the first three (in Bold) should not get selected and the rest should. This is a close topic to another post but because of it's negative nature, it is different.

I'm using R but any regular expression will do I guess. The following is the best shot in the mentioned post:

a <- c("1", "1.7654", "-2.5", "2-", "2.", "m", "2..3", "2....233..6", "2.2.8", "2--5", "6-4-9")
grep(pattern="(-?0[.]\\d+)|(-?[1-9]+\\d*([.]\\d+)?)|0$", x=a)

which outputs:

\[1\] 1  2  3  4  5  7  8  9 10 11
Community
  • 1
  • 1
Mehrad Mahmoudian
  • 3,466
  • 32
  • 36
  • 7
    `a[is.na(as.numeric(a))]` comes pretty close except for the "2." – talat Jul 13 '15 at 15:14
  • 1
    Do you care about leading zeroes? Do you want "012" to match, or not? I guess "0.12" has to match. What about trailing zeroes, like "0.1200"? – Spacedman Jul 13 '15 at 15:16
  • @Spacedman I guess those are also mathematically correct numbers. 2==0002=2.0000=0002.000 – Mehrad Mahmoudian Jul 13 '15 at 15:25
  • @docendodiscimus 's seems best. – MichaelChirico Jul 13 '15 at 15:32
  • @MichaelChirico the issue is it generates a warning (NAs introduced by coercion) and then the warnings should be supressed. Also I'm not sure which way is faster, regex or is.na(as.numeric(x)) – Mehrad Mahmoudian Jul 13 '15 at 15:41
  • `suppressWarnings(a[is.na(as.numeric(a))])`, as inspired by [this](http://stackoverflow.com/questions/14984989/how-to-avoid-warning-when-introducing-nas-by-coercion) – MichaelChirico Jul 13 '15 at 15:52
  • 2
    There are also slightly more exotic number formats such as "1.2E05" (which is 120000) but they are mostly produced by computers. – Spacedman Jul 13 '15 at 15:55

6 Answers6

4

You can use following regex :

^(?:((\d+(?=[^.]+|\.{2,})).)+|(\d\.){2,}).*|[^\d]+$

See demo https://regex101.com/r/tZ3uH0/6

Note that your regex engine should support look-ahead with variable length.and you need to use multi-line flag and as mentioned in comment you can use perl=T to active look-ahead in R.

this regex is contains 2 part that have been concatenated with an OR.first part is :

(?:((\d+(?=[^.]+|\.{2,})).)+|(\d\.){2,}).*

which will match a combination of digits that followed by anything except dot or by 2 or more dot.which the whole of this is within a capture group that can be repeat and instead of this group you can have a digit which followed by dot 2 or more time (for matching some strings like 2.3.4.) .

and at the second part we have [^\d]+ which will match anything except digit.

Regular expression visualization

Debuggex Demo

Mazdak
  • 105,000
  • 18
  • 159
  • 188
2

I think this should do the job:

re <- "^-?[0-9]+$|^-?[0-9]+\\.[0-9]+$"
R> a[!grepl(re, a)]
#[1] "2-"          "2."          "m"           "2..3"        "2....233..6" "2.2.8"       "2--5"       
#[8] "6-4-9" 
nrussell
  • 18,382
  • 4
  • 47
  • 60
2
a[grep("^-?\\d*(\\.?\\d*)$", a, invert=T)]

With a suggested edit from @Frank.

Speed Test

a <- rep(a, 1e4)
all.equal(a[is.na(as.numeric(a))], a[grep("^-?\\d+(\\.?\\d+)?$|^\\d+\\.$", a, invert=T)])
[1] TRUE

library(microbenchmark)
microbenchmark(dosc = a[is.na(as.numeric(a))],
           plafort = a[grep("^-?\\d*(\\.?\\d*)$", a, invert=T)])
# Unit: milliseconds
#     expr      min       lq     mean   median       uq      max neval
#     dosc 27.83477 28.32346 28.69970 28.51254 28.76202 31.24695   100
#  plafort 31.92118 32.14915 32.62036 32.33349 32.71107 35.12258   100
Pierre L
  • 28,203
  • 6
  • 47
  • 69
  • Well, if we accept the .2 and 2. , then the answer@docendodiscimus gave is the easiest and the most human readable. So for the sake of the question, no they should not get selected. (I wonder if using regex is faster than a[is.na(as.numeric(a))] ) – Mehrad Mahmoudian Jul 13 '15 at 15:38
  • 1
    Or `"^-?\\d+(\\.?\\d+)?$"` so you don't have to write the first part twice. – Frank Jul 13 '15 at 15:41
  • 1
    Thank you @Frank. I added a speed test. – Pierre L Jul 13 '15 at 15:46
  • Hm, your regex is an imperfect parallel to docendo's. Try it on `"-2."` for example. And, of course, it fails on `".2"`, which `as.numeric` captures. I guess this could be rectified with `"^-?\\d*(\\.?\\d*)$"` (not sure). As Spacedman mentioned, you'd also need to contend with `2E5` to mirror `as.numeric`... Anyway, seems kind of strange for the regex in the accepted answer to (deliberately and silently) produce the wrong output for the question (by allowing `2.`)... Maybe @MehradMahmoudian can alter the question so that this makes sense...? – Frank Jul 13 '15 at 16:09
  • 1
    They mentioned in the comments that they would like `2.` and `.2` to be considered numbers. – Pierre L Jul 13 '15 at 16:23
0

The solution here is good. You only have to add the negative case [-] and invert the selection!

a <- c("1", "1.7654", "-2.5", "2-", "2.", "m", "2..3", "2....233..6", "2.2.8", "2--5", "6-4-9")
a[grep(pattern="(^[1-9]\\d*(\\.\\d+)?$)|(^[-][1-9]\\d*(\\.\\d+)?$)",invert=TRUE, x=a)]

[1] "2-"          "2."          "m"           "2..3"        "2....233..6"
[6] "2.2.8"       "2--5"        "6-4-9" 
Community
  • 1
  • 1
0

Try this:

a[!grepl("^\\-?\\d?\\.?\\d+$", a)]
Shenglin Chen
  • 4,504
  • 11
  • 11
0

I like the simplicity of as.numeric(). This would be my suggestion:

require(stringr)

a <- c("1", "1.7654", "-2.5", "2-", "2.", "m", "2..3", "2....233..6", "2.2.8", "2--5", "6-4-9")
a

a1 <- ifelse(str_sub(a, -1) == ".", "string filler", a)
a1

outvect <- is.na(as.numeric(a1))
outvect
mef jons
  • 232
  • 1
  • 3
  • 10