10

I am trying to use regular expressions in R to remove text - either an 'X' or an 'X.' - from the front of a number. I am new to regular expressions and having a hard time getting this to work. I have tried every combination of X and . with or without the escape character that I could think of, including:

  • str_replace("X.4.89294e-05", "X.",'') Result "4.89294e-05" but for fails for str_replace("X4.89294e-05", "X.",'')Result ".89294e-05"
  • str_replace("X.4.89294e-05", "[X.]",'') Result ".4.89294e-05"
  • str_replace("X.4.89294e-05", "[X/.?]",'') Result ".4.89294e-05"
  • str_replace("X.4.89294e-05", "[X//.?]",'') Result ".4.89294e-05"
  • str_replace('X.4.89294e-0','X/.{0,1}','') Result "X.4.89294e-0"
  • str_replace('X.4.89294e-0','[X/.{0,1}]','') Result ".4.89294e-0"

Any help would be greatly appreciated.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
mrpargeter
  • 339
  • 3
  • 12
  • A forward slash `/` is nothing special, the escape character is a backslash \. And generally, for regex in R, you need 2 backslashes (one for R itself, one for the regex). – Gregor Thomas Apr 20 '18 at 03:25

3 Answers3

9

The . must be escaped. In R, you do that by adding a \\ before the . operator.

Read on the need for \\ here: Escape with a double backslash

Like this:

txt = c("X.4.89294e-0", "X4.89294e-0")
str_replace(txt, "^X(\\.)?", "")

If you do not want to specifically match X or X. at the very beginning, remove ^ from the example above to match it anywhere in the word.

Deepak Rajendran
  • 358
  • 1
  • 11
2

You mean remove 'X' Or 'X.' From any digits?
Actually an single . in regex should be like this \., so try str_replace("X.4.89294e-05", "X\.?", "") instead.

Alex.Fu
  • 69
  • 3
2

remove text - either an 'X' or an 'X.' - from the front of a number

Taking into account that all your test cases contain a single X or X. at the start of the string, you may use

sub("^X\\.?(\\d)", "\\1", x)
str_replace(x, "^X\\.?(\\d)", "\\1")

Note that at the regex testing sites, you need to use a single backslash (a literal backslash) that is "coded" with the double backslash inside R string literals.

Details

  • ^ - start of the string
  • X - an X char
  • \\.? - \. matches a literal dot, and ? is a quantifier making the regex engine match 1 or 0 consecutive occurrences of the . char
  • (\\d) - a capturing group #1 that matches and stores in a memory slot any digit (\d matches any digit)
  • \\1 - inside a replacement argument, the reference to the value stored in Group 1 memory slot.

You may even use a (?=\d) lookahead based solution to check for a digit immediately to the right of the current location:

sub("^X\\.?(?=\\d)", "", x, perl=TRUE)
str_replace(x, "^X\\.?(?=\\d)", "")

Then, there is no need to use \1 because the text matched with a lookahead is not put into the match and thus won't get removed during the sub/str_replace operation.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563