2

I want to remove & and . from the following array and extract the numbers only,

x = as.factor(c(".&.", "0.0119885482338&.&.", ".&2.25880593895", ".&.&.&.&.&.&.&.", ".&0.295142083575&.", "0.708323350364",".&.&0.193766679861",".&.&.&.&7.65239874523E-4&.&."))

I tried the following gsub() command:

gsub("[^0-9.E-]","",x)

The output:

".."                     "0.0119885482338.."      ".2.25880593895"         
"........"              
".0.295142083575."       "0.708323350364"         "..0.193766679861"       
"....7.65239874523E-4.." 

Any suggestions to update the above gsub command so that the output will look like:

"" "0.0119885482338" "2.25880593895" "" "0.295142083575" 
"0.708323350364" "0.193766679861" "7.65239874523E-4"  
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • Note you may need to adjust [the number pattern](https://stackoverflow.com/questions/638565/parsing-scientific-notation-sensibly), it might vary depending on the actual data/requirements. – Wiktor Stribiżew May 09 '21 at 10:18

4 Answers4

4

You can use

> sub("^.*?(?:([-+]?\\d*\\.?\\d+(?:[eE][-+]?\\d+)?).*|$)","\\1",x)
[1] ""                 "0.0119885482338"  "2.25880593895"    ""                 "0.295142083575"   "0.708323350364"   "0.193766679861"   "7.65239874523E-4"

See the regex demo.

Details:

  • ^ - start of string
  • .*? - any text, as short as possible
  • (?: - start of a non-capturing group:
    • ([-+]?\\d*\\.?\\d+(?:[eE][-+]?\\d+)?) - Group 1 (\1): a number pattern
    • .* - the rest of the string
  • |
    • $ - end of string
  • ) - end of the non-capturing group.

See an online R demo:

x=as.factor(c(".&.", "0.0119885482338&.&.", ".&2.25880593895", ".&.&.&.&.&.&.&.", ".&0.295142083575&.", "0.708323350364",".&.&0.193766679861",".&.&.&.&7.65239874523E-4&.&."))
sub("^.*?(?:([-+]?\\d*\\.?\\d+(?:[eE][-+]?\\d+)?).*|$)","\\1",x)
## => [1] ""                 "0.0119885482338"  "2.25880593895"    ""                
##    [5] "0.295142083575"   "0.708323350364"   "0.193766679861"   "7.65239874523E-4"
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
4

Here is a base R approach using grepl followed by sub:

x <- x[grepl("\\d+", x)]
x <- sub("^.*?(\\d+(?:\\.\\d+)?(?:E[-+]\\d+)?).*$", "\\1", x)
x

[1] "0.0119885482338"  "2.25880593895"    "0.295142083575"   "0.708323350364"  
[5] "0.193766679861"   "7.65239874523E-4"
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • 1
    The command mentioned by @Tim is removing the missing observations ("") from the array. – Apurba Shil May 09 '21 at 11:59
  • ...and why do you want those empty string entries anyway? – Tim Biegeleisen May 09 '21 at 12:08
  • Those empty string entries are generating NA's (which is desired) after I converted them to numeric values (using as.numeric()). The above array is a part of a column from a data frame and thus NA observations need to be kept for the downstream analysis. The solution you gave/suggested is helpful to me. – Apurba Shil May 09 '21 at 12:33
3

In the alternatives below remove as.numeric at the end if you want the result to be character.

1) The following does not use regular expressions. The form of the input shown in the question is & separated fields so it converts x from factor to character, splits it into fields separated by &, removes any dot that is in a field by itself and then converts the remainder to numeric. No packages are used.

s <- unlist(strsplit(paste(x), "&", fixed = TRUE))
as.numeric(s[s != "."])
## [1] 0.0119885482 2.2588059390 0.2951420836 0.7083233504 0.1937666799
## [6] 0.0007652399

Alternately, we could represent it as a pipeline

library(magrittr)

x %>%
  paste %>%
  strsplit("&", fixed = TRUE) %>%
  unlist %>%
  Filter(function(x) x != ".", .) %>%
  as.numeric
## [1] 0.0119885482 2.2588059390 0.2951420836 0.7083233504 0.1937666799
## [6] 0.0007652399

2) The approach in the question can work if we remove the leading and trailing dots afterwards, remove zero length fields and convert to numeric

as.numeric(Filter(nzchar, trimws(gsub("[^0-9.E-]","",x),, whitespace = "\\.")))
## [1] 0.0119885482 2.2588059390 0.2951420836 0.7083233504 0.1937666799
## [6] 0.0007652399

Update

In a comment it was mentioned that it is desired that the result be the same length as the input. Assuming that in that case we want character output we can shorten the above to the following:

L <- strsplit(paste(x), "&", fixed = TRUE)
sapply(L, function(x) c(x[x != "."], "")[1])
## [1] ""                 "0.0119885482338"  "2.25880593895"    ""                
## [5] "0.295142083575"   "0.708323350364"   "0.193766679861"   "7.65239874523E-4"

x %>% paste %>% strsplit("&", fixed = TRUE) %>% sapply(function(x) c(x[x != "."], "")[1])
## [1] ""                 "0.0119885482338"  "2.25880593895"    ""                
## [5] "0.295142083575"   "0.708323350364"   "0.193766679861"   "7.65239874523E-4"


trimws(gsub("[^0-9.E-]","",x), whitespace = "\\.")
## [1] ""                 "0.0119885482338"  "2.25880593895"    ""                
## [5] "0.295142083575"   "0.708323350364"   "0.193766679861"   "7.65239874523E-4"
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Hi,The first 2 commands are working well. The last command is returning an error **Error in trimws(gsub("[^0-9.E-]", "", x), whitespace = "\\.") : unused argument (whitespace = "\\.")**. – Apurba Shil May 10 '21 at 13:39
  • 1
    Update to the most recent version of R. whitespace= was added in one of the more recent versions. – G. Grothendieck May 10 '21 at 13:45
  • 1
    or if it is not feasible for you to upgrade use: `gsub("^\\.*|\\.*$", "", gsub("[^0-9.E-]","",x))` – G. Grothendieck May 10 '21 at 13:52
1

In case . and & are always together (in your given example that's the case) you can use \\.*&\\.*.

gsub("\\.*&\\.*", "", x)
#[1] ""                 "0.0119885482338"  "2.25880593895"    ""                
#[5] "0.295142083575"   "0.708323350364"   "0.193766679861"   "7.65239874523E-4"
GKi
  • 37,245
  • 2
  • 26
  • 48