2

I'm pretty sure what I'm looking for is a Regular expression in R for reading scientific notion. Below is what I have done and the specifics. I very much appreciate any help.

I have a text file where some numbers are scientific notation and some are just decimals or integers. I'm trying to read them into R using regular expressions. I wrote a program to do this, and I was successful as long as the numbers did not use scientific notation or negative numbers.

The program I wrote was

getBig <-function(fileName,rows,columns)
{

  dat <-readChar(fileName, file.info(fileName)$size)
 gregexpr('[0-9][/.0-9]+',dat,perl = TRUE)

  s <- regmatches(dat,m)
  s <- s[[1]]
  s<-s[-1] #the first element is the list size
  S <- matrix(s,ncol=rows,nrow=columns)
  S<- t(S)   
  return(S)
}

I tried to modify the regular expression to include negative numbers and scientific notation by modifying the above program with the below regular expression but was not successful. Does anyone have an idea where I am going wrong? Any help is appreciated, and I have the example file format below as well.

 m <- gregexpr(' [-+]?[0-9]*(/.?[0-9]*([eE][-+]?[0-9]?))?',dat,perl = TRUE)

[-+]? + or - optional

[0-9]* a digit 0-9 at most 0 times

( start non optional block /.? optinal [0-9]* match 0 or more times

( start another block [eE][-+]? e or E + or - optional [0-9]* a digit 0-9 1 or more times )?)? close blocks matching optional

The file format below is rows,columns

where (rowN,rowN,rowN) refers to columns 1-3 for the Nth row. i.e

[3,1]    ((1,1,-1),-2.542611418857958448210085379141884323299379672715620518130686999531487002844642281770330354890802745e-05,8.586192002176000052697976968885158408090751670240233300961472896241959822732337130019333683974778635e-05))
steve3051980
  • 129
  • 3

1 Answers1

0

Based on Regex for numbers on scientific notation? the following could work in R:

Regex for scientific notation only:

only_sci_notation_numbers_regex <- "^(-?[0-9]*)\\.?[0-9]+[eE]?[-\\+]?[0-9]+$"

Regex for scientific notation and non-scientific notation decimals or integers:

 all_numbers_regex <- "^(-?[0-9]*)((\\.?[0-9]+[eE]?[-\\+]?[0-9]+)|(\\.[0-9]+))*$"

Example of some of the patterns this matches and doesn't match:

 examples_match <- c(
  "0", "1", "1.5", "0.2", "-0", "-1", "-1.5", "-0.2", ".1", "-.1", 
  "1.05E+10", "1.05e+10","-1.05E+10", "-1.05e+10", "1.05E-10", "1.05e-10","-1.05E-10", "-1.05e-10", 
  ".1e5", ".1E5", "-.1e5", "-.1E5")
  
  examples_not_match <- c("1.", "1.e5", "1e5.")
   
  # matches only numbers in scientific notation (so not examples 1-10)
  lapply(examples_match, function(x) grepl(only_sci_notation_numbers_regex, x))
  
  # matches numbers in scientific and non-scientific notation
  lapply(examples_match, function(x) grepl(all_numbers_regex, x))
  
  # doesn't match mis-formatted numbers
  lapply(examples_not_match, function(x) grepl(only_sci_notation_numbers_regex, x))
  lapply(examples_not_match, function(x) grepl(all_numbers_regex, x))
  
  

These regular expressions assume that the full string represents your number. If you want to match a scientific / non-scientific number that constitutes only part of a string (e.g. to extract it from a longer string via stringr::str_extract), you'd have to remove the ^ in the beginning and the $ in the end of the respective expression.

Phil
  • 954
  • 1
  • 8
  • 22