-1

I have a regex which captures all the floating point and integers correctly from a text. It also avoids alphanumeric texts.

Regex : /[+-]?\d*?[^a-zA-Z\n][^\s]/

But it fails in one of the test cases below .

Requirement :

1.) Capture all valid integers and decimals numbers (including one with positive + and negative - signs). 1,1.0, -1.0,-1,.6, 0.7, 0 ,+.6, +.01 are all valid. 7. is not valid . .6 is not captured in text below with above regex

2.) Avoid texts like 3E , 171A etc ........this regex does everything except this case. It captures text like 11A, 17E (but NOT 9E,8B), The text 10E in the extract below is getting captured in this regex, but not 9W .10E is also not needed here. Any string of format "NUMBERALPHABETs" has to be avoided

3.) Whitespaces should not get captured. Don't want to keep on trimming in the code [dataset can be huge, can use string.trim() in java, but want to avoid it]

Any suggestions ?

Sample text below

     la=    -0.8    -0.7    -1.3    -1.6    -0.2    -0.9    -0.6    -0.7    -0.4     0.0 
  9W t=  32.611  32.599  32.588  32.577  32.565  32.531  32.519  32.508  32.496  32.485
      a=    13.6    17.2    13.9    14.8    12.7    17.8    13.7    14.3    16.9    15.9 
      p=    16.2    17.9    17.7    16.5    14.8    20.3    16.7    17.1    21.1    17.8 
     la=     0.7     1.     0.7     0.8     0.6     0.9     1.0     2.0     1.8     0.9 
      t=  32.309  32.298  32.287  32.276  32.265  32.177  32.166  32.155  32.144  32.133
      a=    12.1    13.4    17.5    17.0     0.0    14.5    14.7    14.7    16.7    14.5 
      p=    15.2    14.6    18.4    18.5     0.0    15.1    15.9    17.1    17.5    17.0 
     la=     0.9     .6     1.3     0.5     0.0     0.3     0.9     0.9     0.9     0.6 

 10E t=  32.658  32.646  32.635  32.623  32.612  32.577  32.566  32.555  32.543  32.532
      a=    13.8    17.3    16.0    15.2    13.8    16.4    15.3    20.3    17.6    16.5 
      p=    15.2    18.0    17.4    17.1    15.6    17.7    18.0    23.2    19.1    18.8

Regex : /([^\s][\d])+(.\d+)?[^a-zA-Z][^\s]/ does everything but fails on 1, 0.9 etc .....does not capture the first digit and last digit.

Any help is appreciated.

Don Woodward
  • 121
  • 2
  • 8

2 Answers2

3

You can use this: (?<!\S)[+-]?(?:\d+|\d*\.\d+)(?!\S)

Explanation:

  • (?<!\S) check that matched pattern is not preceded by something else, than whitespace character. Equivalent to (<=\s|^),
  • [+-]? optional sign,
  • (?:\d+|\d*\.\d+) either integer, or floating number with optional integer part,
  • (?!\S) (equivalent to (?=\s|$)) matched number from previous point should be followed by whitespace symbol (space, tab or newline). Notice that this symbols is checked, but not included into actual match.

Demo here

markalex
  • 8,623
  • 2
  • 7
  • 32
  • Thank you Markalex. Looks promising ......Testing in progress. Will accept as answer soon. Did not understand this ?: in this group(?:\d+|\d*\.\d+) ....What is that for ? Question mark in beginning of group. – Don Woodward Jul 15 '23 at 18:37
  • 2
    I think: `(?=\s)` should be: `(?=\s|$))` to catch the last number. – Poul Bak Jul 15 '23 at 18:45
  • # 12E t= 32.9+84 32.973 32.961 32.950 32.938 32.904 In this text 32.9 is not getting captured, but +84 is . Wish both were captured – Don Woodward Jul 15 '23 at 18:55
  • @PoulBak, you are right. Replaced it with `(?!\S)` - more concise way for the same instruction. – markalex Jul 15 '23 at 18:58
  • 32.932.973 ...ignore it entirely. 32.9+84 ignore it entirely. Both these should not get captured. – Don Woodward Jul 15 '23 at 19:12
  • @DonWoodward, `(?: )` is [a non-capturing group](https://stackoverflow.com/q/3512471). Similar to usual group, it's just not producing any separate output when matching. – markalex Jul 15 '23 at 19:20
  • Yes 32.9 and +84 should be captured (or ignored completely). But it should be like this 32.9 <>+84 with spaces in the middle. Also if you see in your regex, 32+.937 , the 32 part is not getting captured. Only +.937 is getting captured. If you capture 32 and +.937 , thats also fine. I am using regex101 to test this. Link here : https://regex101.com/r/ZvXFtd/1 – Don Woodward Jul 15 '23 at 20:15
  • @DonWoodward, negative lookbehind should solve this. Updated the answer. – markalex Jul 15 '23 at 20:20
  • Thank you marked answer. This is what I needed. Either ignore it completely like the one you shared. Or capture both 32.9 and +.84 separately. Your case falls in first category. – Don Woodward Jul 15 '23 at 20:56
2

I came up to the following:

(?<=(\s|^))[+-]?(\d+|\d*\.\d+)(?=(\s|$))

(?<=(\s|^)) and (?=(\s|$)) are custom word boundaries, which enshures that we avoid smth like 1.2e5 or word123, where 1.2 and 123 would be our match

Etheilred
  • 43
  • 7