-1

I need a regex expression that, given a text string, extracts only the numbers. Usually these numbers will be given in decimal notation, in which case so far I have used the following expression:

r'-?(?:\d+(?:\.\d*)?)'

Which can be verified with the following example:

>>>text1=u'''MULTIPOLYGON (((-0.026629449670668229 38.880267142395049, 
                       -0.037640029706400797 38.887965291134428, 
                       -0.038243258379973236 38.886652370401961, 
                       -0.038324794358468445 38.886474904266947, 
                       -0.039081561703673183 38.885154939177824)))'''
>>>re.findall(r'-?(?:\d+(?:\.\d*)?)',text1)
[u'-0.026629449670668229', u'38.880267142395049', u'-0.037640029706400797', u'38.887965291134428', u'-0.038243258379973236', u'38.886652370401961', u'-0.038324794358468445', u'38.886474904266947', u'-0.039081561703673183', u'38.885154939177824']

However, in some cases (which I did not initially contemplate) the notation in which a number is expressed is scientific (AeN), which is not compatible with the aforementioned expression, as shown in the following example:

>>>text2=u'''MULTIPOLYGON (((-1.1577490327131464e-05 38.865878133979862, 
                       -0.037640029706400797 38.887965291134428, 
                       -0.038243258379973236 38.886652370401961, 
                       -0.038324794358468445 38.886474904266947, 
                       -0.039081561703673183 38.885154939177824)))'''
>>>re.findall(r'-?(?:\d+(?:\.\d*)?)',text2)
[u'-1.1577490327131464', u'-05', u'38.865878133979862', u'-0.037640029706400797', u'38.887965291134428', u'-0.038243258379973236', u'38.886652370401961', u'-0.038324794358468445', u'38.886474904266947', u'-0.039081561703673183', u'38.885154939177824']

I would like to know if there is an expression that for the previous example obtains the following result:

>>>re.findall(RE_EXPRESSION,text2)
    [u'-1.1577490327131464e-05', u'38.865878133979862', u'-0.037640029706400797', u'38.887965291134428', u'-0.038243258379973236', u'38.886652370401961', u'-0.038324794358468445', u'38.886474904266947', u'-0.039081561703673183', u'38.885154939177824']

2 Answers2

0
Alternation may help.

re.findall(r'-?\d+\.\d+(?:[Ee][-+]?\d+)|-?\d+\.\d+',text2)

['-1.1577490327131464e-05', '38.865878133979862', '-0.037640029706400797', '38.887965291134428', '-0.038243258379973236', '38.886652370401961', '-0.038324794358468445', '38.886474904266947', '-0.039081561703673183', '38.885154939177824']
LetzerWille
  • 5,355
  • 4
  • 23
  • 26
0

Your regex seems fine and if you just want to cover exponential forms too in addition to normal forms, which can be like e-123 or e+123 or e123 where e can sometimes be E, you just need to add (?:[eE][-+]?\d+)? in your existing regex and use following regex,

-?(?:\d+(?:\.\d*)?)(?:[eE][-+]?\d+)?

Demo

Pushpesh Kumar Rajwanshi
  • 18,127
  • 2
  • 19
  • 36