9

I am trying to extract scientific numbers from lines in a text file. Something like

Example:

str = 'Name of value 1.111E-11   Next Name 444.4'

Result:

[1.111E-11, 444.4]

I've tried solutions in other posts but it looks like that only works for integers (maybe)

>>> [int(s) for s in str.split() if s.isdigit()]
[]

float() would work but I get errors each time a string is used.

>>> float(str.split()[3])
1.111E-11
>>> float(str.split()[2])
ValueError: could not convert string to float: value

Thanks in advance for your help!!

Josh Melson
  • 91
  • 1
  • 1
  • 2

3 Answers3

13

This can be done with regular expressions:

import re
s = 'Name of value 1.111E-11   Next Name 444.4'
match_number = re.compile('-?\ *[0-9]+\.?[0-9]*(?:[Ee]\ *-?\ *[0-9]+)?')
final_list = [float(x) for x in re.findall(match_number, s)]
print final_list

output:

[1.111e-11, 444.4]

Note that the pattern I wrote above depends on at least one digit existing to the left of the decimal point.

EDIT:

Here's a tutorial and reference I found helpful for learning how to write regex patterns.

Since you asked for an explanation of the regex pattern:

'-?\ *[0-9]+\.?[0-9]*(?:[Ee]\ *-?\ *[0-9]+)?'

One piece at a time:

-?        optionally matches a negative sign (zero or one negative signs)
\ *       matches any number of spaces (to allow for formatting variations like - 2.3 or -2.3)
[0-9]+    matches one or more digits
\.?       optionally matches a period (zero or one periods)
[0-9]*    matches any number of digits, including zero
(?: ... ) groups an expression, but without forming a "capturing group" (look it up)
[Ee]      matches either "e" or "E"
\ *       matches any number of spaces (to allow for formats like 2.3E5 or 2.3E 5)
-?        optionally matches a negative sign
\ *       matches any number of spaces
[0-9]+    matches one or more digits
?         makes the entire non-capturing group optional (to allow for the presence or absence of the exponent - 3000 or 3E3

note: \d is a shortcut for [0-9], but I'm jut used to using [0-9].

Brionius
  • 13,858
  • 3
  • 38
  • 49
  • 1
    This answer was incredibly useful. I would like to add however that if you modify the `-?` to read `[-+]?` then it will properly match scientific notation if the string puts the `+` for positive numbers or exponents. – Brian May 12 '19 at 11:39
3

You could always just use a for loop and a try-except statement.

>>> string = 'Name of value 1.111E-11   Next Name 444.4'
>>> final_list = []
>>> for elem in string.split():
        try:
            final_list.append(float(elem))
        except ValueError:
            pass


>>> final_list
[1.111e-11, 444.4]
Sukrit Kalra
  • 33,167
  • 7
  • 69
  • 71
  • Nice. +1 for dedication to duck-typing. – Brionius Aug 09 '13 at 17:59
  • I prefer this method over regex, but it comes with the assumption that there are spaces around the numbers. If there is a chance that there is a space around the number, you regex is a better solution. – SethMMorton Aug 09 '13 at 21:03
  • @SethMMorton: I assumed that was the case, since the OP's initial approach was to split the data on spaces and he knows more about his data than I do. :) – Sukrit Kalra Aug 10 '13 at 01:33
  • @SukritKalra I just wanted to make it clear for people that find this answer in the future and don't necessarily have this criteria. – SethMMorton Aug 11 '13 at 02:49
2

I'd use Regex:

import re
s = 'Name of value 1.111E-11   Next Name 444.4'
print [float(x) for x in re.findall("-?\d+.?\d*(?:[Ee]-\d+)?", s)]

output:

[1.111e-11, 444.4]
  • You need a `-?` at the beginning of the pattern to catch negative numbers. – Brionius Aug 09 '13 at 18:01
  • Thanks! This works great. I've read through the Python documentation on Regex, but I can't decipher "-?\d+.?\d*(?:[Ee]-\d+)?". Could you give a brief explanation or where I can find one? – Josh Melson Aug 09 '13 at 19:20
  • @JoshMelson - I added an explanation of the regex pattern I used at the bottom of my answer. It's very similar to iCodez's pattern. – Brionius Aug 09 '13 at 21:42