0

I have a string containing variable names and values. There is no designated separator between the names and the values and the names may or may not contain underscores.

string1 = 'Height_A_B132width_top100.0lengthsimple0.00001'

I would like to get the variables into a dictionary:

# desired output: dict1 = {'Height_A_B': 132, 'width_top': 100.0, 'lengthsimple': 0.00001}

Trying the following itertools method

Input1:

from itertools import groupby
[''.join(g) for _, g in groupby(string1, str.isdigit)]

Output1:

['Height_A_B', '132', 'width_top', '100', '.', '0', 'lengthsimple', '0', '.', '00001']

The following should almost get there, but the iPython interpreter tells me this str attribute doesn't exist (it is in the docs). Anyway...

Input2:

[''.join(g) for _, g in groupby(string1, str.isnumeric)]

Output2:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-25-cf931a137f50> in <module>()
----> 1 [''.join(g) for _, g in groupby(string1, str.isnumeric)]

AttributeError: type object 'str' has no attribute 'isnumeric'

Anyway, what would happen if the number contained an exponent with a '+' or a '-' symbol?

string2 = 'Height_A132width_top100.0lengthsimple1.34e+003'
# desired output: dict2 = {'Height_A_B': 132, 'width_top': 100.0, 'lengthsimple': 1.34e+003}

Input3:

[''.join(g) for _, g in groupby(string2, str.isdigit)]

Output3:

['Height_A', '132', 'width_top', '100', '.', '0', 'lengthsimple', '1', '.', '34', 'e+', '003']

I wonder, if someone has an elegant solution?

UPDATE: There is some discussion below about preserving the types of the numerical variables (e.g. int, float etc.). In fact the scientific notation in string2 turned out to be a bit of a red herring because if you create a variable

>>> a = 1.34e+003

you get

>>> print a
1340.0

anyway, so the chance of producing a string with 1.34+003 in it is low.

So string2 is a more appropriate test case if we change it to, say

string2 = 'Height_A132width_top100.0lengthsimple1.34e+99'
feedMe
  • 3,431
  • 2
  • 36
  • 61

4 Answers4

2

You can use regex : ([^\d.]+)(\d[\d.e+-]*):

  1. [^\d.] means :Everything except digits and period
  2. + means one or more.
  3. other group need at least one digit then number or e or -/+.

group 1 is key, group 2 is value.

demo

Code:

import re
vals = { x:float(y) if '.' in y else int(y) for (x,y) in (re.findall(r'([^\d.]+)(\d[\d.e+-]*)',string2))} 

{'width_top': 100.0, 'Height_A': 132, 'lengthsimple': 1340.0}
Ali Nikneshan
  • 3,500
  • 27
  • 39
  • This works great for string1, but doesn't seem to work for string2, because of the exponent in the last value. – feedMe Dec 17 '15 at 14:17
  • code and sample updated, still not strictly true. but you can expand the regex to accept only correct scientific numbers – Ali Nikneshan Dec 17 '15 at 14:22
  • Now working with string2, thanks! Great solution, but I wonder if there is another way without the regex wizardry? Nonetheless, I will go away and study regex now :) – feedMe Dec 17 '15 at 14:25
  • Although, the values in the dict are stored as strings. – feedMe Dec 17 '15 at 14:33
  • This is really great and concise, thanks, however the stated desired output is dict2 = {'Height_A_B': 132, 'width_top': 100.0, 'lengthsimple': 1.34e+003}, where the numerical variables preserve their type, e.g. int, scientific notation etc. – feedMe Dec 17 '15 at 17:18
  • this can be done for int and float but not scientific notation. – Ali Nikneshan Dec 17 '15 at 17:21
  • Yes I thought about the scientific notation part some more and updated my "string2" test case... please see the question update! – feedMe Dec 17 '15 at 17:27
1

Handling numbers in scientific notation makes this a little tricky, but it's possible with a carefully-written regex. Hopefully, my regex behaves correctly on all data. :)

import re

def parse_numstr(s):
    ''' Convert a numeric string to a number. 
    Return an integer if the string is a valid representation of an integer,
    Otherwise return a float, if its's a valid rep of a float,
    Otherwise, return the original string '''
    try:
        return int(s)
    except ValueError:

        try:
            return float(s)
        except ValueError:
            return s

pat = re.compile(r'([A-Z_]+)([-+]?[0-9.]+(?:e[-+]?[0-9]+)?)', re.I)

def extract(s):
    return dict((k, parse_numstr(v)) for k,v in pat.findall(s))

data = [
    'Height_A_B132width_top100.0lengthsimple0.00001',
    'Height_A132width_top100lengthsimple1.34e+003',
    'test_c4.2E1p-3q+5z123E-2e2.71828',
]

for s in data:
    print(extract(s))

output

{'Height_A_B': 132, 'width_top': 100.0, 'lengthsimple': 1.0000000000000001e-05}
{'width_top': 100, 'Height_A': 132, 'lengthsimple': 1340.0}
{'q': 5, 'p': -3, 'z': 1.23, 'test_c': 42.0, 'e': 2.71828}

Note that my regex will accept malformed numbers in scientific notation that contain multiple decimal points, which parse_numstr will just return as strings. That shouldn't be a problem if your data doesn't contain such malformed numbers.

Here's a slightly better regex. It only allows a single decimal point, but will also accept malformed numbers with no digits either side of the decimal point, like . or .E1, etc.

pat = re.compile(r'([A-Z_]+)([-+]?[0-9]*\.?[0-9]*(?:e[-+]?[0-9]+)?)', re.I)

Also see this answer for a regex that captures numbers in scientific notation.

Community
  • 1
  • 1
PM 2Ring
  • 54,345
  • 6
  • 82
  • 182
  • You could just `return float(s)`, it would work for either – Padraic Cunningham Dec 17 '15 at 14:58
  • @PadraicCunningham: True, but the OP's desired output has integers, eg 123, so I felt it was appropriate to return integers when possible. – PM 2Ring Dec 17 '15 at 15:04
  • Maybe but I think they are just using both types as that is what is in their string but essentially 132 == 132.0 – Padraic Cunningham Dec 17 '15 at 15:32
  • I personally think it is always worth preserving the numerical type where possible! This avoids confusion when comparing in conditionals etc. Thanks for your effort. – feedMe Dec 17 '15 at 17:13
  • @feedMe, unless you are using something like isinstace where you are actually checking the type it is not going to make a difference. calling float is also not going to change `1.34e+99` so not sure why you think that is relevant – Padraic Cunningham Dec 17 '15 at 19:26
  • @PadraicCunningham here is a scenario where it makes a difference; I create a directory based on an int variable "folder72". Later, I access that variable somehow and it is now a float. I try to find "folder72.0", which doesn't exist. Perhaps you consider this contrived but these kinds of issues occur on a daily basis, and one way to avoid them, other than remembering the original type and manually recasting, is to handle them in a way that they preserve their type. – feedMe Dec 17 '15 at 19:28
  • @feedMe, why would you be casting to int or float if they were directory names? I still don't see how your edit regarding `1.34e+99` is relevant either, there is no `scientific("string")` function and `int("1.34e+9")` could never work – Padraic Cunningham Dec 17 '15 at 19:29
  • @PadraicCunningham Whether or not you can imagine a reason for doing this is irrelevant :) I have proposed a problem and asked for a solution. There are many examples in my own work where variables are shared through unorthodox ways e.g. communicating between two commercial software packages. These packages might independently use the variables and then try to write/access the same file/folder. This is unfortunately a reality. – feedMe Dec 18 '15 at 08:50
  • Re: the scientific("string") edit, my point was exactly that there is no scientific type, that my original value would be printed to screen without the exponent. Therefore, it is more relevant to use a test string (string2) where the value is included as python would add it using `"name_of_variable%s" % value`. This way the dict output will match the original string. – feedMe Dec 18 '15 at 08:52
0

Here you go:

import re
p = re.compile(ur'([a-zA-z]+)([0-9.]+)')
test_str = u"Height_A_B132width_top100.0lengthsimple0.00001"

print dict(re.findall(p, test_str))
masnun
  • 11,635
  • 4
  • 39
  • 50
0

This simple regex will work:

[0-9.+e]+|\D+

To create your dicts:

def pairs(s):
    mtch = re.finditer("[0-9.+e]+|\D+", s)
    m1, m2 = next(mtch, ""), next(mtch, "")
    while m1:
        yield m1.group(), float(m2.group())
        m1, m2 = next(mtch, ""), next(mtch, "")

Demo:

In [27]: s =  'Height_A_B132width_top100.0lengthsimple0.00001'

In [28]: print(dict(pairs(s)))
{'Height_A_B': 132.0, 'width_top': 100.0, 'lengthsimple': 1e-05}

In [29]: s = 'Height_A132width_top100.0lengthsimple1.34e+003'

In [30]: print(dict(pairs(s)))
{'width_top': 100.0, 'Height_A': 132.0, 'lengthsimple': 1340.0}

Or for a more general approach, you could use ast.literal_eval to parse the values to work for multiple types:

from ast import literal_eval
def pairs(s):
    mtch = re.finditer("[0-9.+e]+|\D+", s)
    m1, m2 = next(mtch, ""), next(mtch, "")
    while m1:
        yield m1.group(), literal_eval(m2.group())
        m1, m2 = next(mtch, ""), next(mtch, "")

Which if you are really concerned about ints vs floats:

In [31]: s = 'Height_A132width_top100.0lengthsimple1.34e+99'

In [32]: dict(pairs(s))
Out[32]: {'Height_A': 132, 'lengthsimple': 1.34e+99, 'width_top': 100.0}
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321