Python programmatically parse complex string

Question

I'm not sure this can be done programmatically. I've tried various examples found on SO but an expert may look and quickly say 'stop wasting your time'.

Example 2 <-- LOOK AT THIS ONE

Using Python
df containing: 21000 rows × 26 columns
1 column contains a "short description" as shown in the two examples below but there are many different sets of data.
Ideally, the items in blue (linked images) would become column names, and the data in white would be in the row. If any of the 2100 items have a value for a named column, it would populate correctly.
However, if I can only pick one df element and parse it into two columns (left column blue content, right column white content) that would be acceptable.

Parse:

"Connection=Cable with connector, M12x1-Male, 4-pin, 0.30 m, Version=Background light, Dimension=43 x 9.5 x 64.5 mm, Rated operating voltage Ue DC=24 V, Current draw max.=208 mA, Operating mode=Normal, Material=Aluminum anodized, black Glass PMMA, Illumination area=25 x 25 mm, Light type=LED Red light, Wave length=617 nm, Illuminence (0.1 m)=350 Lux, Beam angle=40 ° x 40 °, Ambient temperature=-10...55 °C, Approval/Conformity=CE; EAC; WEEE, IP rating=IP54"

or

Product Group=HF (13.56 MHz), Product name=WLAN, Dimension=100 x 51 x 265 mm, Antenna type=Rod, Supported data carrier types=DIN ISO 14443; DIN ISO 15693, Display=TFT Touchscreen-Display (color) 480x640 VGA resolution, Keypad=52 keys, alphanumeric, Operating voltage Ub=3.7 V DC Battery pack, Storage temperature=-40...60 °C, Ambient temperature=-10...50 °C, IP rating=IP65, Approval/Conformity=CE; WEEE

or...

Example 1

Excel file with additional "short descriptions"

If you can come up with an unambiguous algorithm to parse the data, then you can likely write a Python program to do the same. — AMC, Aug 06 '20 at 23:26
Perhaps I worded my post poorly; after trying several 'unambiguous algorithm's' I was unsuccessful though I am a novice. Does a pattern jump out to the trained eye on SO? I thank @das-g for giving it a shot..it's closer. — w100bxc, Aug 07 '20 at 01:19

score -1 · Accepted Answer · answered Aug 07 '20 at 13:36

Somewhat hacky solution: string manipulation by trail and error

Observation: description-value pairs are delimited by , .

So let's try to split the string at these delimeters:

line = "Connection=Cable with connector, M12x1-Male, 4-pin, 0.30
 m, Version=Background light, Dimension=43 x 9.5 x 64.5 mm, Rated operating volt
age Ue DC=24 V, Current draw max.=208 mA, Operating mode=Normal, Material=Alumin
um anodized, black Glass PMMA, Illumination area=25 x 25 mm, Light type=LED Red 
light, Wave length=617 nm, Illuminence (0.1 m)=350 Lux, Beam angle=40 ° x 40 °, 
Ambient temperature=-10...55 °C, Approval/Conformity=CE; EAC; WEEE, IP rating=IP
54"

line.split(', ')

Were there any commas (,) not followed by a space? Let's check whether the split result still contains any commas:

>>> any(',' in part for part in line.split(', '))
False

Alright.

Observation: description and value are separated by =.

Let's check whether all parts we identified contain a =:

>>> all('=' in x for x in line.split(', '))
False

Huh. What happened? Let's look at the complete result:

>>> line.split(', ')
['Connection=Cable with connector',
 'M12x1-Male',
 '4-pin',
 '0.30 m',
 'Version=Background light',
 'Dimension=43 x 9.5 x 64.5 mm',
 'Rated operating voltage Ue DC=24 V',
 'Current draw max.=208 mA',
 'Operating mode=Normal',
 'Material=Aluminum anodized',
 'black Glass PMMA',
 'Illumination area=25 x 25 mm',
 'Light type=LED Red light',
 'Wave length=617 nm',
 'Illuminence (0.1 m)=350 Lux',
 'Beam angle=40 ° x 40 °',
 'Ambient temperature=-10...55 °C',
 'Approval/Conformity=CE; EAC; WEEE',
 'IP rating=IP54']

Aha: Seems like there are values that contain , :

Cable with connector, M12x1-Male, 4-pin, 0.30 m
Aluminum anodized, black Glass PMMA

and these were also split.

Let's simply rejoin those:

fake_parts = line.split(', ')
real_parts = []

for part in fake_parts:
    if '=' in part:
        real_parts.append(part)
    else:
        real_parts[-1] += f', {part}'

How does that look?

>>> real_parts
['Connection=Cable with connector, M12x1-Male, 4-pin, 0.30 m',
 'Version=Background light',
 'Dimension=43 x 9.5 x 64.5 mm',
 'Rated operating voltage Ue DC=24 V',
 'Current draw max.=208 mA',
 'Operating mode=Normal',
 'Material=Aluminum anodized, black Glass PMMA',
 'Illumination area=25 x 25 mm',
 'Light type=LED Red light',
 'Wave length=617 nm',
 'Illuminence (0.1 m)=350 Lux',
 'Beam angle=40 ° x 40 °',
 'Ambient temperature=-10...55 °C',
 'Approval/Conformity=CE; EAC; WEEE',
 'IP rating=IP54']

>>> all('=' in part for part in real_parts)
True

Much better!

Do all parts now contain exactly one =? Let's try by splitting them up:

>>> all(len(part.split('=')) == 2 for part in real_parts)
True

Good. With that, we can form a dictionary:

>>> from collections import OrderedDict

>>> OrderedDict(part.split('=') for part in real_parts)
OrderedDict([('Connection', 'Cable with connector, M12x1-Male, 4-pin, 0.30 m'),
             ('Version', 'Background light'),
             ('Dimension', '43 x 9.5 x 64.5 mm'),
             ('Rated operating voltage Ue DC', '24 V'),
             ('Current draw max.', '208 mA'),
             ('Operating mode', 'Normal'),
             ('Material', 'Aluminum anodized, black Glass PMMA'),
             ('Illumination area', '25 x 25 mm'),
             ('Light type', 'LED Red light'),
             ('Wave length', '617 nm'),
             ('Illuminence (0.1 m)', '350 Lux'),
             ('Beam angle', '40 ° x 40 °'),
             ('Ambient temperature', '-10...55 °C'),
             ('Approval/Conformity', 'CE; EAC; WEEE'),
             ('IP rating', 'IP54')])

or simply

>>> dict(part.split('=') for part in real_parts)
{'Ambient temperature': '-10...55 °C',
 'Approval/Conformity': 'CE; EAC; WEEE',
 'Beam angle': '40 ° x 40 °',
 'Connection': 'Cable with connector, M12x1-Male, 4-pin, 0.30 m',
 'Current draw max.': '208 mA',
 'Dimension': '43 x 9.5 x 64.5 mm',
 'IP rating': 'IP54',
 'Illumination area': '25 x 25 mm',
 'Illuminence (0.1 m)': '350 Lux',
 'Light type': 'LED Red light',
 'Material': 'Aluminum anodized, black Glass PMMA',
 'Operating mode': 'Normal',
 'Rated operating voltage Ue DC': '24 V',
 'Version': 'Background light',
 'Wave length': '617 nm'}

Now that's something you can probably work with. However, this approach is fragile:

What, if some descriptions also contain , ?
What, if descriptions or values contain =?
What if the format contains specific escape sequences?

Proper solution: use a parser

To correctly interpret data encoded as text according to an elaborate (or not so elaborate) set of rules, use a parser library. See e.g. How best to parse a simple grammar? for options.

This though requires you to specify the exact set of rules (called "grammar") that govern the encoding, and thus to also know these rules. Whether and how well these rules can be derived by looking at the encoded data, depends on that data.

Thanks for your help; it looks like my long data set is too unreliable. I found there are some ',' inside the descriptions just like you warned. However I'm sure your comments will help others faced with the same problem. — w100bxc, Aug 07 '20 at 17:19

Python programmatically parse complex string

1 Answers1

Somewhat hacky solution: string manipulation by trail and error

Proper solution: use a parser