0

I am given a raw string which is a path or "direction" to a string in JSON. I need the following string converted to a list containing dictionaries..

st = """data/policy/line[Type="BusinessOwners"]/risk/coverage[Type="FuelHeldForSale"]/id"""

The list should look like this

paths = ['data','policy','line',{'Type':'BusinessOwners'},'risk','coverage',{"Type":"FuelHeldForSale"},"id"]

I then iterate over this list to find the object in the JSON (which is in a Spark RDD)

I attempted st.split(\) which gave me

st.split('/')
Out[370]: 
['data',
 'policy',
 'line[Type="BusinessOwners"]',
 'risk',
 'coverage[Type="FuelHeldForSale"]',
 'CalculationDisplay']

But how do I convert and split items like 'line[Type="BusinessOwners"]' to 'line',{'Type':'BusinessOwners'} ?

mdeonte001
  • 39
  • 9
  • Hi. Did you try using eval()? Can you try this out: st_new=eval(st) Then print st_new. I hope this works.! – Shrinivas Deshmukh Mar 16 '18 at 04:35
  • Hi! That did not work @ShrinivasDeshmukh data/policy/line[Type="BusinessOwners"]/risk/coverage[Type="FuelHeldForSale"]/id ^ SyntaxError: invalid syntax – mdeonte001 Mar 16 '18 at 04:37
  • Please refer to this link, a similar problem has been discussed here: https://stackoverflow.com/questions/36068779/how-to-convert-a-string-containing-a-list-of-dict-into-python-object – Shrinivas Deshmukh Mar 16 '18 at 04:43
  • @mdeonte001 --- You should be a lot more specific as to what you want if you want people to use their time to solve your problem. If you want a dictionary in your list then state it instead of leaving others to read your mind! – Michael Swartz Mar 16 '18 at 05:25
  • @MichaelSwartz please see above, i state 'list containing dictionaries..' and in my example i show a dictionary. – mdeonte001 Mar 16 '18 at 15:37

4 Answers4

1

Would be more efficient if it wasn't a 1 liner, but I'll let you figure it out from here. Probably wanna come up with a more robust regex based parsing engine if your input varies more than your given schema. Or just use a standardized data model like JSON.

[word if '=' not in word else {word.split('=')[0]:word.split('=')[1]} for word in re.split('[/\[]', st.replace(']','').replace('"',''))]

['data', 'policy', 'line', {'Type': 'BusinessOwners'}, 'risk', 'coverage', {'Type': 'FuelHeldForSale'}, 'id']

TTT
  • 1,952
  • 18
  • 33
1
import json

first_list = st.replace('[', '/{"').replace(']', '}').replace('="', '": "').split('/')
[item if not "{" in item  else json.loads(item) for item in first_list]

or using ast.literal_eval

import ast

[item if not "{" in item  else ast.literal_eval(item) for item in first_list]


out:
['data',
 'policy',
 'line',
 {'Type': 'BusinessOwners'},
 'risk',
 'coverage',
 {'Type': 'FuelHeldForSale'},
 'id']
Rahul
  • 10,830
  • 4
  • 53
  • 88
  • Hi, I ran this and I got the error AttributeError: 'str' object has no attribute 'literal_eval' - what is ast.literal_eval(item) should that be st.literal? – mdeonte001 Mar 16 '18 at 04:47
  • sorry. you need to import ast first. please check again. – Rahul Mar 16 '18 at 04:48
  • Not a huge fan of using literal_eval, but this is much better. – Mad Physicist Mar 16 '18 at 05:11
  • This works! Thank you. @MadPhysicist care share why you dislike literal_eval? – mdeonte001 Mar 16 '18 at 05:13
  • It certainly does work. I am wary of literal_eval because of things like [this gist](https://gist.github.com/Aran-Fey/2667cef9420e930e57d80187a76e35e4). I am not 100% sure if it can be done exactly with literal_eval, but I would rather not take a chance. – Mad Physicist Mar 16 '18 at 05:18
  • Thank you both - i really appreciate it. @MadPhysicist and Rahul I'm going to benchmark each of these in Spark and see what performs fastest. – mdeonte001 Mar 16 '18 at 15:39
0

Regular expressions may be a good tool here. It looks like you want to transform elements that look like text1[text2="text3"] with `text1, {text2: text3}. The regex would look something like this:

(\w+)\[(\w+)=\"(\w+)\"\]

You can modify this expression in any number of ways. For example, you could use something other than \w+ for the names, and insert \s* to allow optional whitespace wherever you want.

The next thing to keep in mind is that when you do find a match, you need to expand your list. The easiest way to do that would be to just create a new list and append/extend it:

import re

paths = []
pattern = re.compile(r'(\w+)\[(\w+)=\"(\w+)\"\]')
for item in st.split('/'):
    match = pattern.fullmatch(item)
    if match:
        paths.append(match.group(1))
        paths.append({match.group(2): match.group(3)})
    else:
        paths.append(item)

This makes a paths that is

['data', 'policy', 'line', {'Type': 'BusinessOwners'}, 'risk', 'coverage', {'Type': 'FuelHeldForSale'}, 'id']

[IDEOne Link]

I personally like to split the functionality of my code into pipelines of functions. In this case, I would have the main loop accumulate the paths list based on a function that returned replacements for the split elements:

def get_replacement(item):
    match = pattern.fullmatch(item)
    if match:
        return match.group(1), {match.group(2): match.group(3)}
    return item,

paths = []
for item in st.split('/'):
    paths.extend(get_replacement(item))

The comma in return item, is very important. It makes the return value into a tuple, so you can use extend on whatever the function returns.

[IDEOne Link]

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
0

Let's do it in one line :

import re

pattern=r'(?<=Type=)\"(\w+)'
data="""data/policy/line[Type="BusinessOwners"]/risk/coverage[Type="FuelHeldForSale"]/id"""


print([{'Type':re.search(pattern,i).group().replace('"','')} if '=' in i else i for i in re.split('\/|\[',data)])

output:

['data', 'policy', 'line', {'Type': 'BusinessOwners'}, 'risk', 'coverage', {'Type': 'FuelHeldForSale'}, 'id']
Aaditya Ura
  • 12,007
  • 7
  • 50
  • 88