2

I have this dataframe:

df = pd.DataFrame([
  { "name": "foo", "results": "{'lower_bound': '15000', 'upper_bound': '19999'}"},
  { "name": "bar", "results": "{'lower_bound': '10000', 'upper_bound': '14999'}"}
])

Currently it looks like this:

    name    results
0   foo     {'lower_bound': '15000', 'upper_bound': '19999'}
1   bar     {'lower_bound': '10000', 'upper_bound': '14999'}

I'd like to convert the results column to structured data, so I can filter for rows where df.lower_bound is less than 13000.

How can I do this?

cs95
  • 379,657
  • 97
  • 704
  • 746
Richard
  • 62,943
  • 126
  • 334
  • 542
  • Untested but I think you could use from `pandas.io.json import json_normalize` then call the function on your column, explode is another function you can look at as well – Umar.H Dec 08 '19 at 18:46
  • 1
    Thanks. The first step might be to convert them from a string field though? – Richard Dec 08 '19 at 18:54
  • 1
    you were right, thanks! you inadvertently taught me something new. – Umar.H Dec 08 '19 at 19:43

5 Answers5

2

This might not be optimum or best way :

from ast import literal_eval
df['results'] = df['results'].apply(lambda i: literal_eval(i))

This converts the string-formatted values in the 'results' column to dictionary. Output :

  name                                           results
0  foo  {'lower_bound': '15000', 'upper_bound': '19999'}
1  bar  {'lower_bound': '10000', 'upper_bound': '14999'}

Then :

df['results'] = df['results'].apply(lambda i: {k:int(v) for k,v in i.items()})

This converts the values of the dictionaries in that column to integer format from string. Output :

  name                                       results
0  foo  {'lower_bound': 15000, 'upper_bound': 19999}
1  bar  {'lower_bound': 10000, 'upper_bound': 14999}

Now, get the rows where the dictionary's lower_bound key has value < 13000, handle the case even if there is no lower_bound key in any row :

df.loc[map(lambda i:i.get('lower_bound', '0')<13000, df['results'])]

Output :

  name                                       results
1  bar  {'lower_bound': 10000, 'upper_bound': 14999}
Arkistarvh Kltzuonstev
  • 6,824
  • 7
  • 26
  • 56
2

If you don't want to modify your existing DataFrame, you can build a condition by converting "results" into a DataFrame, and then filter:

mask = (pd.DataFrame(df['results'].apply(ast.literal_eval).tolist())
          .astype(int)
          .eval('lower_bound < 13000'))
df[mask]

  name                                           results
1  bar  {'lower_bound': '10000', 'upper_bound': '14999'}

Another option would be to assign this back to the DataFrame via concat:

df = pd.concat([
      df, 
      pd.DataFrame(df.pop('results').apply(ast.literal_eval).tolist()).astype(int)
  ], 
  axis=1)
df[df['lower_bound'] < 13000]

  name lower_bound upper_bound
1  bar       10000       14999
cs95
  • 379,657
  • 97
  • 704
  • 746
1

another method is using json_normalize and ast_literal

from pandas.io.json import json_normalize
from ast import literal_eval

then split out the JSON col and then merge them back on their indices.

    s = json_normalize(df['results'].apply(literal_eval).astype(int))
    df_new = pd.merge(df,s,right_index=True,left_index=True)
    print(df_new)
      name  Results                                           lower_bound  upper_bound  
    0  foo  {'lower_bound': '15000', 'upper_bound': '19999'}       15000   19999  
    1  bar  {'lower_bound': '10000', 'upper_bound': '14999'}       10000   14999  

and for the dtypes :

  print(df_new.dtypes)
    name           object
    results        object
    lower_bound     int32
    upper_bound     int32
    dtype: object
Umar.H
  • 22,559
  • 7
  • 39
  • 74
1

You can convert the string expression to a JSON compatible format by replacing the single quotes with double quotes, then convert from JSON to dict:

df = pd.DataFrame([
  { "name": "foo", "results": "{'lower_bound': '15000', 'upper_bound': '19999'}"},
  { "name": "bar", "results": "{'lower_bound': '10000', 'upper_bound': '14999'}"}
])
pd.concat([df, pd.DataFrame(df.pop('results').apply(lambda x: json.loads(x.replace("'", '"'))).tolist())], axis=1)

This is faster than using ast.literal_eval and if you have many rows you will clearly see the difference in execution time.

pakallis
  • 181
  • 6
1

You can try:

df = pd.DataFrame([
  { "name": "foo", "results": "{'lower_bound': '15000', 'upper_bound': '19999'}"},
  { "name": "bar", "results": "{'lower_bound': '10000', 'upper_bound': '14999'}"}
])
lower_bound = []
upper_bound = []
for index, row in df.iterrows():
    r = eval(row['results'])
    lower_bound.append(r['lower_bound'])
    upper_bound.append(r['upper_bound'])
df['lower_bound'] = lower_bound
df['upper_bound'] = upper_bound
print(df[['name', 'lower_bound', 'upper_bound']])

result:

  name lower_bound upper_bound
0  foo       15000       19999
1  bar       10000       14999
René
  • 4,594
  • 5
  • 23
  • 52