3

I've got a very large dataframe where one of the columns is a dictionary itself. (let's say column 12). In that dictionary is a part of a hyperlink, which I want to get.

In Jupyter, I want to display a table where I have column 0 and 2, as well as the completed hyperlink

I think I need to:

  • Extract that dictionary from the dataframe
  • Get a particular keyed value from it
  • Create the full hyperlink from the extracted value
  • Copy the dataframe and replace the column with the hyperlink created above

Let's just tackle step 1 and I'll make other questions for the next steps. How do I extract values from a dataframe into a variable I can play with?

import pytd
import pandas

client = pytd.Client(apikey=widget_api_key.value, database=widget_database.value)
results = client.query(query)
dataframe = pandas.DataFrame(**results)
dataframe
# Not sure what to do next
Christopher Pisz
  • 3,757
  • 4
  • 29
  • 65

1 Answers1

0

If you only want to extract one key from the dictionary and the dictionary is already stored as a dictionary in the column, you can do it as follows:

import numpy  as np
import pandas as pd

# assuming, your dicts are stored in column 'data'
# and you want to store the url in column 'url'
df['url']= df['data'].map(lambda d: d.get('url', np.NaN) if hasattr(d, 'get') else np.NaN)

# from there you can do your transformation on the url column

Testdata and results

df= pd.DataFrame({
        'col1': [1, 5, 6],
        'data': [{'url': 'http://foo.org', 'comment': 'not interesting'}, {'comment': 'great site about beer receipes, but forgot the url'}, np.NaN],
        'json': ['{"url": "http://foo.org", "comment": "not interesting"}', '{"comment": "great site about beer receipes, but forgot the url"}', np.NaN]
    }
)

# Result of the logic above:
   col1                                               data             url
0     1  {'url': 'http://foo.org', 'comment': 'not inte...  http://foo.org
1     5  {'comment': 'great site about beer receipes, b...             NaN
2     6                                                NaN             NaN

If you need to test, if your data is already stored in python dicts (rather than strings), you can do it as follows:

print(df['data'].map(type))

If your dicts are stored as strings, you can convert them to dicts first based on the following code:

import json

def get_url_from_json(document):
    if pd.isnull(document):
        url= np.NaN
    else:
        try:
            _dict= json.loads(document)
            url= _dict.get('url', np.NaN)
        except:
            url= np.NaN
    return url

df['url2']= df['json'].map(get_url_from_json)

# output:
print(df[['col1', 'url', 'url2']])   

   col1             url            url2
0     1  http://foo.org  http://foo.org
1     5             NaN             NaN
2     6             NaN             NaN
jottbe
  • 4,228
  • 1
  • 15
  • 31