0

First post...

Pardon the naivete but I'm struggling with what I thought was a simple problem based on this previous post Splitting dictionary/list inside a Pandas Column into Separate Columns

I'm trying to split a column in a dataframe from a CSV file into separate columns where the key becomes the column labels and the values are the indices.

Here are the first couple instances of said column.

                                             captionTime
0       {'startTimeMs': 113488, 'endTimeMs': 116700}
1       {'startTimeMs': 116742, 'endTimeMs': 121080}
2       {'startTimeMs': 121121, 'endTimeMs': 122706}
3       {'startTimeMs': 128462, 'endTimeMs': 129838}

When I run:

df2 = df['captionTime'].apply(pd.Series)

it only returns a series instead of two columns labelled 'startTimeMs' & 'endTimeMs'.

                                                   0
0       {'startTimeMs': 113488, 'endTimeMs': 116700}
1       {'startTimeMs': 116742, 'endTimeMs': 121080}
2       {'startTimeMs': 121121, 'endTimeMs': 122706}
3       {'startTimeMs': 128462, 'endTimeMs': 129838}

UPDATE

I was able to grab the original API code a colleague used to export the CSV file.

CSV Snippet:

captionTime,contentType,language,region,timedTextType
"{'startTimeMs': 5000, 'endTimeMs': 6708}",None,id,{},SUBS
"{'startTimeMs': 15875, 'endTimeMs': 19125}",None,id,{},SUBS
"{'startTimeMs': 19500, 'endTimeMs': 22875}",None,id,{},SUBS
"{'startTimeMs': 27791, 'endTimeMs': 30291}",None,id,{},SUBS

Out of curiousity, I tried my initial method of splitting the series before writing the data and it worked no problem. I know that r.json() returns a dictionary, so I'm assuming-- when pandas reads the CSV it's reading the captionTime column as a string, not a dictionary.

Input:

r = session.post("{}{}".format(endpoint, api), headers=headers, data=json.dumps(body), params=params)
r.raise_for_status()
rDict = r.json()

results = rDict['results']

df = pd.DataFrame(results)
df2 = df['captionTime'].apply(pd.Series)

print(df2)

Output:

     endTimeMs  startTimeMs
0         6708         5000
1        19125        15875
2        22875        19500
3        30291        27791

I may not always be able to pull the data myself or may receive files from other colleagues, how do I go about cleaning the file to properly split the dictionary?

D. Price
  • 37
  • 6
  • Can you do df.captionTime.apply(type), and show us the result – BENY Jan 17 '18 at 05:31
  • 2
    can you include the code to create the dataframe? e.g. `df = pd.DataFrame({'captionTime': [{'startTimeMs': 113488, 'endTimeMs': 116700}, {'startTimeMs': 113488, 'endTimeMs': 116700}]})` (note that this example splits correctly so there might be something fishy in your data) – maxymoo Jan 17 '18 at 05:34
  • The only way I can reproduce this behaviour with the dataframe given as above, is when the dicts are actually strings. That will still print as the first code block (i.e., no extra quotes), and thus look like a dict, but results in the single Series as given. Perhaps a bad read of JSON data or similar caused the strings to not be converted to dict? –  Jan 17 '18 at 05:37
  • @Wen `0 1 2 3 ` – D. Price Jan 17 '18 at 21:45
  • @Evert Updated the post-- but I believe you are correct, something funky in the read of the JSON. – D. Price Jan 17 '18 at 21:48

1 Answers1

0

The relevant entries are strings, not dictionaries. It looks like the JSON is not correctly formatted (*) (or there is a bug in the JSON parser).

I can suggest a few workarounds, but none are ideal.

Given the following data:

data = {'captionTime': [   "{'startTimeMs': 113488, 'endTimeMs': 116700}",
                           "{'startTimeMs': 116742, 'endTimeMs': 121080}",
                           "{'startTimeMs': 121121, 'endTimeMs': 122706}",
                           "{'startTimeMs': 128462, 'endTimeMs': 129838}",]}

and a DataFrame:

import pandas as pd
df = pd.DataFrame(data)

You can transform the relevant column entries into a dict in two ways (with caveats):

import ast
df['captionTime'] = df['captionTime'].apply(lambda x: ast.literal_eval(x))

The above relies on trust in the format of the entries. ast.literal_eval is certainly safer than simply eval, but may still cause problems.

An alternative is to parse it as JSON, but you'll need to turn the single quotes into double quotes(*):

import json
df['captionTime'] = df['captionTime'].apply(lambda x: json.loads(x.replace("'", '"')))

If there is a double quote inside two single quotes (i.e., for some reason, the key of a dictionary entry has a double quote), this fails.


(*) It may be that the JSON actually contains single quotes somewhere, where double quotes are expected, turning the data into a string instead of key-value pairs. Without knowing the actual data, that remains just a guess.