1

How to create a column out of list nested inside a dataframe column

I have a dataframe having values: Input data

I want to split the data inside a message column into sub columns such as

Output data

Till now i have split the data on the basis of comma

Work

As the data is not in json format i can't use json.loads on it. I split it using the code below

key = []
for i in df['text']:
i = i.replace("{", "")
i = i.replace("}", "")
for x in i.split(","):
    key.append(x.split(": "))
key[0]

With giving an output [' HashCode', '"650e4390:0"']

After this i am stuck how to creat a dataframe out of this as the pattern befor and after {} is not fixed ans changes.

Error while using ast and json

Error image

as string is not json as in some rows the } braces are missing

Community
  • 1
  • 1
pratha1995
  • 103
  • 1
  • 14

2 Answers2

2

Also you can use ast.literal_eval follwed by apply(pd.Series) and pd.concat :

import ast
df['x'] = df['x'].apply(lambda x : ast.literal_eval(x))
df1 = df['x'].apply(pd.Series)
res = pd.concat([df, df1], axis=1).drop('message', axis=1)
Sociopath
  • 13,068
  • 19
  • 47
  • 75
Loochie
  • 2,414
  • 13
  • 20
  • 1
    Solution is slow because bad performance of `apply(pd.Series)`, check [this](https://stackoverflow.com/questions/35491274/pandas-split-column-of-lists-into-multiple-columns/35491399#35491399) – jezrael Mar 28 '19 at 10:01
  • @jezrael: exactly, it's ~150 times slower than my solution and ~100 times slower than `json.loads` option that you suggested on a 20k column. I've added it to my benchmark – perl Mar 28 '19 at 10:05
0

The more efficient option is concatenating these values with str.cat and reading with read_json:

df = pd.DataFrame({'json': ['{"a": 1, "b": "foo"}', '{"a": 3, "b": "bar"}']})

x = pd.read_json(df['json'].str.cat(sep='\n'), lines=True)

print(x)

Output:

   a    b
0  1  foo
1  3  bar

Performance comparison with some other given solutions:

%%timeit
pd.read_json(df['json'].str.cat(sep='\n'), lines=True)
44.2 ms ± 1.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
pd.DataFrame([json.loads(x) for x in df['json']], index=df.index)
88.9 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
pd.DataFrame([ast.literal_eval(x) for x in df['json']], index=df.index)
335 ms ± 3.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
df['json'].apply(lambda x : ast.literal_eval(x)).apply(pd.Series)
6.83 s ± 212 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Performance tested on the following DataFrame:

df = pd.DataFrame({'json': ['{"a": 1, "b": 2}', '{"c": 3, "d": "z"}']*10000})
perl
  • 9,826
  • 1
  • 10
  • 22