How to create a subcolumns from a list specified inside a column

Question

How to create a column out of list nested inside a dataframe column

I have a dataframe having values:

I want to split the data inside a message column into sub columns such as

Till now i have split the data on the basis of comma

As the data is not in json format i can't use json.loads on it. I split it using the code below

key = []
for i in df['text']:
i = i.replace("{", "")
i = i.replace("}", "")
for x in i.split(","):
    key.append(x.split(": "))
key[0]

With giving an output [' HashCode', '"650e4390:0"']

After this i am stuck how to creat a dataframe out of this as the pattern befor and after {} is not fixed ans changes.

Error while using ast and json

as string is not json as in some rows the } braces are missing

How is created `DataFrame`? If from json, is possible add some sample json data to question? — jezrael, Mar 28 '19 at 09:35

score 2 · Answer 1 · edited Mar 28 '19 at 10:00

2

Also you can use ast.literal_eval follwed by apply(pd.Series) and pd.concat :

import ast
df['x'] = df['x'].apply(lambda x : ast.literal_eval(x))
df1 = df['x'].apply(pd.Series)
res = pd.concat([df, df1], axis=1).drop('message', axis=1)

edited Mar 28 '19 at 10:00

Sociopath

13,068
19
47
75

answered Mar 28 '19 at 09:39

Loochie

2,414
13
20

1

Solution is slow because bad performance of `apply(pd.Series)`, check [this](https://stackoverflow.com/questions/35491274/pandas-split-column-of-lists-into-multiple-columns/35491399#35491399) – jezrael Mar 28 '19 at 10:01
@jezrael: exactly, it's ~150 times slower than my solution and ~100 times slower than `json.loads` option that you suggested on a 20k column. I've added it to my benchmark – perl Mar 28 '19 at 10:05

perl · Answer 2 · 2019-03-28T10:02:48.733

The more efficient option is concatenating these values with str.cat and reading with read_json:

df = pd.DataFrame({'json': ['{"a": 1, "b": "foo"}', '{"a": 3, "b": "bar"}']})

x = pd.read_json(df['json'].str.cat(sep='\n'), lines=True)

print(x)

Output:

   a    b
0  1  foo
1  3  bar

Performance comparison with some other given solutions:

%%timeit
pd.read_json(df['json'].str.cat(sep='\n'), lines=True)
44.2 ms ± 1.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
pd.DataFrame([json.loads(x) for x in df['json']], index=df.index)
88.9 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
pd.DataFrame([ast.literal_eval(x) for x in df['json']], index=df.index)
335 ms ± 3.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
df['json'].apply(lambda x : ast.literal_eval(x)).apply(pd.Series)
6.83 s ± 212 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Performance tested on the following DataFrame:

df = pd.DataFrame({'json': ['{"a": 1, "b": 2}', '{"c": 3, "d": "z"}']*10000})

How to create a subcolumns from a list specified inside a column

2 Answers2