1

I have a dataframe like this:

artid  link     ner_label
1      url1     "{('blanqui', 'Person'): 6, ('walter benjamin', 'Person'): 2}"
2      url2     "{('john', 'Person'): 8, ('steven', 'Person'): 3}"

The type of each row of ner_label is string. I would like to have this:

artid   link     ner                label      score
    1   url1     'blanqui'         'Person'     6
    1   url1     'walter benjamin' 'Person'     2
    2   url2     'john'            'Person'     8
    2   url2     'steven'          'Person'     3   

How can I do this ? I really don't know how to do this.

jos97
  • 405
  • 6
  • 18
  • JSON does not support dictionary keys that are not strings. If you change it to a list of dicts or lists, you can use or adapt the following thread: https://stackoverflow.com/questions/25511765/pandas-expand-json-field-across-records – thshea Jan 07 '21 at 14:02

2 Answers2

1

Not the most efficient way but it will do the job for you

from ast import literal_eval

df['ner'] = df['ner_label'].apply(lambda x: list(literal_eval(x).keys()))
df['score'] = df['ner_label'].apply(lambda x: list(literal_eval(x).values()))

df = df.set_index(['artid', 'link', 'ner_label']).apply(pd.Series.explode).reset_index()

df['label'] = [i[1] for i in df['ner']]
df['ner'] = [i[0] for i in df['ner']]
df.drop(['ner_label'], axis=1, inplace=True)

Output:

  artid     link    ner             score   label
0   1      url1     blanqui            6    Person
1   1      url1     walter benjamin    2    Person
2   2      url2     john               8    Person
3   2      url2     steven             3    Person
Sociopath
  • 13,068
  • 19
  • 47
  • 75
  • Thank you, I tried with my data and I have this error: ValueError: malformed node or string: nan. For the the 2 lines: `df['ner'] = df['ner_label'].apply(lambda x: list(literal_eval(x).keys())) df['score'] = df['ner_label'].apply(lambda x: list(literal_eval(x).values()))` – jos97 Jan 07 '21 at 14:24
  • Try the above solution without `literal_eval` – Sociopath Jan 07 '21 at 17:18
1

Here is the solution with only pandas

df = df.assign(ner_label=df['ner_label'].str.split(', \(')).explode('ner_label')
df['ner_label']= df['ner_label'].str.replace('(','').str.replace('\)','').\
str.replace('\{','').str.replace('\}','').str.replace('\"','')

df[['ner','score']] = df.ner_label.str.split(':', expand=True)

df[['ner','label']] = df.ner.str.split(',', expand=True)

df.drop(columns='ner_label', inplace=True)

Output:

    artid   link    ner score   label
0   1   url1    'blanqui'   6   'Person'
0   1   url1    'walter benjamin'   2   'Person'
1   2   url2    'john'  8   'Person'
1   2   url2    'steven'    3   'Person'