0

I am trying to take out keys and values from json to separate rows in pandas

I have:

|---------------------|------------------|
|      session        |       scoring    |
|---------------------|------------------|
|      session1       | {id1:scoring1,   |
|                     |  id2:scoring2,   |
|                     |  id3:scoring3}   |   
|---------------------|------------------|
|      session2       |  {id4:scoring4,  |
|                     |   id5:scoring5}  |
|---------------------|------------------|

I would like to get:

|---------------------|------------------|---------------------|------------------|
|      session        |       scoring    |         id          |      score       |
|---------------------|------------------|---------------------|------------------|
|      session1       | {id1:scoring1,   |         id1         |      score1      |
|                     |  id2:scoring2,   |                     |                  |
|                     |  id3:scoring3}   |                     |                  |
|---------------------|------------------|---------------------|------------------|
|      session1       | {id1:scoring1,   |         id2         |      score2      |
|                     |  id2:scoring2,   |                     |                  |
|                     |  id3:scoring3}   |                     |                  |
|---------------------|------------------|---------------------|------------------|
|      session1       | {id1:scoring1,   |         id3         |      score3      |
|                     |  id2:scoring2,   |                     |                  |
|                     |  id3:scoring3}   |                     |                  |
|---------------------|------------------|---------------------|------------------|
|      session2       |  {id4:scoring4,  |         id4         |      score4      | 
|                     |   id5:scoring5}  |                     |                  |
|---------------------|------------------|---------------------|------------------|
|      session2       |  {id4:scoring4,  |         id5         |      score5      | 
|                     |   id5:scoring5}  |                     |                  |
|---------------------|------------------|---------------------|------------------|

The code i used: (Iterating the rows and jsons, if id is first in json then put it in adjacent cell else create new row and append to df)

append_index = df.shape[0]

for index, row in df.iterrows():
    append_now = False
    for key, val in row['scoring'].items():
        if append_now:
            row['id'] = key
            row['score'] = val
            df.loc[append_index] = row
            append_index += 1

        else:
            df.loc[index,'id'] = key
            df.loc[index, 'score'] = val
        append_now = True

The problem is that df consists of 6+ mlm rows and to iterate only 20rows it takes half an hour. But when I limit first 1k rows it works well

Evgeny
  • 3
  • 1
  • The **second** answer to this question [How to iterate over rows in a DataFrame in Pandas](https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas) might be helpful. – Timus Oct 28 '20 at 11:36

1 Answers1

0

Not sure if that works better, but you might want to give it a try:

Sample frame

data = [[{'id1': 'score1', 'id2': 'score2', 'id3': 'score3'}],
        [{'id4': 'score4', 'id5': 'score5'}]]
df = pd.DataFrame(data, index=['session1', 'session2'])

looks like

                                                           0
session1  {'id1': 'score1', 'id2': 'score2', 'id3': 'score3'}
session2  {'id4': 'score4', 'id5': 'score5'}

This

data_new = [[session, id, score]
            for session, scores in zip(df.index, df[0])
            for id, score in scores.items()]
df = pd.DataFrame(data_new)
df.set_index(0, inplace=True)

reproduces your result

            1       2
0                    
session1  id1  score1
session1  id2  score2
session1  id3  score3
session2  id4  score4
session2  id5  score5

but might perform better.

Timus
  • 10,974
  • 5
  • 14
  • 28
  • Thank you! It solved my task less than a minute. As far as I understand the problem was with 'iterrows()' which is too slow for huge DataFrame. – Evgeny Oct 28 '20 at 19:03
  • @Evgeny Thanks for the feedback, I'm glad I could help. I'm not completely sure that the use of `iterrows()` was the the only reason. I suspect that the continuous modification of an already large dataframe might be suboptimal too. – Timus Oct 28 '20 at 19:47