how to parse repeated list element into dataframe

Question

The example would be a list called 'main' looks like

main =  [('date', '2020-04-21'),  ('oldname', 'Tap'),  ('newname', 'Tapnew'),  ('icon_url',   '3'),  ('date', '2020-04-21'),  ('oldname', 'Nod'),  ('newname', 'Nodnew'),  ('icon_url','4'),  ('date', '2020-04-21'),  ('oldname', 'Mik'),  ('newname', 'Miknew'),  ('icon_url','5')]

I've tried to directly parse and transform using this.

df = pd.DataFrame(main)
test = df.T
test.columns = test.iloc[0]
a = test.drop(test.index[0])

However the result dataframe is still a long sparse form with repeated columns

 date      oldname     newname    icon_url     date      oldname     newname    icon_url    date      oldname     newname    icon_url 
2020-04-21    Tap      Tapnew        3       2020-04-21      Nod     Nodnew       4      2020-04-21       Mik     Miknew      5

The desired output would be

 date      oldname     newname    icon_url     
2020-04-21    Tap     Tapnew        3     
2020-04-21    Nod     Nodnew        4      
2020-04-21    Mik     Miknew        5

I've been struggle the whole day --- Can anyone shed some lights on this? Thanks in advance.

score 3 · Answer 1 · answered Apr 24 '20 at 17:28

Convert the list of tuple into dictionary.

In [62]: def tuple_to_dict(some_list):
    ...:     result = {}
    ...:     for k, v in some_list:
    ...:         result.setdefault(k, []).append(v)
    ...:
    ...:     return result
    ...:

In [63]: tuple_to_dict(main)
Out[63]:
{'date': ['2020-04-21', '2020-04-21', '2020-04-21'],
 'oldname': ['Tap', 'Nod', 'Mik'],
 'newname': ['Tapnew', 'Nodnew', 'Miknew'],
 'icon_url': ['3', '4', '5']}

In [64]: df = pd.DataFrame(tuple_to_dict(main))

In [65]: df
Out[65]:
         date oldname newname icon_url
0  2020-04-21     Tap  Tapnew        3
1  2020-04-21     Nod  Nodnew        4
2  2020-04-21     Mik  Miknew        5

score 1 · Answer 2 · answered Apr 24 '20 at 17:24

From df = pd.DataFrame(main) it's just pivot on two columns (more about pivot here):

(pd.DataFrame(main, columns=['col','val'])
   .assign(idx=lambda x: x.groupby('col').cumcount())
   .pivot('idx','col','val')
)

Output:

col        date icon_url newname oldname
idx                                     
0    2020-04-21        3  Tapnew     Tap
1    2020-04-21        4  Nodnew     Nod
2    2020-04-21        5  Miknew     Mik

score 1 · Answer 3 · edited Jan 09 '21 at 00:47

1

s=a.melt()
s['i']=s.groupby(0).cumcount()
s=s.pivot(index='i',columns=0,values='value')
0        date icon_url newname oldname
i                                     
0  2020-04-21        3  Tapnew     Tap
1  2020-04-21        4  Nodnew     Nod
2  2020-04-21        5  Miknew     Mik

edited Jan 09 '21 at 00:47

halfer

19,824
17
99
186

answered Apr 24 '20 at 17:41

BENY

317,841
20
164
234

score 0 · Answer 4 · answered Apr 24 '20 at 17:24

Read in the DataFrame as you have. Then create an index for groups of data by checking where the word is 'date' and taking the cumsum. At this point we just pivot

df = pd.DataFrame(main)
df['index'] = df[0].eq('date').cumsum()
df = df.pivot(index='index', columns=0, values=1).rename_axis(None, axis=1)

             date icon_url newname oldname
index                                     
1      2020-04-21        3  Tapnew     Tap
2      2020-04-21        4  Nodnew     Nod
3      2020-04-21        5  Miknew     Mik

score 0 · Answer 5 · answered Apr 24 '20 at 17:42

This solution uses Python and should be more efficient for large data. It makes use of the fact that Python lists are ordered and you are looking at parsing tuples in the group of 4 (variable n in the solution)

main =  [('date', '2020-04-21'),  ('oldname', 'Tap'),  ('newname', 'Tapnew'),  ('icon_url',   '3'),  ('date', '2020-04-21'),  ('oldname', 'Nod'),  ('newname', 'Nodnew'),  ('icon_url','4'),  ('date', '2020-04-21'),  ('oldname', 'Mik'),  ('newname', 'Miknew'),  ('icon_url','5')]

n = 4

# Extract column names
main_columns = [item[0] for item in main[:n]]
# Extract values
main_values = [item[1] for item in main]
# Reshape values to return list of lists
main_reshaped = [main_values[(i-1)*n:(i*n)] for i in range(1, len(main_values)//4+1)]

# Call DataFrame constructor
pd.DataFrame(main_reshaped, columns = main_columns)


    date        oldname newname icon_url
0   2020-04-21  Tap     Tapnew  3
1   2020-04-21  Nod     Nodnew  4
2   2020-04-21  Mik     Miknew  5

score 0 · Answer 6 · answered Apr 24 '20 at 18:20

u could use defaultdict to get a dictionary, then read into a dataframe:

from collections import defaultdict
d = defaultdict(list)
for k,v in main:
    d[k].append(v)

pd.DataFrame(d)

date    oldname newname icon_url
0   2020-04-21  Tap Tapnew  3
1   2020-04-21  Nod Nodnew  4
2   2020-04-21  Mik Miknew  5

how to parse repeated list element into dataframe

6 Answers6