3

So I have a pandas dataframe from csv file looks like this:

year,month,day,list
2017,09,01,"[('United States of America', 12345), (u'Germany', 54321), (u'Switzerland', 13524), (u'Netherlands', 24135), ... ]
2017,09,02,"[('United States of America', 6789), (u'Germany', 9876), (u'Switzerland', 6879), (u'Netherlands', 7968), ... ]

The number of country-count pairs in the 4th column of each row is not identical.
I want to expand the list in the 4th column, and transform the dataframe into something like this:

year,month,day,country,count
2017,09,01,'United States of America',12345
2017,09,01,'Germany',54321
2017,09,01,'Switzerland',13524
2017,09,01,'Netherlands',24135
...
2017,09,02,'United States of America',6789
2017,09,02,'Germany',9876
2017,09,02,'Switzerland',6879
2017,09,02,'Netherlands',7968
...

My thought was to generate 2 independent columns, then join them to the origin dataframe. Maybe Something like this:

country = df.apply(lambda x:[x['list'][0]]).stack().reset_index(level=1, drop=True)
count  = df.apply(lambda x:[x['list'][1]]).stack().reset_index(level=1, drop=True)
df.drop('list', axis=1).join(country).join(count)

The code above is definitely not working (I just hope it can help express my thought), and I have no idea how to expand the date columns as well.
Any help or suggestion is much appreciated.

Dan Lwo
  • 43
  • 6

3 Answers3

0

Probably, the easiest way to solve you problem is to iterate over the tuples contained in the dataframe and to create a new one. You can do that with two nested for loops.

df_new = []
for i in df.itertuples():
    for l in i.list:
        df_new.append([i.year, i.month, i.day, l[0], l[1]])

df_new = pd.DataFrame(df_new, columns=['year', 'month', 'day', 'country', 'count'])

If the fourth field of the list is not a list but a string (the double quotes in the dataframe example leave me some doubts), you can use the literal_eval function from the ast library: Converting a string representation of a list into an actual list object

Covix
  • 115
  • 1
  • 9
0

Use:

import ast
#convert strings to lists of tuples
df['list'] = df['list'].apply(ast.literal_eval)
#create reshaped df from column list
df1 =pd.DataFrame([dict(x) for x in df['list'].values.tolist()]).stack().reset_index(level=1)
df1.columns = ['country','count']
#join to original
df = df.drop('list', 1).join(df1).reset_index(drop=True)
print (df)
   year  month  day                   country  count
0  2017      9    1                   Germany  54321
1  2017      9    1               Netherlands  24135
2  2017      9    1               Switzerland  13524
3  2017      9    1  United States of America  12345
4  2017      9    2                   Germany   9876
5  2017      9    2               Netherlands   7968
6  2017      9    2               Switzerland   6879
7  2017      9    2  United States of America   6789
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

So, what you need is to cconvert a column with a list of values into multiple rows. One solution is to create a new dataframe and do a left join:

df = pd.DataFrame({'A':['a','b'],'B':['x','y'],
                   'C':[['a1', 'a2'],['b1', 'b2', 'b3']]})

df
#    A  B               C
# 0  a  x      [[a1, a2]]
# 1  b  y  [[b1, b2, b3]]

dfr=df['C'].apply(lambda k: pd.Series(k)).stack().reset_index(level=1, drop=True).to_frame('C')

dfr
#     C
# 0  a1
# 0  a2
# 1  b1
# 1  b2
# 1  b3

df[['A','B']].join(dfr, how='left')
#    A  B   C
# 0  a  x  a1
# 0  a  x  a2
# 1  b  y  b1
# 1  b  y  b2
# 1  b  y  b3

Finally, use reset_index()

df[['A','B']].join(dfr, how='left').reset_index(drop=1)
#    A  B   C
# 0  a  x  a1
# 1  a  x  a2
# 2  b  y  b1
# 3  b  y  b2
# 4  b  y  b3

Credit: https://stackoverflow.com/a/39955283/2314737

user2314737
  • 27,088
  • 20
  • 102
  • 114