0

I have a dataframe like that :

Text      Mail                            Phone
text_1    mail_1,mail_2,mail_3       ['phone_1', 'phone_2']
text_2    mail_4,mail_5              ['phone_3', 'phone_4']
text_3    mail_6, mail_7,mail_8      ['phone_5']
 .             .                      .
text_n    mail_x                     ['phone_y', 'phone_y+1']
 .             .                       .

I want to obtain a dataframe like that :

Text         Mail                            Phone
text1       mail_1                           phone_1
text1       mail_2                           phone_2
text1       mail_3                             ?
text2       mail_4                           phone_3
text2       mail_5                           phone_4
text3       mail_6                           phone_5
text3       mail_7                             ?
text3       mail_8                             ?              
text_n      mail_x                           phone_y 
text_n        ?                              phone_y+1

For each row of the initial dataframe, the number of mail(s) and phone(s) is variable and can be 0.

Regards,

dkk
  • 171
  • 3
  • 11
  • So what is preventing you from doing this? – MFisherKDX May 02 '19 at 23:50
  • I used a code, but I'm not able to use the 'text' attribute as an index – dkk May 02 '19 at 23:54
  • Do you have control over the DataFrame - are you constructing it that way? It might be easier to start over and make it the way you want? – wwii May 02 '19 at 23:54
  • I generate the columns 'Mail' and 'Phone' by extracting from the column 'Text' the e-mail adresses and the phone numbers. So my initial dataset has only one column (the column 'Text'). – dkk May 02 '19 at 23:59
  • 1
    Downvoting because you are showing zero effort. Here are my requirements, please give me the solution. – MFisherKDX May 03 '19 at 00:01
  • I searched, I have the feeling that I have to use the function stack(), but I don't obtain what I want to do. – dkk May 03 '19 at 00:06
  • 1
    This is actually a good question and quite hard to solve since the length of the values are different. [These](https://stackoverflow.com/q/53218931/9081267) answers will help you further – Erfan May 03 '19 at 01:42

1 Answers1

1

Use zip_longest() from itertools and reconstruct the DF:

from itertools import zip_longest

df_new = pd.DataFrame([ 
    [t, m, p] for t,M,P in df.values 
    for m,p in zip_longest(M.split(','),P) 
], columns=df.columns)

df_new.fillna('?', inplace=True)
#In [x]: df_new                                                                                                                    
#Out[x]: 
#     Text     Mail      Phone
#0  text_1   mail_1    phone_1
#1  text_1   mail_2    phone_2
#2  text_1   mail_3          ?
#3  text_2   mail_4    phone_3
#4  text_2   mail_5    phone_4
#5  text_3   mail_6    phone_5
#6  text_3   mail_7          ?
#7  text_3   mail_8          ?
#8  text_n   mail_x    phone_y
#9  text_n        ?  phone_y+1
jxc
  • 13,553
  • 4
  • 16
  • 34