0

I am trying to separate the column values separated by "," separator of a panda dataframe.

The original data Original panda dataframe

enter image description here

The desired output Desired output

enter image description here

I have tried several ways.

Explode/stack a Series of strings

newdf['Month'] = newdf['Month'].apply(list)

using the above code I am getting [j,a,n,,f,e,b] and then I have used

pd.Dataframe({'Month':np.concatenate(newdf['Month'].values), 'cust.no':newdf['cust.no'].repeat(newdf['cust no.'].apply(len))})

The output is each letter is coming in separate rows. As a result, the row numbers are not matching with "cust no." and I am getting error.

I know there are several functions available but I couldn't one that can efficiently break down the values.

Robert
  • 7,394
  • 40
  • 45
  • 64
Deya
  • 79
  • 1
  • 9
  • You posted this question earlier today. It was, and still is, a duplicate. Either way, in the future, please post dataframes as images, not text – user3483203 Aug 20 '18 at 21:28
  • The below link has solved my problem. Very very useful. https://stackoverflow.com/questions/50082449/splitting-multiple-columns-on-a-delimiter-into-rows-in-pandas-dataframe – Deya Aug 22 '18 at 03:11

2 Answers2

0

You can always just use a regex (regular expression) to identify all text before the comma.

Assuming your original dataframe is called data, meaning your months column is data['Months'], you can use the regular expression r'(.+?),' to select everything before the comma.

data['Months'] = data['Months'].str.extract(r'(.+?),', expand=True)

You can always test regex at https://pythex.org/. Try entering your months column in the test string box, and (.+?), as the regular expression.

Joska
  • 356
  • 4
  • 17
0

Setup

df = pd.DataFrame({'id': [1,2,3,4], 'month': ['Jan,Fev', 'Feb,July', 'Jun,Aug', 'July,Mar']})

    id  month
0   1   Jan,Fev
1   2   Feb,July
2   3   Jun,Aug
3   4   July,Mar

str.split+pd.DataFrame()+stack

df = df.set_index('id')
pd.DataFrame(df.month.str.split(',').to_dict()).T.stack().reset_index(level=0, name='month')

    level_0 month
0   1       Jan
1   1       Fev
0   2       Feb
1   2       July
0   3       Jun
1   3       Aug
0   4       July
1   4       Mar
rafaelc
  • 57,686
  • 15
  • 58
  • 82
  • Thank you. I want to do on all the columns at one time otherwise I am getting error due to not matching row numbers. So, I am using the following code pd.DataFrame(new.apply(lambda x: x.to_dict("series").str.split(",").T.stack().reset_index(),axis=1,raw = False)). However, I am getting this error- ("unsupported type: ", . Would you like to share your thoughts. – Deya Aug 21 '18 at 19:03