How to split a column data into other columns which is stored in a dataframe?

Question

The df is the dataframe which contain the following information.

 In [61]: df.head()
    Out[61]: 
       id  movie_id                  info
    0   1         1   Italy:1 January 1994
    1   2         2   USA:22 January 2006
    2   3         3   USA:12 February 2006
    3   4         4   USA:February 2006
    4   5         5   USA:2006

I want output like below:

In [61]: df.head()    
Out[61]: 
   id  movie_id    country Date    Month   Year
0   1         1    Italy    1     January  1994
1   2         2    USA      22    January  2006
2   3         3    USA      12    February 2006
3   4         4    USA      None  February 2006
4   5         5    USA      None  None     2006

The data is stored in dataframe and it must be overwrite into the dataframe.

Write a function(s) that will extract/separate and return the pieces then assign the return value of that function(s) to a new column - maybe use [Dataframe.apply or .applymap](http://pandas.pydata.org/pandas-docs/stable/api.html#id5). — wwii, Oct 16 '16 at 14:45

Psidom · Accepted Answer · 2016-10-17T13:19:23.777

2

You can use regex :|\s+ to split the column on either semicolon or white spaces and specify the expand parameter to be true so that the result will expand to columns:

df[["country","Date","Month","Year"]] = df['info'].str.split(':|\s+', expand = True)

Update:

To handle optional missing dates and months, you could try extract with regular expression:

(df[["country","Date","Month","Year"]] = 
     df['info'].str.extract('^([A-Za-z]+):(\d{1,2})? ?([A-Za-z]+)? ?(\d{4})$'))

^([A-Za-z]+):(\d{1,2})? ?([A-Za-z]+)? ?(\d{4})$' contains four capture groups corresponding to country, Date, Month, Year respectively;
^ and $ denote the start and end of the string;
([A-Za-z]+) captures the country which is before : and consists of letters;
(\d{1,2}) captures Date which consists of one or two digits but optional(with ? after the group), i.e, could be missing;
([A-Za-z]+) captures Month which consists of letters and it's marked as optional with ?;
(\d{4}) captures the year which consists of four digits;

edited Oct 17 '16 at 13:19

answered Oct 16 '16 at 14:08

Psidom

209,562
33
339
356

some data are missing in info like info only contains **USA:2013**. This can be represented inside the country and year, but the **2013** is under **Date** column. Can you do it by reversing or by any other method. – Nishant Singh Oct 17 '16 at 04:31
In [17]: dataframe[['year', 'month', 'day']] = dataframe['release'].apply( lambda x: pd.Series(x.split()[::-1])) – Nishant Singh Oct 17 '16 at 04:37
I tried above code to solve my problem but the country name and date come along in column. – Nishant Singh Oct 17 '16 at 04:38

score 1 · Answer 2 · answered Oct 16 '16 at 14:09

1

Using split string method.

In [163]: df[['country', 'date', 'month', 'year']] = df['info'].str.split('\W+', expand=True)

In [164]: df
Out[164]:
   id  movie_id                  info country date     month  year
0   1         1  Italy:1 January 1994   Italy    1   January  1994
1   2         2   USA:22 January 2006     USA   22   January  2006
2   3         3  USA:12 February 2006     USA   12  February  2006
3   4         4  USA:19 February 2006     USA   19  February  2006
4   5         5   USA:22 January 2006     USA   22   January  2006

answered Oct 16 '16 at 14:09

Zero

74,117
18
147
154

This work very well but I have one problem more in this that is: – Nishant Singh Oct 16 '16 at 14:27
some data are missing from info that is: info contains only 'Italy:2013' after splitting the **2013** is shown under **date* column and **month,year is none**. How to fix this problem. – Nishant Singh Oct 16 '16 at 14:32
You'd need to write custom function. Hint: http://stackoverflow.com/a/40069615/2137255 – Zero Oct 16 '16 at 14:34
Can you make it in reverse form so that I can easily handle missing data. Example: year, month, date, country so that the missing value be none. – Nishant Singh Oct 17 '16 at 04:40

How to split a column data into other columns which is stored in a dataframe?

2 Answers2