1

The df is the dataframe which contain the following information.

 In [61]: df.head()
    Out[61]: 
       id  movie_id                  info
    0   1         1   Italy:1 January 1994
    1   2         2   USA:22 January 2006
    2   3         3   USA:12 February 2006
    3   4         4   USA:February 2006
    4   5         5   USA:2006

I want output like below:

In [61]: df.head()    
Out[61]: 
   id  movie_id    country Date    Month   Year
0   1         1    Italy    1     January  1994
1   2         2    USA      22    January  2006
2   3         3    USA      12    February 2006
3   4         4    USA      None  February 2006
4   5         5    USA      None  None     2006

The data is stored in dataframe and it must be overwrite into the dataframe.

Nishant Singh
  • 501
  • 3
  • 9
  • 20
  • 1
    show us what you tried – Steven G Oct 16 '16 at 13:56
  • Write a function(s) that will extract/separate and return the pieces then assign the return value of that function(s) to a new column - maybe use [Dataframe.apply or .applymap](http://pandas.pydata.org/pandas-docs/stable/api.html#id5). – wwii Oct 16 '16 at 14:45

2 Answers2

2

You can use regex :|\s+ to split the column on either semicolon or white spaces and specify the expand parameter to be true so that the result will expand to columns:

df[["country","Date","Month","Year"]] = df['info'].str.split(':|\s+', expand = True)

enter image description here

Update:

To handle optional missing dates and months, you could try extract with regular expression:

(df[["country","Date","Month","Year"]] = 
     df['info'].str.extract('^([A-Za-z]+):(\d{1,2})? ?([A-Za-z]+)? ?(\d{4})$'))
  • ^([A-Za-z]+):(\d{1,2})? ?([A-Za-z]+)? ?(\d{4})$' contains four capture groups corresponding to country, Date, Month, Year respectively;
  • ^ and $ denote the start and end of the string;
  • ([A-Za-z]+) captures the country which is before : and consists of letters;
  • (\d{1,2}) captures Date which consists of one or two digits but optional(with ? after the group), i.e, could be missing;
  • ([A-Za-z]+) captures Month which consists of letters and it's marked as optional with ?;
  • (\d{4}) captures the year which consists of four digits;

enter image description here

Psidom
  • 209,562
  • 33
  • 339
  • 356
  • some data are missing in info like info only contains **USA:2013**. This can be represented inside the country and year, but the **2013** is under **Date** column. Can you do it by reversing or by any other method. – Nishant Singh Oct 17 '16 at 04:31
  • In [17]: dataframe[['year', 'month', 'day']] = dataframe['release'].apply( lambda x: pd.Series(x.split()[::-1])) – Nishant Singh Oct 17 '16 at 04:37
  • I tried above code to solve my problem but the country name and date come along in column. – Nishant Singh Oct 17 '16 at 04:38
1

Using split string method.

In [163]: df[['country', 'date', 'month', 'year']] = df['info'].str.split('\W+', expand=True)

In [164]: df
Out[164]:
   id  movie_id                  info country date     month  year
0   1         1  Italy:1 January 1994   Italy    1   January  1994
1   2         2   USA:22 January 2006     USA   22   January  2006
2   3         3  USA:12 February 2006     USA   12  February  2006
3   4         4  USA:19 February 2006     USA   19  February  2006
4   5         5   USA:22 January 2006     USA   22   January  2006
Zero
  • 74,117
  • 18
  • 147
  • 154
  • This work very well but I have one problem more in this that is: – Nishant Singh Oct 16 '16 at 14:27
  • some data are missing from info that is: info contains only 'Italy:2013' after splitting the **2013** is shown under **date* column and **month,year is none**. How to fix this problem. – Nishant Singh Oct 16 '16 at 14:32
  • You'd need to write custom function. Hint: http://stackoverflow.com/a/40069615/2137255 – Zero Oct 16 '16 at 14:34
  • Can you make it in reverse form so that I can easily handle missing data. Example: year, month, date, country so that the missing value be none. – Nishant Singh Oct 17 '16 at 04:40