1

How do I quickly make new columns that hold the three chunks contained in the column 'File'?

recieved messy data like this

d = {   'File' : pd.Series(['firstname lastname                   05/31/1996                     9999999999  ', 'FN SometimesMiddileInitial. LN                    05/31/1996                 9999999999  ']), 
    'Status' : pd.Series([0., 0.]), 
    'Error' : pd.Series([2., 2.])}
df=pd.DataFrame(d)

UPDATE In reality, i'm starting from a very messy excel file and my data has '\xa0 \xa0' between string characters. so my first attempt looks like

from pandas import DataFrame, ExcelFile
import pandas as pd
location = r'c:/users/meinzerc/Desktop/table.xlsx'
xls = ExcelFile(location)
table = xls.parse('Sheet1')
splitdf = df['File'].str.split('\s*)

My attempt doesn't work at all. WHY?

Chet Meinzer
  • 1,691
  • 2
  • 21
  • 35
  • possible duplicate of [Pandas DataFrame, how do i split a column into two](http://stackoverflow.com/questions/14745022/pandas-dataframe-how-do-i-split-a-column-into-two) – sashkello Jan 22 '14 at 23:52

1 Answers1

4

You could use a regex to pick up at least two spaces:

In [11]: df.File.str.split('\s\s+')
Out[11]: 
0       [firstname lastname, 05/31/1996, 9999999999, ]
1    [FN SometimesMiddileInitial. LN, 05/31/1996, 9...
Name: File, dtype: object

Perhaps a better option is to use extract (and perhaps there is a neater regex!!):

In [12]: df.File.str.extract('\s*(?P<name>.*?)\s+(?P<date>\d+/\d+/\d+)\s+(?P<number>\w+)\s*')
Out[12]: 
                             name        date      number
0              firstname lastname  05/31/1996  9999999999
1  FN SometimesMiddileInitial. LN  05/31/1996  9999999999

[2 rows x 3 columns]
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • Thank you Andy. simple and perfect.. in practice, but when i tried it on my excel df, df.values showed '\xa0 \xa0' instead of white space. I've never dealt with that before, so i tried df.File.str.split('\\xa0+\s*\\xa0*') and finally got it! Maybe i can edit my original question to elicit feedback for \xa0? also, i'll try the extract! looks great, i like extract because it is more intential! – Chet Meinzer Jan 23 '14 at 18:26
  • @ChetMeinzer that's a latin-1 codepoint for non-breaking whitespace. I'm afraid you'll get bitten by this kind of thing occasionally so it's a good idea to check the results a bit after, and see what bites next :). Another alternative might be to `.replace('\xa0', ' ')` first. – Andy Hayden Jan 23 '14 at 19:10
  • I think you are right. If i can get that out of there, then it will be more normal to adjust. Thanks again! – Chet Meinzer Jan 23 '14 at 19:17