2

I have a pandas column of strings. I want to break up the individual components of the string into their own columns using regex:

In [35]:

import re
import pandas as pd

In [36]:

data = {'raw': ['Baker 1 2009-11-17       1223.0',
                'Baker 1 2010-06-24       1122.7',
                'Baker 2 2009-07-24       2819.0',
                'Baker 2 2010-08-25       2971.6',
                'Baker 1 2011-01-05       1410.0',
                'Baker 2 2010-09-04       4671.6']}
df = pd.DataFrame(data, columns = ['raw'])
df

Out[36]:
     raw
0    Baker 1 2009-11-17 1223.0
1    Baker 1 2010-06-24 1122.7
2    Baker 2 2009-07-24 2819.0
3    Baker 2 2010-08-25 2971.6
4    Baker 1 2011-01-05 1410.0
5    Baker 2 2010-09-04 4671.6

This is what I want it to look like:

Out[41]:
     name    value   date          score
0    Baker   1       2010-06-24    1223.0
1    Baker   1       2009-07-24    1122.7
2    Baker   2       2009-07-24    2819.0
3    Baker   2       2010-08-25    2971.6
4    Baker   1       2011-01-05    1410.0
5    Baker   2       2010-09-04    4671.6

I've tried data.str.contains() but I can't seem to make it work. Any help would be appreciated.

Anton
  • 4,765
  • 12
  • 36
  • 50

2 Answers2

3

Based on this answer - Pandas DataFrame - how do I split a column

In [122]: pd.DataFrame(df['raw'].str.split().tolist(), columns=['name','value','date','score'])
Out[122]: 
    name value        date   score
0  Baker     1  2009-11-17  1223.0
1  Baker     1  2010-06-24  1122.7
2  Baker     2  2009-07-24  2819.0
3  Baker     2  2010-08-25  2971.6
4  Baker     1  2011-01-05  1410.0
5  Baker     2  2010-09-04  4671.6

[6 rows x 4 columns]
Community
  • 1
  • 1
chrisb
  • 49,833
  • 8
  • 70
  • 70
  • All of the columns here are object dtypes, which I doubt is what OP wanted. You can call `convert_objects(convert_numeric=True)` on the result to convert numbers, but that still won't convert `date` to `datetime64[ns]`. – Phillip Cloud Jun 07 '14 at 17:29
1

Is it a requirement that you use regular expressions? Using regexs here is too complicated, since you have structured data that are easily parsed by read_csv. That said, there are a couple of ways you can do this, in addition to @chrisb's answer:

StringIO + read_csv:

In [45]: data
Out[45]:
{'raw': ['Baker 1 2009-11-17       1223.0',
  'Baker 1 2010-06-24       1122.7',
  'Baker 2 2009-07-24       2819.0',
  'Baker 2 2010-08-25       2971.6',
  'Baker 1 2011-01-05       1410.0',
  'Baker 2 2010-09-04       4671.6']}

In [46]: text = '\n'.join(data['raw'])

In [47]: print(text)
Baker 1 2009-11-17       1223.0
Baker 1 2010-06-24       1122.7
Baker 2 2009-07-24       2819.0
Baker 2 2010-08-25       2971.6
Baker 1 2011-01-05       1410.0
Baker 2 2010-09-04       4671.6

In [48]: from StringIO import StringIO

In [49]: df = pd.read_csv(StringIO(text), sep=r'\s+', parse_dates=[2], names=['name', 'value', 'date', 'score'])

In [50]: df
Out[50]:
    name  value       date   score
0  Baker      1 2009-11-17  1223.0
1  Baker      1 2010-06-24  1122.7
2  Baker      2 2009-07-24  2819.0
3  Baker      2 2010-08-25  2971.6
4  Baker      1 2011-01-05  1410.0
5  Baker      2 2010-09-04  4671.6

In [51]: df.dtypes
Out[51]:
name             object
value             int64
date     datetime64[ns]
score           float64
dtype: object

This allows you provide names and infers the dtypes. I would go for this above the others.

Series.str.extract():

Note: You probably shouldn't use the score regex to match arbitrary floats (e.g., it doesn't match negative numbers): look into tokenize.Floatnumber

In [29]: df
Out[29]:
                               raw
0  Baker 1 2009-11-17       1223.0
1  Baker 1 2010-06-24       1122.7
2  Baker 2 2009-07-24       2819.0
3  Baker 2 2010-08-25       2971.6
4  Baker 1 2011-01-05       1410.0
5  Baker 2 2010-09-04       4671.6

In [30]: raw = df.raw.str.extract(r'(?P<name>[a-zA-Z]+)\s+(?P<value>\d+)\s+(?P<date>\d{4}-\d{2}-\d{2})\s+(?P<score>\d*\.\d*)')

In [31]: raw
Out[31]:
    name value        date   score
0  Baker     1  2009-11-17  1223.0
1  Baker     1  2010-06-24  1122.7
2  Baker     2  2009-07-24  2819.0
3  Baker     2  2010-08-25  2971.6
4  Baker     1  2011-01-05  1410.0
5  Baker     2  2010-09-04  4671.6

In [32]: raw.dtypes
Out[32]:
name     object
value    object
date     object
score    object
dtype: object

In [33]: r = raw.convert_objects(convert_numeric=True)

In [34]: r
Out[34]:
    name  value        date   score
0  Baker      1  2009-11-17  1223.0
1  Baker      1  2010-06-24  1122.7
2  Baker      2  2009-07-24  2819.0
3  Baker      2  2010-08-25  2971.6
4  Baker      1  2011-01-05  1410.0
5  Baker      2  2010-09-04  4671.6

In [35]: r.dtypes
Out[35]:
name      object
value      int64
date      object
score    float64
dtype: object

Note: This didn't convert the date column. Use pandas.to_datetime for that.

Phillip Cloud
  • 24,919
  • 11
  • 68
  • 88