Is it a requirement that you use regular expressions? Using regexs here is too complicated, since you have structured data that are easily parsed by read_csv
. That said, there are a couple of ways you can do this, in addition to @chrisb's answer:
StringIO
+ read_csv
:
In [45]: data
Out[45]:
{'raw': ['Baker 1 2009-11-17 1223.0',
'Baker 1 2010-06-24 1122.7',
'Baker 2 2009-07-24 2819.0',
'Baker 2 2010-08-25 2971.6',
'Baker 1 2011-01-05 1410.0',
'Baker 2 2010-09-04 4671.6']}
In [46]: text = '\n'.join(data['raw'])
In [47]: print(text)
Baker 1 2009-11-17 1223.0
Baker 1 2010-06-24 1122.7
Baker 2 2009-07-24 2819.0
Baker 2 2010-08-25 2971.6
Baker 1 2011-01-05 1410.0
Baker 2 2010-09-04 4671.6
In [48]: from StringIO import StringIO
In [49]: df = pd.read_csv(StringIO(text), sep=r'\s+', parse_dates=[2], names=['name', 'value', 'date', 'score'])
In [50]: df
Out[50]:
name value date score
0 Baker 1 2009-11-17 1223.0
1 Baker 1 2010-06-24 1122.7
2 Baker 2 2009-07-24 2819.0
3 Baker 2 2010-08-25 2971.6
4 Baker 1 2011-01-05 1410.0
5 Baker 2 2010-09-04 4671.6
In [51]: df.dtypes
Out[51]:
name object
value int64
date datetime64[ns]
score float64
dtype: object
This allows you provide names and infers the dtypes. I would go for this above the others.
Series.str.extract()
:
Note: You probably shouldn't use the score
regex to match arbitrary floats (e.g., it doesn't match negative numbers): look into tokenize.Floatnumber
In [29]: df
Out[29]:
raw
0 Baker 1 2009-11-17 1223.0
1 Baker 1 2010-06-24 1122.7
2 Baker 2 2009-07-24 2819.0
3 Baker 2 2010-08-25 2971.6
4 Baker 1 2011-01-05 1410.0
5 Baker 2 2010-09-04 4671.6
In [30]: raw = df.raw.str.extract(r'(?P<name>[a-zA-Z]+)\s+(?P<value>\d+)\s+(?P<date>\d{4}-\d{2}-\d{2})\s+(?P<score>\d*\.\d*)')
In [31]: raw
Out[31]:
name value date score
0 Baker 1 2009-11-17 1223.0
1 Baker 1 2010-06-24 1122.7
2 Baker 2 2009-07-24 2819.0
3 Baker 2 2010-08-25 2971.6
4 Baker 1 2011-01-05 1410.0
5 Baker 2 2010-09-04 4671.6
In [32]: raw.dtypes
Out[32]:
name object
value object
date object
score object
dtype: object
In [33]: r = raw.convert_objects(convert_numeric=True)
In [34]: r
Out[34]:
name value date score
0 Baker 1 2009-11-17 1223.0
1 Baker 1 2010-06-24 1122.7
2 Baker 2 2009-07-24 2819.0
3 Baker 2 2010-08-25 2971.6
4 Baker 1 2011-01-05 1410.0
5 Baker 2 2010-09-04 4671.6
In [35]: r.dtypes
Out[35]:
name object
value int64
date object
score float64
dtype: object
Note: This didn't convert the date
column. Use pandas.to_datetime
for that.