Convert a string to dataframe

Question

I have a string like this:

txt = 'A      AGILENT TECH INC              \nAA     ALCOA INC                     '

And want to obtain a DataFrame like this:

In [185]: pd.DataFrame({'col1':['A','AA'],'col2':['AGILENT TECH INC','ALCOA INC']})
Out[185]:
  col1              col2
0    A  AGILENT TECH INC
1   AA         ALCOA INC

I tried so far:

from StringIO import StringIO
import re

pd.DataFrame.from_csv(StringIO(re.sub(' +\n', ';', txt)), sep=';')

Out[204]:
Empty DataFrame
Columns: [AA     ALCOA INC                     ]
Index: []

But the result is not the one expected. It seems I do not handle all optionality of from_csv or StringIO.

It is certainly linked to this question.

score 5 · Accepted Answer · answered Sep 02 '15 at 15:50

Use read_fwf and pass the column widths:

In [15]:
import io
import pandas as pd    
col2
txt = 'A      AGILENT TECH INC              \nAA     ALCOA INC                     '
df = pd.read_fwf(io.StringIO(txt), header=None, widths=[7, 37], names=['col1', 'col2'])
df
Out[15]:
  col1              col2
0    A  AGILENT TECH INC
1   AA         ALCOA INC

score 0 · Answer 2 · edited Sep 02 '15 at 17:28

0

import re

txt = 'A      AGILENT TECH INC              \nAA     ALCOA INC                     '

result = {'col{0}'.format(i + 1): re.split(r'\s{2,}', x.strip()) for i, x in enumerate(txt.splitlines())}

#{'col1':['A','AA'],'col2':['AGILENT TECH INC','ALCOA INC']}

edited Sep 02 '15 at 17:28

Chris Montanaro

16,948
4
20
29

answered Sep 02 '15 at 15:52

Cody Bouche

945
5
10

Nader Hisham · Answer 3 · 2015-09-02T18:01:08.607

txt = 'A      AGILENT TECH INC              \nAA     ALCOA INC                     '
# First create a list , each element in the list represents new line
# at the same step replace the first occurrences of `spaces` with '__'
lines = [re.sub('\s+' , '__' , line.strip() , 1) for line in txt.split('\n')]
# 
Out[143]:
['A__AGILENT TECH INC', 'AA__ALCOA INC']
# then create a series of all resulting lines 
S = pd.Series(lines)

Out[144]:
0    A__AGILENT TECH INC
1          AA__ALCOA INC
dtype: object
# split on `__` which replaced the first occurrences of `spaces` before and then convert the series to a list
data = S.str.split('__').tolist()
Out[145]:
[['A', 'AGILENT TECH INC'], ['AA', 'ALCOA INC']]
pd.DataFrame( data, columns = ['col1' , 'col2'])
Out[142]:
col1    col2
0   A   AGILENT TECH INC
1   AA  ALCOA INC

Convert a string to dataframe

3 Answers3