2

I have a string like this:

txt = 'A      AGILENT TECH INC              \nAA     ALCOA INC                     '

And want to obtain a DataFrame like this:

In [185]: pd.DataFrame({'col1':['A','AA'],'col2':['AGILENT TECH INC','ALCOA INC']})
Out[185]:
  col1              col2
0    A  AGILENT TECH INC
1   AA         ALCOA INC

I tried so far:

from StringIO import StringIO
import re

pd.DataFrame.from_csv(StringIO(re.sub(' +\n', ';', txt)), sep=';')

Out[204]:
Empty DataFrame
Columns: [AA     ALCOA INC                     ]
Index: []

But the result is not the one expected. It seems I do not handle all optionality of from_csv or StringIO.

It is certainly linked to this question.

Community
  • 1
  • 1
Colonel Beauvel
  • 30,423
  • 11
  • 47
  • 87

3 Answers3

5

Use read_fwf and pass the column widths:

In [15]:
import io
import pandas as pd    
col2
txt = 'A      AGILENT TECH INC              \nAA     ALCOA INC                     '
df = pd.read_fwf(io.StringIO(txt), header=None, widths=[7, 37], names=['col1', 'col2'])
df
Out[15]:
  col1              col2
0    A  AGILENT TECH INC
1   AA         ALCOA INC
EdChum
  • 376,765
  • 198
  • 813
  • 562
0
import re

txt = 'A      AGILENT TECH INC              \nAA     ALCOA INC                     '

result = {'col{0}'.format(i + 1): re.split(r'\s{2,}', x.strip()) for i, x in enumerate(txt.splitlines())}

#{'col1':['A','AA'],'col2':['AGILENT TECH INC','ALCOA INC']}
Chris Montanaro
  • 16,948
  • 4
  • 20
  • 29
Cody Bouche
  • 945
  • 5
  • 10
0
txt = 'A      AGILENT TECH INC              \nAA     ALCOA INC                     '
# First create a list , each element in the list represents new line
# at the same step replace the first occurrences of `spaces` with '__'
lines = [re.sub('\s+' , '__' , line.strip() , 1) for line in txt.split('\n')]
# 
Out[143]:
['A__AGILENT TECH INC', 'AA__ALCOA INC']
# then create a series of all resulting lines 
S = pd.Series(lines)

Out[144]:
0    A__AGILENT TECH INC
1          AA__ALCOA INC
dtype: object
# split on `__` which replaced the first occurrences of `spaces` before and then convert the series to a list
data = S.str.split('__').tolist()
Out[145]:
[['A', 'AGILENT TECH INC'], ['AA', 'ALCOA INC']]
pd.DataFrame( data, columns = ['col1' , 'col2'])
Out[142]:
col1    col2
0   A   AGILENT TECH INC
1   AA  ALCOA INC
Nader Hisham
  • 5,214
  • 4
  • 19
  • 35