1

I have a massive ASCII data table with several consecutive white spaces as separators and missing data as follows:

493802.2.38...9.................11...    
5221.8.23.............7.........1.    
1383782.51.............16.......1...   

I need to read it with pandas to be as follows, which consider the first white space as a separator while the second as a column missing data

493802 |2 |38| |9| | | | |  | | | |11| |
5221   |8 |23| | | | | | |7 | | | |  |1|
1383782|51|  | | | | | | |16| | | |1 | |

I have tried

df = pd.read_csv('file.txt', sep='\s+',header=None,engine='python')

but it considers any count of white spaces as the delimiter so it parses the file into only five columns

And tried

 df = pd.read_csv('file.txt', sep='\s',header=None,engine='python')

but it returns the error that expected fields are less than what pandas saw

Finally, I've even tried some other parameters such as quoting or line-terminator but no success. any help?

Mshendy
  • 303
  • 3
  • 11
  • On the second row there are 13 spaces between `23` and `7`. On the third row there are 13 spaces between `51` and `16`.Since the `23` and `51` are in different columns, shouldn't the `7` and `16` be in different columns as well? – unutbu Jun 10 '17 at 11:28
  • yes you are right, just a typo – Mshendy Jun 10 '17 at 13:01

2 Answers2

0

If I got this right, in your file the '.' indicates the separation and missing values.

You could replace all '..' with ', ' and then replace the remaining '.' with ',' and set the commas as separator; sep= ',' .

According to your example, this works for me :

import io
import pandas as pd  

with open('file.txt','r') as file:
     content = file.read()

s2=content.replace('..',', ').replace('.',',')
df = pd.read_table(io.StringIO(s2),sep=',',header=None)

Note: Io is necessary so that pandas reads the text as a file 
  • in the original file there are spaces not dots, however i've tried to replace all spaces with dots and then applied the answer but still says found rows are less than expected – Mshendy Jun 10 '17 at 13:02
0

If we use the regex pattern \s{2}|\s as the separator, then two spaces will be matched preferentially over one space if there are two spaces. So, for example, splitting the first line with \s{2}|\s would yield

In [33]: re.split(r'\s{2}|\s', '493802 2 38   9                 11   ')
Out[33]: ['493802', '2', '38', '', '9', '', '', '', '', '', '', '', '', '11', '', '']

Notice that three spaces get matched as two separators back-to-back. Since there is nothing between the two separators, Pandas will insert a NaN for the missing value:

import pandas as pd
try: from cStringIO import StringIO         # for Python2
except ImportError: from io import StringIO # for Python3

# so that there is no ambiguity about our input
data = '''\
493802 2 38   9                 11   
5221 8 23             7         1 
1383782 51             16       1   '''

df = pd.read_table(StringIO(data), sep=r'\s{2}|\s', header=None,  
                   engine='python').reset_index()
print(df)

yields

     index   0     1   2    3   4   5   6     7    8   9  10   11    12   13
0   493802   2  38.0 NaN  9.0 NaN NaN NaN   NaN  NaN NaN NaN  NaN  11.0  NaN
1     5221   8  23.0 NaN  NaN NaN NaN NaN   NaN  7.0 NaN NaN  NaN   NaN  1.0
2  1383782  51   NaN NaN  NaN NaN NaN NaN  16.0  NaN NaN NaN  1.0   NaN  NaN

Beware that the result is a little different than the one you posted. The first two rows match, but the third row is different. Nevertheless, I think \s{2}|\s might be the regex you are looking for.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • It keeps returning the error that expected rows are less than an what pandas saw. "Expected 30 fields in line 151, saw 31. Error could possibly be due to quotes being ignored when a multi-char delimiter is used." – Mshendy Jun 10 '17 at 13:00