Use list items as column seperators pd.read_fwf

Question

I have text files containing tables which I want to put into a dataframe. Per file the column headers are the same, but the width is different depending on the content (because they contain names of different lengths for example).

So far I managed to get the index of the first character of the header, so I know where the column separation should be. As all files (about 100) have different widths I want to be able to insert this list in order to create a dataframe with the correct column width.

This is what the first two rows look like:

!Column1        !Column2     !Column3       !Column4     !Column5                  
Company with a  $1,000,000   Yes            Jack, Hank   X
name            
Company with.   $2,000       No             Rita, Hannah X
another name

What I tried so far:

(1)

pandas pd.read_fwf('file.txt', colspec(()) - This does work, but with colspec I have to put in the (from, start) indexes for each column. Not only would this be burdensome manually, but some files have 12 columns while others have 10.

(2)

pandas pd.read_fwf('file.txt', widhts(list)) - Here I can easily insert the list with locations of the column separations, but it does not seem to create a separation at those indexes. I do not exactly understand what is does.

Question:

I currently have a list of indexes of all the exclamation marks:

list = [0, 17, 30, 45, 58]

How can I use this list and separate the columns to convert the .txt file into a DataFrame?

Any other way to solve this issue is more than welcome!

Can you post some of the content of the txt file exactly how they look in the file? — XXavier, Jul 23 '22 at 13:38
better show what you have in file. It can better show what you try to do . And maybe there is better method to read data from file. — furas, Jul 23 '22 at 15:47
@furas Thanks for the tip! I have added the first two lines of text to my question. Do you know how I can solve it? — Not_a_Robot, Jul 24 '22 at 14:11
@XXavier I have added it to my question. Does this make it easier to understand? — Not_a_Robot, Jul 24 '22 at 14:11
you can use `zip()` to create pairs `pairs = list(zip(list, list[1:]))` but it may need to create more complex code to substract `-1` from `end` — furas, Jul 24 '22 at 14:22

Pieter Geelen · Accepted Answer · 2022-07-24T16:07:43.303

So what you can do is standardize the spacing with regex.

import re
string = "something    something  something  more"
results = re.sub("(\W+)", "|", string)
results

That returns

'something|something|something|more'

If you have standardized the delimiters, you can load it with fwf or just read_csv.

EDIT

In order to derive the span of the header that is delimited with a exclamation mark ! you can use the re library too. The logic of the pattern is that the sequence has to start with ! and then is followed up by many non-!. The next group would inherently start with a !. In code it would look something like this:

example_txt = """!Column1        !Column2     !Column3       !Column4     !Column5                  
Company with a  $1,000,000   Yes            Jack, Hank   X
name            
Company with.   $2,000       No             Rita, Hannah X
another name"""

first_line = example_txt.split("\n")[0]

import re 

indexes = []
p = re.compile("![^!]*")
for m in p.finditer(first_line):
    indexes.append(m.span())

print(indexes)

Which returns

[(0, 16), (16, 29), (29, 44), (44, 57), (57, 83)]

This should bring you close to what you need for fwf method of pandas. Not that indexing in python starts at 0 and that if the end-index doesn't count. So if you index from [0:16] then you would get the 0th to 15th element (not including the 16th element), returning 16 elements in total. The index can therefore be directly applied.

Thanks for you help! However, I only have the separator in the line of the headers (I have added example text in my question). Do you know how I can use the exclamation marks (or rather, their indexes) as separators? — Not_a_Robot, Jul 24 '22 at 14:12
Hi @Not_a_Robot, I edited my answer to accomdate for the indexing that you are looking for. Let me know whether you can take it from here :) — Pieter Geelen, Jul 24 '22 at 16:08

furas · Answer 2 · 2022-07-24T14:39:45.143

You can use zip() to convert list [0, 17-1, 30-1, 45-1] to pairs [(0, 16), (16, 29), (29, 44), (44, 57)]

cols = [0, 17-1, 30-1, 45-1]

#pairs = [(a, b) for a, b in zip(cols, cols[1:])]
pairs = list(zip(cols, cols[1:]))

df = pd.read_fwf('file.txt', pairs)

Minimale working code.

I use io to simulate file in memory - so everyone can simply copy and test it - but you should use filename

text = '''!Column1        !Column2     !Column3       !Column4     !Column5                  
Company with a  $1,000,000   Yes            Jack, Hank   X
name            
Company with.   $2,000       No             Rita, Hannah X
another name'''

import pandas as pd
import io

#cols = [0, 17-1, 30-1, 45-1]

# --- search all `!` in first line ----

#with open('file.txt') as fh:
with io.StringIO(text) as fh:
   first = next(fh)

start = 0
cols = []
while True:
    pos = first.find('!', start)
    if pos < 0:
        break
    cols.append(pos)
    start = pos+1
print('cols:', cols)    
cols.append(1000)  # add some big value as the end of last column

# --- convert to pairs ---

#pairs = [(a, b) for a, b in zip(cols, cols[1:])]
pairs = list(zip(cols, cols[1:]))
print('pairs:', pairs)

# --- load file ---

#df = pd.read_fwf('file.txt', pairs)
df = pd.read_fwf(io.StringIO(text), pairs)

print(df)

Result:

         !Column1    !Column2 !Column3      !Column4 !Column5
0  Company with a  $1,000,000      Yes    Jack, Hank        X
1            name         NaN      NaN           NaN      NaN
2   Company with.      $2,000       No  Rita, Hannah        X
3    another name         NaN      NaN           NaN      NaN

But this shows other problem - first column has values in two rows but pandas treats it as separated rows.

Use list items as column seperators pd.read_fwf

2 Answers2

EDIT