1

I have a ".dat" file that looks like this:

enter image description here

You can download it here: https://filebin.net/hnhsv8sc3zvaoi8k

I would like to read this file into a pandas data frame. I have found this question and tried the solution:

with open('Output2.dat','r') as f:
    df = pd.DataFrame(l.rstrip().split() for l in f)

The result looks like that:

enter image description here

  • the row "Balance" was displaced, How to change that?
  • How can I make sure that the first row is added as pandas column titles?

Thank you very much.

EDIT: This is how the final Table should look like:

enter image description here

henry
  • 875
  • 1
  • 18
  • 48

1 Answers1

1

The "empty elements" are 0xa0 bytes; non-breaking spaces in Latin-1, but spaces nevertheless, therefore they are seen as whitespace by str.split.

Here's a hacky workaround that works:

with open("Output2.dat", "rb") as f:  # binary mode
    df = pd.DataFrame(
        [e.decode("latin-1") if e != b'\xa0' else None for e in l.strip().split()]
        for l in f
    )

Here I'm opening the file in binary mode, because bytes.split ignores that special space. When I detect it, I further replace it with None to not have some weird character in the empty cells.

I'm assuming a latin-1 encoding, because 0xa0 on its own is invalid UTF-8, but you may need to replace that with windows-1252 or some other encoding, depending on where this data is coming from.

L3viathan
  • 26,748
  • 2
  • 58
  • 81
  • That's very nice ! Thanks a lot ! Is there a way to add the first row as column names in the pandas dataframe ? – henry Nov 05 '19 at 11:57
  • Sure, you could do something like `columns = [element.decode("latin-1") for element in next(f).strip().split()]` before the line where you define the dataframe, and provide them as columns like `df = pd.DataFrame(([e.decode....] for l in f), columns=columns)` – L3viathan Nov 05 '19 at 12:15