I have a problem with data that is exported from SAP. Sometimes you can find a line break in the posting text. What should be in one line, is then in two and this results in a pretty bad data frame. The most annoying thing is, that I am unable to make pandas aware of this problem, it just read those wrong lines even if the column count is smaller than the header.
An example of a wrong data.txt:
MANDT~BUKRS~BELNR~GJAHR
030~01~0100650326
~2016
030~01~0100758751~2017
You can see, that the first line has a wrong line break after 0100650326. The 2016 belongs to the first row. The third line is as it should be.
If I import this file:
data = pd.read_csv(
path_to_file,
sep='~',
encoding='latin1',
error_bad_lines=True,
warn_bad_lines=True)
I get this. What is pretty wrong:
MANDT BUKRS BELNR GJAHR
0 30.0 1 100650326.0 NaN
1 NaN 2016 NaN NaN
2 30.0 1 100758751.0 2016.0
Is it possible to fix the wrong line break or to tell pandas to ignore lines where column count is smaller than header?
Just to make it complete. I want to get this:
MANDT BUKRS BELNR GJAHR
0 30 1 100650326 2016
1 30 1 100758751 2016
I tried to use with open and to replace '\n' (the line break) with '' (nothing), but this results in a single liner file. This is not intended.