1

I have a problem with data that is exported from SAP. Sometimes you can find a line break in the posting text. What should be in one line, is then in two and this results in a pretty bad data frame. The most annoying thing is, that I am unable to make pandas aware of this problem, it just read those wrong lines even if the column count is smaller than the header.

An example of a wrong data.txt:

MANDT~BUKRS~BELNR~GJAHR
030~01~0100650326
~2016
030~01~0100758751~2017

You can see, that the first line has a wrong line break after 0100650326. The 2016 belongs to the first row. The third line is as it should be.

If I import this file:

data = pd.read_csv(
    path_to_file,
    sep='~',
    encoding='latin1',
    error_bad_lines=True,
    warn_bad_lines=True)

I get this. What is pretty wrong:

   MANDT  BUKRS        BELNR   GJAHR
0   30.0      1  100650326.0     NaN
1    NaN   2016          NaN     NaN
2   30.0      1  100758751.0  2016.0

Is it possible to fix the wrong line break or to tell pandas to ignore lines where column count is smaller than header?

Just to make it complete. I want to get this:

   MANDT  BUKRS      BELNR  GJAHR
0     30      1  100650326   2016
1     30      1  100758751   2016

I tried to use with open and to replace '\n' (the line break) with '' (nothing), but this results in a single liner file. This is not intended.

Erik Steiner
  • 581
  • 1
  • 5
  • 18

2 Answers2

1

You can do some pre-processing to get rid of the unwanted breaks. Example below which I tested.

import fileinput

with fileinput.FileInput('input.csv', inplace=True, backup='.orig.bak') as file:
    for line in file:
        print(line.replace('\n','^'), end='')

with fileinput.FileInput('input.csv', inplace=True, backup='.1.bak') as file:
    for line in file:
        print(line.replace('^~','~'), end='')

with fileinput.FileInput('input.csv', inplace=True, backup='.2.bak') as file:
    for line in file:
        print(line.replace('^','\n'), end='')
Hassan Voyeau
  • 3,383
  • 4
  • 22
  • 24
  • No, it will only break sometimes. I can not see the reason for the wrong break, but I know that the reason occurs only in that column. – Erik Steiner Sep 17 '18 at 11:45
  • @ErikSteiner Modified my answer with actual code. Let me know if that helps you. – Hassan Voyeau Sep 17 '18 at 12:39
  • It works in my example, but if I have a different encoding I get an error. `UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 7094: character maps to ` I tried to follow [THIS](https://stackoverflow.com/questions/25203040/fileinput-inplace-filtering-encoding) suggestion, but I guess that write new line is a problem and not helpful, because it will not fix my lines.. – Erik Steiner Sep 19 '18 at 13:52
  • @ErikSteiner What encoding do you need to support and can you give 3 example lines that will give the above error? – Hassan Voyeau Sep 19 '18 at 14:12
0

The correct way would be to fix the file at creation time. If this is not possible, you could pre-process the file or use a wrapper.

Here is a solution using a byte level wrapper that combines lines until you have the correct number of delimiters. I use a byte level wrapper to make use of the classes of the io module and add as little code of my own as I can: a RawIOBase reads lines from an underlying byte file object, and combines lines to have the expected number of delimiters (only readinto and readable are overriden)

class csv_wrapper(io.RawIOBase):
    def __init__(self, base, delim):
        self.fd = base           # underlying (byte) file object
        self.nfields = None
        self.delim = ord(delim)  # code of the delimiter (passed as a character)
        self.numl = 0            # number of line for error processing
        self._getline()          # load and process the header line
    def _nfields(self):
        # number of delimiters in current line          
        return len([c for c in self.line if c == self.delim])

    def _getline(self):
        while True:
            # loads a new line in the internal buffer
            self.line = next(self.fd)
            self.numl += 1
            if self.nfields is None:           # store number of delims if not known
                self.nfields = self._nfields()
            else:
                while self.nfields > self._nfields():  # optionaly combine lines
                    self.line = self.line.rstrip() + next(self.fd)
                    self.numl += 1
            if self.nfields != self._nfields():        # too much here...
                print("Too much fields line {}".format(self.numl))
                continue               # ignore the offending line and proceed
            self.index = 0                             # reset line pointers
            self.linesize = len(self.line)
            break
    def readinto(self, b):
        if len(b) == 0: return 0
        if self.index == self.linesize:            # if current buffer is exhausted
            try:                                   # read a new one
                self._getline()
            except StopIteration:
                return 0
        for i in range(len(b)):                    # store in passed bytearray
            if self.index == self.linesize: break
            b[i] = self.line[self.index]
            self.index += 1
        return i
    def readable(self):
        return True

You can then change your code to:

data = pd.read_csv(
    csv_wrapper(open(path_to_file, 'rb'), '~'),
    sep='~',
    encoding='latin1',
    error_bad_lines=True,
    warn_bad_lines=True)
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • What if I have this `030~01~The sign ~ is great~2017` If BELNR is a text field and someone used the delimiter sign in this text field. With `error_bad_lines=True` pandas would skip this line, but what would happen with you wrapper? Would the wrapper try to combine more lines in order to reach the delimiter count? – Erik Steiner Sep 19 '18 at 14:07
  • `numl` in `raise Exception("Too much fields line {}".format(numl))` is not defined says PyCharm.. – Erik Steiner Sep 19 '18 at 14:54
  • @ErikSteiner: it should have been `self.numl`. Fixed. But this is not what your first comment wants, so the code has been changed to ignore lines with too many delimiters. – Serge Ballesta Sep 19 '18 at 15:10