0

I have a CSV file that has errors. The most common one is a too early linebreak.

But now I don't know how to remove it ideally. If I read the line by line

with open("test.csv", "r") as reader:
    test = reader.read().splitlines()

the wrong structure is already in my variable. Is this still the right approach and do I use a for loop over test and create a copy or can I manipulate directly in the test variable while iterating over it?

I can identify the corrupt lines by the semikolon, some rows end with a ; others start with it. So maybe counting would be an alternative way to solve it?

EDIT: I replaced reader.read().splitlines() with reader.readlines() so I could handle the rows which end with a ;

for line in lines:
    if("Foobar" in line):
        line = line.replace("Foobar", "")
    if(";\n" in line):
        line = line.replace(";\n", ";")

The only thing that remains are rows that beginn with a ; Since I need to go back one entry in the list

Example:

Col_a;Col_b;Col_c;Col_d 
2021;Foobar;Bla 
;Blub

Blub belongs in the row above.

tripleee
  • 175,061
  • 34
  • 275
  • 318
sa-biene
  • 13
  • 7
  • Take a look at `rstrip` https://stackoverflow.com/a/275025/15469537 – fravolt Sep 01 '21 at 11:55
  • 2
    Without an example it's hard to recommend anything in particular. Can you share 3-4 lines of CSV with an example of one or two each broken and proper records? – tripleee Sep 01 '21 at 11:55
  • 1
    In case it's not obvious, line breaks inside CSV records are permitted, as long as they are quoted. The Python `csv` module knows how to cope with this. – tripleee Sep 01 '21 at 11:56
  • You need to share the csv and explain how you can point to a "broken" line – balderman Sep 01 '21 at 11:56
  • `awk -F ';' 'NF+n <= 7 { printf "$0"; n+=$NF; next } { n=0 }1' broken.csv >fixed.csv` might do something useful if you have a semicolon-separated file which is supposed to have seven fields in each record. – tripleee Sep 01 '21 at 12:00

3 Answers3

0

This is how I deal with this. This function fixes the line if there are more columns than needed or if there is a line break in the middle.

Parameters of the function are:

  • message - content of the file - reader.read() in your case
  • columns - number of expected columns
  • filename - filename (I use it for logging)
def pre_parse(message, columns, filename):
    parsed_message=[]
    i =0
    temp_line =''
    for line in message.splitlines():
        #print(line)
        split = line.split(',')
        if len(split) == columns:
            parsed_message.append(line)
        elif len(split) > columns:
            print(f'Line {i} has been truncated in file {filename} - too much columns'))
            split = split[:columns]
            line = ','.join(split)
            parsed_message.append(line)
        elif len(split) < columns and temp_line =='':
            temp_line = line.replace('\n','')
            print(temp_line)
        elif temp_line !='':
            line = temp_line+line
            if line.count(',') == columns-1:
                print((f'Line {i} has been fixed in file {filename} - extra line feed'))
                parsed_message.append(line)
                temp_line =''
            else:
                temp_line=line.replace('\n', '')
        i+=1
    return parsed_message

make sure you use proper split character and proper line feed characer.

Hubert
  • 131
  • 8
0

Here's a simple Python script to merge lines until you have the desired number of fields.

import sys

sep = ';'
fields = 4

collected = []
for line in sys.stdin:
    new = line.rstrip('\n').split(sep)
    if collected:
        collected[-1] += new[0]
        collected.extend(new[1:])
    else:
        collected = new
    if len(collected) < fields:
       continue
    print(';'.join(collected))
    collected = []

This simply reads from standard input and prints to standard output. If the last line is incomplete, it will be lost. The separator and the number of fields can be edited into the variables at the top; exposing these as command-line parameters left as an exercise.

If you wanted to keep the newlines, it would not be too hard to only strip a newline from the last fields, and use csv.writer to write the fields back out as properly quoted CSV.

tripleee
  • 175,061
  • 34
  • 275
  • 318
0

I ended up using this post to create a solution: Replace CRLF with LF in Python 3.6 it also helped me get over the hump and provided an understanding of what was happening underneath the hood.

OldFile=r"c:\Test\input.csv"
NewFile=r"C:\Test\output.csv"

#reading it in as binary keeps the cr lf in windows as is
with (
    open(OldFile, 'rb') as f_in,
    open(NewFile, 'wb') as f_out,
):


FileContent = f_in.read()

#removing all line breaks including the ones after the carriage return
oldLineFeed = b'\n'
newLineFeed = b''

FileContent = FileContent.replace(oldLineFeed, newLineFeed)

#only have a carriage return now at the end of each true line, added back in the line break 
oldLineFeed = b'\r'
newLineFeed = b'\r\n'
FileContent = FileContent.replace(oldLineFeed, newLineFeed)

f_out.write(FileContent)

f_in.close()
f_out.close()
Breadtruck
  • 1,943
  • 5
  • 25
  • 39