1

When opening my CSV file in Notepad++ it shows the encoding is ANSI, and the line breaks are showing as LF. I'm trying to remove these line breaks via python, but have not found a solution. A couple solutions I've tried:

with open(csvTable, 'r') as f:
    raw_csv = f.read()
    clean_csv = raw_csv.strip()

with open(csvTable2, 'w') as f:
    f.write(clean_csv)

And:

data_set = pd.read_csv(data_file,skip_blank_lines=True, encoding='ANSI')
data_set.to_csv(target_file,index=False)

Can someone point me in the right direction?

J. Skinner
  • 309
  • 1
  • 8
  • 17

2 Answers2

1

You can (and should) use the csv module for any CSV-related work; beyond allowing you to control the lineterminator, it will also preserve the internal line breaks that may exist inside a "cell".

The following:

  • opens your input CSV and creates an output CSV, both with the CP-1252 (Windows-1252 or "ANSI") encoding

  • the newline="" directive tells the file handlers created by open() not to do anything with newlines as the files are being read and written; the csv module's reader and writer will handle all newline characters

  • importantly, newlines inside a cell will be left as is, while the newline at the end of a row (the "line terminator") will be set CRLF with lineterminator="\r\n"

import csv

ANSI = "cp1252"

with (
    open("input.csv", encoding=ANSI, newline="") as f_in,
    open("output.csv", "w", encoding=ANSI, newline="") as f_out,
):
    reader = csv.reader(f_in)
    writer = csv.writer(f_out, lineterminator="\r\n")
    for row in reader:
        writer.writerow(row)

Here's my input.csv:

Col1,Col2
"à
ÿ",1
b,2
c,3

Looking at it from hexdump:

00000000  43 6f 6c 31 2c 43 6f 6c  32 0a 22 e0 0a ff 22 2c  |Col1,Col2."�.�",|
00000010  31 0a 62 2c 32 0a 63 2c  33                       |1.b,2.c,3|
00000019

we can see that we have LF (0a) for the line terminators, and the à and ÿ characters are encoded as e0 and ff, according to CP-1252.

Here's ouptut.csv:

Col1,Col2
"à
ÿ",1
b,2
c,3

Looks to the same to the naked eye; here's how it looks from hexdump:

00000000  43 6f 6c 31 2c 43 6f 6c  32 0d 0a 22 e0 0a ff 22  |Col1,Col2.."�.�"|
00000010  2c 31 0d 0a 62 2c 32 0d  0a 63 2c 33 0d 0a        |,1..b,2..c,3..|
0000001e

We can see:

  • the à and ÿ are still encoded as e0 and ff, the CP-1252 encoding was kept

  • the internal newline between à and ÿ was kept as 0a (LF)

  • the line terminators were converted to 0d 0a (CRLF)

Zach Young
  • 10,137
  • 4
  • 32
  • 53
0

I ended up using this post to create a solution: Replace CRLF with LF in Python 3.6 it also helped me get over the hump and provided an understanding of what was happening underneath the hood.

OldFile=r"c:\Test\input.csv"
NewFile=r"C:\Test\output.csv"

#reading it in as binary keeps the cr lf in windows as is
with (
    open(OldFile, 'rb') as f_in,
    open(NewFile, 'wb') as f_out,
):


FileContent = f_in.read()

#removing all line breaks including the ones after the carriage return
oldLineFeed = b'\n'
newLineFeed = b''

FileContent = FileContent.replace(oldLineFeed, newLineFeed)

#only have a carriage return now at the end of each true line, added back in the line break 
oldLineFeed = b'\r'
newLineFeed = b'\r\n'
FileContent = FileContent.replace(oldLineFeed, newLineFeed)

f_out.write(FileContent)

f_in.close()
f_out.close()
Breadtruck
  • 1,943
  • 5
  • 25
  • 39