How to remove newline within a column in delimited file?

Question

I have a file that looks like this:

1111,AAAA,aaaa\n
2222,BB\nBB,bbbb\n
3333,CCC\nC,cccc\n
...

Where \n represents a newline.

When I read this line-by-line, it's read as:

1111,AAAA,aaaa\n
2222,BB\n
BB,bbbb\n
3333,CCC\n
C,cccc\n
...

This is a very large file. Is there a way to read a line until a specific number of delimiters, or remove the newline character within a column in Python?

There's the rstrip method as explained here: http://stackoverflow.com/questions/275018/how-can-i-remove-chomp-a-newline-in-python — westandy, Mar 31 '16 at 19:29
A newline is the line delimiter of a file. You don't have 3 lines with 3 fields each, you have 5 lines, some with three fields and some with two fields. Can you post the first few lines of your *actual* file? I'm curious to see if `BB\nBB` actually has quote characters around it. — Robᵩ, Mar 31 '16 at 19:38
What do you see when you `cat` (UNIX) or `type` (Windows) the file? Do you see the backslash-n sequence, or do you see link breaks? — Robᵩ, Mar 31 '16 at 19:49

score 2 · Accepted Answer · answered Mar 31 '16 at 19:35

I think after you read the line, you need to count the number of commas aStr.count(',')

While the number of commas is too small (there can be more than one \n in the input), then read the next line and concatenate the strings

while aStr.count(',') < Num:
     another = file.readline()
     aStr = aStr + another

score 0 · Answer 2 · answered Mar 31 '16 at 19:29

0

1111,AAAA,aaaa\n
2222,BB\nBB,bbbb\n

According to your file \n here is not actually a newline character, it is plain text.

For actually stripping newline characters you could use strip() or other variations like rstrip() ot lstrip().

If you work with large files you don't need to load full content in memory. You could iterate line by line until some counter or anything else.

answered Mar 31 '16 at 19:29

xiº

4,605
3
28
39

According to the post, he replaced the newline within the logical line with \n so that it can be visible to the readers. – Robert Jacobs Mar 31 '16 at 19:38

score 0 · Answer 3 · answered Mar 31 '16 at 19:45

I think perhaps you are parsing a CSV file that has embedded newlines in some of the text fields. Further, I suppose that the program that created the file put quotation marks (") around the fields.

That is, I supposed that your text file actually looks like this:

1111,AAAA,aaaa
2222,"BB
BB",bbbb
3333,"CCC
C",cccc

If that is the case, you might want to use code with better CSV support than just line.split(','). Consider this program:

import csv

with open('foo.csv') as fp:
    reader = csv.reader(fp)
    for row in reader:
        print row

Which produces this output:

['1111', 'AAAA', 'aaaa']
['2222', 'BB\nBB', 'bbbb']
['3333', 'CCC\nC', 'cccc']

Notice the five lines (delimited by newline characters) of the CSV file become 3 rows (some with embedded newline characters) in the CSV data structure.

How to remove newline within a column in delimited file?

3 Answers3