7

I have a large CSV file with one column and line breaks in some of its rows. I want to read the content of each cell and write it to a text file but the CSV reader is splitting the cells with line breaks into multiple ones (multiple rows) and writing each one to a separate text file.

Using Python 3.6.2 on a MAC Sierra

Here is an example:

"content of row 1"
"content of row 2 
 continues here"
"content of row 3"

And here is how I am reading it:

with open(csvFileName, 'r') as csvfile:

    lines= csv.reader(csvfile)

    i=0
    for row in lines:
        i+=1
        content= row

        outFile= open("output"+str(i)+".txt", 'w')

        outFile.write(content)

        outFile.close()

This is creating 4 files instead of 3 for each row. Any suggestions on how to ignore the line break in the second row?

Labibah
  • 5,371
  • 6
  • 25
  • 23
  • 2
    That source CSV doesn't seem properly formatted as a CSV. Try using an editor like Microsoft Excel or Google Sheets. They'll output the CSV correctly, with cells containing special characters wrapped in quotation marks. See https://stackoverflow.com/questions/566052/ – andrewgu Sep 05 '17 at 18:42
  • 2
    is the row delimiter literally row#? how can you tell when something is not a 'new' row – EoinS Sep 05 '17 at 18:42
  • 1
    strip the row and check if it is equal to empty string before creating files. Like this: ```content=row.strip()``` – arshovon Sep 05 '17 at 18:43
  • @andrewgu thanks but when I look at the file using vim the cells all are in quotation marks "" – Labibah Sep 05 '17 at 18:48
  • @arsho thanks but this did not fix it – Labibah Sep 05 '17 at 18:48
  • @Jean-FrançoisFabre thanks. I tried adding newline= '' " but it did not work either – Labibah Sep 05 '17 at 18:56
  • it's not possible for the csv module to differentiate between a new row and a multi-line row because you're missing the quotes in your input file. – Jean-François Fabre Sep 05 '17 at 18:56
  • @Jean-FrançoisFabre the rows actually have quotes. I just edited the example in my question to show that. – Labibah Sep 05 '17 at 18:58
  • I just tested with a simple csv with quotes and it works fine. – Jean-François Fabre Sep 05 '17 at 18:58
  • Then I cannot reproduce. Can you 1) tell us your version of python 2) your operating system and 3) paste a raw sample (without the "row" prefixes) of your file, thus creating a [mcve] – Jean-François Fabre Sep 05 '17 at 18:59
  • also try adding `,quotechar='"'` to csv reader constructor (I doubt it will work) – Jean-François Fabre Sep 05 '17 at 19:00
  • @Jean-FrançoisFabre edited the question as requested to include python version and OS. – Labibah Sep 05 '17 at 19:03
  • @Jean-FrançoisFabre I tried adding quotechar= ' " ' but as you expected, it did not change anything – Labibah Sep 05 '17 at 19:04
  • 1
    it still works fine here. python 3.4 windows. Sorry, cannot reproduce. Maybe it's an issue with invisible characters. Can you edit the file using hex editor? Can you try with the input you posted (in a new file) to convince yourself that your original input file has a problem. Because it works fine, as I said – Jean-François Fabre Sep 05 '17 at 19:06
  • @Jean-FrançoisFabre you are right. It is working for the simple example for me too. There must be some different problem with my big CSV file – Labibah Sep 05 '17 at 19:27
  • 1
    macintosh has a strange way to terminate lines. check hex editor between your simple file and your big file. Create little extracts of your big files. Check if lines end with 0D or 0D 0A or 0A... all the help I can offer, sorry. – Jean-François Fabre Sep 05 '17 at 19:30
  • This is in act the kind of CSV files that Microsoft Excel produces. – vy32 Apr 07 '19 at 15:19

2 Answers2

2

You could define a regular expression pattern to help you iterate over the rows.

Read the entire file contents - if possible.

s = '''"content of row 1"
"content of row 2 
 continues here"
"content of row 3"'''

Pattern - double-quote, followed by anything that isn't a double-quote, followed by a double-quote.:

row_pattern = '''"[^"]*"'''
row = re.compile(row_pattern, flags = re.DOTALL | re.MULTILINE)

Iterate the rows:

for r in row.finditer(s):
    print r.group()
    print '******'

>>> 
"content of row 1"
******
"content of row 2 
 continues here"
******
"content of row 3"
******
>>>
wwii
  • 23,232
  • 7
  • 37
  • 77
0

The file you describe is NOT a CSV (comma separated values) file. A CSV file is a list of records one per line where each record is separated from the others by commas. There are various "flavors" of CSV which support various features for quoting fields (in case fields have embedded commas in them, for example).

I think your best bet would be to create an adapter class/instance which would pre-process the raw file, find and merge the continuation lines into records and them pass those to your instance of csv.reader. You could model your class after StringIO from the Python standard libraries.

The point is that you create something which processes data but behaves enough like a file object that it can be used, transparently, as the input source for something like csv.reader().

(Done properly you can even implement the Python context manager protocol. io.StringIO does support this protocol and could be used as a reference. This would allow you to use instances of this hypothetical "line merging" adapter class in a Python with statement just as you're doing with your open file() object in your example code).

from io import StringIO
import csv
data = u'1,"a,b",2\n2,ab,2.1\n'
with StringIO(data) as infile:
    reader = csv.reader(infile, quotechar='"')
    for rec in reader:
        print(rec[0], rec[2], rec[1])

That's just a simple example of using the io.StringIO in a with statement Note that io.StringIO requires Unicode data, io.BytesIO requires "bytes" or string data (at least in 2.7.x). Your adapter class can do whatever you like.

Jim Dennis
  • 17,054
  • 13
  • 68
  • 116
  • 4
    Yes it is a valid CSV file, check the spec at seciont 2.6 https://tools.ietf.org/html/rfc4180 – Thiago Dantas Nov 11 '20 at 11:20
  • 2
    It is very much valid CSV. CSV file can contain CR/LF characters inside the field (column) but it needs to be quoted in such case. As per [Wikipedia](https://en.wikipedia.org/wiki/Comma-separated_values) you can see that: _Fields containing a line-break, double-quote or commas should be quoted. (If they are not, the file will likely be impossible to process correctly.)_ – Wojciech Jakubas Feb 16 '21 at 21:53