Parsing tab delimited string results in new-line character seen in unquoted field

Question

I have a tab-delimited text file that is saved in a database field. When I try to parse the text/string content (from the database field), I keep getting the error new-line character seen in unquoted field.

A lot of SO posts (here and here) deal with reading a file directly and specify with open(path, 'rb') as f or with open(path, 'rU'). However, I can't use with open(...) since I am reading the text/string value from a database record/field.

A simple example demonstrates my problem below.

import csv

s = """X    Y   A   B   C   D
E
F"""

list(csv.reader([s], delimiter='\t'))[0] # throws error

Conceptually, the line is X\tY\tA\t\B\t\C\tD\rE\rF\r\n.

What I would expect is ['X', 'Y', 'A', 'B', 'C', 'D\rE\rF'].

If the field is quoted, then everything works. But I have no control upstream over how these text are generated (impossible to control and re-export). Example below.

s = """X    Y   A   B   C   "D
E
F"
"""
list(csv.reader([s], delimiter='\t', quotechar='"'))[0]

Any ideas on how I can get this parsing to work?

Do all the rows have the same number of columns? If so you could split the text by newline, then each row by tab; rows with embedded newlines will be short, and so recognisable. Then write the lines into a csv so that they get quoted properly, if you need csv. — snakecharmerb, Sep 24 '19 at 12:50

Parsing tab delimited string results in new-line character seen in unquoted field

0 Answers0