5

I have a file containing multiple entries. Each entry is of the following form:

"field1","field2","field3","field4","field5"

All of the fields are guaranteed to not contain any quotes, however they can contain ,. The problem is that field4 can be split across multiple lines. So an example file can look like:

"john","male US","done","Some sample text
across multiple lines. There
can be many lines of this","foo bar baz"
"jane","female UK","done","fields can have , in them","abc xyz"

I want to extract the fields using Python. If the field would not have been split across multiple lines this would have been simple: Extract string from between quotations. But I can't seem to find a simple way to do this in presence of multiline fields.

EDIT: There are actually five fields. Sorry about the confusion if any. The question has been edited to reflect this.

Community
  • 1
  • 1
Subhasis Das
  • 1,667
  • 13
  • 13

4 Answers4

6

I think that the csv module can solve this problem. It splits correctly with newlines:

import csv 

f = open('infile', newline='')
reader = csv.reader(f)
for row in reader:
    for field in row:
        print('-- {}'.format(field))

It yields:

-- john
-- male US
-- done
-- Some sample text
across multiple lines. There
can be many lines of this
-- foo bar baz
-- jane
-- female UK
-- done
-- fields can have , in them
-- abc xyz
Birei
  • 35,723
  • 2
  • 77
  • 82
  • I thought about json, but it do not likes \n in value. – eri Aug 31 '13 at 22:52
  • This solution might be good enough without the `newline` argument to `open` and it would work directly in Python 2. I understand why you used it, just thought it was worth noting. – Paulo Almeida Aug 31 '13 at 22:58
1

The answer from the question you linked worked for me:

import re
f = open("test.txt")
text = f.read()

string_list = re.findall('"([^"]*"', text)

At this point, string_list contains your strings. Now, these strings can have line breaks in them, but you can use

new_string = string_list.replace("\n", " ")

to clean that up.

Mark R. Wilkins
  • 1,282
  • 7
  • 15
  • alecxe: Sure, and in that case one might need to do something more sophisticated, like reading the file in chunks and parsing them. – Mark R. Wilkins Aug 31 '13 at 22:50
0

Try :

awk '{FS=','} /pattern if needed/{print $0}' fname
Vivek
  • 910
  • 2
  • 9
  • 26
  • Did you mean to use RS instead of OFS? Even then this does not work since a fields itself can have `,`, and the fields are multiline. Awk only reads a file line by line. – Subhasis Das Aug 31 '13 at 22:42
0

If you control the input to this file, you need to sanitize it beforehand by replacing \n with something ([\n]?) before putting the values into a comma-separated list.

Or, instead of saving strings -- save them as r-strings.

Then, use the csv module to parse it quickly with predefined separators, encoding and quotechar

blakev
  • 4,154
  • 2
  • 32
  • 52