Regex to remove doubled double quotes from CSV

Question

I have an excel sheet that has a lot of data in it in one column in the form of a python dictionary from a sql database. I don't have access to the original database and I can't import the CSV back into sql with the local infile command due to the fact that the keys/values on each row of the CSV are not in the same order. When I export the excel sheet to CSV I get:

"{""first_name"":""John"",""last_name"":""Smith"",""age"":30}"
"{""first_name"":""Tim"",""last_name"":""Johnson"",""age"":34}"

What is the best way to remove the " before and after the curly brackets as well as the extra " around the keys/values?

I also need to leave the integers alone that don't have quotes around them.

I am trying to then import this into python with the json module so that I can print specific keys but I can't import them with the doubled double quotes. I ultimately need the data saved in a file that looks like:

{"first_name":"John","last_name":"Smith","age":30}
{"first_name":"Tim","last_name":"Johnson","age":34}

Any help is most appreciated!

That's not CSV format. It looks like you're looking for JSON. — maček, Aug 24 '13 at 22:51
The bottom data line in my post is what is in the column of the excel sheet for each row. There are around 13k rows. When I save it to CSV, the top is what I get. I think I can use the JSON module but I need to get rid of the doubled double quotes. Due to the CSV format, excel is taking my existing quotes and doubling them when I save this to CSV. — Christopher, Aug 24 '13 at 22:52
Thanks for the quick suggestions! I have to run out but will check these out when I return and select an answer. I appreciate the help! — Christopher, Aug 24 '13 at 23:04

ridgerunner · Answer 1 · 2013-08-25T00:05:39.590

Easy:

`text = re.sub(r'"(?!")', '', text)`

Given the input file: TEST.TXT:

"{""first_name"":""John"",""last_name"":""Smith"",""age"":30}"
"{""first_name"":""Tim"",""last_name"":""Johnson"",""age"":34}"

The script:

import re
f = open("TEST.TXT","r")
text_in = f.read()
text_out = re.sub(r'"(?!")', '', text_in)
print(text_out)

produces the following output:

{"first_name":"John","last_name":"Smith","age":30}
{"first_name":"Tim","last_name":"Johnson","age":34}

score 2 · Answer 2 · answered Aug 25 '13 at 00:12

2

This should do it:

with open('old.csv') as old, open('new.csv', 'w') as new:
    new.writelines(re.sub(r'"(?!")', '', line) for line in old)

answered Aug 25 '13 at 00:12

elyase

39,479
12
112
119

score 1 · Answer 3 · edited May 23 '17 at 11:57

I think you are overthinking the problem, why don't replace data?

l = list()
with open('foo.txt') as f:
    for line in f:
        l.append(line.replace('""','"').replace('"{','{').replace('}"','}'))
s = ''.join(l)

print s # or save it to file

It generates:

{"first_name":"John","last_name":"Smith","age":30}
{"first_name":"Tim","last_name":"Johnson","age":34}

Use a list to store intermediate lines and then invoke .join for improving performance as explained in Good way to append to a string

Mario Rossi · Accepted Answer · 2013-08-24T23:34:35.063

If the input file is just as shown, and of the small size you mention, you can load the whole file in memory, make the substitutions, and then save it. IMHO, you don't need a RegEx to do this. The easiest to read code that does this is:

with open(filename) as f:
    input= f.read()
input= str.replace('""','"')
input= str.replace('"{','{')
input= str.replace('}"','}')
with open(filename, "w") as f:
    f.write(input)

I tested it with the sample input and it produces:

{"first_name":"John","last_name":"Smith","age":30}
{"first_name":"Tim","last_name":"Johnson","age":34}

Which is exactly what you want.

If you want, you can also pack the code and write

with open(inputFilename) as if:
    with open(outputFilename, "w") as of:
        of.write(if.read().replace('""','"').replace('"{','{').replace('}"','}'))

but I think the first one is much clearer and both do exactly the same.

@jabaldonedo `str` is **not** a reserved word. Otherwise my program would have not compiled and run. `str` **is** a built-in function that I can redefine if I want as I am not using it. But you are right, it can be confusing. I'm changing the code. Thanks for pointing it out and helping making the answer a better answer. — Mario Rossi, Aug 24 '13 at 23:18

score 1 · Answer 5 · answered Aug 24 '13 at 23:36

You can actual use the csv module and regex to do this:

st='''\
"{""first_name"":""John"",""last_name"":""Smith"",""age"":30}"
"{""first_name"":""Tim"",""last_name"":""Johnson"",""age"":34}"\
'''

import csv, re

data=[]
reader=csv.reader(st, dialect='excel')
for line in reader:
    data.extend(line)

s=re.sub(r'(\w+)',r'"\1"',''.join(data))
s=re.sub(r'({[^}]+})',r'\1\n',s).strip()
print s

Prints

{"first_name":"John","last_name":"Smith","age":"30"}
{"first_name":"Tim","last_name":"Johnson","age":"34"}

Regex to remove doubled double quotes from CSV

5 Answers5

text = re.sub(r'"(?!")', '', text)

`text = re.sub(r'"(?!")', '', text)`