7

I need to read a CSV file in Pandas which has data in the following format (double 'double quotes' for one of the fields)

"column1","column2","column3","column4"
"10",""AB"","ABCD","abcd"
"11",""CD,E"","CDEF","abcd"
"12",""WER"","DEF,31","abcd"

I expect the correctly parsed dataframe to be like

column1  column2  column3   column4
10       AB       ABCD      abcd
11       "CD,E"   CDEF      abcd
12       WER      "DEF,31"  abcd

I tried using

df= pd.read_csv('sample.txt',quotechar='""', quoting=csv.QUOTE_ALL)

and

df= pd.read_csv('sample.txt',quotechar='"', quoting=csv.QUOTE_ALL)

but getting

TypeError: "quotechar" must be a 1-character string

and

pandas.errors.ParserError: Error tokenizing data. C error: Expected 4 fields in line 3, saw 5

Is there a way for me to read this file as is without having to preprocess and remove the double 'double quotes' in the data?

When column2 has no commas, I'm able to read the data with some extra quotes which I can replace in further processing steps. I'm having parsing problems only when column2 is having a comma.

Vinay
  • 1,473
  • 4
  • 14
  • 24

4 Answers4

2

I'm not sure if pandas can do this by itself since you also have both unescaped separators and quotes in your data.

However, you should be able to parse it after modifying the data with regex by escaping quotes that are part of the field.

import re
from io import StringIO

data = """
"column1","column2","column3","column4"
"10",""AB"","ABCD","abcd"
"11",""CD,E"","CDEF","abcd"
"12",""WER"","DEF,31","abcd"
"""

data = re.sub('(?<!^)"(?!,")(?<!,")(?!$)', '\\"', data, flags=re.M)

pd.read_csv(StringIO(data), escapechar='\\')

If you are reading from a file, then:

with open('path/to/csv', 'r') as f:
    data = re.sub('(?<!^)"(?!,")(?<!,")(?!$)', '\\"', f.read(), flags=re.M)
    df = pd.read_csv(StringIO(data), escapechar='\\')

The regex looks for quotes that are not at the start (?<!^) or end (?!$) of each line, and for quotes that are not part of the start (?<!,") and end (?!,") of each field

kayoz
  • 1,104
  • 12
  • 16
2

Pre-processing is not necessarily a bad thing. If done using system pipes, it won't add too much overhead (like the one above with StringIO). It can also save lots of time chasing some known issues (ex. DOS CR, no EOL at EOF, NULL bytes in the middle, etc. depending on your system).

For this case + some other common issues like extra quotes, spaces, commas, imbedded commas. While pandas do have a parameter doublequote, it's not flexible enough.

Using a system pipe, should be efficient for large files on Linux

import os
df = pd.read_csv(
  os.popen('sed -r "s/^\s+|(^[,[:space:]]*|\s*)(#.*)?$//g; s/\s+,/,/g; s/\\"\\"/\\"/g" %s' % fname),
    quotechar='"', skipinitialspace=True)

OR: using a python pipe

import re
from io import StringIO
with open(fname) as f:
  data = re.sub('""', '"', re.sub('[ \t]+,', ',',
    re.sub('^[ \t]+|(^[ \t,]*|[ \t]*)(#.*)?$', '', f.read(), flags=re.M)))
  df = pd.read_csv(StringIO(data), quotechar='"', skipinitialspace=True)

Input file with comments and issues

a, b, c, d         # header w/ trailing spaces
,, ,,,,            # commas + spaces, no data
# extra space before data
  1, 2, 3.5, 4k     
3, " 5 " , 7.6, "n, m" # extra spaces, comma inside
10, "20" , 30.5, w z
40, 60, 75, ""x, q""   # double quoting

Output:

    a   b     c     d
0   1   2   3.5    4k
1   3   5   7.6  n, m
2  10  20  30.5   w z
3  40  60  75.0  x, q

It's now clean and properly formatted:

a int64
b int64
c float64
d object

list(df['d']):  ['4k', 'n, m', 'w z', 'x, q']
nightshift
  • 21
  • 1
1

This should do the trick for you

df =pd.read_csv("so.txt",encoding='utf-8', names=["column1","column2","column3","column4"],sep='",',header=0,quoting=csv.QUOTE_ALL)
nithin
  • 753
  • 3
  • 7
  • 21
  • Thanks! Although I'm able to read the file, using the separator mentioned is causing the columns to be read with extra quotes for each field. Writing the df to a CSV file gives the following output, the first row being like this: ```"""10","""""AB""","""ABCD","""abcd"""``` – Vinay Oct 14 '19 at 09:43
0

I tryed the others answers, but one didn't work and another messed up with my text. I ended up creating my solution. I figured out that a good separator would be ", but I can't use it with Pandas, the library just didn`t work.

So this code read line by line, split by \", create as pd.Series because I only could append in the dataframe using a Series.

df = pd.DataFrame()

i = 0
with open("teste.csv", 'r') as arquivo:
    for linha in arquivo:
        i = i+1
        print(i)
        colunas = pd.Series( linha.split("\",") )        
        df = df.append(colunas, ignore_index=True)

Important to note that this code took some time to process 70000 lines, but it worked perfect to me and to the OP example.

Rhuan Barros
  • 171
  • 1
  • 6