1

Using the following CSV file:

"aa"!#"2811"!#"Location"!#"11"!#"67000"!#"ZZ"
"bb"!#2812!#"Location"!#"22"!#"67540"!#"XX"
"cc"!#"2813"!#Location!#"33"!#"67117"!#"YY"
"dd"!#"2452"!#"location"!#"44"!#"67000"!#"ZZ"

And using the following python code :

import pandas
import csv
pandas.read_csv("test.csv", sep="!#", header=None, quotechar='"')

Give the following result:

      0      0       1           2     3        4     5
0  "aa"  "2811"  "Location"  "11"  "67000"  "ZZ"
1  "bb"    2812  "Location"  "22"  "67540"  "XX"
2  "cc"  "2813"    Location  "33"  "67117"  "YY"
3  "dd"  "2452"  "location"  "44"  "67000"  "ZZ"

However, as I specified quotechar='"', the result should be

    0     1         2   3      4   5
0  aa  2811  Location  11  67000  ZZ
1  bb  2812  Location  22  67540  XX
2  cc  2813  Location  33  67117  YY
3  dd  2452  location  44  67000  ZZ

Am I missing anything ?

Edit: Replacing all !# by , makes it work, so apparently, quotechar isn't interpreted as sep is more than 1 char. So I'm looking for a solution without any str.replace() (I can't change the !#, and " are important as !# can be found within a column.

Blusky
  • 3,470
  • 1
  • 19
  • 35
  • 1
    quotechar is a quotechar, not a way to strip quotes. If you want to do type conversion or other processing, setting the quote char isn't going to automagically do that for you. – pvg Aug 18 '17 at 15:57
  • @tdube: Not really, this issue is a misformed CSV – Blusky Aug 18 '17 at 15:58
  • @pvg: It's not "Striping", the quotechar is there in case the separator is present in the CSV, the quotechar should not be present in the data (unless using `doublequote` or `escapechar`). – Blusky Aug 18 '17 at 15:59

6 Answers6

3

Please see here: python pandas read_csv quotechar does not work

The quotechar does not work if the separator is more than one character. I tried with a comma as a separator and it works.

lomizod
  • 56
  • 5
  • Indeed, replacing all `!#` by `,` makes it works, but I cannot change the CSV files. Any idea to make it work without any dirty `.replace()` ? – Blusky Aug 18 '17 at 16:03
2

Here is a workaround:

import re

data = []
with open(filename, 'r') as f:
    for row in f:
        data.append([re.sub(r'^"|"$', '', item).strip() for item in row.split('!#')])
df = pd.DataFrame(data)
>>> print(df)
    0     1         2   3      4   5
0  aa  2811  Location  11  67000  ZZ
1  bb  2812  Location  22  67540  XX
2  cc  2813  Location  33  67117  YY
3  dd  2452  location  44  67000  ZZ

For an explanation of the regular expression, it removes double quotations found either at the start or end of each parsed string.

  • ^" asserts position of the quotation at the start of the string.
  • "$ asserts positions of the quotation at the end of the string.
  • | matches either assertion above.

[re.sub(r'^"|"$', '', item).strip() for item in row.split('!#')] is a list comprehension that removes starting and ending quotations from each parsed element of the row and strips all whitespace.

Alexander
  • 105,104
  • 32
  • 201
  • 196
0
df = pd.read_csv("test.csv", delimiter="!#", header=None)
df = df.apply(lambda x: x.str.strip('"'))

output

    0     1         2   3      4   5
0  aa  2811  Location  11  67000  ZZ
1  bb  2812  Location  22  67540  XX
2  cc  2813  Location  33  67117  YY
3  dd  2452  location  44  67000  ZZ
Mr_U4913
  • 1,294
  • 8
  • 12
0

You could also use the replace function after reading the file

data = pd.read_csv('/Users/a70286/Desktop/temp.csv',sep="!#", header=None)
data.apply(lambda x: x.str.replace('"',""))
Gayatri
  • 2,197
  • 4
  • 23
  • 35
0

We know that using a multi-character delimiter with pd.read_csv will invoke the regex engine and using a regex delimiter is prone to ignoring quoted data -- at least, I don't see how it can be done, particularly if we wish !# to be treated as a non-delimiter when inside a pair of quotes.

So if we are going to use pd.read_csv, we'll have to use a single-character delimiter. What shall it be? If we use sep='!' then some columns will begin with # -- which will obscure the meaning of initial quote marks. Cleaning up the quote marks is an unwanted headache (notice the double quotes around X were not parsed correctly):

In [218]: pd.read_csv(StringIO(content), sep='!', quotechar='"', header=None)
Out[218]: 
      0        1                  2      3         4     5
0  a!#a  #"2811"  #"Location ""X"""  #"11"  #"67000"  #"1"
1    bb    #2812        #"Location"  #"22"  #"67540"  #"2"
2    cc  #"2813"          #Location  #"33"  #"67117"  #"3"
3    dd  #"2452"        #"location"  #"44"  #"67000"  #"4"

Therefore it is better to use sep='#'. So our hand is basically forced: we need to use sep='#', and quotechar='"':

In [219]: pd.read_csv(StringIO(content), sep='#', quotechar='"', header=None)
Out[219]: 
       0      1              2    3       4  5
0  a!#a!  2811!  Location "X"!  11!  67000!  1
1    bb!  2812!      Location!  22!  67540!  2
2    cc!  2813!      Location!  33!  67117!  3
3    dd!  2452!      location!  44!  67000!  4

Still, we'll need to clean up the !s after-the-fact, but at least quotes are properly respected (notice the !# inside the a!#a! was not interpreted as a delimiter).

The code below shows how to remove the !s and convert numeric-strings into numerals:

import pandas as pd
StringIO = pd.io.common.StringIO

content = '''\
"a!#a"!#"2811"!#"Location ""X"""!#11!!#"67000"!#"1"
"bb"!#2812!#"Location"!#"22"!#"67540"!#"2"
"cc"!#"2813"!#Location!#"33"!#"67117"!#"3"
"dd"!#"2452"!#"location"!#"44"!#"67000"!#"4"
'''
df = pd.read_csv(StringIO(content), sep='#', quotechar='"', header=None)
for col in df.columns[:-1]:
    try: df[col] = df[col].str[:-1]
    except AttributeError: pass
    try: df[col] = pd.to_numeric(df[col], errors='raise')
    except ValueError: pass
print(df)

yields

      0     1             2    3      4  5
0  a!#a  2811  Location "X"  11!  67000  1
1    bb  2812      Location   22  67540  2
2    cc  2813      Location   33  67117  3
3    dd  2452      location   44  67000  4

I modified your "CSV" data a bit to make sure !# inside quotes were not modified and that double quotes are properly parsed as single quotes.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
0

Using quoting = 3 in read_csv will work