7

I'm trying to remove all commas that are inside quotes (") with python:

'please,remove all the commas between quotes,"like in here, here, here!"'
                                                          ^     ^

I tried this, but it only removes the first comma inside the quotes:

re.sub(r'(".*?),(.*?")',r'\1\2','please,remove all the commas between quotes,"like in here, here, here!"')

Output:

'please,remove all the commas between quotes,"like in here here, here!"'

How can I make it remove all the commas inside the quotes?

carloabelli
  • 4,289
  • 3
  • 43
  • 70

5 Answers5

20

Assuming you don't have unbalanced or escaped quotes, you can use this regex based on negative lookahead:

>>> str = r'foo,bar,"foobar, barfoo, foobarfoobar"'
>>> re.sub(r'(?!(([^"]*"){2})*[^"]*$),', '', str)
'foo,bar,"foobar barfoo foobarfoobar"'

This regex will find commas if those are inside the double quotes by using a negative lookahead to assert there are NOT even number of quotes after the comma.

Note about the lookaead (?!...):

  • ([^"]*"){2} finds a pair of quotes
  • (([^"]*"){2})* finds 0 or more pair of quotes
  • [^"]*$ makes sure we don't have any more quotes after last matched quote
  • So (?!...) asserts that we don't have even number of quotes ahead thus matching commas inside the quoted string only.
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • 1
    Good news is all my quotes are balanced! Thanks! – carloabelli Jul 12 '16 at 18:45
  • 2
    it seems to work for me with mutliple comma's ... this is `re.magic` I hate regex in general ... but you sir are a genius with it – Joran Beasley Jul 12 '16 at 18:48
  • If you have time I'd be fascinated to know how on earth it works haha – carloabelli Jul 12 '16 at 18:49
  • 1
    Here's a [regex101](https://regex101.com/r/sF6nX0/1) with this. – Brendan Abel Jul 12 '16 at 18:53
  • 1
    @anubhava Thanks for the explanation as well. – carloabelli Jul 12 '16 at 18:55
  • Still will always be the wrong answer. To assume balanced quotes is ridiculous. Worse still, it takes 10 seconds to do just 55 lines of the sample. Given you are looking ahead to the end of file at every character position, it's exponentially like backtracking. The is probably the worst way to do this. –  Jul 12 '16 at 20:00
  • 1
    Thank you! This is brilliant for dealing with InfluxDB's "inconsistent" quoting of values. – user2460464 Jul 07 '17 at 16:14
  • 1
    wow this works like a charm. Although i don't like regex as it is very confusing but those who like and very fond of it. Thanks for help. – user3341078 Dec 23 '18 at 18:51
  • 1
    @anubhava, Thank you so much sir, your solution really helped me. – Pyd Mar 03 '22 at 18:08
3

You can pass a function as the repl argument instead of a replacement string. Just get the entire quoted string and do a simple string replace on the commas.

>>> s = 'foo,bar,"foobar, barfoo, foobarfoobar"'
>>> re.sub(r'"[^"]*"', lambda m: m.group(0).replace(',', ''), s)
'foo,bar,"foobar barfoo foobarfoobar"'
Brendan Abel
  • 35,343
  • 14
  • 88
  • 118
1

Here is another option I came up with if you don't want to use regex.

input_str = 'please,remove all the commas between quotes,"like in here, here, here!"'

quotes = False

def noCommas(string):
    quotes = False
    output = ''
    for char in string:
        if char == '"':
            quotes = True
        if quotes == False:
            output += char
        if char != ',' and quotes == True:
            output += char
    return output

print noCommas(input_str)
albydarned
  • 123
  • 1
  • 10
0

What about doing it with out regex?

input_str = '...'

first_slice = input_str.split('"')

second_slice = [first_slice[0]]
for slc in first_slice[1:]:
    second_slice.extend(slc.split(','))

result = ''.join(second_slice)
Dan
  • 1,874
  • 1
  • 16
  • 21
0

The above answer with for-looping through the string is very slow, if you want to apply your algorithm to a 5 MB csv file.

This seems to be reasonably fast and provides the same result as the for loop:

#!/bin/python3

data = 'hoko foko; moko soko; "aaa mo; bia"; "ee mo"; "eka koka"; "koni; masa"; "co co"; ehe mo; "bi; ko"; ko ma\n "ka ku"; "ki; ko"\n "ko;ma"; "ki ma"\n"ehe;";koko'

first_split=data.split('"')
split01=[]
split02=[]
for slc in first_split[0::2]:
    split01.append(slc)
for slc in first_split[1::2]:
    slc_new=",".join(slc.split(";"))
    split02.append(slc_new)

resultlist = [item for sublist in zip(split01, split02) for item in sublist]
if len(split01) > len (split02):
   resultlist.append(split01[-1])
if len(split01) < len (split02):
   resultlist.append(split02[-1])
   
result='"'.join(resultlist)
print(data)
print(split01)
print(split02)
print(result)

Results in:

hoko foko; moko soko; "aaa mo; bia"; "ee mo"; "eka koka"; "koni; masa"; "co co"; ehe mo; "bi; ko"; ko ma
 "ka ku"; "ki; ko"
 "ko;ma"; "ki ma"
"ehe;";koko
['hoko foko; moko soko; ', '; ', '; ', '; ', '; ', '; ehe mo; ', '; ko ma\n ', '; ', '\n ', '; ', '\n', ';koko']
['aaa mo, bia', 'ee mo', 'eka koka', 'koni, masa', 'co co', 'bi, ko', 'ka ku', 'ki, ko', 'ko,ma', 'ki ma', 'ehe,']
hoko foko; moko soko; "aaa mo, bia"; "ee mo"; "eka koka"; "koni, masa"; "co co"; ehe mo; "bi, ko"; ko ma
 "ka ku"; "ki, ko"
 "ko,ma"; "ki ma"
"ehe,";koko
amirzolal
  • 168
  • 1
  • 9