2

I want to open a CSV file, using open(). I read it per line. For some reason, I'm not using Pandas.

I want to replace comma , with _XXX_, but I want to avoid replacing commas inside double quotes " because that comma is not a separation tag, so I can't use:

string_ = string_.replace(',', '_XXX_')

How to do this? User regex maybe?

I've found replace comma inside quotation, Python regex: find and replace commas between quotation marks, but i need replace comma OUTSIDE quotation.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Hjin
  • 320
  • 1
  • 11
  • what's the reason you're not using `pandas`? – lenik May 31 '20 at 01:57
  • I do not see how [Regex to pick characters outside of pair of quotes](https://stackoverflow.com/questions/632475/regex-to-pick-characters-outside-of-pair-of-quotes) can help, most solutions are very inefficient (those with lookaheads, they must be avoided by all means), and the one worth attention is for PCRE only, and it requires specific Python knowledge to make it work in Python. – Wiktor Stribiżew Jul 18 '20 at 17:13

1 Answers1

0

You may use a re.sub with a simple "[^"]*" regex (or (?s)"[^"\\]*(?:\\.[^"\\]*)*" if you need to handle escaped sequences in between double quotes, too) to match strings between double quotes, capture this pattern into Group 1, and then match a comma in all other contexts. Then, pass the match object to a callable used as the replacement argument where you may further manipulate the match.

import re
print( re.sub(r'("[^"]*")|,', 
    lambda x: x.group(1) if x.group(1) else x.group().replace(",", ""),
    '1,2,"test,3,7","4, 5,6, ... "') )
    # => 12"test,3,7""4, 5,6, ... "

print( re.sub(r'(?s)("[^"\\]*(?:\\.[^"\\]*)*")|,', 
    lambda x: x.group(1) if x.group(1) else x.group().replace(",", ""),
    r'1,2,"test, \"a,b,c\" ,03","4, 5,6, ... "') )
    # => 12"test, \"a,b,c\" ,03""4, 5,6, ... "

See the Python demo.

Regex details

  • ("[^"]*")|,:
    • ("[^"]*") - Capturing group 1: a ", then any 0 or more chars other than " and then a "
    • | - or
    • , - a comma

The other one is

  • (?s) - the inline version of a re.S / re.DOTALL flag
  • ("[^"\\]*(?:\\.[^"\\]*)*") - Group 1: a ", then any 0 or more chars other than " and \ then 0 or more sequences of a \ and any one char followed with 0 or more chars other than " and \ and then a "
  • | - or
  • , - comma.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563