Is this possible using regular expression

Question

I am using Python 2.7 and I am fairly familiar with using regular expressions and how to use them in Python. I would like to use a regex to replace comma delimiters with a semicolon. The problem is that data wrapped in double qoutes should retain embedded commas. Here is an example:

Before:

"3,14","1,000,000",hippo,"cat,dog,frog",plain text,"2,25"

After:

"3,14";"1,000,000";hippo;"cat,dog,frog";plain text;"2,25"

Is there a single regex that can do this?

possible duplicate of [Python parse csv file - replace commas with colons](http://stackoverflow.com/questions/6630170/python-parse-csv-file-replace-commas-with-colons) — thegrinner, Jan 22 '15 at 20:50
Not a direct duplicate, but assuming you want to change the delimiter of a CSV that's the question you want to look at. — thegrinner, Jan 22 '15 at 20:50
You can use Python's `csv` package with strings. That will take care of the comma-within-quote issues for you. — , Jan 22 '15 at 21:14
not the same question, since I am asking for a regex solution — panofish, Jan 22 '15 at 21:23
Thanks Jack Maney, but the csv package doesn't support unicode and that's a killer for me. :( — panofish, Jan 22 '15 at 21:37

score 1 · Answer 1 · answered Jan 22 '15 at 20:50

# Python 2.7
import re

text = '''
  "3,14","1,000,000",hippo,"cat,dog,frog",plain text,"2,25"
'''.strip()

print "Before: " + text
print "After:  " + ";".join(re.findall(r'(?:"[^"]+"|[^,]+)', text))

This produces the following output:

Before: "3,14","1,000,000",hippo,"cat,dog,frog",plain text,"2,25"
After:  "3,14";"1,000,000";hippo;"cat,dog,frog";plain text;"2,25"

You can tinker with this here if you need more customization.

I like the repl.it website! – panofish Jan 22 '15 at 21:26 — panofish, Jan 22 '15 at 21:26

score 1 · Answer 2 · answered Jan 22 '15 at 20:53

You can use:

>>> s = 'foo bar,"3,14","1,000,000",hippo,"cat,dog,frog",plain text,"2,25"'
>>> print re.sub(r'(?=(([^"]*"){2})*[^"]*$),', ';', s)
foo bar;"3,14";"1,000,000";hippo;"cat,dog,frog";plain text;"2,25"

RegEx Demo

This will match comma only if it is outside quote by matching even number of quotes after ,.

Casimir et Hippolyte · Accepted Answer · 2015-01-23T22:48:45.677

This is an other way that avoids to test all the string until the end with a lookahead for each occurrence. It's a kind of (more or less) \G feature emulation for re module. Instead of testing what comes after the comma, this pattern find the item before the comma (and the comma obviously) and is written in a way that makes each whole match consecutive to the precedent.

re.sub(r'(?:(?<=,)|^)(?=("(?:"")*(?:[^"]+(?:"")*)*"|[^",]*))\1,', r'\1;', s)

online demo

details:

(?:          # ensures that results are contiguous 
    (?<=,)        # preceded by a comma (so, the one of the last result)
  |             # OR
    ^             # at the start of the string
)
(?= # (?=(a+))\1 is a way to emulate an atomic group: (?>a+)
    (                        # capture the precedent item in group 1
        "(?:"")*(?:[^"]+(?:"")*)*"  # an item between quotes
      |
        [^",]*               # an item without quotes
    )
) \1  # back-reference for the capture group 1
,

The advantage of this way is that it reduces the number of steps to obtain a match and provides a near from constant number of steps whatever the item before (see the regex101 debugger). The reason is that all characters are matched/tested only once. So even the pattern is more long, it is more efficient (and the gain grow up in particular with long lines)

The atomic group trick is only here to reduce the number of steps before failing for the last item (that is not followed by a comma).

Note that the pattern deals with items between quotes with escaped quotes (two consecutive quotes) inside: "abcd""efgh""ijkl","123""456""789",foo

How can I get the match count? I tried using findall and finditer, but they returned a count that is 1 less than actual. findall does not return the last match? — panofish, Jan 23 '15 at 18:04
@panofish: normal, only items before a comma are matched, so the last item can not be matched! — Casimir et Hippolyte, Jan 23 '15 at 20:51
@panofish: sorry I have forgotten the case where the quoted item begins with an escaped quote `"""abc"" def"`. It is corrected. — Casimir et Hippolyte, Jan 23 '15 at 22:48

Mazdak · Answer 4 · 2015-01-22T21:10:34.283

0

You can split with regex and then join it :

>>> ';'.join([i.strip(',') for i in re.split(r'(,?"[^"]*",?)?',s) if i])
'"3,14";"1,000,000";hippo;"cat,dog,frog";plain text;"2,25"'

edited Jan 22 '15 at 21:10

answered Jan 22 '15 at 20:46

Mazdak

105,000
18
159
188

score 0 · Answer 5 · edited May 23 '17 at 11:56

0

This regex seems to do the job

,(?=(?:[^"]*"[^"]*")*[^"]*\Z)

Adapted from: How to match something with regex that is not between two special characters?

And tested with http://pythex.org/

edited May 23 '17 at 11:56

Community

1
1

answered Jan 22 '15 at 20:56

martintama

193
1
4

Is this possible using regular expression

5 Answers5

RegEx Demo