2

I am using Python 2.7 and I am fairly familiar with using regular expressions and how to use them in Python. I would like to use a regex to replace comma delimiters with a semicolon. The problem is that data wrapped in double qoutes should retain embedded commas. Here is an example:

Before:

"3,14","1,000,000",hippo,"cat,dog,frog",plain text,"2,25"

After:

"3,14";"1,000,000";hippo;"cat,dog,frog";plain text;"2,25"

Is there a single regex that can do this?

panofish
  • 7,578
  • 13
  • 55
  • 96

5 Answers5

1
# Python 2.7
import re

text = '''
  "3,14","1,000,000",hippo,"cat,dog,frog",plain text,"2,25"
'''.strip()

print "Before: " + text
print "After:  " + ";".join(re.findall(r'(?:"[^"]+"|[^,]+)', text))

This produces the following output:

Before: "3,14","1,000,000",hippo,"cat,dog,frog",plain text,"2,25"
After:  "3,14";"1,000,000";hippo;"cat,dog,frog";plain text;"2,25"

You can tinker with this here if you need more customization.

rchang
  • 5,150
  • 1
  • 15
  • 25
1

You can use:

>>> s = 'foo bar,"3,14","1,000,000",hippo,"cat,dog,frog",plain text,"2,25"'
>>> print re.sub(r'(?=(([^"]*"){2})*[^"]*$),', ';', s)
foo bar;"3,14";"1,000,000";hippo;"cat,dog,frog";plain text;"2,25"

RegEx Demo

This will match comma only if it is outside quote by matching even number of quotes after ,.

anubhava
  • 761,203
  • 64
  • 569
  • 643
1

This is an other way that avoids to test all the string until the end with a lookahead for each occurrence. It's a kind of (more or less) \G feature emulation for re module. Instead of testing what comes after the comma, this pattern find the item before the comma (and the comma obviously) and is written in a way that makes each whole match consecutive to the precedent.

re.sub(r'(?:(?<=,)|^)(?=("(?:"")*(?:[^"]+(?:"")*)*"|[^",]*))\1,', r'\1;', s)

online demo

details:

(?:          # ensures that results are contiguous 
    (?<=,)        # preceded by a comma (so, the one of the last result)
  |             # OR
    ^             # at the start of the string
)
(?= # (?=(a+))\1 is a way to emulate an atomic group: (?>a+)
    (                        # capture the precedent item in group 1
        "(?:"")*(?:[^"]+(?:"")*)*"  # an item between quotes
      |
        [^",]*               # an item without quotes
    )
) \1  # back-reference for the capture group 1
,

The advantage of this way is that it reduces the number of steps to obtain a match and provides a near from constant number of steps whatever the item before (see the regex101 debugger). The reason is that all characters are matched/tested only once. So even the pattern is more long, it is more efficient (and the gain grow up in particular with long lines)

The atomic group trick is only here to reduce the number of steps before failing for the last item (that is not followed by a comma).

Note that the pattern deals with items between quotes with escaped quotes (two consecutive quotes) inside: "abcd""efgh""ijkl","123""456""789",foo

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
0

You can split with regex and then join it :

>>> ';'.join([i.strip(',') for i in re.split(r'(,?"[^"]*",?)?',s) if i])
'"3,14";"1,000,000";hippo;"cat,dog,frog";plain text;"2,25"'
Mazdak
  • 105,000
  • 18
  • 159
  • 188
0

This regex seems to do the job

,(?=(?:[^"]*"[^"]*")*[^"]*\Z)

Adapted from: How to match something with regex that is not between two special characters?

And tested with http://pythex.org/

Community
  • 1
  • 1
martintama
  • 193
  • 1
  • 4