1

I have the following string:

s = "XIDJIJFHD8","Gothika","a0KU000000JMYCrMAP","USA","English","Sub & Audio","VOD","SD","01/01/2011 00:00:00.000000","12/31/2049 00:00:00.000000",,"Confirmed",,,,"Feature",,"2003-11-21","2004-03-23",,"R","for violence, brief language and nudity.","2024863","6000008953",,,"10.5240/A6FC-02AE-8093-3B05-6240-T","10.5240/D052-B470-0D01-25DF-DA91-4","2024863_6000008953","idwb:2024863_6000008953","CA-0000950613"

I need to convert it to 'pipe-separated'. Fields are enclosed with quotations ", though if a field is empty, it won't have anything. The number of | in the final output should be 31. Here is what I have so far:

re.sub(r'(\,|\")(,)(,|\")', '|', s)

However, the length of the above has only 23. What would the correct regex be?

Or, even better, maybe I could just do it directly in the csv module. Something like:

string_with_pipes = csv.write(s, delimiter="|")

Note that I just want to get a modified string, not actually save a file.

David542
  • 104,438
  • 178
  • 489
  • 842
  • The regex doesn't work because successive commas are replaced by one pipe. They should in fact be replaced with a pipe for each one – ruaridhw Jan 03 '18 at 03:44
  • @ruaridhw right, so how would that be done? – David542 Jan 03 '18 at 03:46
  • See my answer below :) Just a small tweak to the regular expression you had originally pretty much by adding `(?` to the start of the first and third groups. You also don't need to escape , or " characters and the | can be replaced with a selection of allowable characters. `(\,|\")` becomes `(?[,"])` – ruaridhw Jan 03 '18 at 03:56

2 Answers2

2

There is no need for regular expressions. You can do it with a combination of csv.reader() and csv.writer() using a temporary buffer for which we'll use StringIO:

import csv
from StringIO import StringIO


s = '"XIDJIJFHD8","Gothika","a0KU000000JMYCrMAP","USA","English","Sub & Audio","VOD","SD","01/01/2011 00:00:00.000000","12/31/2049 00:00:00.000000",,"Confirmed",,,,"Feature",,"2003-11-21","2004-03-23",,"R","for violence, brief language and nudity.","2024863","6000008953",,,"10.5240/A6FC-02AE-8093-3B05-6240-T","10.5240/D052-B470-0D01-25DF-DA91-4","2024863_6000008953","idwb:2024863_6000008953","CA-0000950613"'

reader = csv.reader([s])

buffer = StringIO()
writer = csv.writer(buffer, delimiter="|")
writer.writerows(reader)

buffer.seek(0)
print(buffer.getvalue())

Prints:

XIDJIJFHD8|Gothika|a0KU000000JMYCrMAP|USA|English|Sub & Audio|VOD|SD|01/01/2011 00:00:00.000000|12/31/2049 00:00:00.000000||Confirmed||||Feature||2003-11-21|2004-03-23||R|for violence, brief language and nudity.|2024863|6000008953|||10.5240/A6FC-02AE-8093-3B05-6240-T|10.5240/D052-B470-0D01-25DF-DA91-4|2024863_6000008953|idwb:2024863_6000008953|CA-0000950613
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • great, I like this approach the best. However when I do `writerows` I get an error: TypeError: unicode argument expected, got 'str'. Any idea? – David542 Jan 03 '18 at 03:49
  • Doesn't look like this can be done directly in python2.7 -- https://stackoverflow.com/a/13120279/651174. Would you be able to modify your answer to write to bytes? – David542 Jan 03 '18 at 03:58
  • @David542 yeah, was looking for a single Python 2 and 3 compatible way. Updated with a separate "BytesIO" based Python 2.x solution. Thanks. – alecxe Jan 03 '18 at 03:59
  • cool -- it actually works also if you just change the import statement `from StringIO import StringIO` – David542 Jan 03 '18 at 04:01
  • 1
    I updated your import statement in the first one and now it works for both python 2 and 3, so no need for the second answer :) – David542 Jan 03 '18 at 22:21
1

The successive commas are being included in a single match.

You want a regex which doesn't include them in the replacement itself but ensures that they're there

re.sub(r'(?<=[,"])(,)(?=[,"])', '|', s)

This uses lookahead and lookbehinds to check that the , or " is present without replacing them.

  1. (,) Match a comma
  2. (?<=[,"]) Immediately preceded by either a comma or double quote
  3. (?=[,"]) Immediately followed by either a comma or double quote

The (? in the first and third groups ensure that these groups are not included in the replacement

ruaridhw
  • 2,305
  • 8
  • 22