3

I want to replace following string

comments={ts=2010-02-09T04:05:20.777+0000, comment_id=529590|2886|LOL|Baoping Wu|529360}

in

comments={ts=2010-02-09T04:05:20.777+0000, comment_id=529590, user_id = 2886, comment='LOL', user= 'Baoping Wu', post_commented=529360}

My approach is comment_id=.([0-9])* for the first replace Its difficult for me for the other replaces. Can anyone help me?

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563

2 Answers2

0

You can perform all these transformations with one search and replace operation. Use the following regex that has capturing groups:

(comment_id=)(\d+)\|(\d+)\|([^|]+)\|([^|]+)\|(\d+)

Replace with $1$2, user_id = $3, comment='$4', user= '$5', post_commented=$6

See the regex demo

  • (comment_id=) - Group 1, a literal character sequence
  • (\d+) - Group 2: one or more digits
  • \| - a literal pipe symbol
  • (\d+) - Group 3 matching another portion of digits
  • \| - again, a pipe
  • ([^|]+) - Group 4 capturing one or more symbols other than |
  • \| - again, a pipe
  • ([^|]+) - Group 5 capturing one or more symbols other than|`
  • \| - another pipe
  • (\d+) - Group 6 matching another portion of digits

In the replacement string, $n are backreferences to the captured groups.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks:). My dataset is too big to test it on this website. Do you know how i can test it with a python script – Okan Albayrak Feb 24 '16 at 15:55
  • At [regex101.com](https://regex101.com/r/fP0fS9/3), you can get that Python code: go to the *code generator* section (see the left bottom pane). Python code sample is at the bottom. Note that Python backreferences are of the ``\`` + `digit` form. The replacement will be `\1\2, user_id = \3, comment='\4', user= '\5', post_commented=\6`. – Wiktor Stribiżew Feb 24 '16 at 16:00
  • I had try it on IDLE. I got a syntax Error as follows: p = re.compile(ur'(comment_id=)(\d+)\|(\d+)\|([^|]+)\|([^|]+)\|(\d+)') SyntaxError: invalid syntax Because of this symbol ' – Okan Albayrak Feb 24 '16 at 16:14
  • Sorry, right now I am on a bus and can't check but the syntax is correct. Write `import re`, then declare a `s = "your string"`, then check with `print(re.sub(r'(comment_id=)(\d+)\|(\d+)\|([^|]+)\|([^|]+)\|(\d+)', r"\1\2, user_id = \3, comment='\4', user= '\5', post_commented=\6", s))`. In real life, youyou'd want to read the file line by line though and write modified lines to a separate file. – Wiktor Stribiżew Feb 24 '16 at 16:19
  • Note I have put the pattern between single quotes and the replacement pattern between double quotes. – Wiktor Stribiżew Feb 24 '16 at 16:22
  • Thanks, now I want to use it on multiple lines with open('') as f: for line in f: p = re.compile(ur'(comment_id=)(\d+)\|(\d+)\|([^|]+)\|([^|]+)\|(\d+)') subst = u"\1\2, user_id = \3, comment='\4', user= '\5', post_commented=\6" But it doesnt work...Can you help me – Okan Albayrak Feb 24 '16 at 16:45
  • Please see [this SO post](http://stackoverflow.com/questions/12755587/using-python-to-write-specific-lines-from-one-file-to-another-file0) it should help. If not, I will help you once my kids go to bed in a couple of hours. – Wiktor Stribiżew Feb 24 '16 at 16:49
  • Since you got help with Python in [another SO question](http://stackoverflow.com/questions/35609601/optimization-of-parsing-of-python-scripts) please consider finalizing this question. – Wiktor Stribiżew Feb 24 '16 at 21:14
0

try this instead:

comment_id=.*?(?=,)

example

bmbigbang
  • 1,318
  • 1
  • 10
  • 15