Replace only Even occurrences of re.sub() - Python Regex

Question

I am scraping a website which has really poor HTML structure and I am getting text like this

Example:

Creator:
\r\r
My Name
\r\r
Date created:
\r\r
123123
<br><br>
Title:
\r\r
Title here
\r\r

I want it to look like

Creator: My Name
\r\r
Date created:123123
Title:Title here
\r\r

I have this regex _str = re.sub('\r+','',_str) But I know its wrong because it replaces all \r

Is there any way to iterate over re.sub()? Or you have any idea in mind how do I achieve my goal?

Try _str = re.sub('([^\r]+)\r\r([^\r]+\r\r)', '\\1\\2', _str) — Skycc, Nov 09 '16 at 14:09
Check this relevant post http://stackoverflow.com/a/1732454/131057 — Luka Rahne, Nov 09 '16 at 14:13

score 3 · Answer 1 · answered Nov 09 '16 at 14:06

3

You should try something like replacing :

:
\r\r

by :

answered Nov 09 '16 at 14:06

Thomas Blanquet

507
6
17

What you want is when you have a `:`, remove the `\r` after, if you do this : `re.sub('[:]\r+',':',_str)` That should do this :
– Thomas Blanquet Nov 09 '16 at 14:17
I failed sorry, so that should change `Name:\r\rMy_Name\r\r` to `Name:My_Name\r\r` – Thomas Blanquet Nov 09 '16 at 14:19

score 2 · Answer 2 · answered Nov 09 '16 at 14:10

2

You can replace \r\r pattern + next group (including next \r\r pattern) by only the second part.

re.sub('\r+([^\r]+\r+)',r'\1',_str)

(I would have liked to do it with forward lookup but here you have to consume the following pattern)

answered Nov 09 '16 at 14:10

Jean-François Fabre

137,073
23
153
219

score 1 · Answer 3 · answered Nov 09 '16 at 14:07

Does it have to be regex?

s1 = 'Creator:\r\rMy Name\r\rDate created:\r\r123123<br><br>Title:\r\rTitle here\r\r'
s2 = ''.join(l + '\r\r' * (n % 3 == 1) for n, l in enumerate(s1.split('\r\r')))
// s2 == 'Creator:My Name\r\rDate created:123123<br><br>Title:Title here\r\r'

Replace only Even occurrences of re.sub() - Python Regex

3 Answers3