1

I am scraping a website which has really poor HTML structure and I am getting text like this

Example:

Creator:
\r\r
My Name
\r\r
Date created:
\r\r
123123
<br><br>
Title:
\r\r
Title here
\r\r

I want it to look like

Creator: My Name
\r\r
Date created:123123
Title:Title here
\r\r

I have this regex _str = re.sub('\r+','',_str) But I know its wrong because it replaces all \r

Is there any way to iterate over re.sub()? Or you have any idea in mind how do I achieve my goal?

Umair Ayub
  • 19,358
  • 14
  • 72
  • 146

3 Answers3

3

You should try something like replacing :

:
\r\r

by :

Thomas Blanquet
  • 507
  • 6
  • 17
2

You can replace \r\r pattern + next group (including next \r\r pattern) by only the second part.

re.sub('\r+([^\r]+\r+)',r'\1',_str)

(I would have liked to do it with forward lookup but here you have to consume the following pattern)

Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
1

Does it have to be regex?

s1 = 'Creator:\r\rMy Name\r\rDate created:\r\r123123<br><br>Title:\r\rTitle here\r\r'
s2 = ''.join(l + '\r\r' * (n % 3 == 1) for n, l in enumerate(s1.split('\r\r')))
// s2 == 'Creator:My Name\r\rDate created:123123<br><br>Title:Title here\r\r'
Dekel
  • 60,707
  • 10
  • 101
  • 129