replace some part of a word with regex

Question

how do you delete text inside <ref> *some text*</ref> together with ref itself?

in '...and so on<ref>Oxford University Press</ref>.'

re.sub(r'<ref>.+</ref>', '', string) only removes <ref> if <ref> is followed by a whitespace

EDIT: it has smth to do with word boundaries I guess...or?

EDIT2 What I need is that it will math the last (closing) </ref> even if it is on a newline.

Vegar Westerlund · Accepted Answer · 2010-11-10T22:27:58.563

3

I don't really see you problem, because the code pasted will remove the <ref>...</ref> part of the string. But if what you mean is that and empty ref tag is not removed:

re.sub(r'<ref>.+</ref>', '', '...and so on<ref></ref>.')

Then what you need to do is change the .+ with .*

A + means one or more, while * means zero or more.

From http://docs.python.org/library/re.html:

'.' (Dot.) In the default mode, this matches any character except a newline.
    If the DOTALL flag has been specified, this matches any character including
    a newline.
'*' Causes the resulting RE to match 0 or more repetitions of the preceding
    RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’
    followed by any number of ‘b’s.
'+' Causes the resulting RE to match 1 or more repetitions of the preceding
    RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will
    not match just ‘a’.
'?' Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
    ab? will match either ‘a’ or ‘ab’.

edited Nov 10 '10 at 22:27

answered Nov 10 '10 at 22:04

Vegar Westerlund

1,604
3
16
24

and what if the closing `` is on a newline? how do I handle it? – Gusto Nov 10 '10 at 22:12
apparently there is a flag (re.DOTALL) which makes '.' match all characters _including_ newline. But that doesn't seem to work with the re module I've got in python2.6. *Update:* Looking at docs.python.org/library/re.html it says re.sub: Changed in version 2.7,3.1: Added the optional flags argument. – Vegar Westerlund Nov 10 '10 at 22:23
I've tried that (re.DOTALL) `re.sub(r'(?s).*'` but it's loosing control and removes too much, more than half of the text - this is absolutely wrong. any other ideas? – Gusto Nov 10 '10 at 22:34
Again from http://docs.python.org/library/re.html: "The '*', '+', and '?' qualifiers are all greedy". Which means it will match the first and the _last_ . You can change this by adding a ? to the * (re.sub(r'(?s).*?'). Try it out – Vegar Westerlund Nov 10 '10 at 22:41
it looks like `(re.sub(r'(?s).*?')` using re.DOTAL and `r'[^<]*'` using `[^<]` by @erkmene are the same thing – Gusto Nov 10 '10 at 22:53

erkmene · Answer 2 · 2010-11-10T22:57:08.727

You might want to be cautious not to remove a whole lot of text just because there are more than one closing </ref>s. Below regex would be more accurate in my opinion:

r'<ref>[^<]*</ref>'

This would prevent the 'greedy' matching.

BTW: There is a great tool called The Regex Coach to analyze and test your regexes. You can find it at: http://www.weitz.de/regex-coach/

edit: forgot to add code tag in the first paragraph.

score 1 · Answer 3 · answered Nov 10 '10 at 22:40

You could make a fancy regex to do just what you intend, but you need to use DOTALL and non-greedy search, and you need to understand how regexes work in general, which you don't.

Your best option is to use string methods rather than regexes, which is more pythonic anyway:

while '<reg>' in string:
    begin, end = string.split('<reg>', 1)
    trash, end = end.split('</reg>', 1)
    string = begin + end

If you want to be very generic, allowing strange capitalization of the tags or whitespaces and properties in the tags, you shouldn't do this either, but invest in learning a html/xml parsing library. lxml currently seems to be widely recommended and well-supported.

score 0 · Answer 4 · edited May 23 '17 at 12:29

0

If you try to do this with regular expressions you're in for a world of trouble. You're effectively trying to parse something but your parser isn't up to the task.

Matching greedily across strings probably eats up too much, as in this example:

<ref>SDD</ref>...<ref>XX</ref>

You'd end up cleraning up the entire middle.

You really want a parser, something like Beautiful Soup.

from BeautifulSoup import BeautifulSoup, Tag
s = "<a>sfsdf</a> <ref>XX</ref> || <ref>YY</ref>"
soup = BeautifulSoup(s)
x = soup.findAll("ref")
for z in x:
  soup.ref.replaceWith('!')
soup # <a>sfsdf</a> ! || !

edited May 23 '17 at 12:29

Community

1
1

answered Nov 10 '10 at 22:49

Paul Rubel

26,632
7
60
80

i know it will be more practical to stand aback from regex cleaning html, but still...for the sake of the exercise I have to use it. – Gusto Nov 10 '10 at 22:59
While this is almost always the right way to go particularly if you're scraping; in my experience it introduces an unneeded complexity for small find & replace scripts. If tested carefully, the regex method I describe above would solve most of the problems quickly. – erkmene Nov 10 '10 at 23:00

replace some part of a word with regex

4 Answers4