How to use regex to search word in the html string, but ignore the word in html tags. For example <a href="foo">foo</a>
, the first foo
in should be ignored, the second foo
is the pattern to search.
Asked
Active
Viewed 287 times
1

Tmx
- 579
- 2
- 6
- 17
-
Are you looking for a bullet-proof solution? Or one that works with just that specific string? – 4castle Aug 12 '16 at 05:16
-
2Instead of regex, you should use an HTML parser. Try `beautifulsoup` – juanpa.arrivillaga Aug 12 '16 at 05:19
-
The `foo` could be regex express. – Tmx Aug 12 '16 at 05:24
-
@juanpa.arrivillaga Thanks. I will have a try. – Tmx Aug 12 '16 at 05:25
-
An approach that works in many cases is `<[^>]+>|(foo)`. From there you can find the matches that have a first capture group. It breaks though when the attributes contain `>`. – 4castle Aug 12 '16 at 05:35
-
1[`re.sub(r'<[^<]+?>|(foo)', lambda m: "{}".format(m.group(1)) if m.group(1) else m.group(), s)`](https://ideone.com/6Z2Zr0) will work in any Python version. – Wiktor Stribiżew Aug 12 '16 at 06:59
3 Answers
1
An example using BeautifulSoup
combined with regex instead:
from bs4 import BeautifulSoup
import re
string = '''
<a class='fooo123'>foo on its own</a>
<a class='123foo'>only foo</a>
'''
soup = BeautifulSoup(string, "lxml")
foo_links = soup.find_all(text=re.compile("^foo"))
print(foo_links)
# ['foo on its own']
To wrap the found links with e.g. mark
, you can do the following:
from bs4 import BeautifulSoup
import re
string = '''
<a class='fooo123'>foo on its own</a>
<a class='123foo'>only foo</a>
'''
soup = BeautifulSoup(string, "lxml")
foo_links = soup.findAll('a', text=re.compile("^foo"))
for a in foo_links:
mark = soup.new_tag('mark')
a.wrap(mark)
print(soup.prettify())
As well as the mandatory Tony the Pony link...
-
Thanks for the example. The search works. Do you have any idea to highlight the matched item as well? Basically, replace the `foo` with `foo`. I also found that BeautifulSoup is much slower than `regex`, the search with `beautifulsoup` takes about 1.5 seconds while `regex` takes only 5ms (`re.compile("foo", string)`). Maybe because I do not have the correct search pattern? I also tried regex `re.compile("(?:<.*>.*)(foo)")`, it is even longer with 3.830 seconds. – Tmx Aug 12 '16 at 05:58
-
1
This program should be able to find all the contents between tags.
import re
str = '''<h3>
<a href="//stackexchange.com/users/838793061/?accounts">yourcommunities</a>
</h3>
<a href="#" id="edit-pinned-sites">edit</a>
<a href="#" id="cancel-pinned-sites"style="display:none;">cancel</a>'''
pattern = re.compile(r'>([^<>]+)<')
all = re.findall(pattern, str)
for i in all:
print(i)

Arijit Ghosh
- 59
- 1
- 5
0
What if the content contains spaces?
I propose the next regex that also removes the spaces from the answer:
#### With spaces:
line = '<a href="foo"> foo </a>'
re.findall(r'>\s*(\w*)\s*<',line)
### ['foo']
#### No spaces:
line = '<a href="foo">foo</a>'
re.findall(r'>\s*(\w*)\s*<',line)
### ['foo']

Arturo Ruiz Mañas
- 61
- 1
- 3