1

How to use regex to search word in the html string, but ignore the word in html tags. For example <a href="foo">foo</a>, the first foo in should be ignored, the second foo is the pattern to search.

Tmx
  • 579
  • 2
  • 6
  • 17

3 Answers3

1

An example using BeautifulSoup combined with regex instead:

from bs4 import BeautifulSoup
import re

string = '''
<a class='fooo123'>foo on its own</a>
<a class='123foo'>only foo</a>
'''

soup = BeautifulSoup(string, "lxml")
foo_links = soup.find_all(text=re.compile("^foo"))
print(foo_links)
# ['foo on its own']

To wrap the found links with e.g. mark, you can do the following:

from bs4 import BeautifulSoup
import re

string = '''
<a class='fooo123'>foo on its own</a>
<a class='123foo'>only foo</a>
'''

soup = BeautifulSoup(string, "lxml")
foo_links = soup.findAll('a', text=re.compile("^foo"))
for a in foo_links:
    mark = soup.new_tag('mark')
    a.wrap(mark)

print(soup.prettify())

As well as the mandatory Tony the Pony link...

Community
  • 1
  • 1
Jan
  • 42,290
  • 8
  • 54
  • 79
  • Thanks for the example. The search works. Do you have any idea to highlight the matched item as well? Basically, replace the `foo` with `foo`. I also found that BeautifulSoup is much slower than `regex`, the search with `beautifulsoup` takes about 1.5 seconds while `regex` takes only 5ms (`re.compile("foo", string)`). Maybe because I do not have the correct search pattern? I also tried regex `re.compile("(?:<.*>.*)(foo)")`, it is even longer with 3.830 seconds. – Tmx Aug 12 '16 at 05:58
  • @Tmx: Updated the answer. – Jan Aug 12 '16 at 08:07
1

This program should be able to find all the contents between tags.

import re

str = '''<h3>
            <a href="//stackexchange.com/users/838793061/?accounts">yourcommunities</a>
    </h3>

        <a href="#" id="edit-pinned-sites">edit</a>
        <a href="#" id="cancel-pinned-sites"style="display:none;">cancel</a>'''

pattern = re.compile(r'>([^<>]+)<')
all = re.findall(pattern, str)

for i in all:
    print(i)
Arijit Ghosh
  • 59
  • 1
  • 5
0

What if the content contains spaces?

I propose the next regex that also removes the spaces from the answer:

#### With spaces:
line = '<a href="foo">     foo       </a>'
re.findall(r'>\s*(\w*)\s*<',line)
### ['foo']

#### No spaces:
line = '<a href="foo">foo</a>'
re.findall(r'>\s*(\w*)\s*<',line)
### ['foo']