9

I'm trying to mark up an HTML file (literally wrapping strings in "mark" tags) using python and BeautifulSoup. The problem is basically as follows...

Say I have my original html document:

test = "<h1>oh hey</h1><div>here is some <b>SILLY</b> text</div>"

I want to do a case-insensitive search for a string in this document (ignoring HTML) and wrap it in "mark" tags. So let's say I want to find "here is some silly text" in the html (ignoring the bold tags). I'd like to take the matching html and wrap it in "mark" tags.

For example, if I want to search for "here is some silly text" in test, the desired output is:

"<h1>oh hey</h1><div><mark>here is some <b>SILLY</b> text</mark></div>"

Any ideas? If it's more appropriate to use lxml or regular expressions, I'm open to those solutions as well.

follyroof
  • 3,430
  • 2
  • 28
  • 26
  • Close to a duplicate of another question. http://stackoverflow.com/questions/8936030/using-beautifulsoup-to-search-html-for-string – melwil May 28 '13 at 19:50
  • @melwil - it's close, but only covers the (case-sensitive) text retrieval portion of the question. how do you use a similar search string, find case-insensitive matches in the html, and wrap those matches in mark tags? – follyroof May 28 '13 at 19:53
  • Once you find the text, just insert it into the parent enclosed in a mark tag. – melwil May 28 '13 at 19:55
  • How do you want to handle `

    here is some

    silly text

    `? If you turn it into `

    here is some

    silly text

    `, then the `mark` tag spans between two sibling elements. Is that OK?
    – Kevin May 28 '13 at 20:01
  • @Kevin - yup! i'm fine with that – follyroof May 28 '13 at 20:03
  • @Kevin: Depending on how you write things, that may not match because of the extra spaces… but if it does, it's going to match the parent of those two tags (which, depending on which parser BS chose, may be, e.g., a `body` tag), so you will end up with the `` spanning both tags, which is a perfectly valid thing to do. There's no easy way to trick BS4 into generating the invalid tag soup you suggested, so it won't be a problem. – abarnert May 28 '13 at 20:48

1 Answers1

7
>>> soup = bs4.BeautifulSoup(test)
>>> matches = soup.find_all(lambda x: x.text.lower() == 'here is some silly text')
>>> for match in matches:
...     match.wrap(soup.new_tag('mark'))
>>> soup
<html><body><h1>oh hey</h1><mark><div>here is some <b>SILLY</b> text</div></mark></body></html>

The reason I had to pass a function as the name to find_all that compares x.text.lower(), instead of just using the text argument with a function that compares x.lower(), is that the latter will not find the content in some cases that you apparently want.

The wrap function may not work this way in some cases. If it doesn't, you will have to instead enumerate(matches), and set matches[i] = match.wrap(soup.new_tag('mark')). (You can't use replace_with to replace a tag with a new tag that references itself.)

Also note that if your intended use case allows any non-ASCII string to ever match 'here is some silly text' (or if you want to broaden the code to handle non-ASCII search strings), the code above using lower() may be incorrect. You may want to call str.casefold() and/or locale.strxfrm(s) and/or use locale.strcoll(s, t) instead of using ==, but you'll have to understand what you want and how to get it to pick the right answer.

Ryan M
  • 18,333
  • 31
  • 67
  • 74
abarnert
  • 354,177
  • 51
  • 601
  • 671
  • 1
    thanks a lot! it looks like you can get away with just `for match in matches: match.wrap(soup.new_tag('mark'))` – follyroof May 28 '13 at 20:43
  • 1
    @MuradSalahi: I was just testing that. It seems to work in 4.1.3, but not in 4.0.5. I'm not sure why, but I'll change it to the simple version, with an explanation or anyone who it doesn't work for. Thanks! – abarnert May 28 '13 at 20:44
  • 1
    [`.lower()` might not be enough for a case-insensitive search in a Unicode text](http://stackoverflow.com/a/9030564/4279). – jfs May 28 '13 at 20:56
  • @J.F.Sebastian: True. But without knowing the use case, it's impossible to describe what the _right_ thing is. I'll update the answer to cover this a bit. – abarnert May 28 '13 at 21:05
  • What if what I want mark is "silly text"? It cross 2 tags, so you can't find it by iter all tags. – Harry Lee Aug 16 '14 at 14:58
  • @HarryLee: The OP's question, `"Here is some silly text"` already crosses multiple tags, so you aren't adding anything. Of course you want to find a substring rather than an exact match, but that just means you use `'silly text' in x.text.lower()` instead of `'Here is some silly text' == x.text.lower()`. If that isn't obvious enough, create a new question. – abarnert Aug 18 '14 at 02:09
  • @abarnert `soup.find_all` iter tags, you lambda accept `tag` as param. But in 'silly text', 'silly' belongs to tag ``, 'text' is part of tag `div`, you cannot get tag.text == 'silly text'. Am I right? – Harry Lee Aug 18 '14 at 03:21
  • @HarryLee: Did you read my reply? You're not going to get `tag.text.lower() == 'silly text'`, but you are going to get `'silly text' in tag.text.lower()`. The `text` of the `div` tag is `'here is some SILLY text'`—it includes the text of the child tag. That's exactly how the code works on the existing problem, and it will work the same way on yours. (Of course the _mark_ part will be more complicated for your problem, but that's not what you asked…). And again, if you really need help with this, create a new question, don't comment on an answer to someone else's year-old question. – abarnert Aug 18 '14 at 17:42
  • how can I replace `here is some SILLY text` with `here is some great text` without wrapping? – Timo Jan 26 '21 at 20:38