1
<div>random contents without < or > , but has ( )  <div>

Just need to fix the closing div tag

so it looks like <div>random contents</div>

I need to do it in Python by regex.

The input is exact like the first line, there will no any < or > in random contents

user469652
  • 48,855
  • 59
  • 128
  • 165
  • 1
    Err... why do you want do do that? And why do you have to use regexes? – thejh Nov 30 '10 at 21:19
  • 1
    How about `
    random contents
    ` ?
    – Bart Kiers Nov 30 '10 at 21:20
  • 1
    What? What does your typical input looks like? And what about nested divs - they're quite common... If you *know* it has "html close tag" is should be easy enough. – Kobi Nov 30 '10 at 21:22
  • Without more information (can you have nested divs, is it at the end of the string, things like that) it's impossible to answer properly. We need data. – Chris Morgan Nov 30 '10 at 21:26
  • Updated: The input is exact like
    fsdfsdf
    , no other tag involved. Thanks!
    – user469652 Nov 30 '10 at 21:32
  • Not exactly a duplicate, but I think [this answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) might be worth a read... – Daniel Pryden Nov 30 '10 at 21:32
  • @Bart: **My** HTML-targetted regexes *never* fail with such simplistic exploits as that one. :) – tchrist Nov 30 '10 at 21:33
  • @tchrist, I know yours don't, perhaps that's why you're not the one asking the question :). My comments (hopefully) serve two purposes: 1) to raise awareness with the one asking the question that such things *can* happen, and 2) hoping to get a better problem description. – Bart Kiers Nov 30 '10 at 21:42
  • @Bart: I was just teasing, you know. – tchrist Nov 30 '10 at 23:17
  • @tchrist, yeah, I knew that (hence my smiley). My (elaborate) reply was also meant for others to make clear I didn't mean to say that regex can't be used to perform certain operations on html (or some other language). – Bart Kiers Dec 01 '10 at 07:47

3 Answers3

2

I wouldn't recommend a regex - use something like tidy (which is a Python wrapper around HTML Tidy).

Andrew Hare
  • 344,730
  • 71
  • 640
  • 635
  • 1
    Need to include a third party library, which is not very good in my situation. – user469652 Nov 30 '10 at 21:29
  • 1
    @user469652: The problem is that virtually anybody who needs help with regexes on HTML is not at a sufficient level of competence *qua* expertise *qua* wizardry with regexes as to have any hope of doing a good job at it. Those of us who are, don’t ask such questions. It’s one of those catch-22 problems. It is too hard to get right for 99.98% of the programmers out there ever to attempt save in very constrained circumstances. And even there they usually blow it and come here for help. It will be years if ever before you can manage this, and it will never be fun. – tchrist Nov 30 '10 at 21:37
  • Understood, thanks very much, I will consider include one efficient HTML processing library in my project. – user469652 Nov 30 '10 at 21:41
  • 1
    @tchrist, in response to your comment: HTML _cannot be parsed_ by regex! Seriously, it can't. HTML allows arbitrary levels of nesting. Regex doesn't. QED. Tony the pony, he comes! http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Chinmay Kanchi Nov 30 '10 at 23:05
  • @Chinway: It is trivial to write recursive regular expressions with full grammars. Therefore you are wrong. QED. If I ever see that idiotic posting supplied as an answer, I vote it down every time just as it deserves. You do not know what you are talking about. – tchrist Nov 30 '10 at 23:11
2

Avoid using regular expressions for dealing with HTML.

This is how it would be parsed in a DOM tree as it currently is:

>>> from BeautifulSoup import BeautifulSoup
>>> BeautifulSoup('<div>random contents<div>')
<div>random contents<div></div></div>

Or are you wanting to turn the second <div> into </div> (which a browser certainly would not do)?

Chris Morgan
  • 86,207
  • 24
  • 208
  • 215
2

replace

(<div>[^<]*<)(div>)

with

$1/$2

Note: This is bad practice, don't do it unless it's absolutely necessary!

thejh
  • 44,854
  • 16
  • 96
  • 107