0

I have a simple requirement to extract text in html. Suppose the html is

<h1>hello</h1> ... <img moduleType="calendar" /> ...<h2>bye</h2> 

I want to convert it into three parts

<h1>hello</h1> 
<img moduleType="calendar" />
<h2>bye</h2> 

The aim is to extract text in two categories, simple html and special tags with <img moduleType="Calendar".

Dennis Williamson
  • 346,391
  • 90
  • 374
  • 439
Fred Yang
  • 2,521
  • 3
  • 21
  • 29
  • /me sigh... another "how to parse html with regex" question... – maček Apr 22 '10 at 19:11
  • What language are you coding in? There's likely a better solution than regular expressions, many languages have DOM parsers. Also, you might want to accept answers on some of your other questions to improve the quality/quantity of future answers. – Andy E Apr 22 '10 at 19:12
  • 5
    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – ire_and_curses Apr 22 '10 at 19:16
  • [Check the answers](http://stackoverflow.com/questions/tagged/html+regex). – BalusC Apr 23 '10 at 00:06
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – George Apr 23 '10 at 14:25

3 Answers3

1

Don't do that; HTML can be broken in many beautiful ways. Use beautiful soup instead.

florin
  • 13,986
  • 6
  • 46
  • 47
0

It depends on the language and context you are using. I do something similar on my CMS, my approach is first find tags and then attributes.

Get tags

"<img (.*?)/>"

Then I search through the result for specific attributes

'title="(.*?)"'

If you want to find all attributes you could easily change the explicit title to the regex [a-z], or non-whitespace character, and then loop through those results as well.

Owen Allen
  • 411
  • 4
  • 11
  • Fighting against the downvotes you'll get -- Welcome to SO ;-) Include known problems/limitations in your answer. HTML parsing with regular expressions is almost always stomped on. –  Apr 22 '10 at 20:02
0

I actually try to do similar thing as asp.net compiler to compile the mark up into server control tree, regular expression is heavily used by asp.net compiler. I have a temporary solution, although not nice, but seems ok.

//string source = "<h1>hello</h1>";
string source = "<h1>hello<img moduleType=\"calendar\" /></h1> <p> <img moduleType=\"calendar\" /> </p> <h2>bye</h2> <img moduleType=\"calendar\" /> <p>sss</p>";
Regex exImg = new Regex("(.+?)(<img.*?/>)");

var match = exImg.Match(source);
int lastEnd = 0;
while (match.Success)
{
    Console.WriteLine(match.Groups[1].Value);
    Console.WriteLine(match.Groups[2].Value);
    lastEnd = match.Index + match.Length;
    match = match.NextMatch();
}
Console.WriteLine(source.Substring(lastEnd, source.Length - lastEnd ));


Fred Yang
  • 2,521
  • 3
  • 21
  • 29