0

how could I extract the following with regex?

String string = "<h1>1st header</h1>" + "<h2>second header</h2>" +
"<p>some text</p>" + "<hr />";

Pattern p = Pattern.compile("</h1>(\\S+)<hr />", Pattern.MULTILINE);

Output is empty, but why?

membersound
  • 81,582
  • 193
  • 585
  • 1,120
  • 3
    Oh, dear! I hear hoof-beats! http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Jonathan M May 15 '12 at 21:50

2 Answers2

4

The output is empty because the characters between </h1> and <hr /> include spaces. Your \S+ will fail as soon as it encounters a space.

If you replace \\S+ with, say, .+, it should catch everything in your highly specific example string. However, if you'd like to do this "right", and be able to match arbitrary HTML that doesn't perfectly fit your example, use an HTML parser like the HTML Agility Pack. A parser-based version will be easy, correct, and won't endanger your sanity and/or the universe.

Community
  • 1
  • 1
Justin Morgan - On strike
  • 30,035
  • 12
  • 80
  • 104
  • 1
    Though you don't *have* to jump to an HTML parser like a bull at a gate if using a regex genuinely serves your purposes and you're careful about the expression that you use. – Neil Coffey May 16 '12 at 00:01
  • 1
    @NeilCoffey, "...and you're careful...", *and* if you control the HTML you're parsing. If others control it, they will always be able to come up with a legit tag that the regex can't match. That's the main reason to not use regex. – Jonathan M May 16 '12 at 14:52
  • Well, maybe... if you're operating in an environment where somebody is deliberately trying to break your HTML parsing for some reason then that's obviously a different scenario to the case of parsing some HTML documents 'as they are'. I don't disagree that there are scenarios where you need to be wary of using regex to parse HTML. But there are scenarios where regex provides a succinct, working solution and there's really no need to be paranoid about Angering The God Of HTML Parsers if you opt for the simple solution in such cases. But yes, you need to be aware of the issues as you point out. – Neil Coffey May 16 '12 at 15:03
  • 1
    @NeilCoffey, it's really not about angering anyone, or even someone deliberately breaking something. It's just that HTML is widely varied, and if you're trying to scrape, you can't count on anything being consistent. Also, DOM-based solutions are pretty easy to implement these days with good libraries such as mentioned in this answer. It's too easy to do it right to mess with regex. – Jonathan M May 16 '12 at 16:42
  • @NeilCoffey - You're right that regex can be the quickest, easiest fix in certain (limited) tasks involving HTML/XML. I'm urging a parser because a) his sample input gives very few clues as to what he's going to be working with, and b) it sounds to me like he's looking for a robust solution. The `.+` suggestion will work with his sample string, but a parser is the safe way to go. – Justin Morgan - On strike May 16 '12 at 17:41
3

The regex \S+ will not match the space between "some text". Also, don't use regex to parse HTML if you value your sanity.

Community
  • 1
  • 1
Chris Nava
  • 6,614
  • 3
  • 25
  • 31