how could I extract the following with regex?
String string = "<h1>1st header</h1>" + "<h2>second header</h2>" +
"<p>some text</p>" + "<hr />";
Pattern p = Pattern.compile("</h1>(\\S+)<hr />", Pattern.MULTILINE);
Output is empty, but why?
how could I extract the following with regex?
String string = "<h1>1st header</h1>" + "<h2>second header</h2>" +
"<p>some text</p>" + "<hr />";
Pattern p = Pattern.compile("</h1>(\\S+)<hr />", Pattern.MULTILINE);
Output is empty, but why?
The output is empty because the characters between </h1>
and <hr />
include spaces. Your \S+
will fail as soon as it encounters a space.
If you replace \\S+
with, say, .+
, it should catch everything in your highly specific example string. However, if you'd like to do this "right", and be able to match arbitrary HTML that doesn't perfectly fit your example, use an HTML parser like the HTML Agility Pack. A parser-based version will be easy, correct, and won't endanger your sanity and/or the universe.
The regex \S+ will not match the space between "some text". Also, don't use regex to parse HTML if you value your sanity.