Java regex help -- white space is killing me

Question

Alright, so I have the following output:

<p style="margin-top: 0">

</p>

that I want to be replaced with <br />. I have the following code:

string.replaceAll("<p([^>]*)></p>","<br/>");

What would I need to put between the > and < tags in order to replace only paragraph tags that have white space? That is, no characters or number between them.

Thanks

Obligatory reference to http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — David Gelhar, Nov 19 '10 at 04:47
The summary of what David linked to: Don't use Regex on HTML/XML — Andrew Barber, Nov 19 '10 at 04:52
@Andrew: Don’t use a nuclear warhead when a slingshot will do. Heresy it may be, but it is perfectly acceptable and indeed advisable to use regex for HTML and XML **provided** that they are well-defined subsets under your control. For example, if you’ve generated them yourself so that you know you don’t have arbitrarily complex craziness. For small problems that meet a certain sort of prediability, regexes are absolutely 100% fine. Of coure, somebody who doesn’t even know how to match Unicode whitespace (which is *hard* in Java, dang it!) should never attempt such a thing. — tchrist, Nov 19 '10 at 04:59
@tchrist: Maybe you meant to post your comment as a reply to David Gelhar, which is who posted the link. I was merely summarizing it. I didn't say I agree with it. (nor that I disagree with it, for that matter... my belief is closer to yours than the 'purists' on this one) — Andrew Barber, Nov 19 '10 at 05:02
@David: This is for a very simple project that will never see anything even close to production. I know any and all input that would go in, and just needed to clear this bit up to clean up the output without going through too much trouble. So, although I wouldn't suggest anyone to do something like this in an application that wold be used in the real world, it fits my application perfectly. I do thank you for the concern, however. — Samsquanch, Nov 19 '10 at 05:05
@user485418: Given that restricted domain, regexes seem the right answer. Just be careful with Java and Unicode: its charclass alias like `\w` and `\s` only work on ASCII. It’s really lame. — tchrist, Nov 19 '10 at 05:07
See my comment to you below my answer. using \p{Zs} will handle Unicode whitespace. — laz, Nov 19 '10 at 18:23

score 1 · Answer 1 · edited Mar 23 '12 at 18:55

1

Use this method:

string.replaceAll("<p([^>]*)>\\s*</p>", "<br/>");

edited Mar 23 '12 at 18:55

Somnath Muluk

55,015
38
216
226

answered Nov 19 '10 at 04:46

Reese Moore

11,524
3
24
32

That’s a pretty naïve regex. You have to be [a whole lot more careful](http://stackoverflow.com/questions/4044946/regex-to-split-html-tags/4045840#4045840) to stand any chance of doing it correctly. – tchrist Nov 19 '10 at 04:53
@user485418: last I checked, Java’s `\s` was only good for ASCII data. Hope that’s what you’ve got. – tchrist Nov 19 '10 at 05:00

score 1 · Accepted Answer · answered Nov 19 '10 at 04:53

1

string.replaceAll("<p([^>]*)>\\s+?</p>","<br/>");

That should handle most scenarios. It is a non-greedy repetition where at least one whitespace character is required. Of course, using an HTML parser would yield more consistent results.

answered Nov 19 '10 at 04:53

laz

28,320
5
53
50

I cannot possibly see what a minimal vs a maximal ASCII whitespace match is going to buy you. Also, my HTML pages are constantly full of `\x85` and `\xA0` characters. Javs’s `\s` implementation is busted, you know. – tchrist Nov 19 '10 at 05:03
“Safer”? I think I’m going to go join the Cthulhu crowd. Sheesh! – tchrist Nov 19 '10 at 05:04
Perhaps the original poster controls the HTML and just needs to perform a transformation on the files? I agree that regular expressions usually are the solution for XML parsing. But for simple text munging, I can't tell you how many times I've fired up vim and used search and replace to transform legacy data quickly. – laz Nov 19 '10 at 14:15
Also, if the requirement is to match all Unicode whitespace characters, \\s+? can be replaced with \\p{Zs}+? wouldn't you agree? – laz Nov 19 '10 at 16:46
Just noticed my above comment says "regular expressions usually are the solution for XML parsing". That "are" should have been "aren't"! – laz May 12 '11 at 14:11

Java regex help -- white space is killing me

2 Answers2