1

Alright, so I have the following output:

<p style="margin-top: 0">

</p>

that I want to be replaced with <br />. I have the following code:

string.replaceAll("<p([^>]*)></p>","<br/>");

What would I need to put between the > and < tags in order to replace only paragraph tags that have white space? That is, no characters or number between them.

Thanks

Samsquanch
  • 8,866
  • 12
  • 50
  • 89
  • Is this ASCII data or UTF-8 data? – tchrist Nov 19 '10 at 04:44
  • 5
    Obligatory reference to http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – David Gelhar Nov 19 '10 at 04:47
  • The summary of what David linked to: Don't use Regex on HTML/XML – Andrew Barber Nov 19 '10 at 04:52
  • @Andrew: Don’t use a nuclear warhead when a slingshot will do. Heresy it may be, but it is perfectly acceptable and indeed advisable to use regex for HTML and XML **provided** that they are well-defined subsets under your control. For example, if you’ve generated them yourself so that you know you don’t have arbitrarily complex craziness. For small problems that meet a certain sort of prediability, regexes are absolutely 100% fine. Of coure, somebody who doesn’t even know how to match Unicode whitespace (which is *hard* in Java, dang it!) should never attempt such a thing. – tchrist Nov 19 '10 at 04:59
  • @tchrist: Maybe you meant to post your comment as a reply to David Gelhar, which is who posted the link. I was merely summarizing it. I didn't say I agree with it. (nor that I disagree with it, for that matter... my belief is closer to yours than the 'purists' on this one) – Andrew Barber Nov 19 '10 at 05:02
  • @David: This is for a very simple project that will never see anything even close to production. I know any and all input that would go in, and just needed to clear this bit up to clean up the output without going through too much trouble. So, although I wouldn't suggest anyone to do something like this in an application that wold be used in the real world, it fits my application perfectly. I do thank you for the concern, however. – Samsquanch Nov 19 '10 at 05:05
  • @user485418: Given that restricted domain, regexes seem the right answer. Just be careful with Java and Unicode: its charclass alias like `\w` and `\s` only work on ASCII. It’s really lame. – tchrist Nov 19 '10 at 05:07
  • See my comment to you below my answer. using \p{Zs} will handle Unicode whitespace. – laz Nov 19 '10 at 18:23

2 Answers2

1

Use this method:

string.replaceAll("<p([^>]*)>\\s*</p>", "<br/>");
Somnath Muluk
  • 55,015
  • 38
  • 216
  • 226
Reese Moore
  • 11,524
  • 3
  • 24
  • 32
  • That’s a pretty naïve regex. You have to be [a whole lot more careful](http://stackoverflow.com/questions/4044946/regex-to-split-html-tags/4045840#4045840) to stand any chance of doing it correctly. – tchrist Nov 19 '10 at 04:53
  • @user485418: last I checked, Java’s `\s` was only good for ASCII data. Hope that’s what you’ve got. – tchrist Nov 19 '10 at 05:00
1
string.replaceAll("<p([^>]*)>\\s+?</p>","<br/>");

That should handle most scenarios. It is a non-greedy repetition where at least one whitespace character is required. Of course, using an HTML parser would yield more consistent results.

laz
  • 28,320
  • 5
  • 53
  • 50
  • I cannot possibly see what a minimal vs a maximal ASCII whitespace match is going to buy you. Also, my HTML pages are constantly full of `\x85` and `\xA0` characters. Javs’s `\s` implementation is busted, you know. – tchrist Nov 19 '10 at 05:03
  • “Safer”? I think I’m going to go join the Cthulhu crowd. Sheesh! – tchrist Nov 19 '10 at 05:04
  • Perhaps the original poster controls the HTML and just needs to perform a transformation on the files? I agree that regular expressions usually are the solution for XML parsing. But for simple text munging, I can't tell you how many times I've fired up vim and used search and replace to transform legacy data quickly. – laz Nov 19 '10 at 14:15
  • Also, if the requirement is to match all Unicode whitespace characters, \\s+? can be replaced with \\p{Zs}+? wouldn't you agree? – laz Nov 19 '10 at 16:46
  • Just noticed my above comment says "regular expressions usually are the solution for XML parsing". That "are" should have been "aren't"! – laz May 12 '11 at 14:11