0

I normally read only on stackoverflow and get a few programming tipps, but today I've got a question about regex.

I've parsed HTML-Code from JSoup and cleared with a regex every whitespace before a < and after a >. The problem is that the whitespace behind a -Tag (and and ) or before a , and is also cleared.

What can I add to my regex so that the whitespace behind a closing tag (only italic, bold and underline) or before an opening tag would not be removed (or that only one whitespace is left)?

My regex:

newHtml.select(UpgradeOldHtmlTags.BODY.toString()).html().replace("\n", "").replaceAll("\\s*<\\s*", "<")
            .replaceAll("\\s*>\\s*", ">")

part of the outcome:

und &nbsp;<u>Schadstofffreisetzung</u>bei Reinigungs-

outcome that i want:

und &nbsp; <u>Schadstofffreisetzung</u> bei Reinigungs-

Thank you really much for your help.

Edit:

After parsing with JSoup:

<p><br></p> <ol>  <li><font color="#007b00"><span style="font-size: 18px;"><b><u>Sicherheitsdatenblatt </u></b></span></font>auf Anfrage erhältlich. (EUH210)</li> </ol> <p> www.google.de </p> <p><u>Keimbesiedelung</u> in Kanälen, Filtern und ggf. Befeuchterwasser der Anlage:&nbsp; </p>

After my regex:

<p><br></p><ol><li><font color="#007b00"><span style="font-size: 18px;"><b><u>Sicherheitsdatenblatt</u></b></span></font>auf Anfrage erhältlich. (EUH210)</li></ol><p>www.google.de</p><p><u>Keimbesiedelung</u>in Kanälen, Filtern und ggf. Befeuchterwasser der Anlage:&nbsp;</p>

For example the whitespace between the word "Sicherheitsdatenblatt" and the -tag should not be deleted.

best regards from Bavaria

Sebastian

Sebastian
  • 3
  • 4

2 Answers2

0

I know you wanted this in Java but I am able to do in JavaScript. See if the regex helps..

Here is the match on the regex101 https://regex101.com/r/5rt9he/1

and the replace function in JavaScript

 let str = "und &nbsp;<u>Schadstofffreisetzung</u>bei Reinigungs-";
 let result = str.replace(/(<u>)(.*?)(<\/u>)/, " $1$2$3 ");
 console.log(result) -> 
 und &nbsp; <u>Schadstofffreisetzung</u> bei Reinigungs-
JBone
  • 1,724
  • 3
  • 20
  • 32
  • Thanks for your answer, but I need it in Java. I've only access to the backend-part :( But I will try this regex tomorrow, thanks for your help – Sebastian Sep 04 '17 at 20:45
0

I've got it for myself. Thanks for your help.

To the commentators: Read more than the title next time! You will see, that I haven't used regex to parse HTML. And don't post links that have nothing to do with the topic.

So you won't scare off new posters that just need a little bit of help and can maybe also help other new posters ...

Sebastian
  • 3
  • 4