I normally read only on stackoverflow and get a few programming tipps, but today I've got a question about regex.
I've parsed HTML-Code from JSoup and cleared with a regex every whitespace before a < and after a >. The problem is that the whitespace behind a -Tag (and and ) or before a , and is also cleared.
What can I add to my regex so that the whitespace behind a closing tag (only italic, bold and underline) or before an opening tag would not be removed (or that only one whitespace is left)?
My regex:
newHtml.select(UpgradeOldHtmlTags.BODY.toString()).html().replace("\n", "").replaceAll("\\s*<\\s*", "<")
.replaceAll("\\s*>\\s*", ">")
part of the outcome:
und <u>Schadstofffreisetzung</u>bei Reinigungs-
outcome that i want:
und <u>Schadstofffreisetzung</u> bei Reinigungs-
Thank you really much for your help.
Edit:
After parsing with JSoup:
<p><br></p> <ol> <li><font color="#007b00"><span style="font-size: 18px;"><b><u>Sicherheitsdatenblatt </u></b></span></font>auf Anfrage erhältlich. (EUH210)</li> </ol> <p> www.google.de </p> <p><u>Keimbesiedelung</u> in Kanälen, Filtern und ggf. Befeuchterwasser der Anlage: </p>
After my regex:
<p><br></p><ol><li><font color="#007b00"><span style="font-size: 18px;"><b><u>Sicherheitsdatenblatt</u></b></span></font>auf Anfrage erhältlich. (EUH210)</li></ol><p>www.google.de</p><p><u>Keimbesiedelung</u>in Kanälen, Filtern und ggf. Befeuchterwasser der Anlage: </p>
For example the whitespace between the word "Sicherheitsdatenblatt" and the -tag should not be deleted.
best regards from Bavaria
Sebastian