Java Splitting string including patterns into an Array

Question

I'm trying to pseudo translate the text embedded within HTML in a string. I don't want to touch the actual html tags or its attributed, just the content.

So for example, if I have something like:

<td colspan='2'><a>This is a Text in <b>Bold</b></a></td>

I want this to be eventually modified into

<td colspan='2'><a>Thìs ís à Tèxt îñ <b>Bòlð</b></a></td>

1) I can't use any third party libraries, so I'm using standard regex to parse html 2) I tried both pattern.match() and pattern.split() but both seem to have a few limitations. pattern.split() helps with splitting the string based on a regex pattern, but I lose the actual pattern in that process. Pattern.match helps with retaining the pattern, but I can't guarentee the markup.

So ideally I would want something to take the string with HTML and break it into an array like

array[0]: HTML Tag
array[1]: Plain Text
array[2]: HTML Tag
array[3]: Plain Text
array[4]: HTML Tag
array[5]: Plain Text
array[6]: HTML Tag

Any ideas ?

I'd look into an HTML parsing lib like [Jsoup](http://jsoup.org/) to do what you want — Ryan J, Nov 30 '15 at 21:48
@RyanJ in his question he states: `1) I can't use any third party libraries, so I'm using standard regex to parse html` — Mike Elofson, Nov 30 '15 at 21:49
Forcing you to use regex for parsing any html/xml style language is lunacy. The first rule of regex re:html is `don't parse html with regex`. http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not — KevinDTimm, Nov 30 '15 at 21:52
Possible duplicate of [Question about parsing HTML using Regex and Java](http://stackoverflow.com/questions/2394457/question-about-parsing-html-using-regex-and-java) — KevinDTimm, Nov 30 '15 at 21:54

score 0 · Answer 1 · edited May 23 '17 at 12:22

As regex, you could use this one:

(?<=>)[^>]+(?=<)

I'm assuming here that you have a replace function that can take a captured group and mingle its text:

String str = "<td colspan='2'><a>This is a Text in <b>Bold</b></a></td>";
str.replaceAll("(?<=>)[^>]+(?=<)","");

However, without knowing how you intend to "pseudotranslate" a string, we can't really help you further. For custom replacement methods, this answer may be useful.

Java Splitting string including patterns into an Array

1 Answers1