0

I'm trying to pseudo translate the text embedded within HTML in a string. I don't want to touch the actual html tags or its attributed, just the content.

So for example, if I have something like:

<td colspan='2'><a>This is a Text in <b>Bold</b></a></td>

I want this to be eventually modified into

<td colspan='2'><a>Thìs ís à Tèxt îñ <b>Bòlð</b></a></td>

1) I can't use any third party libraries, so I'm using standard regex to parse html 2) I tried both pattern.match() and pattern.split() but both seem to have a few limitations. pattern.split() helps with splitting the string based on a regex pattern, but I lose the actual pattern in that process. Pattern.match helps with retaining the pattern, but I can't guarentee the markup.

So ideally I would want something to take the string with HTML and break it into an array like

array[0]: HTML Tag
array[1]: Plain Text
array[2]: HTML Tag
array[3]: Plain Text
array[4]: HTML Tag
array[5]: Plain Text
array[6]: HTML Tag

Any ideas ?

Silican
  • 124
  • 1
  • 12
  • 1
    I'd look into an HTML parsing lib like [Jsoup](http://jsoup.org/) to do what you want – Ryan J Nov 30 '15 at 21:48
  • @RyanJ in his question he states: `1) I can't use any third party libraries, so I'm using standard regex to parse html` – Mike Elofson Nov 30 '15 at 21:49
  • Show us the code that you are using at the moment. – RealSkeptic Nov 30 '15 at 21:50
  • Forcing you to use regex for parsing any html/xml style language is lunacy. The first rule of regex re:html is `don't parse html with regex`. http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not – KevinDTimm Nov 30 '15 at 21:52
  • Possible duplicate of [Question about parsing HTML using Regex and Java](http://stackoverflow.com/questions/2394457/question-about-parsing-html-using-regex-and-java) – KevinDTimm Nov 30 '15 at 21:54

1 Answers1

0

As regex, you could use this one:

(?<=>)[^>]+(?=<)

I'm assuming here that you have a replace function that can take a captured group and mingle its text:

String str = "<td colspan='2'><a>This is a Text in <b>Bold</b></a></td>";
str.replaceAll("(?<=>)[^>]+(?=<)","");

However, without knowing how you intend to "pseudotranslate" a string, we can't really help you further. For custom replacement methods, this answer may be useful.

Community
  • 1
  • 1
Bram Vanroy
  • 27,032
  • 24
  • 137
  • 239