0

I have a String text that has a regular form and want to take two parts of this String. the String has the format

"<html><div style=\"text-align:center;\"><b>****</b><br><i>Aula: </i><b>****</b></div></html>"

Where the ****indicates the parts of string that I want to take. How can I do? I'm using JAVA, also the string is written in HTML.

We can see that the intresting parts of the String are both limited by <b> and <\b>

Jose Luis
  • 3,307
  • 3
  • 36
  • 53
Bernheart
  • 607
  • 1
  • 8
  • 17

2 Answers2

5

If that is exactly form of your HTML String then you can use substring method using positions of <b> and </b> (if your HTML code can change you should use HTML parser)

String s = "<html><div style=\"text-align:center;\"><b>first</b><br><i>Aula: </i><b>second</b></div></html>";
int start = s.indexOf("<b>");
int end = s.indexOf("</b>");
String firstMatch = s.substring(start + "<b>".length(), end);

//now we can start looking for next `<b>` after position where we found `</b>`
start = s.indexOf("<b>", end);
//and look for </b> after position that we found latest <b>
end = s.indexOf("</b>", start);
String secondMatch = s.substring(start + "<b>".length(), end);

System.out.println(firstMatch);
System.out.println(secondMatch);

output:

first
second
Pshemo
  • 122,468
  • 25
  • 185
  • 269
  • Thanks, that will be good for the first interesting word. And how can I take the second? Even the second one, infact, bengins and ends with – Bernheart Sep 13 '13 at 19:00
  • @Bernheart Sorry I didn't notice that there are two parts that need to be extracted. Will edit. – Pshemo Sep 13 '13 at 19:02
  • Thank you for explain. That is what I wanted! – Bernheart Sep 13 '13 at 19:09
  • @Bernheart, You can also use `lastIndexOf()` for the second ``. A matter of taste though but just something you should read up on. You never know when it might come handy. – Ravi K Thapliyal Sep 13 '13 at 19:34
4

You have a few options. The most obvious, but probably not the best, is to use a regex. Look at String.replaceAll for that.

A better option is to use an HTML parser. An example of that is JSoup.

Daniel Kaplan
  • 62,768
  • 50
  • 234
  • 356
  • You shouldn't use a regex to parse HTML. http://stackoverflow.com/a/1732454/1864167 – Jeroen Vannevel Sep 13 '13 at 19:00
  • You shouldn't be suggesting `replaceAll()` when OP clearly wants to parse data out of the string. I wonder if people have stopped reading answers before voting it up. – Ravi K Thapliyal Sep 13 '13 at 19:03
  • @RaviThapliyal no need to be rude. You can use `replaceAll` to do that. – Daniel Kaplan Sep 13 '13 at 19:04
  • @tieTYT, please add an illustration. It would help me as well. – Ravi K Thapliyal Sep 13 '13 at 19:07
  • @RaviThapliyal `System.out.println("
    ****
    Aula: ****
    ".replaceAll("
    ", "").replaceAll("
    ", ""));`
    – Daniel Kaplan Sep 13 '13 at 19:12
  • 1
    @tieTYT, first of all there are two `` values that need to be parsed. Your solution leaves `****
    Aula: ****` as output which is incorrect. Secondly, almost always you would parse something known out of something unknown. Your solution of passing a known header and footer is just plain hacky and impractical.
    – Ravi K Thapliyal Sep 13 '13 at 19:19
  • My mistake. Just add another `replaceAll("
    Aula: ", "")` on the end. Yes it's hacky. That's why I said, "probably not the best" That's why I said a **better** option is to use JSoup.
    – Daniel Kaplan Sep 13 '13 at 20:10