0

I need to parse a string like this one:

"<img src=\"some_link\" height=\"200\" width=\"auto\" /><br><br\>"Lorem ipsum dolor si amet...\" Name<br>address<br>www.google.com<br>01 42 42 42 42"

I need everything after the img tag but I want each one separate: the lorem ipsum part / the name part / the web link part / the phone number

I'm not really here for code example but for some method and techniques to do it. At first I wanted to just delete the img part and replace the br tag with \n but it would be great to have each information separate so that I can work on them.

EDIT: I used Jsoup as metionned below and it works fine! Thanks

Jonathan Aurry
  • 93
  • 1
  • 10

3 Answers3

0

I agree with Rishabh Gupta, that the regexps are the easiest way to go. Before elaborating more on than, I want to point out that parsing HTML with regexps is error-prone, however for simple tasks (where it is OK to have some small number of defects) it takes less effort. An example:

String s =  "<img src=\"some_link\" height=\"200\" width=\"auto\";
Pattern p = Pattern.compile("<img src=\"([^\"]+)\" height=\"([^\"]+)\";
Matcher m = p.matcher(s);
if (m.find()) {
    String link = m.group(1);
    String height = m.group(2);
}

In the above pattern I use capturing groups "()" and character sets "[]". E. g. '([^\"]+)' means "one or more consequtive characters that are not a quote" and this will be in the first capturing group - used by m.group(1).

The above makes sense if the order of the attributes is fixed, i.e. you know in advance that the image tag will always have the "src=" followed by "height=", etc. For random order, you could first find everything inside the image tag (regexp: "]+>") and then use another regexp to extract attrbute pairs.

  • Please don't suggest using regex with HTML. It's a terrible idea: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – kaqqao Oct 24 '16 at 11:05
0

You can split given string on <br> tag and store it in string array.

String[] strArr=givenString.split("\\<br>");

use br tag in split function

Use value from string array as required

Mr Lister
  • 45,515
  • 15
  • 108
  • 150
Pallavi
  • 1
  • 2
0

Because this is not just any string, but HTML, you should use an HTML parser (never ever attempt parsing HTML with regex).

jsoup is the best choice in Java:

    String html = "<img src=\"some_link\" height=\"200\" width=\"auto\" /><br><br\\>\"Lorem ipsum dolor si amet...\" Name<br>address<br>www.google.com<br>01 42 42 42 42";
    Document doc = Jsoup.parse(html);

    for (Element e : doc.select("*")) {
        for (TextNode tn : e.textNodes()) {
            System.out.println(tn.text());
        }
    }
Community
  • 1
  • 1
kaqqao
  • 12,984
  • 10
  • 64
  • 118