Extract texts from a string

Question

I've following string which is HTML -

<html>
    <head>
        <title>Repository</title>
    </head>
    <body>
        <h2>Subversion</h2>
        <ul>
            <li>
                <a href="../">..</a>
            </li>
            <li>
                <a href="branch_A/">branch_A</a>
            </li>
            <li>
                <a href="branch_B/">branch_B</a>
            </li>
        </ul>
    </body>
</html>

Out of this I want to get labels of li tag which are branch_A, branch_B Count of li's can vary. I want to get all of them. Can you please help how I can parse this String and get those values?

NOTE I could have used jsoup library to achieve same, but considering our project restriction, I cannot use it.

I'm sure there are HTML parsers in Java. Don't use RegEx for that. See [this](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Cid, Jun 11 '20 at 09:14
Use an HTML parser like [jsoup](https://jsoup.org/) for this. — kshitij86, Jun 11 '20 at 09:15
Yes, that is available but because of restriction to use external library, I cannot make use of it. Let me add this in question also. — Alpha, Jun 11 '20 at 09:16
Using regex and strings will be pesky for this, but if you have to do it, [check this out](https://www.oreilly.com/library/view/java-cookbook-3rd/9781449338794/ch04.html) — kshitij86, Jun 11 '20 at 09:21
Just for curiosity, is there any reasons you should [reinvent the square wheel](https://exceptionnotfound.net/reinventing-the-square-wheel-the-daily-software-anti-pattern/) ? Could be because it's for school or some coding challenge websites (not bad reasons) or could be because of some non-technical managers who imagined great ideas about something they clearly lack of knowledge — Cid, Jun 11 '20 at 09:25

score 0 · Answer 1 · answered Jun 11 '20 at 09:29

0

You can use an HTML parser for this. In the code below jsoup (https://www.baeldung.com/java-with-jsoup) is used and its quick and easy.

   Document doc = Jsoup.connect(fix url here).get();
   doc.select(tag you want).forEach(System.out::println);

Other tools are discussed here: https://tomassetti.me/parsing-html/

answered Jun 11 '20 at 09:29

Kehinde

16
3

From question : *"NOTE I could have used jsoup library to achieve same, but considering our project restriction, I cannot use it."* – Cid Jun 11 '20 at 11:14

Salem AlHarbi · Answer 2 · 2020-06-11T09:53:03.233

Using Java 8 streams:

    String html = "<html>\n" +
    "    <head>\n" +
    "        <title>Repository</title>\n" +
    "    </head>\n" +
    "    <body>\n" +
    "        <h2>Subversion</h2>\n" +
    "        <ul>\n" +
    "            <li>\n" +
    "                <a href=\"../\">..</a>\n" +
    "            </li>\n" +
    "            <li>\n" +
    "                <a href=\"branch_A/\">branch_A</a>\n" +
    "            </li>\n" +
    "            <li>\n" +
    "                <a href=\"branch_B/\">branch_B</a>\n" +
    "            </li>\n" +
    "        </ul>\n" +
    "    </body>\n" +
    "</html>";

html.lines().filter(line -> line.contains("<a href")).forEach(System.out::println);

Output:

            <a href="../">..</a>
            <a href="branch_A/">branch_A</a>
            <a href="branch_B/">branch_B</a>

Keep in mind you can run streams in parallel if you have huge HTML file.

Also you can strip HTML tags using map:

html.lines().filter(line -> line.contains("<a href")).map(line -> line.replaceAll("<[^>]*>","")).forEach(System.out::println);

Output:

            branch_A
            ..
            branch_B

I would be surprised if I fined new line between HTML tags and it's attributes in the HTML. I could happen but is not common. This solution is meant to solve specific problem if you want more generic solution you can use Jsoup. — Salem AlHarbi, Jun 11 '20 at 11:55
You can use filter function to drop out the elements you don't want to include in the final result based on the predicate you provide. — Salem AlHarbi, Jun 11 '20 at 11:56

Extract texts from a string

2 Answers2