-1

I've following string which is HTML -

<html>
    <head>
        <title>Repository</title>
    </head>
    <body>
        <h2>Subversion</h2>
        <ul>
            <li>
                <a href="../">..</a>
            </li>
            <li>
                <a href="branch_A/">branch_A</a>
            </li>
            <li>
                <a href="branch_B/">branch_B</a>
            </li>
        </ul>
    </body>
</html>

Out of this I want to get labels of li tag which are branch_A, branch_B Count of li's can vary. I want to get all of them. Can you please help how I can parse this String and get those values?

NOTE I could have used jsoup library to achieve same, but considering our project restriction, I cannot use it.

Alpha
  • 13,320
  • 27
  • 96
  • 163
  • 3
    I'm sure there are HTML parsers in Java. Don't use RegEx for that. See [this](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Cid Jun 11 '20 at 09:14
  • Use an HTML parser like [jsoup](https://jsoup.org/) for this. – kshitij86 Jun 11 '20 at 09:15
  • Yes, that is available but because of restriction to use external library, I cannot make use of it. Let me add this in question also. – Alpha Jun 11 '20 at 09:16
  • Using regex and strings will be pesky for this, but if you have to do it, [check this out](https://www.oreilly.com/library/view/java-cookbook-3rd/9781449338794/ch04.html) – kshitij86 Jun 11 '20 at 09:21
  • Just for curiosity, is there any reasons you should [reinvent the square wheel](https://exceptionnotfound.net/reinventing-the-square-wheel-the-daily-software-anti-pattern/) ? Could be because it's for school or some coding challenge websites (not bad reasons) or could be because of some non-technical managers who imagined great ideas about something they clearly lack of knowledge – Cid Jun 11 '20 at 09:25

2 Answers2

0

You can use an HTML parser for this. In the code below jsoup (https://www.baeldung.com/java-with-jsoup) is used and its quick and easy.

   Document doc = Jsoup.connect(fix url here).get();
   doc.select(tag you want).forEach(System.out::println);

Other tools are discussed here: https://tomassetti.me/parsing-html/

Kehinde
  • 16
  • 3
  • From question : *"NOTE I could have used jsoup library to achieve same, but considering our project restriction, I cannot use it."* – Cid Jun 11 '20 at 11:14
0

Using Java 8 streams:

    String html = "<html>\n" +
    "    <head>\n" +
    "        <title>Repository</title>\n" +
    "    </head>\n" +
    "    <body>\n" +
    "        <h2>Subversion</h2>\n" +
    "        <ul>\n" +
    "            <li>\n" +
    "                <a href=\"../\">..</a>\n" +
    "            </li>\n" +
    "            <li>\n" +
    "                <a href=\"branch_A/\">branch_A</a>\n" +
    "            </li>\n" +
    "            <li>\n" +
    "                <a href=\"branch_B/\">branch_B</a>\n" +
    "            </li>\n" +
    "        </ul>\n" +
    "    </body>\n" +
    "</html>";

html.lines().filter(line -> line.contains("<a href")).forEach(System.out::println);

Output:

            <a href="../">..</a>
            <a href="branch_A/">branch_A</a>
            <a href="branch_B/">branch_B</a>

Keep in mind you can run streams in parallel if you have huge HTML file.

Also you can strip HTML tags using map:

html.lines().filter(line -> line.contains("<a href")).map(line -> line.replaceAll("<[^>]*>","")).forEach(System.out::println);

Output:

            branch_A
            ..
            branch_B
  • What if the string contains ` – Cid Jun 11 '20 at 11:12
  • I would be surprised if I fined new line between HTML tags and it's attributes in the HTML. I could happen but is not common. This solution is meant to solve specific problem if you want more generic solution you can use Jsoup. – Salem AlHarbi Jun 11 '20 at 11:55
  • You can use filter function to drop out the elements you don't want to include in the final result based on the predicate you provide. – Salem AlHarbi Jun 11 '20 at 11:56