0

I'm trying to get links from html of a site but unable to do so using Jsoup.

This is the HTML:

<div class="anime_muti_link">
    <ul>
  <li><div class="doamin">Domain</div><div class="link">Link</div></li>
  <li class="anime">
    <a href="#" class="active" rel="1" data-video="example.com" ><div class="server m1">Server m1</div><span>Watch This Link</span></a>
  </li>
    
  <li class="anime">
    <a href="#" rel="1" data-video="example.com" ><div class="server m1">Server m2</div><span>Watch This Link</span></a>
  </li>
  
              <li class="xstreamcdn">
      <a href="#" rel="29" data-video="example.com">Xstreamcdn</div><span>Watch This Link</span></a>
    </li>
          <li class="mixdrop">
      <a href="#" rel="7" data-video="example.com"><div class="server mixdrop">Mixdrop</div><span>Watch This Link</span></a>
    </li>
          <li class="streamsb">
      <a href="#" rel="13" data-video="example.com">StreamSB</div><span>Watch This Link</span></a>
    </li>
          <li class="doodstream">
      <a href="#" rel="14" data-video="example.com">Doodstream</div><span>Watch This Link</span></a>
    </li>
  
</ul>
</div>

This is the android code that I wrote which doesn't seem to work:

try {
                Document doc = Jsoup.connect(URL).get();
                Elements content = doc.getElementsByClass("anime_muti_link");
                Elements links = content.select("a");

                String[] urls = new String[links.size()];
                for (int i = 0; i < links.size(); i++) {
                    urls[i] = links.get(i).attr("data-video");
                    if (!urls[i].startsWith("https://")) {
                        urls[i] = "https:" + urls[i];
                    }
                }
                arrayList.addAll(Arrays.asList(urls));
                Log.d("CALLING_URL", "Links: " + Arrays.toString(urls));

            } catch (IOException e) {
                e.getMessage();
            }

Can someone please help me with this? Thanks

Edit: Basically I'm trying to get those 6 links and add them to my list to use it within the app.

Edit 2:

So I found another HTML that can seems better:

<div class="heading-servers">
     <span><i class="fa fa-signal"></i> Servers</span>
     <ul class="servers">
      <li data-vs="https://example.com" class="server server-active" style="display: block;" onclick="return loadIframe('ifrm', this.getAttribute('data-vs'));">Netu</li>
      <li data-vs="https://example.com" class="server" style="display: block;" onclick="return loadIframe('ifrm', this.getAttribute('data-vs'));">VideoVard</li>
      <li data-vs="https://example.com" class="server" style="display: block;" onclick="return loadIframe('ifrm', this.getAttribute('data-vs'));">Doodstream</li>
      <li data-vs="https://example.com" class="server" style="display: block;" onclick="return loadIframe('ifrm', this.getAttribute('data-vs'));">Okstream</li>
     </ul>
    </div>
Meggan Sam
  • 319
  • 1
  • 4
  • 17

1 Answers1

2

As you can see, in this li definition you are including a nested div:

<li class="xstreamcdn">
      <a href="#" rel="29" data-video="example.com">Xstreamcdn</div><span>Watch This Link</span></a>
    </li>

This is causing that the variable content, the HTML fragment with class anime_muti_link, to look like:

<div class="anime_muti_link"> 
 <ul> 
  <li>
   <div class="doamin">
    Domain
   </div>
   <div class="link">
    Link
   </div></li> 
  <li class="anime"> <a href="#" class="active" rel="1" data-video="example.com">
    <div class="server m1">
     Server m1
    </div><span>Watch This Link</span></a> </li> 
  <li class="anime"> <a href="#" rel="1" data-video="example.com">
    <div class="server m1">
     Server m2
    </div><span>Watch This Link</span></a> </li> 
  <li class="xstreamcdn"> <a href="#" rel="29" data-video="example.com">Xstreamcdn</a></li>
 </ul>
</div>

A similar result will be obtained even if you tidy your HTML. I used this code from one of my previous answers:

Tidy tidy = new Tidy();
tidy.setXHTML(true);
tidy.setIndentContent(true);
tidy.setPrintBodyOnly(true);
tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");
tidy.setSmartIndent(true);
tidy.setShowWarnings(false);
tidy.setQuiet(true);
tidy.setTidyMark(false);

org.w3c.dom.Document htmlDOM = tidy.parseDOM(new ByteArrayInputStream(html.getBytes()), null);

OutputStream out = new ByteArrayOutputStream();
tidy.pprint(htmlDOM, out);
String tidiedHtml = out.toString();
// System.out.println(tidiedHtml);

Document document = Jsoup.parse(tidiedHtml);
Elements content = document.getElementsByClass("anime_muti_link");
System.out.println(content);

And this is why you are finding only three anchors.

Please, try correcting your HTML or selecting the anchor tag as the document level instead:

Document document = Jsoup.parse(html);
// Elements content = document.getElementsByClass("anime_muti_link");
// System.out.println(content);
Elements links = document.select("a");

String[] urls = new String[links.size()];
for (int i = 0; i < links.size(); i++) {
  urls[i] = links.get(i).attr("data-video");
  if (!urls[i].startsWith("https://")) {
    urls[i] = "https://" + urls[i];
  }
}
System.out.println(Arrays.asList(urls));

If the result obtained contains undesired links, perhaps you can try narrowing the selector used, something like:

document.select(".anime_muti_link a")

If this doesn't work, another possible alternative could be selecting the anchor elements with a data-video attribute, a[data-video]:

Document document = Jsoup.parse(html);
Elements videoLinks = document.select("a[data-video]");

String[] urls = new String[videoLinks.size()];
for (int i = 0; i < videoLinks.size(); i++) {
  urls[i] = videoLinks.get(i).attr("data-video");
  if (!urls[i].startsWith("https://")) {
    urls[i] = "https://" + urls[i];
  }
}
System.out.println(Arrays.asList(urls));

With your new test case, you can obtain the desired information with a very similar code:

String html = "<div class=\"heading-servers\">\n" +
    "     <span><i class=\"fa fa-signal\"></i> Servers</span>\n" +
    "     <ul class=\"servers\">\n" +
    "      <li data-vs=\"https://example.com\" class=\"server server-active\" style=\"display: block;\" onclick=\"return loadIframe('ifrm', this.getAttribute('data-vs'));\">Netu</li>\n" +
    "      <li data-vs=\"https://example.com\" class=\"server\" style=\"display: block;\" onclick=\"return loadIframe('ifrm', this.getAttribute('data-vs'));\">VideoVard</li>\n" +
    "      <li data-vs=\"https://example.com\" class=\"server\" style=\"display: block;\" onclick=\"return loadIframe('ifrm', this.getAttribute('data-vs'));\">Doodstream</li>\n" +
    "      <li data-vs=\"https://example.com\" class=\"server\" style=\"display: block;\" onclick=\"return loadIframe('ifrm', this.getAttribute('data-vs'));\">Okstream</li>\n" +
    "     </ul>\n" +
    "    </div>";

Document document = Jsoup.parse(html);
Elements videoLinks = document.select("div.heading-servers ul.servers li.server");

String[] urls = new String[videoLinks.size()];
for (int i = 0; i < videoLinks.size(); i++) {
  urls[i] = videoLinks.get(i).attr("data-vs");
  if (!urls[i].startsWith("https://")) {
    urls[i] = "https://" + urls[i];
  }
}

System.out.println(Arrays.asList(urls));

The most important part is the definition of the selector that should be applied to the parsed document, div.heading-servers ul.servers li.server in our case.

I provided a selector with many fragments, but depending on the actual use HTML it could be simplified with ul.servers li.server or even li.server.

jccampanero
  • 50,989
  • 3
  • 20
  • 49
  • I can't change the HTML as that's not my website. I'll try your other solution, thanks! – Meggan Sam Nov 27 '21 at 13:01
  • 1
    You are welcome @MegganSam, I hope it helps. I updated the answer to provide feedback about how to possibly narrow the selector in case you obtain undesired links. I haven't tested this last update. I hope it helps. – jccampanero Nov 27 '21 at 14:13
  • @MegganSam Were you able to test the proposed solution? Did it work? – jccampanero Nov 30 '21 at 19:12
  • I did try, yes. Unfortunately, it doesn't work :( – Meggan Sam Nov 30 '21 at 20:14
  • 1
    I am sorry to hear that. I suppose it has to do with processing the whole HTML document. I updated the answer trying providing another alternative. Please, could you try? I hope it helps. – jccampanero Nov 30 '21 at 22:32
  • Thanks for trying to help but seems like the HTML itself isn't written well so I've edited my question with another one that has child classes too so can you maybe help me with that? I tried you solution on the new one but that doesn't work either sadly – Meggan Sam Dec 01 '21 at 19:28
  • 1
    You are welcome. Yes, I agree with you Meggan, probably there should be something wrong in the HTML. Regarding your edition, I updated my answer with a possible solution. Please, could you try? – jccampanero Dec 01 '21 at 22:23
  • It worked, omg, thanks a ton!!! – Meggan Sam Dec 01 '21 at 22:29
  • 1
    That is great Meggan!! I am very happy to hear that it worked properly!! – jccampanero Dec 01 '21 at 22:34