1

I am trying to extract a specific captcha image id using api Jsoup, the html image tag is like :

<img id="wlspispHIPBimg03256465465dsd5456" style="display: inline; width: 200px; height: 100px;" aria-hidden="true" src="https://users/hip/data/rnd=435cb60d0a6b63ef4">

This is my code to obtain the attribute id="wlspispHIPBimg03256465465dsd5456":

doc = Jsoup.connect("http://go.microsoft.com/fwlink/?LinkID=614866&clcid")
                .timeout(0).get();

Elements images = doc.select("img[src~=(?i)]");
for (Element image : images) {
    System.out.println(image.attr("id"));
}

The problem is that i can't get the id of captcha image

Moony
  • 21
  • 3
  • This code works fine for me. Please [edit] your question and post [short but complete example which will let us reproduce your problem](http://sscce.org/). – Pshemo Nov 07 '15 at 16:16
  • Also `doc.select("img[src~=(?i)");` is same as `doc.select("img[src]");` since `(?i)` is just flag to make used regex case-insensitive but there is no regex there to begin with and your selector wasn't even closed with `]`. – Pshemo Nov 07 '15 at 16:22
  • Thanks for your answer Pshemo, i also tried to use img[^id="wlspispHIPBimg"] but it's not working – Moony Nov 07 '15 at 16:42

2 Answers2

0

You need to find something in the html that discriminates the img tag of any other tag in the document. From your posted code that is can't be deduced, so i use my imagination here:

Element imageEl = doc.select("img[scr*=rnd]").first();

This exploits that the source of the image contains "rnd" in it path. To get the best solution you must look yourself. Also it helps a lot if you learn the CSS selectors of Jsoup.

luksch
  • 11,497
  • 6
  • 38
  • 53
0

I think you simply can't accomplish this using only Jsoup, the DOM is modified at runtime with javascript and jsoup simply does not execute it.

View also this other question.

Community
  • 1
  • 1
Giuseppe Ricupero
  • 6,134
  • 3
  • 23
  • 32