0

I am currently trying to parse a Craigslist page using JSoup for an Android application. Here is the URL to a page that I am trying to parse:

http://seattle.craigslist.org/search/sss?query=ford&sort=rel

When I inspect the elements using Chrome, I can see that the HTML structure for an ad is as follows:

<p class="row" data-pid="4711759405"> 
    <a href="/see/ctd/4711759405.html" class="i" data-id="0:00U0U_d4iR9oMNMBY">
        <img alt="" src="http://images.craigslist.org/00U0U_d4iR9oMNMBY_300x300.jpg">
    </a> 
    <span class="txt"> 
        <span class="star v" title="save this post in your favorites list"></span> 
        <span class="pl">
    ....

Using JSoup, I am able to parse everything EXCEPT for the img tag. Here is how I am making the HTTP request:

document = Jsoup.connect(url).get();
Elements images = document.select("img");

This method will only find 2 images, none of which are ad images. I also used the Chrome plugin POSTMAN in order to replicate an HTTP GET request, and I find that there are no img tags for any of the ads. Why is this happening and how can I retrieve the img tag src URL?

Note that I am able to retrieve everything else, but the img tags.

Jonathan Hedley
  • 10,442
  • 3
  • 36
  • 47
user1927638
  • 1,133
  • 20
  • 42

2 Answers2

3

The ad images on the URL you gave are loaded using JavaScript after the page is loaded, that's why the initial HTML source does not contain any img tags.

However, there is a mapping between the data-id property of the a element in the HTML structure you posted, and the src property of the generated img tag. For example, let us consider the following element:

<a href="/see/ctd/4711759405.html" class="i" data-id="0:00U0U_d4iR9oMNMBY">

Just retrieve the data-id property from the a element, remove the part before the colon, add _300x300.jpg at the end, and you get the name if the image file. The full URL then becomes:

http://images.craigslist.org/00U0U_d4iR9oMNMBY_300x300.jpg

So, instead of selecting img elements with JSoup, select a elements and construct the image URLS from their data-id attributes.

Another solution would be to load the page in a WebView so that JavaScript gets executed, but I strongly discourage this over performance concerns.

Yasa Akbulut
  • 381
  • 2
  • 6
  • I can't believe I missed that pattern before. I did look at the data-id at one point, but I must have missed the correlation. Thanks! – user1927638 Oct 13 '14 at 15:48
0

I'm not 100% sure, but it looks like they might be denying the requests server-side to stop people from doing what your doing. I'm seeing in POSTMAN the same result you are.

As a work around, you could load the page in a webview then inject javascript to return the entire <html> node. Here is a link to another SO question, that also includes alternate methods: how to get html content from a webview?

Community
  • 1
  • 1
soundsofpolaris
  • 596
  • 3
  • 12