Fetch urls from html with jsoup

Question

I am trying to fetch a url with jsoup in order to download an imadge from that url for some reason it dosn't work.

i am trying first to find where " div class="rg_di" " appears in the html file for the first time, and than to fetch the url that comes right after:

a href="http://www.google.co.il/imgres?imgurl=http://michellepicker.files.wordpress.com/2011/03/grilled-chicken-mexican-style.jpg&amp;imgrefurl=http://michellepicker.wordpress.com/2011/04/25/grilled-chicken-mexican-style-black-beans-guacamole/&amp;h=522&amp;w=700&amp;tbnid=4hXCtCfljxmJXM:&amp;zoom=1&amp;docid=ajIrwZMUrP5_GM&amp;ei=iVOqVPmDDYrnaJzYgIAM&amp;tbm=isch"

this is the url of the html:

view-source:https://www.google.co.il/search?q=grilled+mexican+chicken&es_sm=93&source=lnms&tbm=isch&sa=X&ei=h1OqVOH6B5bjaqGogvAP&ved=0CAgQ_AUoAQ&biw=1920&bih=955

here is the code i tried:

try 
        {
            doc = Jsoup.connect(url).get();
            Element link = doc.select("div.rg_di").first();
            Element link2 = link.select("a").first();
            String relHref = link2.attr("href"); // == "/"
            String absHref = link.attr("abs:href");
            tmpResult = absHref;



        } 
        catch (Exception e) 
        {
            Log.e("Error", e.getMessage());
            e.printStackTrace();
        }

full activity code:

package com.androidbegin.parselogintutorial;

import com.androidbegin.parselogintutorial.SingleRecipe.urlTask;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.koushikdutta.urlimageviewhelper.sample.UrlImageViewHelperSample;
import com.parse.GetCallback;
import com.parse.ParseException;
import com.parse.ParseObject;
import com.parse.ParseQuery;
import com.parse.ParseUser;
public class Bla extends Activity
{
    ImageView iv,bm;
    TextView recipeTitle;
    String urlForImage = "";
    @Override
    protected void onCreate(Bundle savedInstanceState) 
    {
        // TODO Auto-generated method stub
        super.onCreate(savedInstanceState);
        setContentView(R.layout.bla_layout);
        new urlTask("grilled mexican chicken").execute("grilled mexican chicken");
        //new DownloadImageTask((ImageView)findViewById(R.id.RecipeImage)).execute(urlForImage);
    }
    public class DownloadImageTask extends AsyncTask<String, Void, Bitmap> 
    {
        ImageView bmImage;
        public DownloadImageTask(ImageView bmImage) {
            this.bmImage = bmImage;
        }
        protected Bitmap doInBackground(String... urls) 
        {
            String urldisplay = urls[0];
            Bitmap mIcon11 = null;
            try 
            {
                InputStream in = new java.net.URL(urldisplay).openStream();
                mIcon11 = BitmapFactory.decodeStream(in);
                in.close();
            } 
            catch (Exception e) 
            {
                Log.e("Error", e.getMessage());
                e.printStackTrace();
            }
            return mIcon11;
        }
        protected void onPostExecute(Bitmap result) 
        {
            bmImage.setImageBitmap(result);
        }   
    }
    public class urlTask extends AsyncTask<String, Void, String> 
    {
        String str;
        public urlTask(String str)
        {
            this.str = str;
        }
        String tmpResult = str;
        Document doc;
        protected String doInBackground(String... urls) 
        {
            String urldisplay = urls[0];
            String url = "https://www.google.co.il/search?q=grilled+mexican+chicken&es_sm=93&source=lnms&tbm=isch&sa=X&ei=h1OqVOH6B5bjaqGogvAP&ved=0CAgQ_AUoAQ&biw=1920&bih=955";
            WebClient webClient = new WebClient(BrowserVersion.FIREFOX_24); // Chrome not working
            HtmlPage page = null;
            try 
            {
                page = webClient.getPage(url);
            } catch (FailingHttpStatusCodeException e1) 
            {
                // TODO Auto-generated catch block
                e1.printStackTrace();
            }
            catch (MalformedURLException e1) 
            {
                // TODO Auto-generated catch block
                e1.printStackTrace();
            }
            catch (IOException e1) 
            {
                // TODO Auto-generated catch block
                e1.printStackTrace();
            } 
            try 
            {
                Document doc = Jsoup.parse(page.asXml());
                Elements divs = doc.select(".rg_di");
                for(Element div : divs)
                {
                    Element img = div.select("a").get(0);
                    String link  = img.attr("href");
                    System.out.println(link);
                }

            }
            catch (Exception e) 
            {
                 e.printStackTrace();
            }
            return tmpResult;
        }
        protected void onPostExecute(String result) 
        {
            result = tmpResult;
            urlForImage = tmpResult;
        }   
    }
}

thanks for any help

score 4 · Answer 1 · edited May 23 '17 at 12:16

I edited your code to get rid of error 403

instead of this:

doc = Jsoup.connect(url).get();

write this:

doc = Jsoup.connect(url).userAgent("Mozilla").get();

It seems that the link is generated dynamically. Jsoup fetches html which doesn't contain .rg_di class, therefore

doc.select("div.rg_di").first();

returns null and we get nullpointerexception.

The html snippet downloaded by jsoup

<img height="104" src="https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcT-pctOxpuUcdq118aFU3s2miRfUa6Ev8eF-UsxARHV-vbcOUV8byEtt2YT" width="140">

Best we can do is to get every img tag and iterate over them we get list of icon links

Document doc = Jsoup.connect(url).userAgent("Mozilla").get();
Elements imgs = doc.select("img");
for(Element img : imgs){
    String link  = img.attr("src");
    System.out.println(link);
}

/textinputassistant/tia.png
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcT-pctOxpuUcdq118aFU3s2miRfUa6Ev8eF-UsxARHV-vbcOUV8byEtt2YT
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQMq354p43ddqPcpV9-q_05YkmY7XUPgv6Sl2oQLqFxQ5-IkpGAAuFTLMM
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcTW-RinkkW_fBdlHzTJn6vNmR85TR58geQgfjQnEJmOqzjq0Oi-z-8zXjg
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcRUXLzKi3UyQ6mF9JD20Z1jYNhVxQz7tkhJIEGOL3kua8ptoQrvo8-Nco_X
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcTverQlzF_hauCabscWF4wHLb_q7g9M_UDKO6LaldSRHhsTj7CxtVF2yvc
...

There are numerous solutions to parse dynamic content. link

EDIT 1

I implemented htmlunit to render a page

import java.io.IOException;
import java.net.MalformedURLException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;


public class Main {
    public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
        String url = "https://www.google.co.il/search?q=grilled+mexican+chicken&es_sm=93&source=lnms&tbm=isch&sa=X&ei=h1OqVOH6B5bjaqGogvAP&ved=0CAgQ_AUoAQ&biw=1920&bih=955";
        WebClient webClient = new WebClient(BrowserVersion.FIREFOX_24); // Chrome not working
        HtmlPage page = webClient.getPage(url); 
        try {
            Document doc = Jsoup.parse(page.asXml());
            Elements divs = doc.select(".rg_di");
            for(Element div : divs){
                Element img = div.select("a").get(0);
                String link  = img.attr("href");
                System.out.println(link);
            }
        } catch (Exception e) {
             e.printStackTrace();
        }
    }
}

htmlunit has its own html parsing api, but I'll stick with more intuitive jsoup

EDIT 2

As long as your goal is to render and parse HTML page on Android device HTMLUnit isn't a good option source

HtmlUnit uses Java classes that are not available on Android. On top of that, HtmlUnit uses a bunch of other libraries, some of which may have their own dependencies on these libraries. So, as awesome as HmlUnit is, I think getting it to run on Android may not be an easy task.

You can try this kind of solution. Or
You can torture yourself and try this solution(you'd better not). Or
If you take this guy's experience into consideration, it will be better if you redesign your software architecture:
1. create java server that renders webpage and parses it. HTMLUnit + Jsoup
2. save the parsed data in server's file system in JSON format. Gson
3. create servlet that sends the JSON file when android app requests it.

thanks very much for your answer! i try to look at all the answers from the link.it is all very complicated, even the "httmlunit" is not easy to understand how to implement, is there any chance you can post a working sample with httmlunit ? — maor, Jan 07 '15 at 09:00
@maor I'll figure out how to use htmlunit as soon as I get free time — gkiko, Jan 07 '15 at 11:46
first of all thanks a lot! -second,i tried it and it crashes.. " " Exception Ljava/lang/NoClassDefFoundError; thrown while initializing Lcom/gargoylesoftware/htmlunit/WebClient" -any idea why? — maor, Jan 11 '15 at 09:06
Have you included htmlunit jar to your classpath? [download](http://sourceforge.net/projects/htmlunit/files/htmlunit/2.15/htmlunit-2.15-bin.zip/download) — gkiko, Jan 11 '15 at 09:21
you mean did i add it to the libs folder of my project? - if you mean that,then yes i added htmlunit-2.15 , htmlunit-core-js-2.15 , httpclient-4.3.3 , httpcore-4.3.2, and httpmime-4.3.3. — maor, Jan 11 '15 at 09:28

Fetch urls from html with jsoup

1 Answers1

EDIT 1

EDIT 2