0

I need to extract both ID and mfg.name of all foods listed in the following html https://ndb.nal.usda.gov/ndb/search/list

I am using Jsoup and pretty new to it.

here is the html source that I have to extract id and name of the food enter image description here

and here is my source code in java:

    try{
    Document doc = Jsoup.connect("https://ndb.nal.usda.gov/ndb/search/list?maxsteps=6&format=&count=&max=50&sort=fd_s&fgcd=&manu=&lfacet=&qlookup=&ds=&qt=&qp=&qa=&qn=&q=&ing=&offset=0&order=asc").userAgent("mozilla/17.0").get();
    Elements temp =doc.select ("div.list-left");

    int i=0;
    for ( Element Food:temp){
        i++;
        System.out.println(i+ "" +Food.getElementsByTag("table").first().text());
    }
    }
    catch (IOException e){
        e.printStackTrace();
    }

so here I get all information from the first page. But I need to extract ID and mfg.names of all pages.

any help will be appreciated.

R.fred
  • 3
  • 2

1 Answers1

0

Try this.

try {
    int maxPage = 3681;
    int i = 0;
    for (int page = 0; page < maxPage; ++page) {
        Document doc = Jsoup.connect(
            "https://ndb.nal.usda.gov/ndb/search/list"
            + "?maxsteps=6&format=&count=&max=50"
            + "&sort=fd_s&fgcd=&manu=&lfacet=&qlookup=&ds="
            + "&qt=&qp=&qa=&qn=&q=&ing=&offset=" + (page * 50)
            + "&order=asc")
            .userAgent("mozilla/17.0").get();
        Elements rows = doc.select("div.list-left table tbody tr");
        for (Element row : rows) {
            ++i;
            System.out.print("No." + i);
            System.out.print(" ID=" + row.select("td:eq(1) a").text());
            System.out.println(" Manufacturer=" + row.select("td:eq(3) a").text());
        }
    }
} catch (IOException e) {
    e.printStackTrace();
}
  • thank you so much! it works! the only thing I need to extract the data of all 3681 pages. this code only extract the data of first page. do you have any recommendation? thank you again! – R.fred Mar 17 '17 at 19:19
  • Change `offset=0` to `offset=50`, `offset=100`, ... in the URL. –  Mar 17 '17 at 22:20
  • appreciate your help! – R.fred Mar 18 '17 at 22:29