10

So I'm building an app that displays an imageboard from a website I go to in a more user-friendly interface. There's a lot of problems with it at the moment, but the biggest one right now is fetching the images to display them.

The way I have it right now, the images are displayed in a GridView of size 12, mirroring the number of images on each page of the imageboard. I'm using Jsoup to scrape the page for the thumbnail image URLs to display in the GridView, as well as getting the URLs for the full size images to display when a user clicks on the thumbnail.

The problem right now is that it takes anywhere from 8-12 seconds on average for Jsoup to get the HTML page to scrape. This I find unacceptable and I was wondering if there was any way to make this faster or if this is going to be an inherent bottleneck that I can't do anything about.

Here's the code I'm using to fetch the page to scrape:

try {
    Document doc = Jsoup.connect(url).get();
    Elements links = doc.select("img[src*=/alt2/]");
    for (Element link : links) {
        thumbURL = link.attr("src");
        linkURL = thumbURL.replace("/alt2/", "/").replace("s.jpg", ".jpg");
        imgSrc.add(new Pair<String, String>(thumbURL, linkURL));
    }
}
catch {
    e.printStackTrace();
}
seraphzero
  • 163
  • 1
  • 1
  • 8

4 Answers4

8

I used Jsoup for a TLFN scraper and I had no issues with speed. You should narrow down the bottleneck. I presume its your scraping that is causing the speed issue. Try tracing your selector and your network traffic separately and see which is to blame. If your selector is to blame then consider finding another approach for querying and benchmark the results.

For faster, general idea, testing you can always run Jsoup from a normal Java project and when you feel like you have improved it, throw it back on a device and see if it has similar performance improvements.

EDIT

Not that this is your issue but be aware that using iterators 'can' cause quite a bit of garbage collection to trigger. Typically this is not a concern although if you use them in many places with much repetition, they can cause some devices to take a noticeable performance hit.

not great

for (Element link : links)

better

int i;
Element tempLink;
for (i=0;i<links.size();i++) {
   tempLink = links.get(i);
}

EDIT 2

If the image URLs are starting with /alt2/ you may be able to use ^= instead of *= which could potentially make the search faster. Additionally, depending on the amount of HTML, you may be wasting a lot of time looking in the completely wrong place for these images. Check to see if these images are wrapped inside an identifiable container such as something like <div class="posts">. If you can narrow down the amount of HTML to sift through it may improve the performance.

ian.shaun.thomas
  • 3,468
  • 25
  • 40
  • Well the thing is that I timed the get() call which is where I got the 8-12 second delay. I'll take a look into the selector though. – seraphzero Apr 24 '12 at 13:58
  • It may be faster to select all images then loop through them manually picking the correct images. – ian.shaun.thomas Apr 24 '12 at 15:02
  • Running more timing tests, I'm most certain it's the get() call that is taking up all the time. On a regular Java project, the get() call is taking around 1-2 seconds and the select() call about 0.05 seconds. Running the same code on the Android emulator, it's taking the aforementioned 8-12 seconds for get() and around 0.7 seconds for select(). – seraphzero Apr 24 '12 at 19:15
  • 1
    AHHH Emulator! Yes, I would be extremely hesitant to trust the results on the emulator. For more realistic results, try following some instructions for setting up an x86 based emulator. It is a night and day difference in performance. Nothing will top using a physical device for development and testing though. – ian.shaun.thomas Apr 24 '12 at 20:10
  • OK, I'll look into setting up an x86 based emulator. I had a friend test the app for a bit on his phone and he said it was slow but I don't remember which part of the app he said was slow. – seraphzero Apr 25 '12 at 01:51
  • Alright running on the x86 based emulator, it's ranging from 1-6 seconds. Much better, now it's time tackle the other problems in my app. Thanks a lot. – seraphzero Apr 25 '12 at 03:12
3

Though a slightly different, this question has the same answer as Scraping dynamically generated html inside Android app.

In short, you should offload the "download & parse" part to a remote web service. See Web Scraping from Android for a discussion.

Dave Jarvis
  • 30,436
  • 41
  • 178
  • 315
Yevgeniy
  • 1,313
  • 2
  • 13
  • 26
2

I ran into the very same issue:

The Logcat on my HTC One S clearly shows that the connection-response only takes the first 4 Seconds (3 Connections in parallel). The Parsing takes almost 30-40 Seconds which is a HUGE time .. notice that the HTC One S has a very fast dualcore @ 1,4ghz .. The problem is clearly not connected to the emulator

02-27 14:11:55.278: DEBUG/MyActivity(10735): =c>
02-27 14:11:55.278: DEBUG/MyActivity(10735): =c>
02-27 14:11:55.278: DEBUG/MyActivity(10735): =c>
02-27 14:11:59.002: DEBUG/MyActivity(10735): <r=
02-27 14:11:59.012: DEBUG/MyActivity(10735): <r=
02-27 14:11:59.422: DEBUG/MyActivity(10735): <r=
02-27 14:12:33.949: DEBUG/MyActivity(10735): <d=
02-27 14:12:37.463: DEBUG/MyActivity(10735): <d=
02-27 14:12:38.294: DEBUG/MyActivity(10735): <d=

This is my code:

// Jsoup-Connection
Connection c = Jsoup.connect(urls[0]);
// Request timeout in ms
c.timeout(5000);
Connection.Response r = c.execute();
Log.d("MyActivity","<r= doInBackground ("+urls[0]+")");

// Get the actual Document
Document doc = r.parse();
Log.d("MyActivity","<d= doInBackground ("+urls[0]+")");

Update:

02-27 20:38:25.649: INFO/MyActivity(18253): !=c> 
02-27 20:38:27.511: INFO/MyActivity(18253): !<r= 
02-27 20:38:28.873: INFO/MyActivity(18253): !#d=

I got some new results .. the previosu ones were from running my app on android as DEBUGGING .. the now posted results are from running without debugging mode (from IntelliJ IDE) .. any explanation why debugging makes Jsoup so slow?

Running on debuggin on my i5-Desktop-Machine I got no performance-penalty.

The culprit why my code is so slow on Android is definitly the DEBUG-Mode mode .. it slows jsoup down by factor 100.

cimba007
  • 101
  • 1
  • 5
  • I think it's something related to the huge number of calls of very small methods. I suppose Java VM/debugger set internal hooks on each method entry/exit or something like that. Anyway Jsoup is incredible and elegant library but this problem exists only in debug mode. I use Eclipse and disconnect the debugger when i'm tired waiting. – WindRider Apr 21 '13 at 17:42
0

Can you identify better the content you want to get because there is only one reason that can slow down the execution of your code

select("img[src*=/alt2/]")

Is there any common "class" with the images you want to get ?

ChristopheCVB
  • 7,269
  • 1
  • 29
  • 54