Fastest way to load text from URLS using Java without concurrency

Question

odd request, I know, but I'm working on a program as a learning exercise which takes a .txt file containing a bunch of URLS pointing to text files on the web. It then hashes each word in each text and allows the user to search.

I'm building the program twice, once without concurrency, and once with. I'm just about done with the hashing part of the program sans-concurrency, and my timings show that the time scales fairly linearly with the number of URLS in the original file.

The slowest part of the process, though, is actually retrieving the URLS from the web. Currently I am doing this like so

URL url = new URL(revURL);
Scanner revScanner = new Scanner(url.openStream());

where revURL is a string passed to the method from main. Is there a faster way to do retrieve those files, or is this about as quick as it will get without breaking into concurrency?

Get a faster internet connection? Also, look at NIO and selectors, although multiple threads will be a lot easier in your case. — vanza, Jan 07 '14 at 02:53
@vanza - nio does not make networks "faster" it makes programs more scalable by requiring fewer threads to handle more connections. in the single thread case, nio won't change anything. — jtahlborn, Jan 07 '14 at 03:02
You might try an HTTP proxy... but it's unlikely to help, besides caching after the first query. — Elliott Frisch, Jan 07 '14 at 03:13
@jtahlborn: nio is a way to handle several connections "concurrently" without requiring multiple threads. So instead of fetching each URL sequentially, several can be read at the same time, so yes, it can be used to speed things up. For `n` URLs, the total time could be just the time to download from the slowest link, instead of the sum of downloading all `n` URLs. But I'm sure you know that (right?). — vanza, Jan 07 '14 at 17:34
@vanza - how exactly would you get fully parallel downloading using nio when you are only using a single thread? if that were the case, i could build the awesomest, fastest website in the world using a single thread. — jtahlborn, Jan 07 '14 at 18:15
@jtahlborn: did you see the quotes around "concurrently"? You can read/write from multiple sockets as data is available / buffer space is available in each. So while not really concurrent, it allows handling multiple sockets in a non-sequential manner, without having to block when one of the sockets is waiting for data to arrive / go over the pipe, which can lead to the speed ups I explained. Looks like you need to play more with asynchronous i/o. A comment really is sort of short to explain all that, you know... — vanza, Jan 07 '14 at 18:56
@vanza - i understand very well how the nio library works. your original, vague assertion was that nio makes network code faster. i replied that nio is about scalability, not speed. yes, if the endpoints are slow, using nio with a single thread may be a bit faster than serial downloads. however, if the endpoints are close to the speed of your connection, then nio won't gain you anything. — jtahlborn, Jan 07 '14 at 19:09
@jtahlborn: I never said what you say I did, and even now you agree that nio can make the code run faster than just serializing everything (even if each individual download will not be faster, duh, that's pretty obvious isn't it?), so what is your point really? Anyway, this is being the scope of the question. — vanza, Jan 07 '14 at 19:11
@vanza - don't understand what you mean about each individual download. i said it _might_ be faster overalll in certain situations and it might not be in others (more likely). sorry if i misinterpreted you first comment, i interpreted it to mean that you thought nio would make things faster (since that's what the original question was about). — jtahlborn, Jan 07 '14 at 23:02

Fastest way to load text from URLS using Java without concurrency

0 Answers0