0

I have a task in which I need to download a page from web, for a number of entities (>700). The code is designed in such a way that a particular function takes name of one entity, downloads its resource page, then processes it to mine some attributes and puts them in a global HashMap. See the below codes:

Global Data structures to be used in processing of each entity:

static HashMap<String, HashMap<String, ArrayList<String> > > Table = new HashMap<String, HashMap<String, ArrayList<String>>>();
static ArrayList<String> allColumns = new ArrayList<String>();

Single Threaded Code:

BufferedReader br = new BufferedReader(new FileReader(filePath_ListOfEntities));
String entityURL;
while ((entityURL = br.readLine()) != null) {
    String entityID = entityURL.replace("http://dbpedia.org/resource/", "");
    try{
        GetRowForEntityURL(entityID); // Downloads page, processes it and updates the global DSs
    }catch(Exception e)
     {
         System.out.println("Ignored: " + entityID + " Error: " + e.getMessage());
     }
 }
 PrintTable(); // prints the global hashmap

By downloading the resource page for each of the entity, the processing becomes a lot much fast which means that rate determining operation is to download the resource page. Note that operation for an entity is independent of others, but processing of page can only be done after the resource page availability. Hence, I tried to create a separate thread for the function GetRowForEntityURL(entityID). Following is Multithreaded code, which instead took more time as compared to Singlethreaded code:

Multithreaded Code:

BufferedReader br = new BufferedReader(new FileReader(filePath_ListOfEntities));
String entityURL;
ArrayList<Thread> threads = new ArrayList<>();
while ((entityURL = br.readLine()) != null) {
    String entityID = entityURL.replace("http://dbpedia.org/resource/", "");

    Thread t = new Thread(new Runnable() {
        public void run()
        {
            try{
                GetRowForEntityURL(entityID); // Downloads page, processes it and updates the global DSs
            }catch(Exception e)
            {
                System.out.println("Ignored: " + entityID + " Error: " + e.getMessage());
            }
        }
    });
    t.run();
    threads.add(t);
}
for(int i = 0; i < threads.size(); i++)
    threads.get(i).join();
System.out.println("***********Threads Joined******************");
PrintTable(); // prints the global hashmap

Why Multithreaded code is not faster, given that each of the entity should be processed in parallel and hence downloading should happen in parallel? That should have been a lot much faster than single threaded.

EDIT:

Now as clear that even after using T.start(), the downloading happens on a single connection. I need to improve the download code for actually leveraging the multiple threads. Here is my download code, in which I tried to create a new connection in each call (and so each thread) but I guess that is not working out.

public static void downloadFile(String entityID) throws IOException {
    String fileURL = "http://dbpedia.org/data/" + entityID + ".rdf";
    String saveDir = inputFolder;
    URL url = new URL(fileURL);
    HttpURLConnection httpConn;// = (HttpURLConnection) url.openConnection();
    int responseCode;// = httpConn.getResponseCode();
    do{
        httpConn = (HttpURLConnection) url.openConnection();
        responseCode = httpConn.getResponseCode();
    }
    while(responseCode != HttpURLConnection.HTTP_OK);

    // always check HTTP response code first
    if (responseCode == HttpURLConnection.HTTP_OK) {
        System.out.println("Downloading for: "+entityID);
        String fileName = "";
        String disposition = httpConn.getHeaderField("Content-Disposition");

        if (disposition != null) {
            // extracts file name from header field
            int index = disposition.indexOf("filename=");
            if (index > 0) {
                fileName = disposition.substring(index + 10,
                        disposition.length() - 1);
            }
        } else {
            // extracts file name from URL
            fileName = fileURL.substring(fileURL.lastIndexOf("/") + 1,
                    fileURL.length());
        }

        // opens input stream from the HTTP connection
        InputStream inputStream = httpConn.getInputStream();
        String saveFilePath = saveDir + File.separator + fileName;

        // opens an output stream to save into file
        //saveFilePath.replace(".rdf", ".txt");
        String downloadAt = inputFolder + entityID + ".txt";
        FileOutputStream outputStream = new FileOutputStream(downloadAt);

        int bytesRead = -1;
        byte[] buffer = new byte[4096];
        while ((bytesRead = inputStream.read(buffer)) != -1) {
            outputStream.write(buffer, 0, bytesRead);
        }

        outputStream.close();
        inputStream.close();

        //System.out.println("File downloaded");
    } else {
        System.out.println("Download Failed. Server replied HTTP code: " + responseCode);
    }
    httpConn.disconnect();
}
Bit Manipulator
  • 248
  • 1
  • 2
  • 17
  • How many threads are you creating? You should try an `Executor` with a fixed threadpool size instead. Then you can easily tune the amount of threads and see how it affects the time. – Kayaman Nov 23 '14 at 08:35
  • As there are number of entities (>500). One thread per entity. – Bit Manipulator Nov 23 '14 at 08:44
  • There are several fallacies here. For one thing, there is only one network between you and the target, which isn't multi-threaded, so there is little actual reason to believe that multi-threading the downloads will actually improve total download time. – user207421 Nov 23 '14 at 08:50
  • @EJP, please explain "one network between you and the target". I have different files for each entity. Is it like - the server makes only one connection for one client (ipaddress) even if the requests are for different files? – Bit Manipulator Nov 23 '14 at 09:01

3 Answers3

1

Use an Executor with a threadpool size of 1 to test out the single threaded speed. Then increase the poolsize to see how it affects the time.

Then notice how the performance actually weakens when you have a pool the size of 500 due to all the context switching that's happening.

Kayaman
  • 72,141
  • 5
  • 83
  • 121
0

Something is wrong here and I don’t see what it is clearly, since we only have a small piece of the puzzle.

As @EJP suggested, you may only have a single connection [modem?] to the internet. When thread 1 connects and waits for a response, threads 2-n waits on the connection. Therefore, what you are essentially doing is single threading.

If you can multiplex somehow then maybe you can speed things up. Such as the way a browser does new tabs: open in new tab open in new tab etc. Everything goes out, but the browser doesn’t wait for each reply; it handles replies asynchronously.

That won’t work if the destination is not multiplexed as well.

edharned
  • 1,884
  • 1
  • 19
  • 20
0

As pointed by other answers, I tried to check that even new threads are being created or not, by printing some message at the start of thread task as shown:

Thread t = new Thread(new Runnable() {
    public void run()
    {
        System.out.println("New Thread Started");
        try{
            GetRowForEntityURL(entityID); // Downloads page, processes it and updates the global DSs
        }catch(Exception e)
        {
            System.out.println("Ignored: " + entityID + " Error: " + e.getMessage());
        }
    }
});

And it seemed to be single threaded (not a good way to concretely say the conclusion, but I am stating this, as had they been processed in parallel, the corresponding message for each thread should have been printed instantaneously but what happened was one message used to get printed then it used to take a lot time, which must be the processing time, and then after the lag second message used to be printed). So, it was clear that the program was single threaded.

I checked again for writing Multithreaded code and then, I realized the mistake: Instead of calling T.run(), we need to call T.start(). This made the code Multithreaded, which I can verify from the instantaneous printing of message. But now, as told by @EJP and @edharned, server is responding with Error codes, but that is a different problem.

Correct Code for calling a method in a separate Parallel Thread:

ArrayList<Thread> threads = new ArrayList<>(); // Store the thread IDs so that you can join them back - basically it means that your main thread should wait for the parallel threads to complete the task they have been assigned to.
while ((entityURL = br.readLine()) != null) {
    String entityID = entityURL.replace("http://dbpedia.org/resource/", "");

    Thread t = new Thread(new Runnable() {
        public void run()
        {
            try{
                // Call any Function that you want to be executed in the parallel thread
                GetRowForEntityURL(entityID); // Downloads page, processes it and updates the global DSs
            }catch(Exception e)
            {
                System.out.println("Ignored: " + entityID + " Error: " + e.getMessage());
            }
        }
    });
    t.start(); // NOTE: This was the mistake. Call t.start() and not t.run()
    threads.add(t); // add the thread ID in your record
}

//Join the threads i.e. wait till all the created threads have finished their task
for(int i = 0; i < threads.size(); i++)
    threads.get(i).join();

T.run() does not spawn a new thread but calls the function in the same thread, whereas T.start() spawns a new thread. More on difference between Thread.start() and Thread.run() can be seen at this stackoverflow answer.

Community
  • 1
  • 1
Bit Manipulator
  • 248
  • 1
  • 2
  • 17
  • 1
    So is it working? Did you answer your own question? The "other" problem you mention is your real problem. When are you going to address that? – edharned Nov 24 '14 at 15:27
  • Multiple theads are being created and only 1/4 of pages are actually downloaded and processed. For rest of them getting error `Download Failed. Server replied HTTP code: 503`. I am trying to resolve that and will surely add that portion in my answer, and only after that will mark it as accepted. Meanwhile I wrote the answer for rectifying that even multiple threads were not being created earlier. – Bit Manipulator Nov 25 '14 at 06:04
  • It also gives the error `Error: Connection timed out: connect` and keeps waiting if I keep polling till Server replies `HttpURLConnection.HTTP_OK` code. – Bit Manipulator Nov 25 '14 at 06:35
  • I have edited the question to focus on the specific problem now i.e. dynamic (different) connections for each thread. Have I identified it correctly? – Bit Manipulator Nov 25 '14 at 06:42