I have a task in which I need to download a page from web, for a number of entities (>700). The code is designed in such a way that a particular function takes name of one entity, downloads its resource page, then processes it to mine some attributes and puts them in a global HashMap. See the below codes:
Global Data structures to be used in processing of each entity:
static HashMap<String, HashMap<String, ArrayList<String> > > Table = new HashMap<String, HashMap<String, ArrayList<String>>>();
static ArrayList<String> allColumns = new ArrayList<String>();
Single Threaded Code:
BufferedReader br = new BufferedReader(new FileReader(filePath_ListOfEntities));
String entityURL;
while ((entityURL = br.readLine()) != null) {
String entityID = entityURL.replace("http://dbpedia.org/resource/", "");
try{
GetRowForEntityURL(entityID); // Downloads page, processes it and updates the global DSs
}catch(Exception e)
{
System.out.println("Ignored: " + entityID + " Error: " + e.getMessage());
}
}
PrintTable(); // prints the global hashmap
By downloading the resource page for each of the entity, the processing becomes a lot much fast which means that rate determining operation is to download the resource page. Note that operation for an entity is independent of others, but processing of page can only be done after the resource page availability. Hence, I tried to create a separate thread for the function GetRowForEntityURL(entityID)
. Following is Multithreaded code, which instead took more time as compared to Singlethreaded code:
Multithreaded Code:
BufferedReader br = new BufferedReader(new FileReader(filePath_ListOfEntities));
String entityURL;
ArrayList<Thread> threads = new ArrayList<>();
while ((entityURL = br.readLine()) != null) {
String entityID = entityURL.replace("http://dbpedia.org/resource/", "");
Thread t = new Thread(new Runnable() {
public void run()
{
try{
GetRowForEntityURL(entityID); // Downloads page, processes it and updates the global DSs
}catch(Exception e)
{
System.out.println("Ignored: " + entityID + " Error: " + e.getMessage());
}
}
});
t.run();
threads.add(t);
}
for(int i = 0; i < threads.size(); i++)
threads.get(i).join();
System.out.println("***********Threads Joined******************");
PrintTable(); // prints the global hashmap
Why Multithreaded code is not faster, given that each of the entity should be processed in parallel and hence downloading should happen in parallel? That should have been a lot much faster than single threaded.
EDIT:
Now as clear that even after using T.start()
, the downloading happens on a single connection. I need to improve the download code for actually leveraging the multiple threads. Here is my download code, in which I tried to create a new connection in each call (and so each thread) but I guess that is not working out.
public static void downloadFile(String entityID) throws IOException {
String fileURL = "http://dbpedia.org/data/" + entityID + ".rdf";
String saveDir = inputFolder;
URL url = new URL(fileURL);
HttpURLConnection httpConn;// = (HttpURLConnection) url.openConnection();
int responseCode;// = httpConn.getResponseCode();
do{
httpConn = (HttpURLConnection) url.openConnection();
responseCode = httpConn.getResponseCode();
}
while(responseCode != HttpURLConnection.HTTP_OK);
// always check HTTP response code first
if (responseCode == HttpURLConnection.HTTP_OK) {
System.out.println("Downloading for: "+entityID);
String fileName = "";
String disposition = httpConn.getHeaderField("Content-Disposition");
if (disposition != null) {
// extracts file name from header field
int index = disposition.indexOf("filename=");
if (index > 0) {
fileName = disposition.substring(index + 10,
disposition.length() - 1);
}
} else {
// extracts file name from URL
fileName = fileURL.substring(fileURL.lastIndexOf("/") + 1,
fileURL.length());
}
// opens input stream from the HTTP connection
InputStream inputStream = httpConn.getInputStream();
String saveFilePath = saveDir + File.separator + fileName;
// opens an output stream to save into file
//saveFilePath.replace(".rdf", ".txt");
String downloadAt = inputFolder + entityID + ".txt";
FileOutputStream outputStream = new FileOutputStream(downloadAt);
int bytesRead = -1;
byte[] buffer = new byte[4096];
while ((bytesRead = inputStream.read(buffer)) != -1) {
outputStream.write(buffer, 0, bytesRead);
}
outputStream.close();
inputStream.close();
//System.out.println("File downloaded");
} else {
System.out.println("Download Failed. Server replied HTTP code: " + responseCode);
}
httpConn.disconnect();
}