0

Hi guys, I have a client that would like to check the variations around their website. They have 5 million URLs to check. If I was to send requests/pings synchronously, it would take me 23 days. So I'm looking for a multithreaded solution. I originally started this problem off in Python, but didn't see much improvement/couldn't scale well, so here I am in Java, and if this fails too, I'll try in Go before throwing in the towel.

The issue is I'm not seeing any improvements at all with multithreading. Perhaps I'm implementing it wrong, could anyone please help me?

Edits:

I'll just be making edits here and new comers can look at the history of this post to see how I've progressed through the problem.

This is the socket suggestion, fails when I try to run it in a thread, unsure what I'm doing wrong here too.

Main Class:

package com.company;
import java.io.IOException;
import java.util.ArrayList;
import java.util.concurrent.TimeUnit;

public class Main extends Thread{
public static void main(String[] args) throws IOException {
long startTime = System.nanoTime();
Helpers.get("www.google.com", 80); // works here
String path = "test.txt";
boolean append = true;
for (int x = 0; x < 1; x++) {
ArrayList<String> urls = new ArrayList<String>();
// when x = 0, y = 0 | 10 /\ when x = 1, y = 10 | 20
for (int y= x * 10;y < ((x + 1) * 10); y++){
urls.add(String.format("www.google%d.com/", y)); // doesn't work here
}
Thread thread = new Thread(new Helpers(path, append, urls, 80));
thread.start();
thread.interrupt();
}
long endTime = System.nanoTime();
long duration = TimeUnit.NANOSECONDS.toMillis(endTime - startTime);
System.out.println(duration + " ms");
}
}

Helpers Class:

package com.company;
import java.io.IOException;
import java.net.*;import java.io.FileWriter;
import java.io.PrintWriter;import java.util.ArrayList;
public class Helpers extends Thread{
public Helpers(String path, boolean append, ArrayList<String> urls, int port) throws IOException {
this.run(path, append, urls, port);
}
public void run(String path, boolean append, ArrayList<String> urls, int port) throws IOException {
for (String url : urls) {
String status = Helpers.get(url, port);Helpers.writeToFile(path, append, status);
System.out.println(status);
}
}
public static String get(String url, int port) throws IOException {
try {
Socket conn = new Socket(url, 80);
conn.close();
return url + " | Success";
}catch (UnknownHostException error){return url + " | Failed";
}
}
Lafftar
  • 145
  • 3
  • 11
  • You're leaking connections. You need to at least close the input stream of the `HttpURLConnection`, if you manage to get it. – user207421 Aug 05 '20 at 01:20
  • @MarquisofLorne I'm sorry, I don't really understand, could you please give an example? – Lafftar Aug 05 '20 at 04:45
  • Example of using [HttpURLConnection](https://www.journaldev.com/7148/java-httpurlconnection-example-java-http-request-get-post), a more simple [one](https://stackoverflow.com/questions/4767553/safe-use-of-httpurlconnection) – Lebecca Aug 06 '20 at 16:18
  • Use `ForkJoinPool`/`CompletableFuture` to take advantage of your cpu's cores (people tend to have more now), and then you need to write your tasks as `Runnable` objects. From that point forward, you need to consider the design from the standpoint of tasks running without bumping into each other; shared resources need to be atomic or locked, and the way you process info may be different. For example, maybe you have 4 worker threads whose sole purpose is to chug out as many http connections as possible for your cpu (they get a request, add to their queue, and feed a result) – Rogue Aug 06 '20 at 16:54
  • @Lebecca thank you for the example, I think I implemented it the way they did, but I don't think it changed the performance very much. – Lafftar Aug 07 '20 at 14:50
  • @Rogue I'll see if that helps and report back, thank you for the suggestion. – Lafftar Aug 07 '20 at 14:51
  • The examples are not for performance enhance, but for releasing connections correctly. – Lebecca Aug 07 '20 at 17:10
  • @Lebecca Oh I see, I think the connections got released fine then? I'm unsure how to check. – Lafftar Aug 08 '20 at 04:04
  • You got no performance improvement because you put all url access in one helper instance, and make helper run in single thread, leads to synchronize access of the url one by one. You need put every url access in a separate thread and so you can access them asyncly. And on top of that, You don't need to create mellion-level thread, but using thread pool instead, for example, a thread pool contains 100 available threads, you submit your task into it, to get a 100x speed up than your current solution. – Lebecca Aug 08 '20 at 04:14
  • @Lebecca, you're referring to this:? `Thread thread = new Thread(new Helpers(path, append, urls));` What I'm trying to do is create a helper object with a group of URLs, then starting each object in it's own thread. Is that not what I'm doing with that block? You can check the edits for the old code. I would really appreciate any suggestions to make it work as intended. By mellion-level thread, are you referring to user or kernel? – Lafftar Aug 08 '20 at 05:13
  • You actually created one thread. – Lebecca Aug 08 '20 at 08:16

2 Answers2

0

You are implementing it wrong. You should make Helpers class extend thread or implement runnable. Pass everything you need to this class, for example, url, file pointer etc.

In your main class, create object of Helper and then run it as a thread.

J.J
  • 633
  • 1
  • 6
  • 14
  • Thank you for taking the time to help, I've implemented it as you suggested, but am still not seeing any improvements. Is there something I'm doing wrong? – Lafftar Aug 06 '20 at 16:16
  • I am not sure what improvements you are hoping to see. You need to check how long each thread takes to finish. You could also go more granular and see how long it takes to write to the file. In the code, all threads write to the same file and that could be a problem. You are also not closing HTTP URL connection. You need to close that in finally block. – J.J Aug 06 '20 at 16:48
  • I just don't think it's running synchronously, because it's printing out from 1-100, not like 1 - 13 - 2 - 23 - 4 - 3, you know? I did try it with closing the URL connection, didn't help. – Lafftar Aug 07 '20 at 14:49
0

You can try a different approach. Instead of making HTTPConnection for every call, you can try creating a socket connection to the webserver and then make multiple calls (GET/HEAD) to different URLs.

/**
 * hostname of the webserver e.g. www.w3.org
 * @param hostname
 * @param urlList
 * @throws IOException
 */
public static void makingHTTPCall( String hostname , List<String> urlList) throws IOException {

    SSLSocketFactory factory = (SSLSocketFactory)SSLSocketFactory.getDefault();
    SSLSocket socket =
            (SSLSocket)factory.createSocket(hostname, 443);


    BufferedReader in
            = new BufferedReader(
            new InputStreamReader(socket.getInputStream()));
    PrintWriter out
            = new PrintWriter(socket.getOutputStream(), true);

    /**
     *  if required create different url List and pass those list to separate thread for better performance
     */

    urlList.forEach(
            url -> {
                System.out.println("Making call to url /" + url);
                out.println("HEAD "  + url + " HTTP/1.1\r\n");
                out.flush();

                String line = "";

                try {
                    while ((line = in.readLine()) != null) {
                        System.out.println("Response" + line);
                        break;
                    }

                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
    );

    try {
            in.close();
    } catch (IOException e) {
        e.printStackTrace();
    }
    out.close();

}

I have tried with the SSL socket you can change this according to your configuration.

  • Thank you for taking the time to reply. I tried this, works synchronously but doesn't work with threading. – Lafftar Aug 07 '20 at 14:46
  • Creating connections is a heavy operation especially when you are facing million-level requests. The answer tends to decrease the connection creation expense. – Lebecca Aug 07 '20 at 17:32
  • @Lebecca Okay so that's the reason why we use Socket instead of Http connection class? The socket simply tries to establish the connection, then quits, whereas the HTTP connections are geared more towards transferring data and quitting? – Lafftar Aug 08 '20 at 04:06
  • If all the URL's are on same webserver : Create Socket once in your main function then pass on this socket to Helper Class which will then make separate HTTP call for each URL. Use some kind of thread pooling to make HTTP call. e.g. URL : www.ggogle.com/abc.html make Socket connection for www.google.com and "/abc.html" will be part of urlList. If the URL's are on different host then use connection pool. – biswajitray Aug 08 '20 at 04:51