0

I have a requirement to run multiple cURL commands on Java. I have been looking around for various methodologies to accomplish this. One such mechanism is to make use of ProcessBuilder. The code I have written is given below:

private void performCurl() {
   ProcessBuilder processBuilder = new ProcessBuilder();
        List<String> curlArgs = getMyCurlArgs(); // curl -k -v https://www.amazon.com -H <and so on>
        String listString = String.join(" ", curlArgs);
        processBuilder.command(curlArgs);
        processBuilder.redirectErrorStream(true);
        Process proc = processBuilder.start();
       
        ExecutorService fixedThreadPool = Executors.newFixedThreadPool(poolSize);
        Future<String> futureOpt;
        try {
            futureOpt = fixedThreadPool.submit(() -> {
                StringBuilder sb = new StringBuilder();
                InputStream ins = proc.getInputStream();
                BufferedReader br = new BufferedReader(new InputStreamReader(ins));
                br.lines().forEach(sb::append);
                try {
                    ins.close();
                    br.close();
                } catch (IOException e) {
                    // my exception
                }
                return sb.toString();
            });
            boolean terminatedNormally = proc.waitFor(15, TimeUnit.SECONDS);
            if (!terminatedNormally)
                throw new SocketTimeoutException("Timed Out");
        } finally {
            fixedThreadPool.shutdown();
            proc.destroy();
        }
        String content = futureOpt.get(); // This content is what I use.
}

Now the code above works as expected. cURL scrapes the website and provides the HTML content. The problem is that ProcessBuilder with cURL is extremely CPU intensive. Especially given the fact that ProcessBuilder makes use of Operating System resources.

My question now is:

a) Can I use ProcessBuilder in a more efficient manner?

b) Or, are there other mechanisms to trigger parallel cURL requests on Java?

Are there any other alternatives to run cURL requests in parallel?

mang4521
  • 742
  • 6
  • 22
  • 2
    The best alternative is to not use cURL at all. Instead, use a native HTTP client. You can use `HttpURLConnection` but that's pretty low-level. Instead, use Java's own `java.net.http.HttpClient` or a third-party library like Apache's HttpClient. – Rob Spoor Mar 10 '23 at 08:48
  • I was planning to but it appears to me that Apache HttpClient is usually detected as a bit pretty easily. Are there some good examples of Apache HttpClient that can be made use of? – mang4521 Mar 10 '23 at 11:44

1 Answers1

2

There's at least two other options that might perform better.

Firstly, you are running each external curl command individually. You could create a single script to run all the commands in parallel (e.g. a shell script) and then execute a single shell command to run that. That would reduce the overhead of all the ProcessBuilders

Secondly, you could use common java HTTP libraries to connect to the server and pull down the content. That would eliminate the overhead of external OS command.

JohnXF
  • 972
  • 9
  • 22
  • For approach (1) what if I am streaming the urls to be scraped and have no control over what the urls are beforehand? When you say common HTTP libraries, are you referring to non cURL solutions such as Apache HTTPClient? What if I would still like to keep cURL as solution? – mang4521 Mar 10 '23 at 08:08
  • If you absolutely must use the OS cUrl then I am not sure you have many options other than `ProcessBuilder` or similar classes for invoking OS commands. However even if streaming the URLs into your class you could batch them up and run a script via `ProcessBuilder` every N commands, or perhaps you could construct a script that can read input, execute that script once and then feed it the URLs as you get them. – JohnXF Mar 10 '23 at 08:14
  • Any good examples for the alternatives? – mang4521 Mar 10 '23 at 11:45
  • 1
    I don't, but a bit of searching should find some hints. You are looking to write a simple script that reads from input line by line (e.g. https://stackoverflow.com/questions/10929453/read-a-file-line-by-line-assigning-the-value-to-a-variable/10929511#10929511) where each line is a URL and the script executes a `curl` command on each. Then user `ProcessBuilder` in your java code to run that script, connecting to the input stream of the script and send the URLs to the stream with a newline between each. e.g. https://stackoverflow.com/questions/11573457/java-processbuilder-input-output-stream – JohnXF Mar 10 '23 at 12:38