0

I have a piece of code which uses ProcessBuilder to run a shell command [cURL] and the response from the command is pretty huge [webpage content].

I am utilising the BufferedReader to read the response from ProcessBuilder as shown below

StringBuilder sb = new StringBuilder();
ProcessBuilder processBuilder = new ProcessBuilder();
List<String> args = getArgs("url_to_be_passed");  // Getting my ProcessBuilder args here
processBuilder.command(args);
processBuilder.redirectErrorStream(true);
Process proc = processBuilder.start();
BufferedReader br = null;
try {
    InputStream inpc = proc.getInputStream();

    /** Beginning of CPU intensive process **/
    br = new BufferedReader(new InputStreamReader(inpc));
    String line;
    while ((line = br.readLine()) != null)
        sb.append(line);

    /** End of CPU intensive process **/

    boolean termination = proc.waitFor(15, TimeUnit.SECONDS);
    if (!termination)
        throw new SocketTimeoutException("Socket Timed out");
} finally {
    if(null != br)
        br.close();
    proc.destroy();
}
String data = sb.toString();   // Convert data from ProcessBuilder into a String

The getArgs() method is as follows:

private List<String> getArgs(String url) {
        List<String> args = new LinkedList<>();
        args.add("curl");
        args.add("-L");
        args.add("--silent");
        args.add("--write-out");
        args.add("HTTPSTATUS:%{http_code}");
        args.add(url);
        args.add("-XGET");
        
        args.add("--compressed");
        args.add("--connect-timeout");
        args.add("10");
        
        return args;
    }

I have profiled this piece of code using VisualVM and the screenshot of the CPU intensive process is as shown below:

enter image description here

My queries are as follows:

  • What is the best way to convert the response from ProcessBuilder into a String?
  • If using BufferedReader indeed is a good way to read response from ProcessBuilder, how to make it more CPU friendly?
mang4521
  • 742
  • 6
  • 22
  • *Which* shell command? – g00se Nov 08 '22 at 14:16
  • @g00se **cURL** to get the response of a webpage. – mang4521 Nov 08 '22 at 14:17
  • 1
    Why not read it directly into a string in Java? (Oversimplification): `String data = new String(new URL(site).openStream().readAllBytes());` – g00se Nov 08 '22 at 14:21
  • I am making a curl call with *ProcessBuilder*. Once i have the webpage response as an *InputStream*, I am trying to convert the response into a String using *BufferedReader*. This process is CPU intensive [I have commented the code snippet that is CPU intensive above]. – mang4521 Nov 08 '22 at 14:22
  • See above for (simplified) code – g00se Nov 08 '22 at 14:24
  • @g00se reading the url directly is not possible. My requirement is to get the webpage response using cURL. – mang4521 Nov 08 '22 at 14:25
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/249427/discussion-between-g00se-and-mang4521). – g00se Nov 08 '22 at 14:28
  • But what do you expect? Not suite understand your question... It is supposed to use all the CPU! – ACV Nov 08 '22 at 14:46
  • @ACV I would like to make the code more efficient where the resource [CPU, memory, time etc] requirement is minimal. – mang4521 Nov 08 '22 at 15:26
  • @mang4521 - the CPU is supposed to be used to max by a process. It is up to the OS to allocate it to different processes. But I definitely want my process to use all the CPU available to complete the task as soon as possible. Memory is another thing altogether. The only way to control CPU time is to introduce sleeps in your processing thread. – ACV Nov 09 '22 at 09:03
  • @ACV I understand that, but the code snippet mentioned above is resource hungry. There ideally should be more efficient ways of handling the same. – mang4521 Nov 09 '22 at 09:35
  • @mang4521 - agree. In that case what you want to do is maybe separate into a thread of its own and let it read and persist the data it got and when it's done to notify the main thread. You're reading a large data stream - it is normal for it to consume CPU. Another approach is to somehow split this data stream into chunks and process them in batches or in parallel (but then need to think how to assemble them back)... Anyway, there is not much you can do here – ACV Nov 09 '22 at 10:16
  • @ACV I already am running the *complete* code snippet shared above as separate threads for multiple incoming requests. I would therefore wish to avoid running the resource hungry snippet as a separate thread. What I would like to do is find out a more efficient way to read from ```ProcessBuilder``` and then convert it to a ```String``` if possible. – mang4521 Nov 09 '22 at 10:25
  • You should move away from ProcessBuilder - it is not sustainable. Find another way as others suggested of fetching data from the internet. – ACV Nov 10 '22 at 16:39
  • @ACV I have scoured may webpages online to find other options apart from ProcessBuilder [for Java], havent found many. As you by any chance aware of any [to make a curl call using Java]? – mang4521 Nov 10 '22 at 16:45
  • @mang4521 as others pointed out, java has its own "curl". You can use that. However! that would not resolve your "problem". Your problem is that reading a large response from a website takes long. So if you don't want that to hold off other "processes" in your application, you must rethink the architecture. Make it a separate app running all the time and persisting the website responses to a database or something in a form which is easy for the main app to consume. – ACV Nov 11 '22 at 09:05
  • @ACV What I do not understand is this. I have utilised say Apache HTTPClient to scrape a website. The CPU usage to obtain the response is around 20%. But , the same with cURL is a 100%. Even if we take efficient utilisation of resources on Apache's part, the difference shouldn't really be this big. – mang4521 Nov 11 '22 at 10:26
  • Probably something to do with how `Process` is implemented.. try increasing the wrapping `BufferedReader` buffer size in your code – ACV Nov 12 '22 at 11:42
  • 1
    @ACV By increase in buffer size, are you referring to this? ```BufferedReader(Reader in, int size)``` – mang4521 Nov 14 '22 at 14:00
  • 2
    What is `FileUtils.readFileToString`? And what actual file/string size are we talking about? – Holger Jan 03 '23 at 15:52
  • @Holger The first code snippet you see in the description is my solution. ```FileUtils.readFileToString``` is a line of code from one of the solutions suggested below [to read the stream into a file and then convert the file content into a string]. This solution did not help. – mang4521 Jan 03 '23 at 16:44
  • 1
    Well, since `FileUtils.readFileToString` is not a standard method, it’s pointless to say that the method, whose implementation we don’t know, did not solve the problem. You also forgot to mention the sizes. And if we are at it, naming the total time would also help to judge the CPU utilization. Further things to clarify: how is the CPU utilization when using the same curl command without Java? And when you split the work into running the subprocess and reading the string, which part is then consuming the CPU? – Holger Jan 03 '23 at 16:48
  • I tried running cURL with python (os) and the CPU utilisation was much lower. As far as splitting the subprocess is concerned, the process to extract the data from the "stream" object into a string is the most time consuming process. – mang4521 Jan 03 '23 at 17:07
  • Can you provide a [mre], so that we may profile the code ourselves? – Slaw Jan 06 '23 at 07:12
  • 1
    Just to be clear. You’re saying that when you use `redirectOutput(ProcessBuilder.Redirect.appendTo(contentFile));`, followed by `FileUtils.readFileToString(contentFile)`, the subprocess is fast and the CPU time consumption is at the `readFileToString` call? You did verify this? Did you try `Redirect to(contentFile)` instead of `Redirect.appendTo(contentFile)`, to be sure that you didn’t read more data than necessary? – Holger Jan 06 '23 at 15:31
  • @Holger I have updated the description [made it more compact]. The code snippet shared in the description spends most of the CPU resources while retrieving the cURL response data and during the process of converting the response stream into a string. – mang4521 Jan 07 '23 at 11:25
  • @Slaw the code snippet shared in the description should ideally be sufficient enough for you to run it successfully. The getArgs() method requires you to pass a url to scrape. Addition of imports [if required] is going to be tough here as there are other imports which are exclusive to the problem. – mang4521 Jan 07 '23 at 11:35
  • 1
    These are the results for the case that you are reading from the pipe. I’d like to see the results for the other case, when you redirect `curl` to a file and read afterwards. That’s important for narrowing down the problem. – Holger Jan 07 '23 at 13:38
  • @mang4521 would you mind describing what is present on "profiler screenshot"? Is it a `Sampler` or `Profiler` tab? How many executions of your scrape method it has captured during session? What is Java version? – Andrey B. Panfilov Jan 07 '23 at 15:49
  • @AndreyB.Panfilov This is a profiler tab. I have profiled the service using VisualVM. It has captured over 30K executions. I am on java 8. – mang4521 Jan 09 '23 at 13:14
  • @Holger redirecting the curl response into a file and then reading the file did not improve the CPU usage.The CPU usage remained same as what was described with my original solution. – mang4521 Jan 09 '23 at 13:16
  • 2
    That’s understood. You said it multiple times. You keep ignoring the question: *which part* is consuming the CPU time when you perform the operation in two steps? Is it still the execution of the `curl` command or is it the reading of the result, which can only happen after the completion of the `curl` command in that case? – Holger Jan 09 '23 at 13:24
  • @Holger It is the process of reading the curl response as a stream which is the most CPU consuming process. Here the curl response is received as a stream and reading the stream takes up the most resource. Hope this answers your question. – mang4521 Jan 09 '23 at 13:32
  • 1
    So it boils down to “how to read a file into a string efficiently”. In your first example, you are reading line by line and appending to a `StringBuilder` which will result into a string without the line breaks. Is this a requirement or would an operation that reads the file as-is at once, keeping the line breaks, also work for you? – Holger Jan 09 '23 at 14:09
  • @Holger Actually no. Once a cURL request is made, the response has to be read as a ```Stream```. Now reading the response itself from the stream is a CPU intensive process [irrespective of whether it is converted into a string or not]. I have also tested the process where the line breaks are not removed. No change in the CPU usage observed there. – mang4521 Jan 09 '23 at 14:44
  • @mang4521 using `StringBuilderWriter/WriterOutputStream` from commons-io seems to be bit faster than `BufferedReader#readLine()`, writing to `ByteArrayOutputStream` and then calling `ByteArrayOutputStream.toString()` is faster than `StringBuilderWriter/WriterOutputStream` (I would think about pooling `ByteArrayOutputStream` objects), btw, what is your throughput expectations if it currently spends 81 secs for 30K executions (more that 300 rps)? – Andrey B. Panfilov Jan 09 '23 at 15:13
  • @AndreyB.Panfilov to be honest, speed isn't my primary concern here but the usage of CPU. My bandwidth would be around 25TPS per instance. There aren't expectations in terms of throughput atm. – mang4521 Jan 09 '23 at 15:27
  • 1
    Please stop interpreting everything I write as a question whether it changes the CPU usage. You don’t know what I would suggest, so you can’t know whether it will change the CPU usage. I asked about your requirements regarding the result. Your code example reads into a `StringBuilder`, eventually producing a single `String`. Now you write you need a `Stream`. The most efficient way to read a file into a single string is [`readString`](https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/nio/file/Files.html#readString(java.nio.file.Path)), by the way. Unfortunately not for Java 8. – Holger Jan 09 '23 at 15:34
  • @Holger I am looking out for certain text in the response string. This I can accomplish either by removing the line breaks or by keeping the response string as is. Either case I can devise a ```java.util.regex.Matcher``` search regex. – mang4521 Jan 09 '23 at 16:14
  • 1
    That’s an important information. For example, you don’t need to call `toString()` on the `StringBuilder`, as the regex engine operates on the `CharSequence` interface, so you can search the `StringBuilder` directly. But if the pattern can not span multiple lines, you can simply perform a match attempt on each line right after reading it, instead of assembling the lines to a big string. You can also try `Scanner`. For the other approach, using the temporary file, you can also check [this answer](https://stackoverflow.com/a/52062570/2711488) for efficient pattern matching options. – Holger Jan 09 '23 at 18:48
  • @Holger Certainly let me take a look. What if I still want to turn the response into a string [as a future reference]? Are there efficient ways of handling this? – mang4521 Jan 10 '23 at 06:34
  • 1
    As said, the most efficient solution is to upgrade to a recent Java version and use [`Files.readString`](https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/nio/file/Files.html#readString(java.nio.file.Path)). In the best case, it does a single read and no copy operations, as it can access internals when converting the buffer to a string, which no other solution can. This depends on the chosen charset encoding and actual contents. Which has been discussed in [this answer](https://stackoverflow.com/a/70258672/2711488), for example. – Holger Jan 10 '23 at 07:54

0 Answers0