13

Java java.util.regex.Matcher replaceFirst(...)/replaceAll(...) API returns strings, which (if using the default heap size) may well cause an OOME for inputs as large as 20-50M characters. These 2 methods can be easily rewritten to write to Writers rather than construct stings, effectively eliminating one point of failure.

The Matcher's factory method, however, only accepts CharSequences, which is also likely to throw an OOME if I use Strings/StringBuffers/StringBuilders.

How do I wrap a java.io.Reader to implement a CharSequence interface (given the fact that my regexps may contain backreferences)? Is there any other solution which can replace regexps in files and is not OOME-prone on large inputs?

In other words, how do I implement a functionality similar to that of GNU sed in Java (as sed is known to tackle files as large as a couple terabytes, while featuring the same support for extended regular expressions)?

NeronLeVelu
  • 9,908
  • 1
  • 23
  • 43
Bass
  • 4,977
  • 2
  • 36
  • 82
  • 2
    Do you only need to replace a single line at a time, or support "whole file in one go" replacements? – Jon Skeet Jun 10 '15 at 10:52
  • `Pattern.matcher()` doesn't create new String. The `Matcher` object created just hold a reference to the CharSequence passed in. – nhahtdh Jun 10 '15 at 10:55
  • 7
    `sed` handles files on a line-by-line basis, which is the reason it doesn't require a lot of memory for large files (unless the file has *very* long lines or the code instructs it to remember a lot of stuff). If you do the same in Java (i.e., read a line, work on it, print it, read the next line, rinse, repeat), you'll require similar amounts of memory. By the way, you may be interested in [Unix4j](https://code.google.com/p/unix4j/). – Wintermute Jun 10 '15 at 11:31
  • What about using CharBuffer that implements CharSequence - see http://stackoverflow.com/questions/25484164/charbuffer-on-top-of-a-memory-mapped-bytebuffer-without-using-lots-of-heap-space – JiriS Jun 10 '15 at 12:20
  • @nhahtdh `Pattern.matcher()` doesn't create a new string, but it needs a `CharSequence`. So in most cases I would read the whole file into a `String`/`StringBuilder` (potential OOME), then feed it to `Pattern.matcher()`. *That* was my point. – Bass Jun 10 '15 at 12:43
  • Ah, OK. You wanted to perform regex matching and replacement on character stream. – nhahtdh Jun 10 '15 at 13:29
  • 1
    Maybe this [anwser](http://stackoverflow.com/a/22018249/2751621) could help you. And there this [repository](https://github.com/fge/largetext) created by the author of this anwser. – emartinelli Jun 10 '15 at 14:18
  • 1
    @emartinelli Thanks, this was exactly what I wanted -- a custom `CharSequence` implementation. Just had little idea of how to do it. – Bass Jun 10 '15 at 14:33

1 Answers1

1

Since what you need is actually the sed behaviour you can execute it by doing something like this:

String[] cmdArray = {"bash", "-c", "sed 's/YourRegex/YourReplaceStr/' inputfile > output"};
Process runCmd = Runtime.getRuntime().exec(cmdArray);

I put a bash example but if you want to run it on windows you can install sed command through Cygwin and execute the same or just install the sed command for windows which you can download from here:

http://gnuwin32.sourceforge.net/packages/sed.htm

For windows you could use:

String[] cmdArray = {"call", "sed 's/YourRegex/YourReplaceStr/' inputfile > output"};
Process runCmd = Runtime.getRuntime().exec(cmdArray);

I don't have windows so cannot test above command, you maybe have to remove call or to change the call to just sed. Another alternative you can try is:

String[] cmdArray = {"cmd", "/c", "sed 's/YourRegex/YourReplaceStr/' inputfile > output"};
Process runCmd = Runtime.getRuntime().exec(cmdArray);

In this link you can find an dir example executed from java you can adapt it to use sed.

Federico Piazza
  • 30,085
  • 15
  • 87
  • 123