1

I am trying to find an efficient way to do a pattern match on a ByteArrayOutputStream whose size exceeds String's max size.

Doing a pattern match on a ByteArrayOutputStream that fits into a single String is trivial:

private boolean doesStreamContainPattern(Pattern pattern, ByteArrayOutputStream baos) throws IOException {

    /*
     *  Append external source String to output stream...
     */

    if (pattern != null) {
        String out = new String(baos.toByteArray(), "UTF-8");
        if (pattern.matcher(out).matches()) {
            return true;
        }
    }

    /*
     *  Some other processing if no pattern match
     */
    return false;
}

But if the size of baos exceeds String max size, the problem turns into:

  1. Feeding baos into multiple Strings.
  2. "Sliding" the pattern matching over the concatenation of those multiple Strings (i.e. the original baos content).

Step 2 looks more challenging then Step 1 but I know that utilities like Unix sed do just that on a file.

What is the right way to accomplish that?

datsb
  • 163
  • 9
  • I don't think it makes much sense to match against an `OutputStream`, no? You're not writing to the stream, you're reading it - so it should be an `InputStream`. – daniu Nov 04 '19 at 09:29

1 Answers1

1

You can write a simple wrapper class to implement CharSequence from the stream:

class ByteArrayCharSequence implement CharSequence {
    private byte[] array;
    public StreamCharSequence(byte[] input) {
        array = input;
    }

    public char charAt(int index) {
        return (char) array[index];
    }
    public int length() {
        return array.length;
    }
    public CharSequence subSequence(int start, int end) {
        return new ByteArrayCharSequence(Arrays.copyOfRange(array, start, end));
    }
    public String toString() {
        // maybe test whether we exceeded max String length
    }
}

and then match by

private boolean doesStreamContainPattern(Pattern pattern, ByteArrayOutputStream baos) throws IOException {
    if (pattern != null) {
        CharSequence seq = new ByteArrayCharSequence(baos.toByteArray());
        if (pattern.matcher(seq).matches()) {
            return true;
        }
    }

    /*
     *  Some other processing if no pattern match
     */
    return false;
}

It's obviously rough around the edges with the cast to char, and using copyOfRange, but it should work for most cases and you can adjust for those where it doesn't.

daniu
  • 14,137
  • 4
  • 32
  • 53
  • Thank you and +1 for your help. Please note that your solution is also limited by the inherent size/length limit of CharSequence (Integer.MAX_VALUE). I now believe the solution should be more in the direction of using [Scanner](https://docs.oracle.com/javase/1.5.0/docs/api/java/util/Scanner.html#findWithinHorizon%28java.lang.String,%20int%29) along with [FileOutputStream](https://docs.oracle.com/javase/7/docs/api/java/io/FileOutputStream.html) – datsb Nov 04 '19 at 12:07
  • Well, `Integer.MAX_VALUE` can address up to 2 GB of data, but I guess either way, `Scanner` is a good idea. You'll find it accepts an `InputStream` rather than an `OutputStream`, as I pointed out makes more sense in my previous comment to the question. – daniu Nov 04 '19 at 13:06