0

I have a big String content, compressed as GZIP and stored as BLOB in database. While extracting from DB, I am able to retrieve the string out of it as:

        try (
             ByteArrayInputStream bis = new ByteArrayInputStream(bytes);
             BufferedInputStream bufis = new BufferedInputStream(new GZIPInputStream(bis));
             ByteArrayOutputStream bos = new ByteArrayOutputStream()
        ) {
            byte[] buf = new byte[4096];
            int len;
            while ((len = bufis.read(buf)) > 0) {
                bos.write(buf, 0, len);
            }
            retval = bos.toString();
        }

My problem here is for some input records, I have this BLOB too big, and I have to grep hardly 5-6 lines from BLOB. And I have to process these records in bulk which is shooting up memory footprints.

Is there a way to extract content from GZIP in chunks, and I can discard all leftover chunks if I get those lines in initial parts only.

Thanks for the help in advance.

pankaj_ar
  • 757
  • 2
  • 10
  • 33
  • 1
    You are already extracting GZIP in chuncks, of roughly 4k each (may be not exactly, but that could be made exact). Could you be more specific on what you expect ? It seems you want to deal with lines of text, in which case I suggest you wrap your GZipInputStream into an InputStreamReader (specifying the charset), and then wrap this one into a BufferedReader, on which you'll get the `readLine()` method. You'll then be able to treat content one text line at a time. See https://stackoverflow.com/questions/34954630/java-read-line-using-inputstream – GPI Sep 18 '20 at 13:36

1 Answers1

1

Don’t read all the bytes from the BLOB into memory at once. Read your BLOB as an InputStream.

Use a BufferedReader to read and check one line at a time.

A BufferedReader wraps another Reader. To translate your decompressing InputStream into a Reader, use InputStreamReader. It is very important that you specify the charset of the text you’re decompressing; you do not want to rely on the default charset of whatever computer you happen to be running on, since it could be different depending on where you run it.

So it would look something like this:

List<String> matchingLines = new ArrayList<>();
String targetToMatch = "pankaj";

try (BufferedReader lines = new BufferedReader(
        new InputStreamReader(
            new GZIPInputStream(
                blob.getBinaryStream()),
            StandardCharsets.UTF_8))) {

    String line;
    while ((line = lines.readLine()) != null) {
        if (line.contains(targetToMatch)) {
            matchingLines.add(line);
        }
    }
}

Since you mention grep, you can also use a regular expression to match lines, though I would prefer String.contains over a regular expression for performance reasons, unless you really need a regular expression.

List<String> matchingLines = new ArrayList<>();
Matcher matcher = Pattern.comple("(?i)pankaj.*ar").matcher("");

try (BufferedReader lines = new BufferedReader(
        new InputStreamReader(
            new GZIPInputStream(
                blob.getBinaryStream()),
            StandardCharsets.UTF_8))) {

    String line;
    while ((line = lines.readLine()) != null) {
        if (matcher.reset(line).find()) {
            matchingLines.add(line);
        }
    }
}
VGR
  • 40,506
  • 4
  • 48
  • 63