0

I'm working with Node.js and GCP Data Loss Prevention to attempt to redact sensitive data from PDFs before I display them. GCP has great documentation on this here

Essentially you pull in the nodejs library and run this

const fileBytes = Buffer.from(fs.readFileSync(filepath)).toString('base64');

// Construct image redaction request
const request = {
  parent: `projects/${projectId}/locations/global`,
  byteItem: {
    type: fileTypeConstant,
    data: fileBytes,
  },
  inspectConfig: {
    minLikelihood: minLikelihood,
    infoTypes: infoTypes,
  },
  imageRedactionConfigs: imageRedactionConfigs,
};

// Run image redaction request
const [response] = await dlp.redactImage(request);
const image = response.redactedImage;

So normally, I'd get the file as a buffer, then pass it to the DLP function like the above. But, I'm no longer getting our files as buffers. Since many files are very large, we now get them from FilesStorage as streams, like so

return FilesStorage.getFileStream(metaFileInfo1, metaFileInfo2, metaFileInfo3, fileId)
      .then(stream => {
        return {fileInfo, stream};
      })

The question is, is it possible to perform DLP image redaction on a stream instead of a buffer? If so, how? I've found some other questions that say you can stream with ByteContentItem and GCPs own documentation mentions "streams". But, I've tried passing the returned stream from .getFileStream into the above byteItem['data'] property, and it doesn't work.

singmotor
  • 3,930
  • 12
  • 45
  • 79
  • @singmotor Are you directly passing the returned stream or converting that stream to fileBytes like you are doing it for the buffer in the 1st line of your code? – Abhijith Chitrapu Mar 24 '22 at 10:54
  • @AbhijithChitrapu I know I can pass a Buffer to DLP, but the problem is for big files that Buffer is too large for the DLP api. So, I'm trying to find out if I can pass a stream directly. – singmotor Mar 24 '22 at 23:38
  • @singmotor PDF files are hard to process by 3rd party applications or libraries and it's best to use Acrobat SDK to split stream files and send them for [buffering](https://developer.adobe.com/document-services/homepage). Then the incoming files can be directed to DLP and then processed and joined together. If each page in a PDF file is too large for DLP to process then you should try running Regex and splitting each page. – Abhijith Chitrapu Mar 30 '22 at 04:17

1 Answers1

1

So chunking the stream up into buffers of appropriate size is going to work best here. There seem to be a number of approaches to build buffers from a stream you can use here.

Potentially relevant: Convert stream into buffer?

(A native stream interface is a good feature request, just not yet there.)

Jordanna Chord
  • 950
  • 5
  • 12
  • Fingers crossed this becomes one of those answers that's updated in a year with: "now DLP natively supports streams"! – singmotor Apr 06 '22 at 00:09