Replace String in PDF with node.js | manipulating datastreams/buffers of pdfs

Question

I am currently trying to replace a placeholder string in an pdf programmatically. ( easy example: I want to change the string "SEI" to "1") I can currently access the content of the pdf and convert that to a stream and/or buffer and convert that buffer back to a pdf, but since im currently not really able to manipulate that stream/buffer correctly i am basically only copying that pdf right now. When i user buffer.toString() and just use a string replace on that for "SEI" to "1", it changes the buffer in the way that it now holds the value "1" where "SEI" was previously, but it doesnt display in the pdf correctly (it only shows a ? character in a square) probably because im not manipulating the buffer correctly.

I am using hummus.js for accessing the pdf data The font of the relevant placeholder is "Frutiger Next Pro Bold" (if that matters)

Code:

async function replacetext(filePath) {      
    const modPdfWriter = hummus.createWriterToModify(filePath, {modifiedFilePath: `${filePath}-modified.pdf`, compress: false})
    const numPages = modPdfWriter.createPDFCopyingContextForModifiedFile().getSourceDocumentParser().getPagesCount()

    for (let page = 0; page < numPages; page++) {
        const copyingContext = modPdfWriter.createPDFCopyingContextForModifiedFile()
        const objectsContext = modPdfWriter.getObjectsContext()

        const pageObject = copyingContext.getSourceDocumentParser().parsePage(page)
        const textStream = copyingContext.getSourceDocumentParser().queryDictionaryObject(pageObject.getDictionary(), 'Contents')
        const textObjectID = pageObject.getDictionary().toJSObject().Contents.getObjectID()

        let data = []
        const readStream = copyingContext.getSourceDocumentParser().startReadingFromStream(textStream)
        while (readStream.notEnded()) {
            const readData = readStream.read(10000)
            data = data.concat(readData)

        }   

        var redactedPdfPageAsString = new Buffer.from(data).toString();
        
     //   var replacedBuffer = redactedPdfPageAsString.replace("SEI", "1");
          var replacedBuffer = replace(redactedPdfPageAsString, "SEI", "1");
          
        objectsContext.startModifiedIndirectObject(textObjectID)

        const stream = objectsContext.startUnfilteredPDFStream();
        stream.getWriteStream().write(strToByteArray(replacedBuffer));
        objectsContext.endPDFStream(stream);

        objectsContext.endIndirectObject();
    }
    
    modPdfWriter.end()

    hummus.recrypt(`${filePath}-modified.pdf`, filePath)
    
}

I also tried node packages like stream-replace or buffer-replace but they were not working.

This is a cutout of the buffer, where also the string "SEI" is contained:

/Span <</Lang (de-DE)/MCID 0 >>BDC BT 0 0 0 1 k /GS0 gs /T1_0 1 Tf 10 0 0 10 25.5118 814.9606 Tm (SEI)Tj ET EMC /Span <</Lang (de-DE)/MCID 1 >>BDC BT 10 0 0 10 39.5317 814.9606 Tm (-)Tj ET EMC /Span <</Lang (de-DE)/MCID 2 >>BDC BT 0 1 1 0 k /

Have you checked whether the font resource **T1_0** does support the character code '1'? (In case of embedded fonts usually only a subset is supported, the subset of character codes actually used.) — mkl, Oct 27 '20 at 17:35
For that specific file you can use a PDF internals browser like iText RUPS, PDFBox PDFDebugger, or the "Browse Internal PDF Structure" option of Adobe Acrobat Preflight. There may be javascript tools for that, too, but I don't know them. Alternatively share the file in question. — mkl, Oct 28 '20 at 09:39
Here is an example pdf, where i want to change "SEITENPLATZHALTER" to digits (between 1 and ~100) https://www.file-upload.net/download-14338588/example.pdf.html — KevGi_AWSi, Oct 28 '20 at 10:28
Your SEITENPLATZHALTER is drawn using a font named **T1_0** on the first two pages and **T1_1** on the remaining two pages; in either case the respective name resolves to the same PDF font object, and this font object is only subset embedded. More exactly it only offers the glyphs for the characters */space/comma/A/E/H/I/J/L/N/P/R/S/T/Z/a/d/e/i/l/m/n/o/p/r/s/t/u*. Thus, a simple search&replace in the content stream will not allow to replace that placeholder text by numbers. Another complication is that your SEITENPLATZHALTER is drawn in four parts: `[(SEITENPL)12 (A)48 (TZHAL)108.1 (TER)]TJ`. — mkl, Oct 28 '20 at 12:01
Do you have any ideas how i could be able to solve that problem? The problem that the "SEITENPLATZHALTER" is drawn in for parts is kinda solvable by shortening the placeholder keyword to something more simple, but still not existent in the german or english language (e.g. AXZ or sth like that). Then its probably only drawn in one part. But about the other problem i currently have no ideas how to workaround. Maybe it helps when i share the file with a number in that same font: https://www.file-upload.net/download-14338724/example2.pdf.html — KevGi_AWSi, Oct 28 '20 at 12:40
What you can try is to add on some page invisibly a string containing all the characters (in particular the digits) you need using the font in question. You can make it invisible by drawing it somewhere outside the page area or by drawing white on white in a minute font size. If you're lucky, that causes all those characters to be included in the embedded subset font in question. — mkl, Oct 28 '20 at 14:22
Other than that please be aware that PDF is explicitly not a format meant for editing but instead meant as a final format for documents looking identically on many devices. Thus, only try editing the page content if you know the document to support your editing actions. In particular don't expect to be able to replace text in arbitrary documents from external sources. — mkl, Oct 28 '20 at 14:24

Replace String in PDF with node.js | manipulating datastreams/buffers of pdfs

0 Answers0