0

I need to parse through multiple PDF files in one of the folders in my Google Drive, and return the parsed information into a Google Sheet. (I have already worked with parsing through Gmail so I don't think this will be a problem for me)

However, the research I have done about this indicates that I will need to import a Library first into my script editor that can parse through PDF files.

I am trying to import the PDF.js Library, but I cannot find the script ID, so instead, I am trying to import the code into a Script file that can then be added as a Library into other scripts.

I have downloaded the Zip file from the repository on GitHub: https://github.com/mozilla/pdf.js

However, I am not sure which file to copy into the script editor? Should it be the file called "Builder.js"?

Sorry this is the first time I am interacting with GitHub.

EDIT: Current script looks something like this. Unlike email, I cannot retrieve the contents of the PDF file in text form so that I can pull out information I need

function getPDFfiles () {

const pdfFolder = DriveApp.getFolderById("myfolderid");
const files = pdfFolder.getFilesByType(MimeType.PDF);

let pdfNames = []

while (files.hasNext()) {
    const file = files.next();
    const fileName = file.getName();
  pdfNames.push([fileName]);
  }


generator.getRange(2, 1, pdfNames.length, pdfNames[0].length).setValues(pdfNames);


}; // getPDFfiles function ends
TheMaster
  • 45,448
  • 6
  • 62
  • 85
Borgher
  • 309
  • 1
  • 9
  • In the current stage, I'm worried that PDF.js might not be able to be directly used with Google Apps Script. How do you want to use this with Google Apps Script? And also, can I ask you about the detail of `I need to parse through multiple PDF files in one of the folders in my Google Drive, and return the parsed information into a Google Sheet.`? First, I would like to correctly understand your question and your possible direction. – Tanaike Jun 01 '23 at 00:27
  • Hi Tanaike, I just updated my question and added my script. Right now, I can only retrieve the file name, instead of the contents of the PDF – Borgher Jun 01 '23 at 00:39
  • Thank you for replying. From your updated question, can I ask you about the detail of `retrieve the contents of the PDF file in text form so that I can pull out information I need`? For example, do you want to just convert the PDF data to text data in this question? – Tanaike Jun 01 '23 at 00:49
  • Hi Tanaike, that is correct. These PDFs are all invoices, so I was thinking of retrieving the complete text and then using javascript to splice in specific areas to get strings such as date, invoice amount, etc. Hopefully that makes sense – Borgher Jun 01 '23 at 00:57
  • Thank you for replying. From your reply, I proposed a modified script as an answer. Please confirm it. If that was not useful, I apologize. – Tanaike Jun 01 '23 at 01:04

1 Answers1

3

I believe your goal is as follows.

  • You want to convert PDF files to text data using Google Apps Script.

In the current stage, I'm worried that PDF.js might not be able to be directly used with Google Apps Script. So, in this case, I would like to propose a method without using PDF.js. When your showing script is modified, how about the following modification?

Modified script:

In this modified script, Drive API is used for converting PDF format to Google Document. So, please enable Drive API at Advanced Google services.

function getPDFfiles() {
  const pdfFolder = DriveApp.getFolderById("myfolderid");
  const files = pdfFolder.getFilesByType(MimeType.PDF);
  const res = []
  while (files.hasNext()) {
    const file = files.next();
    const tempId = Drive.Files.copy({ mimeType: MimeType.GOOGLE_DOCS }, file.getId(), { supportsAllDrives: true }).id;
    const text = DocumentApp.openById(tempId).getBody().getText();
    DriveApp.getFileById(tempId).setTrashed(true); // or Drive.Files.remove(tempId);
    const fileName = file.getName();
    res.push([fileName, text]);
  }

  const generator = SpreadsheetApp.getActiveSheet(); // Please set your sheet.
  generator.getRange(2, 1, res.length, res[0].length).setValues(res);
}
  • When this script is run, the PDF data is converted Google Document. And, the text data is retrieved from Google Document. And, the temporal Google Document is removed. And, the filename and the converted text are put into the Spreadsheet.

Note:

  • This sample script is for retrieving text data from PDF data. If you want to retrieve the specific text from the text data, I think that it can be also achieved from Google Document.

Reference:

Tanaike
  • 181,128
  • 11
  • 97
  • 165