Using Drive API / DriveApp to convert from PDFs to Google Documents

Question

This problem has been successfully resolved. I am editing my post to document my experience for posterity and future reference.

The Task

I have 117 PDF files (average size ~238 KB) uploaded to Google Drive. I want to convert them all to Google Docs and keep them in a different Drive folder.

The Problem

I attempted to convert the files using Drive.Files.insert. However, under most circumstances, only 5 files could be converted this way before the function expires prematurely with this error

Limit Exceeded: DriveApp. (line #, file "Code")

where the line referenced above is when the insert function is called. After calling this function for the first time, subsequent calls typically failed immediately with no additional google doc created.

Approach

I used 3 main ways to achieve my goal. One was using the Drive.Files.insert, as mentioned above. The other two involved using Drive.Files.copy and sending a batch of HTTP requests. These last two methods were suggested by Tanaike, and I recommend reading his answer below for more information. The insert and copy functions are from Google Drive REST v2 API, while batching multiple HTTP requests is from Drive REST v3.

With Drive.Files.insert, I experienced issues dealing with execution limitations (explained in the Problem section above). One solution was to run the functions multiple times. And for that, I needed a way to keep track of which files were converted. I had two options for this: using a spreadsheet and a continuation token. Therefore, I had 4 different methods to test: the two mentioned in this paragraph, batching HTTP requests, and calling Drive.Files.copy.

Because team drives behave differently from regular drives, I felt it necessary to try each of those methods twice, one in which the folder containing the PDFs is a regular non-Team Drive folder and one in which that folder is under a Team Drive. In total, this means I had 8 different methods to test.

These are the exact functions I used. Each of these was used twice, with the only variations being the ID of the source and destination folders (for reasons stated above):

Method A: Using Drive.Files.insert and a spreadsheet

function toDocs() {
  var sheet = SpreadsheetApp.openById(/* spreadsheet id*/).getSheets()[0];
  var range = sheet.getRange("A2:E118");
  var table = range.getValues();
  var len = table.length;
  var resources = {
    title: null,
    mimeType: MimeType.GOOGLE_DOCS,
    parents: [{id: /* destination folder id */}]
  };
  var count = 0;
  var files = DriveApp.getFolderById(/* source folder id */).getFiles();
  while (files.hasNext()) {
    var blob = files.next().getBlob();
    var blobName = blob.getName();
    for (var i=0; i<len; i++) {
      if (table[i][0] === blobName.slice(5, 18)) {
        if (table[i][4])
          break;
        resources.title = blobName;
        Drive.Files.insert(resources, blob);  // Limit Exceeded: DriveApp. (line 51, file "Code")
        table[i][4] = "yes";
      }
    }

    if (++count === 10) {
      range.setValues(table);
      Logger.log("time's up");
    }
  }
}

Method B: Using Drive.Files.insert and a continuation token

function toDocs() {
  var folder = DriveApp.getFolderById(/* source folder id */);
  var sprop = PropertiesService.getScriptProperties();
  var contToken = sprop.getProperty("contToken");
  var files = contToken ? DriveApp.continueFileIterator(contToken) : folder.getFiles();
  var options = {
    ocr: true
  };
  var resource = {
    title: null,
    mimeType: null,
    parents: [{id: /* destination folder id */}]
  };

  while (files.hasNext()) {
    var blob = files.next().getBlob();
    resource.title = blob.getName();
    resource.mimeType = blob.getContentType();
    Drive.Files.insert(resource, blob, options);  // Limit Exceeded: DriveApp. (line 113, file "Code")
    sprop.setProperty("contToken", files.getContinuationToken());
  }
}

Method C: Using Drive.Files.copy

Credit for this function goes to Tanaike -- see his answer below for more details.

function toDocs() {
  var sourceFolderId = /* source folder id */;
  var destinationFolderId = /* destination folder id */;
  var files = DriveApp.getFolderById(sourceFolderId).getFiles();
  while (files.hasNext()) {
    var res = Drive.Files.copy({parents: [{id: destinationFolderId}]}, files.next().getId(), {convert: true, ocr: true});
    Logger.log(res) 
  }
}

Method D: Sending batches of HTTP requests

Credit for this function goes to Tanaike -- see his answer below for more details.

function toDocs() {
  var sourceFolderId = /* source folder id */;
  var destinationFolderId = /* destination folder id */;

  var files = DriveApp.getFolderById(sourceFolderId).getFiles();
  var rBody = [];
  while (files.hasNext()) {
    rBody.push({
      method: "POST",
      endpoint: "https://www.googleapis.com/drive/v3/files/" + files.next().getId() + "/copy",
      requestBody: {
        mimeType: "application/vnd.google-apps.document",
        parents: [destinationFolderId]
      }
    });
  }
  var cycle = 20; // Number of API calls at 1 batch request.
  for (var i = 0; i < Math.ceil(rBody.length / cycle); i++) {
    var offset = i * cycle;
    var body = rBody.slice(offset, offset + cycle);
    var boundary = "xxxxxxxxxx";
    var contentId = 0;
    var data = "--" + boundary + "\r\n";
    body.forEach(function(e){
      data += "Content-Type: application/http\r\n";
      data += "Content-ID: " + ++contentId + "\r\n\r\n";
      data += e.method + " " + e.endpoint + "\r\n";
      data += e.requestBody ? "Content-Type: application/json; charset=utf-8\r\n\r\n" : "\r\n";
      data += e.requestBody ? JSON.stringify(e.requestBody) + "\r\n" : "";
      data += "--" + boundary + "\r\n";
    });
    var options = {
      method: "post",
      contentType: "multipart/mixed; boundary=" + boundary,
      payload: Utilities.newBlob(data).getBytes(),
      headers: {'Authorization': 'Bearer ' + ScriptApp.getOAuthToken()},
      muteHttpExceptions: true,
    };
    var res = UrlFetchApp.fetch("https://www.googleapis.com/batch", options).getContentText();
//    Logger.log(res); // If you use this, please remove the comment.
  }
}

What Worked and What Didn't

None of the functions using Drive.Files.insert worked. Every function using insert for conversion failed with this error

Limit Exceeded: DriveApp. (line #, file "Code")

(line number replaced with generic symbol). No further details or description of the error could be found. A notable variation was one in which I used a spreadsheet and the PDFs were in a team drive folder; while all other methods failed instantly without converting a single file, this one converted 5 before failing. However, when considering why this variation did better than the others, I think it was more of a fluke than any reason related to the use of particular resources (spreadsheet, team drive, etc.)
Using Drive.Files.copy and batch HTTP requests worked only when the source folder was a personal (non-Team Drive) folder.
Attempting to use the copy function while reading from a Team Drive folder fails with this error:

File not found: 1RAGxe9a_-euRpWm3ePrbaGaX5brpmGXu (line #, file "Code")

(line number replaced with generic symbol). The line being referenced is
```
var res = Drive.Files.copy({parents: [{id: destinationFolderId}]}, files.next().getId(), {convert: true, ocr: true});
```
Using batch HTTP requests while reading from a Team Drive folder does nothing -- no doc files are created and no errors are thrown. Function silently terminates without having accomplished anything.

Conclusion

If you wish to convert a large number of PDFs to google docs or text files, then use Drive.Files.copy or send batches of HTTP requests and make sure that the PDFs are stored in a personal drive rather than a Team Drive.

Special thanks to @tehhowch for taking such an avid interest in my question and for repeatedly coming back to provide feedback, and to @Tanaike for providing code along with explanations that successfully solved my problem (with a caveat, read above for details).

I believe you are exceeding the API quota as mentioned [here](https://stackoverflow.com/questions/10311969/what-is-the-limit-on-google-drive-api-usage). Not sure if this is the current quota or it has changed `1000 request per 100 seconds` — Jack Brown, Mar 16 '18 at 05:11
You should be storing the continuation token after each file / set of files, or take an approach like I do in [this answer](https://stackoverflow.com/a/49265306/9337071) and make a "to-do" list to ensure you do not re-convert files. — tehhowch, Mar 16 '18 at 05:26
please include the full error so we can see which limit has exceeded. — Linda Lawton - DaImTo, Mar 16 '18 at 08:07
I don't think that Drive REST API has these limitations. Try getting rid of DriveApp altogether and substituting it with querying Drive API directly — Anton Dementiev, Mar 16 '18 at 14:30
@DaImTo Hi DalmTo, I quoted the full error in my post. That's the full error, I don't get anything more. — Manuel, Mar 16 '18 at 14:33
@AntonDementiev by that, do you mean I should also get rid of files.next()? — Manuel, Mar 16 '18 at 14:34
@DaImTo I haven't noticed a more complete description in either the log or in a prompt anywhere. Could you tell me where to look for it? — Manuel, Mar 16 '18 at 15:47
@Manuel What happens if you turn OCR off? I wonder if you are exceeding a hidden OCR rate limit. Also, what kinds of files are in the iterated folder? You never check it, and always "convert" back to the source mimetype (vs something like `MimeType.GOOGLE_DOCS`) — tehhowch, Mar 16 '18 at 22:05
@tehhowch the same problem occurs whether I turn OCR on or off. Turning it off and using google docs mimetype has the same effect as turning it on and using source mimetype (both produce doc files). Therefore, I don't think OCR is the cause. More importantly, it's something about the FileIterator that keeps stopping my function after 5 doc files are created. If I continue where I left off using the continuation token, it stops immediately, as noted in my post. But if I start from the beginning (by deleting the "contToken" property in my code), then it works for the same first 5 files. — Manuel, Mar 16 '18 at 22:26
@Manuel What happens if you comment out the conversion - can you iterate over the collection? i.e. `var files = []; while (files.hasNext()) { var file = files.next(); files.push([file.getName(), file.getId()]); Utilities.sleep(101); } Logger.log(files.length);` collects every file in the original folder? — tehhowch, Mar 16 '18 at 22:35
@tehhowch If I take out everything related to file conversion, and simply push the file name and id to an array, then yes. It works exactly as expected. It goes through all 117 PDFs, and this is the number that is shown in the log. I should add that it works even without calling the sleep function from Utilities. It is very fast. — Manuel, Mar 16 '18 at 22:47
I cannot confirm this with my own set of PDFs... I was able to use your exact code and convert my puny collection of 24 PDFs into docs :) I am suspecting this is due to some issue with your data being on a **Team Drive**. Can you reproduce this with a source folder that is not on a Team Drive? — tehhowch, Mar 16 '18 at 23:39
@tehhowch that is very strange and frustrating. I moved all the PDFs to a subfolder under My Drive and tested out the function. It went a little further this time -- 12 docs created before the function stopped with the same error. What did your PDFs have in them? My files are quite lengthy... 4 pages each. They are standardized forms. — Manuel, Mar 17 '18 at 00:29
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/167003/discussion-between-tehhowch-and-manuel). — tehhowch, Mar 17 '18 at 01:04
I have a proposal that I want to try for your situation. Can I propose it? — Tanaike, Mar 17 '18 at 01:23
@AntonDementiev I just noticed your comment again. I'm not familiar with the Drive API, only DriveApp (and other "App" services like SpreadsheetApp, DocumentApp, etc). But I will look into how to do it with the Drive REST API. If you can share a link to an article where it would be best for a beginner like me to start, that would be wonderful. — Manuel, Mar 17 '18 at 02:30
Thank you for giving a chance. I posted it. Please confirm it. — Tanaike, Mar 17 '18 at 02:38
@Tanaike your suggestion worked! All PDFs were successfully converted to google docs with no issues. The only condition was that the PDFs must be stored in a personal, non-Team Drive folder. Otherwise, there are errors (see my post for more details). — Manuel, Mar 18 '18 at 00:24
@Manuel note that Team Drives require special handling in the `optionalArgs` part of Drive API requests. By default, files within them are not searched. https://developers.google.com/drive/v2/web/enable-teamdrives#including_team_drive_content_fileslist — tehhowch, Mar 18 '18 at 21:54

score 5 · Accepted Answer · answered Mar 17 '18 at 02:37

You want to convert from PDF files in the folder to Google Documents. PDF files are in a folder of team drive. You want to import converted them to a folder of your Google Drive. If my understanding is correct, how about this method?

For the conversion from PDF to Google Document, it can convert using not only Drive.Files.insert(), but also Drive.Files.copy(). The advantage of use of Drive.Files.copy() is

Although Drive.Files.insert() has the size limitation of 5 MB, Drive.Files.copy() can use over the size of 5 MB.
In my envoronment, the process speed was faster than Drive.Files.insert().

For this method, I would like to propose the following 2 patterns.

Pattern 1 : Using Drive API v2

In this case, Drive API v2 of Advanced Google Services is used for converting files.

function myFunction() {
  var sourceFolderId = "/* source folder id */";
  var destinationFolderId = "/* dest folder id */";
  var files = DriveApp.getFolderById(sourceFolderId).getFiles();
  while (files.hasNext()) {
    var res = Drive.Files.copy({parents: [{id: destinationFolderId}]}, files.next().getId(), {convert: true, ocr: true});
//    Logger.log(res) // If you use this, please remove the comment.
  }
}

Pattern 2 : Using Drive API v3

In this case, Drive API v3 is used for converting files. And here, I used the batch requests for this situation. Because the batch requests can use 100 API calls by one API call. By this, the issue of API quota can be removed.

function myFunction() {
  var sourceFolderId = "/* source folder id */";
  var destinationFolderId = "/* dest folder id */";

  var files = DriveApp.getFolderById(sourceFolderId).getFiles();
  var rBody = [];
  while (files.hasNext()) {
    rBody.push({
      method: "POST",
      endpoint: "https://www.googleapis.com/drive/v3/files/" + files.next().getId() + "/copy",
      requestBody: {
        mimeType: "application/vnd.google-apps.document",
        parents: [destinationFolderId]
      }
    });
  }
  var cycle = 100; // Number of API calls at 1 batch request.
  for (var i = 0; i < Math.ceil(rBody.length / cycle); i++) {
    var offset = i * cycle;
    var body = rBody.slice(offset, offset + cycle);
    var boundary = "xxxxxxxxxx";
    var contentId = 0;
    var data = "--" + boundary + "\r\n";
    body.forEach(function(e){
      data += "Content-Type: application/http\r\n";
      data += "Content-ID: " + ++contentId + "\r\n\r\n";
      data += e.method + " " + e.endpoint + "\r\n";
      data += e.requestBody ? "Content-Type: application/json; charset=utf-8\r\n\r\n" : "\r\n";
      data += e.requestBody ? JSON.stringify(e.requestBody) + "\r\n" : "";
      data += "--" + boundary + "\r\n";
    });
    var options = {
      method: "post",
      contentType: "multipart/mixed; boundary=" + boundary,
      payload: Utilities.newBlob(data).getBytes(),
      headers: {'Authorization': 'Bearer ' + ScriptApp.getOAuthToken()},
      muteHttpExceptions: true,
    };
    var res = UrlFetchApp.fetch("https://www.googleapis.com/batch", options).getContentText();
//    Logger.log(res); // If you use this, please remove the comment.
  }
}

Note :

If the number of API calls at 1 batch request is large (the current value is 100), please modify var cycle = 100.
If Drive API v3 cannot be used for team drive, please tell me. I can convert it for Drive API v2.
If the team drive is the reason of issue for your situation, can you try this after it copied PDF files to your Google Drive?

Reference :

Batching Requests

If these are not useful for you, I'm sorry.

This worked wonderfully! But only when the PDFs were stored in a personal drive rather than a team drive. Thank you, @Tanaike — Manuel, Mar 18 '18 at 00:19
@Manuel Thank you for your additional information. I could study from your question. Thank you, too. And I'm glad your issue was solved. — Tanaike, Mar 18 '18 at 08:46

score 1 · Answer 2 · answered Mar 16 '18 at 06:31

1

You can first of all fetch and store id of all files in a google sheet. Then you can proceed with processing each file normally by using it's id. Then after you have processed them mark that file as processed. And before processing a file check if that file is already processed.

If there are several files then you can also store the row number till where you have processed, next time continue after that.

Then at last create a trigger to execute your function every 10 minutes or so.

By this you can overcome execution time limit for single execution. API request quota and all will not be by-passed by this method.

answered Mar 16 '18 at 06:31

Umair Mohammad

4,489
2
20
34

This sounds like a good idea. I will try this later when I have access to my computer. I will let you know what comes of it. – Manuel Mar 16 '18 at 14:34
I tried something slightly different, actually. I edited my post to show the new code. I thought that, in theory, it should work very similarly to you method, and yet the new method I tried continued to fail. I am not so hopeful of using a spreadsheet to keep a tally of converted files, but I will try it nevertheless. – Manuel Mar 16 '18 at 22:02
Just leaving a comment to report my [lack of] progress so far. I used exactly this method -- after a doc file is created, I set a value in a 2D array corresponding to the spreadsheet tracking the conversions to indicate that the current file was converted. After every 10 loops (function initially stopped after 12), I saved the array to the range to update the sheet with the new marks. The first run was okay enough -- it created 12 files and then stopped. Subsequent runs ended in immediate failure with the error noted in my post. – Manuel Mar 17 '18 at 02:20
Ok, can you please post some screenshots , code snippet to demonstrate what you're doing exactly ? – Umair Mohammad Mar 17 '18 at 14:11
I actually managed to get it working. See my original post (edited) for more details. – Manuel Mar 18 '18 at 00:22