0

I am trying to parse a CSV in Apps Script which has 52 columns and 126543 rows. Data size is 72 MB.

After parsing it using Utilities.parseCsv method, the size of the data is 95104 however the expectation is that it should be 126543 i.e. same as the initial data size.

After removing few columns from the CSV and again repeating the above step, the size of the data after parsing increased but it still did not match the initial data size.

Is there a size limitation with Utilities.parseCsv? I could not find any documentation referring to this limitation if it is true.

Kindly put some light on this and is there any alternative solution which does not have size constraints?

TheMaster
  • 45,448
  • 6
  • 62
  • 85
ab.it.gcp
  • 151
  • 1
  • 14
  • How much data do you have in the cells? I ran [this test](https://gist.github.com/benronkin/9b76c67d91fa062aef892b5edd1244a7) using your dimensions and a short string in every cell, and got the same number of rows that I'd started with. – Ben Sep 12 '22 at 05:53
  • @Ben - Total size of CSV is 72 MB. Updated it in the question as well now – ab.it.gcp Sep 12 '22 at 06:36
  • 1
    What is the source of the CSV file? Does the solution that you are looking should be limited to using Google Apps Script built-in services or you could upload the file, i.e., to BigQuery? What will you do with the file once it's parsed? – Rubén Sep 12 '22 at 14:49

1 Answers1

1

If it did have a size limitation, it would throw a error. The most probable reason for reduced rows is incorrectly formatted csv(improper double quotes)/corrupt csv. Try checking each row's length:

const csvChecker_ = csv => {
  const arr = Utilities.parseCsv(csv),
  len = arr[0].length;
  arr.forEach((row,i) => row.length === len ||
    console.error(`Incorrect length at row ${i+1}. Row parsed incorrectly: ${JSON.stringify(row)}`))
}

You can manually check that row to figure out the problem or build a custom csv parser for your csv.

Related:

Importing CSV from URL that has line breaks within one of the fields

TheMaster
  • 45,448
  • 6
  • 62
  • 85
  • Assuming it's due to incorrect format, how can I identify which are the invalid rows? The method in your answer will only return only the valid rows as Utilities.parseCsv probably filters out the invalid ones. – ab.it.gcp Sep 12 '22 at 09:44
  • @ab.it.gcp No, the method will return invalid rows. `Utilities.parseCsv` won't filter out invalid ones. – TheMaster Sep 12 '22 at 10:42
  • Then probably the issue is not due to invalid records because as I mentioned in the initial post, there are 126543 records in the CSV but after performing Utilities.parseCsv, the no. of records are 95104. So many records are lost after parsing. I manually checked one of the record which was in CSV but is not present after parsing and don't see anything invalid in it. – ab.it.gcp Sep 12 '22 at 10:49
  • 1
    @ab.it.gcp It's a matrix. You're only looking at one dimension. If the csv was `1,2\n3,4\n5,6`(3x2matrix), it can be parsed as `[[1,2],[3,4],[5,6]]` (3x2 matrix). This is correct. But if there are errors on csv, it maybe parsed as `[[1,2,3,4],[5,6]]`(2x[4,2] matrix) or `[["1,2","3,4"],[5,6]]`(still 2x2 matrix). Looking only on one dimension, the number of records: `3` vs `2` you can't say, the data disappeared. You have to look at the other dimension. – TheMaster Sep 12 '22 at 10:53
  • `but is not present after parsing`. I'd ask how did you confirm it "is not present after parsing"? – TheMaster Sep 12 '22 at 10:57
  • Agreed. Ran this code to find the invalid rows but no invalid rows were returned. Nothing was printed. const arr = Utilities.parseCsv(response); len = arr[0].length; for(var i=0; i – ab.it.gcp Sep 12 '22 at 11:17
  • Hence, it seems like the records are lost. – ab.it.gcp Sep 12 '22 at 11:20
  • 1
    I'm not convinced that data disappeared. If you know that a record is clearly missing, and it has a unique id. Try `arr.flat().join(',').match(/15186.*/g)` where `arr` is the parsed `csv`. And `csv.match(/15186.*/g)`: Both should return the correct id back – TheMaster Sep 12 '22 at 11:24
  • @ab.it.gcp In any case, implementing a custom csv parser is easy: `arr=csv.split("\n").map(row => row.split(","))` assuming you don't have double quotes in your csv. You can also implement this as a for loop and debug each step. For eg, the first step to decide length: `const numRecords=csv.split("\n").length`. Related:https://stackoverflow.com/questions/57530297 You can also ``setValues()`` to a Google sheet range to visualize the issue. – TheMaster Sep 12 '22 at 11:43
  • Tried the step based on your your second last comment. It turns out that the value I am searching is not present in the CSV itself (fetched via apps script). I am fetching the CSV through a link using UrlFetchApp.fetch. When I manually download the CSV from the same URL in a browser and search the value in it, it is present. But when done the same using apps script using UrlFetch the records seemed to have lost. So, looks like the issue is not while parsing but while fetching the CSV itself. Apologies if this should be a separate question but please let me know if any idea about it. Thanks. – ab.it.gcp Sep 12 '22 at 12:08
  • 1
    @ab.it.gcp I was thinking that may the issue(if csv parsing was correct). This is why you should start practicing [mcve]. In any case, it should be a separate question. If the csv itself has data missing, that's a issue on the server providing the csv. You need to contact them. – TheMaster Sep 12 '22 at 12:23
  • Noted. And sincere thanks for all the help here! However, the data is not missing in the original CSV but when it is fetched via apps script using UrlFetchApp.fetch, data is lost. I'll post a separate question for that. – ab.it.gcp Sep 12 '22 at 12:28
  • @ab.it.gcp I'm sure that statement is wrong as well. If data is somehow lost during fetching, the data wouldn't be a perfect parseable csv string. This is definitely your server. – TheMaster Sep 12 '22 at 12:32
  • Ok. When I manually hit the same URL (which is passed to UrlFetchApp) in my browser and download the CSV, it has all records. This makes me think that the URL and CSV is fine. If the server had an issue, shouldn't it not work manually as well? – ab.it.gcp Sep 12 '22 at 12:37
  • 2
    @ab.it.gcp Even if what you're saying is true(that's a big if), That doesn't prove data is lost during fetch. It only proves that the server acts differently depending on how the request is made, i.e., it purposely discriminates against Google apps script user agent, if what you're saying is true. Do ask a new question, but stick to the facts and don't make any assumptions/interpretations... Others might have other ideas. – TheMaster Sep 12 '22 at 12:40
  • 1
    @ab.it.gcp [Quotas](https://developers.google.com/apps-script/guides/services/quotas?hl=en) might also be a issue. 50mb/call. Usually servers compress data, even if the uncompressed size is more. So, unless the compressed fetch size itself exceeds 50mb, this shouldn't be a issue. – TheMaster Sep 12 '22 at 12:49
  • Yes, thank you. Looks like large data cannot be fetched via this method. Would need an alternative. – ab.it.gcp Sep 12 '22 at 13:31
  • 2
    @ab.it.gcp Look into `Range` http header and see if the server supports partial downloads. – TheMaster Sep 12 '22 at 13:37
  • Thanks. Yes, the server does support partial downloads and I was able to use ranges and fetched entire data by calling the service multiple times in a loop and concatenating it. But the next step was to upload it in Big query, where I am facing an error. That's a different issue though. Separate question is raised for that - https://stackoverflow.com/questions/73717307/big-query-insert-job-failing-with-empty-response-error-in-apps-script – ab.it.gcp Sep 14 '22 at 13:13
  • @ab.it.gcp Kindly add a answer to your previous questions and accept it. Secondly, Why exactly do you want to upload to big query? – TheMaster Sep 14 '22 at 14:07
  • Added answer to the previous question. Using BigQuery as a central source for many other reports. That is why uploading all the data (unfiltered) to BigQuery and from there many customized reports will be generated by picking and choosing columns and filters depending on the report. – ab.it.gcp Sep 17 '22 at 14:33
  • Currently stuck at this - https://stackoverflow.com/questions/73717307/big-query-insert-job-failing-with-empty-response-error-in-apps-script Also logged an issue here for the same - https://issuetracker.google.com/issues/247052933 – ab.it.gcp Sep 18 '22 at 05:06