3

I'm importing a text dataset to Google Vertex AI and got the following error:

Hello Vertex AI Customer,

Due to an error, Vertex AI was unable to import data into 
dataset [dataset_name].
Additional Details:
Operation State: Failed with errors
Resource Name: [resoure_link]
Error Messages: There are too many rows in the jsonl/csv file. Currently we 
only support 1000000 lines. Please cut your files to smaller size and run 
multiple import data pipelines to import.

I checked my dataset which I generated from pandas and the actual CSV file, it only have 600k lines.

Anyone got similar errors?

user11717481
  • 1
  • 9
  • 15
  • 25
ML_noob
  • 51
  • 5
  • Is it possible to share your dataset which you are trying to import? – Vishal K Nov 25 '21 at 12:06
  • It is company's proprietary dataset I'm afraid. After diving deeper, I think VertexAI also have 10MB dataset limitation on top of the 1M lines limit https://cloud.google.com/natural-language/automl/quotas – ML_noob Nov 26 '21 at 07:44
  • Hi, GCP support here. We would like to take a look at the sample data of your input CSV file to investigate further. So, [can you raise a private thread in the issue tracker (referencing this question, as stated in the template) with the project ID, job ID and a sample data of your input CSV file (Don't want the entire file or any PII)?](https://issuetracker.google.com/issues/new?component=1132178&template=1639003) After you've created the thread, please share here the issue ID, so we can follow up. Note that issues in that component will only be accessible for you and GCP support. – Vishal K Dec 07 '21 at 13:36
  • 1
    Hi @VishalK, thanks for your comment. I've resolved the issue, it turns out to be a problem in my CSV formatting. I'll put an answer to this question. – ML_noob Dec 09 '21 at 05:07

1 Answers1

2

So it turns out to be an error in my CSV formatting.

I forgot to trim newlines and extra whitespaces in my text dataset. This solved the 1M lines count. But after doing that, I then get error telling me I have too much labels while it was only 2.

Error Messages: There are too many AnnotationSpecs in the dataset. Up to 
5000 AnnotationSpecs are allowed in one Dataset.

And this is because I created the text dataset using to_csv() method in Pandas dataframe. Creating a CSV file this way, it will automatically put quotes when your text include a "," (comma character) only. So the CSV file will look like:

"this is a sentence, with a comma", 0
this is a sentence without a comma, 1

Meanwhile, what Vertex AutoML Text wants the CSV is to look like this:

"this is a sentence, with a comma", 0
"this is a sentence without a comma", 1

i.e. you have to put quotes on every line.

Which you can achieve by writing your own CSV formatter, or if you insist on using Pandas to_csv(), you can pass csv.QUOTE_ALL to the quoting parameter. It will look like this:

import csv
df.to_csv("file.csv", index=False, quoting=csv.QUOTE_ALL, header=False)
ML_noob
  • 51
  • 5