I want to start a Vertex AI AutoML Text Entity Extraction Batch Prediction Job, but in my own experience, texts ("content"
field in the JSONL structure), must also accomplish the following two features:
- Every text's size in bytes, must be between 10 and 10000 bytes: DONE
- Every text encoding must be UTF-8: UNKNOWN
My original data is stored in BigQuery, so I'll have to export it to Google Cloud Storage for later batch prediction. To take advantage of BigQuery optimization, I want to accomplish the 2 previous tasks in the BigQuery data source table itself. I have checked Google's official documentation, and the closest I have got to some related information, is this; however not accurate VS what I want. BTW, the query looks as follows:
WITH mydata AS (
SELECT
CASE
WHEN BYTE_LENGTH(posting)>10000 THEN LEFT(posting, 9950)
WHEN BYTE_LENGTH(posting)<10 THEN CONCAT(posting, " is possibly an skill")
ELSE posting
END AS posting
FROM `my-project.Machine_Learning_Datasets.sample-data-source` -- Modified for data protection
)
SELECT
posting as content, -- Something needs to be done here
"text" as mimeType
FROM mydata
And my-project.Machine_Learning_Datasets.sample-data-source
schema looks as follows:
Field name | Type | Mode | Records |
---|---|---|---|
posting | STRING | NULLABLE | 100M |
Any ideas?