0

I want to start a Vertex AI AutoML Text Entity Extraction Batch Prediction Job, but in my own experience, texts ("content" field in the JSONL structure), must also accomplish the following two features:

  • Every text's size in bytes, must be between 10 and 10000 bytes: DONE
  • Every text encoding must be UTF-8: UNKNOWN

My original data is stored in BigQuery, so I'll have to export it to Google Cloud Storage for later batch prediction. To take advantage of BigQuery optimization, I want to accomplish the 2 previous tasks in the BigQuery data source table itself. I have checked Google's official documentation, and the closest I have got to some related information, is this; however not accurate VS what I want. BTW, the query looks as follows:

WITH mydata AS (
  SELECT
    CASE
      WHEN BYTE_LENGTH(posting)>10000 THEN LEFT(posting, 9950)
      WHEN BYTE_LENGTH(posting)<10 THEN CONCAT(posting, " is possibly an skill")
      ELSE posting
      END AS posting
  FROM `my-project.Machine_Learning_Datasets.sample-data-source`  -- Modified for data protection
)
SELECT
  posting as content,  -- Something needs to be done here
  "text" as mimeType
FROM mydata

And my-project.Machine_Learning_Datasets.sample-data-source schema looks as follows:

Field name Type Mode Records
posting STRING NULLABLE 100M

Any ideas?

David Espinosa
  • 760
  • 7
  • 21
  • Hi @David Espinosa, BigQuery supports UTF-8 encoding for both nested or repeated and flat data. For your requirement, you can use the encoding flag `--encoding=UTF-8 or -E=UTF-8` as mentioned in this [documentation](https://cloud.google.com/bigquery/docs/reference/bq-cli-reference#bq_load ).There is also a string function `SAFE_CONVERT_BYTES_TO_STRING(value)` which can be used to convert bytes into string. It replaces any invalid UTF-8 characters with the Unicode replacement character, U+FFFD. Let me know if that helps. – Shipra Sarkar Nov 14 '22 at 12:22

1 Answers1

1

The following answer did the job, FYI:

WITH
  mydata AS (
  SELECT
    CASE
      WHEN BYTE_LENGTH(posting)>10000 THEN LEFT(posting, 9950)
      WHEN BYTE_LENGTH(posting)<10 THEN CONCAT(posting, " is possibly an skill")
    ELSE
    posting
  END
    AS posting
  FROM
    `my-project.Machine_Learning_Datasets.sample-data-source` )
SELECT
  REGEXP_REPLACE(posting, r'[^\x00-\x7F]+', '') AS content,
  "text/plain" AS mimeType
FROM
  mydata

UPDATE: This case has been considered, for an improved workaround.

Thanks!

David Espinosa
  • 760
  • 7
  • 21