BigQuery: Convert a text column to UTF-8

Question

I want to start a Vertex AI AutoML Text Entity Extraction Batch Prediction Job, but in my own experience, texts ("content" field in the JSONL structure), must also accomplish the following two features:

Every text's size in bytes, must be between 10 and 10000 bytes: DONE
Every text encoding must be UTF-8: UNKNOWN

My original data is stored in BigQuery, so I'll have to export it to Google Cloud Storage for later batch prediction. To take advantage of BigQuery optimization, I want to accomplish the 2 previous tasks in the BigQuery data source table itself. I have checked Google's official documentation, and the closest I have got to some related information, is this; however not accurate VS what I want. BTW, the query looks as follows:

WITH mydata AS (
  SELECT
    CASE
      WHEN BYTE_LENGTH(posting)>10000 THEN LEFT(posting, 9950)
      WHEN BYTE_LENGTH(posting)<10 THEN CONCAT(posting, " is possibly an skill")
      ELSE posting
      END AS posting
  FROM `my-project.Machine_Learning_Datasets.sample-data-source`  -- Modified for data protection
)
SELECT
  posting as content,  -- Something needs to be done here
  "text" as mimeType
FROM mydata

And my-project.Machine_Learning_Datasets.sample-data-source schema looks as follows:

Field name	Type	Mode	Records
posting	STRING	NULLABLE	100M

Any ideas?

Hi @David Espinosa, BigQuery supports UTF-8 encoding for both nested or repeated and flat data. For your requirement, you can use the encoding flag `--encoding=UTF-8 or -E=UTF-8` as mentioned in this [documentation](https://cloud.google.com/bigquery/docs/reference/bq-cli-reference#bq_load ).There is also a string function `SAFE_CONVERT_BYTES_TO_STRING(value)` which can be used to convert bytes into string. It replaces any invalid UTF-8 characters with the Unicode replacement character, U+FFFD. Let me know if that helps. — Shipra Sarkar, Nov 14 '22 at 12:22

David Espinosa · Accepted Answer · 2022-11-17T17:27:31.453

The following answer did the job, FYI:

WITH
  mydata AS (
  SELECT
    CASE
      WHEN BYTE_LENGTH(posting)>10000 THEN LEFT(posting, 9950)
      WHEN BYTE_LENGTH(posting)<10 THEN CONCAT(posting, " is possibly an skill")
    ELSE
    posting
  END
    AS posting
  FROM
    `my-project.Machine_Learning_Datasets.sample-data-source` )
SELECT
  REGEXP_REPLACE(posting, r'[^\x00-\x7F]+', '') AS content,
  "text/plain" AS mimeType
FROM
  mydata

UPDATE: This case has been considered, for an improved workaround.

Thanks!

BigQuery: Convert a text column to UTF-8

1 Answers1