0

I have loaded an uncompressed CSV file in BigQuery using Cloud Dataprep. When executing a query for analysis, the data it is not returning values properly.

For example, I have data in a table tab_status as shown here. When I do:

SELECT * FROM tab_status

It is showing all values. However, if I query on values:

SELECT * FROM tab_status trim(lower(Status)) like '%active%'

it is returning null and not giving any result.

I have checked the length which is more than the string length (shown here):

SELECT DISTINCT trim(lower(Status)), length(trim(lower(Status))) FROM tab_status

Now when I export the data from BigQuery and open it in Sublime text editor, it is showing a different format as below:

6630 5f0a 0061 0063 0074 0069 0076 0065
000a 0068 0065 006c 0064 000a 0062 006f
0075 006e 0063 0065 0064 000a 0075 006e
0073 0075 0062 0073 0063 0072 0069 0062
0065 0064 000a 

Can anyone suggest how to deal with those data in BigQuery, since it's showing correctly in the preview. How do I query on value? Are there any transformation / functions available?

Fernando C.
  • 201
  • 1
  • 11
  • Most probably you loaded some kind of unicode or bytestring, and not clear utf-8 text. – Pentium10 Feb 21 '20 at 17:48
  • yeah I guess so. Since it is loaded properly in bigquery , is there any way to read / work on the dat Ain bigquery? Any inbuilt function/transformation? – Prithwiraj Samanta Feb 21 '20 at 18:06
  • One update. In old UI I didn't find the issue in data but in new I can find some junk data as.�a�c�t�i�v�e� . I tried to remove � with bigquery REGEXP_REPLACE(status, r'([^\p{ASCII}]+)', '') but it didn't work. If I use select REGEXP_REPLACE('�a�c�t�i�v�e�', r'([^\p{ASCII}]+)', '') then it gives me active but select distinct REGEXP_REPLACE(status, r'([^\p{ASCII}]+)', '') from tab_status didn't work – Prithwiraj Samanta Feb 21 '20 at 21:03
  • Can you show the full schema for `tab_status` including the data types and mode for each column? – Nathan Griffiths Feb 24 '20 at 03:42

1 Answers1

0

It seems that your data is encoded in unicode. BigQuery expects UTF-8 bytes. Non-UTF8 bytes are basically decoded using UTF-8 and therefore do not convey the true meaning. You should convert your data to UTF-8 bytes before loading them to Bigquery.

I am not sure how you generated your data but here is a Stackoverflow thread that explains how to convert your data into UTF-8 using a Python script.

halfer
  • 19,824
  • 17
  • 99
  • 186
Fernando C.
  • 201
  • 1
  • 11