0

I have a parquet dataset where I saved a byte_array.

I am using Apache Drill to query the dataset:

SELECT id, x, y FROM `dfs.root`.`./data`

This gives me:

+--------------------------------------+-------------+-------------+
|                  ID                  |      X      |      Y      |
+--------------------------------------+-------------+-------------+
| 0A3D27D8-DEC5-54D6-6A8E-8FD5CF721E1C | [B@654e7f63 | [B@39a668e8 |
+--------------------------------------+-------------+-------------+

How do I convert the binary object ID to an actual Python byte_array when querying with PyDrill?

asynts
  • 2,213
  • 2
  • 21
  • 35
user1302023
  • 31
  • 1
  • 1
  • 9
  • STRING_BINARY or some CONVERT_FROM functions could help you. But to know exactly please specify the details. There are several parquet byte array data types: binary and fixed_len_byte_array. To interpret them correctly they usually marked with logical data type. A lot of different data types can be represented as byte array. What kind of logical data type is in your case? Could you provide the schema of your parquet file by using parquet-tools? What was the origin data (int, string, decimal, date)? – Vitalii Diravka Oct 03 '18 at 12:23

1 Answers1

0
SELECT id, CONVERT_FROM(x, 'UTF8') as x, CONVERT_FROM(y, 'UTF8') as y FROM `dfs.root`.`./data`

You can find this info in Apache Drill documentation:
https://drill.apache.org/docs/data-type-conversion/#convert_to-and-convert_from

I think you mean fixed_len_byte_array. It is a primitive Parquet data type. It can be used for INTERVAL and DECIMAL logical data types. Looks like Drill supports both of them out of the box. If you didn't specify the logical datatype for your fixed_len_byte_array, it is not clear how to interpret this data.

Vitalii Diravka
  • 855
  • 6
  • 11
  • This doesn't work, I did not encode the data as a 'UTF8' string. It is a normal byte array, and I want the bytes. The actual binary that ` [B@654e7f63` is representing. This query returns to me: "�]q(}q(M��M�M��KM��KM��KM��KM��KM��K" – user1302023 Oct 02 '18 at 14:40
  • I have added info to the answer. Please provide schema info from parquet-tools to be sure. https://github.com/apache/parquet-mr/tree/master/parquet-tools#parquet-tools – Vitalii Diravka Oct 02 '18 at 22:13
  • It is NOT a fixed byte array, it is a variable byte array: https://github.com/apache/drill/blob/master/common/src/main/java/org/apache/drill/common/types/Types.java#L167 The parquet is not invalid and fast parquet returns the byte array as .. well bytes – user1302023 Oct 03 '18 at 02:18