0

I'm reading a parquet file generated from a MySQL table using AWS DMS service. This table has a field with the type Point (WKB). When I read this parquet file, Spark recognizes it as binary type, as per the code below:

file_dataframe = sparkSession.read.format('parquet')\
    .option('encoding', 'UTF-8')\
    .option('mode', 'FAILFAST')\
    .load(file_path)
file_dataframe.schema

And this is the result:

StructType(List(StructField(id,LongType,true),...,StructField(coordinate,BinaryType,true),...))

I tried casting the column to string, but this is what I get:

file_dataframe = file_dataframe.withColumn('coordinates_str', file_dataframe.coordinate.astype("string"))
file_dataframe.select('coordinates_str').show()

+--------------------+
|     coordinates_str|
+--------------------+
|U�...|
|U�...|
|U�...|
|U�...|
|@G
U�...|
|@G
U�...|
|@G
U�...|
| G
U�...|
| G
U�...|
| G
U�...|
| G
U�...|

This field in MySQL looks like this. If I right click the BLOB I can see its value in the pop-up window.

enter image description here

What I'm interested in doing is getting the POINT (-84.1370849609375 9.982019424438477) that I see in the MySQL viewer as a string column in a Spark Dataframe. Is this possible? I've been Googling about it, but haven't been able to find something that gets me in the right track.

2 Answers2

0

try this:

file_dataframe.withColumn('coordinates_str', decode(col('coordinate'), 'US-ASCII'))
jayrythium
  • 679
  • 4
  • 11
0

After Googling some more and reading the MySQL documentation on the Point datatype, I found this documentation page that mentions that points are stored as WKB or WKT, which are standard for saving geographical locations.

Many engines have support for this format and one can easily get the text, as well as many other geographical operations, with a specific set of functions.

Nevertheless, after some research I didn't find that Spark has these functions built-in. It seems to require some further configuration in order to expose these functions, so I ended up using AWS Athena which does have built-in support. Athena geospatial documentation can be found here.