Get text of geographical binary PySpark

Question

I'm reading a parquet file generated from a MySQL table using AWS DMS service. This table has a field with the type Point (WKB). When I read this parquet file, Spark recognizes it as binary type, as per the code below:

file_dataframe = sparkSession.read.format('parquet')\
    .option('encoding', 'UTF-8')\
    .option('mode', 'FAILFAST')\
    .load(file_path)
file_dataframe.schema

And this is the result:

StructType(List(StructField(id,LongType,true),...,StructField(coordinate,BinaryType,true),...))

I tried casting the column to string, but this is what I get:

file_dataframe = file_dataframe.withColumn('coordinates_str', file_dataframe.coordinate.astype("string"))
file_dataframe.select('coordinates_str').show()

+--------------------+
|     coordinates_str|
+--------------------+
|U�...|
|U�...|
|U�...|
|U�...|
|@G
U�...|
|@G
U�...|
|@G
U�...|
| G
U�...|
| G
U�...|
| G
U�...|
| G
U�...|

This field in MySQL looks like this. If I right click the BLOB I can see its value in the pop-up window.

What I'm interested in doing is getting the POINT (-84.1370849609375 9.982019424438477) that I see in the MySQL viewer as a string column in a Spark Dataframe. Is this possible? I've been Googling about it, but haven't been able to find something that gets me in the right track.

score 0 · Answer 1 · answered Sep 21 '20 at 20:24

0

try this:

file_dataframe.withColumn('coordinates_str', decode(col('coordinate'), 'US-ASCII'))

answered Sep 21 '20 at 20:24

jayrythium

679
4
11

Hi, Jay, thanks for your question. Unfortunately I got the same result, but I found out about WKB and geospatial functions, which I'll answer below – Jose Daniel Chacón Bogarín Sep 22 '20 at 19:17
Sorry that didn't help, glad to hear that you've found a workaround. – jayrythium Sep 22 '20 at 19:25
take a look at this [link](https://stackoverflow.com/questions/63270475/read-a-binary-column-in-spark-using-java-language) – jayrythium Sep 22 '20 at 20:02

score 0 · Accepted Answer · answered Sep 22 '20 at 19:53

After Googling some more and reading the MySQL documentation on the Point datatype, I found this documentation page that mentions that points are stored as WKB or WKT, which are standard for saving geographical locations.

Many engines have support for this format and one can easily get the text, as well as many other geographical operations, with a specific set of functions.

Nevertheless, after some research I didn't find that Spark has these functions built-in. It seems to require some further configuration in order to expose these functions, so I ended up using AWS Athena which does have built-in support. Athena geospatial documentation can be found here.

Get text of geographical binary PySpark

2 Answers2