-1

I'm trying to create a column of tuple based on other two columns in spark dataframe.

data = [ ('A', 4,5 ),
        ('B', 6, 9 )
       ]
columns= ["id","val1", "val2"]
sdf = spark.createDataFrame(data = data, schema = columns)

sdf.withColumn('values', F.struct(F.col('val1'), F.col('val2')) ).show()

what I got is:

enter image description here

I need column values to be tuples. So instead of {4,5} {6,9}, I want (4,5) (6,9). Does anyone know what I did wrong? Thanks a lot.

zesla
  • 11,155
  • 16
  • 82
  • 147
  • This might be helpful - https://stackoverflow.com/questions/36840563/how-to-return-a-tuple-type-in-a-udf-in-pyspark – Vaebhav Aug 02 '21 at 07:35
  • Why do you need tuples ? – Steven Aug 02 '21 at 07:39
  • Does this answer your question? [Calculate the minimum distance to destinations for each origin in pyspark](https://stackoverflow.com/questions/68614421/calculate-the-minimum-distance-to-destinations-for-each-origin-in-pyspark) – Steven Aug 02 '21 at 15:06

1 Answers1

1

That's not how spark works.

Spark is a framework that is developped in Scala, based on Java JVM. It is not Python.
Pyspark is a set of API that calls the Scala methods to execute Spark but in Python language.

Therefore, Python types such as tuple do not exists in Spark. You have to use either :

  • Struct which is close to Python dict
  • Array which are the equivalent of list (probably what you need if you want something close to tuple).

The real question is Why do you need tuples?


EDIT: According to your comment, you need tuples because you want to use haversine. But if you use list (or Spark Array) for example, it works perfectly fine :

# Use the haversine doc example but with list

lyon = [45.7597, 4.8422]
paris = [48.8567, 2.3508]

haversine(lyon, paris)
> 392.2172595594006
Steven
  • 14,048
  • 6
  • 38
  • 73
  • Thank you very much for your response. The reason I need tuple is in my real case, I want to haversine function (from haversine module) to calculate distance between two locations, which require the input as a pair of tuples with lat and lon. e.g. `haversine ( (45.7597, 4.8422), (46.7431, 5.8422) )`. I actually have two columns of `values`, which I need to calculate distance. Could yo advice what I should do ? – zesla Aug 02 '21 at 13:47
  • @zesla Can you update your question with [a Minimal, Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example)? Show what you did, what is going wrong, with code and error message. – Steven Aug 02 '21 at 14:22
  • @zesla I added an edit with array/list working perfectly fine. So update your question with your actual code/problem, that'd avoid [XY_problems](https://en.wikipedia.org/wiki/XY_problem) – Steven Aug 02 '21 at 14:38