0

What is the issue with this code in pyspark

 raw_data = ["James,Smith,36636,M,3000",
    "Michael,Rose,40288,M,4000",
    "Robert,Williams,42114,M,4000",
    "Maria,Anne,39192,F,4000",
    "Jen,Mary,899,F,-1"
    ]

The below code throws errors : unresolved reference m

 dataRDD = spark.sparkContext.parallelize(raw_data)
 mappedRDD = dataRDD.map(lambda m: \
                        arr=m.split(",") \
                        (arr[0],arr[1]))

 print(mappedRDD.collect())

I rewritten the same logic in the below style and it works

 dataRDD = spark.sparkContext.parallelize(raw_data)
 mappedRDD = dataRDD.map(lambda m: (m.split(",")[0],m.split(",")[1]))
Surender Raja
  • 3,553
  • 8
  • 44
  • 80

1 Answers1

1

It's not possible, because lambda function accept only expressions. What you did is you tried to define arr object inside of lambda function, that's why it thrown an error. The latter approach allowed you to skip that definition, therefore code worked.

You can read more on that, e.g., here.

Also, if your case is only to take first 2 elements of each list after splitting, then you can streamline your code even more:

mappedRDD = dataRDD.map(lambda m: m.split(",")[:2])
mckraqs
  • 113
  • 7
  • So, Can i try mappedRDD = dataRDD.map(lambda elem:map_record_to_tuple(elem)) .here map_record_to_tuple is a user defined function which accepts a string and return a tuple – Surender Raja May 15 '22 at 13:26
  • Yup! But only if you work with RDD. While switching to DataFrame you need to remember about UDF registering. The same applies to the Spark SQL engine. More on that you can find, e.g., [here](https://stackoverflow.com/questions/70425427/what-is-the-difference-between-udf-from-sparksession-and-udf-from-pyspark-sql-fu) – mckraqs May 15 '22 at 15:16