I've been struggling for so long with this, and I'd be happy if someone could help me find a resolve the next issue.
I have this table:
+-----+---+------------+
|index| Salary |
+-----+---+------------+
| 1| 200 - 300 PA. |
| 2| 400 PA. |
| 3| 100 - 200 PA. |
| 4| 700 - 800 PA. |
+-----+---+-----+------+
The salary column is String Type. I want to replace each String in the Salary, with the average of the range in contains (if no range, just the number) so the data will be numeric and not String. Want to create this table:
+-----+---+-------+
|index| Salary |
+-----+---+-------+
| 1| 250 |
| 2| 400 |
| 3| 150 |
| 4| 750 |
+-----+---+--+----+
I tried doing it by first creating an array of the Salary so it looks like this:
["100", "-", "300", "PA."] -
so I could extract the number from the whole string. I tried this but it looks bad and it's not working:
curr = outDF.rdd.map(lambda rec: rec[:]).map(lambda rec : rec[0])
curr = curr.map(lambda t : (t[1], t[3])).toDF()
new_df = curr.withColumn("_1", custProdSpending["_1"].cast(IntegerType()))