Filter out non digit values in pyspark RDD

Question

I have an RDD which looks like this:

[["3331/587","Metro","1235","1000"],
["1234/232","City","8479","2000"],
["5987/215","Metro","1111","Unkown"],
["8794/215","Metro","1112","1000"],
["1254/951","City","6598","XXXX"],
["1584/951","City","1548","Unkown"],
["1833/331","Metro","1009","2000"],
["2213/987","City","1197", ]]

I want to calculate the average and max of the last values of each row (1000, 2000 etc) for each distinct value in the second entries (City/Metro) separately. I am using the the following code to collect "City" values:

rdd.filter(lambda row: row[1] == 'City').map(lambda x: float(x[3])).collect()

However, I get error, probably because of the string values ("Unkown" e.g.) in the series.

How can I filter out rows with string and null values (=keep only those convertable to digits), and then calculate max and average?

Lamanus · Accepted Answer · 2020-08-23T11:36:15.710

0

Try this.

rdd = rdd.map(lambda l: [l[i].replace('"', '') for i in range(0, len(l))])
rdd = rdd.filter(lambda l: len(l) > 3) \
   .filter(lambda l: l[1] in ['City', 'Metro']) \
   .filter(lambda l: l[3].isdigit()) \
   .map(lambda l: (l[1], int(l[3]))) \

rdd_avg = rdd.aggregateByKey((0, 0), lambda a, b: (a[0] + b, a[1] + 1), lambda a, b: a + b).mapValues(lambda x: x[0] / x[1])
rdd_max = rdd.reduceByKey(lambda a, b: a if a > b else b)

print(rdd_avg.collect())
print(rdd_max.collect())

[('Metro', 1333.3333333333333), ('City', 2000.0)]
[('Metro', 2000), ('City', 2000)]

edited Aug 23 '20 at 11:36

answered Aug 23 '20 at 09:17

Lamanus

12,898
4
21
47

Thanks. That was great! There may be some other string values like "Unknown". How can I exclude all of them and get all those that can be convered to digits? Also, you calculated summation; how to get average and max? – mah65 Aug 23 '20 at 09:34
check this again. – Lamanus Aug 23 '20 at 09:48
Thanks. rdd.reduceByKey(lambda a, b: a + b) gives me summation, not average. – mah65 Aug 23 '20 at 11:01
Used one of the options in here for averaging: https://stackoverflow.com/questions/29930110/calculating-the-averages-for-each-key-in-a-pairwise-k-v-rdd-in-spark-with-pyth – mah65 Aug 23 '20 at 11:10
Oh, I missed that. sorry. – Lamanus Aug 23 '20 at 11:22

Filter out non digit values in pyspark RDD

1 Answers1