I have an RDD which looks like this:
[["3331/587","Metro","1235","1000"],
["1234/232","City","8479","2000"],
["5987/215","Metro","1111","Unkown"],
["8794/215","Metro","1112","1000"],
["1254/951","City","6598","XXXX"],
["1584/951","City","1548","Unkown"],
["1833/331","Metro","1009","2000"],
["2213/987","City","1197", ]]
I want to calculate the average and max of the last values of each row (1000, 2000 etc) for each distinct value in the second entries (City/Metro) separately. I am using the the following code to collect "City" values:
rdd.filter(lambda row: row[1] == 'City').map(lambda x: float(x[3])).collect()
However, I get error, probably because of the string values ("Unkown" e.g.) in the series.
How can I filter out rows with string and null values (=keep only those convertable to digits), and then calculate max and average?