I started off familiarizing myself with Spark by fiddling around with RDDs, but I'm trying practicing with DataFrames and I'm not quite sure what works and what doesn't with the DataFrames filter function.
For example, I want to filter my DataFrames by comparing whether a column in the DataFrame is equal to any string within a list of strings.
With RDDs, this is what I did:
dataRDD.filter(lambda s: any(substring in s['DATE'] for substring in dateList))
And this seems to work fine. But when I try something similar with a DataFrame:
lambdaDF = df.filter(any(substring in df.DATE for substring in hours1))
I end up getting this error:
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
I guess I'm not fully understanding the difference between the RDD filter function and the DataFrame filter function, and what I can do with them.
Any help or advice is appreciated. Thanks!