I have a pyspark rdd and trying to convert it into a dataframe using some custom sampling ratio. But I am getting below error sometimes that empty rdd cannot be used to create dataframe
ValueError: Can not reduce() empty RDD
Below is my code. As I said, it is not erroring out always. Only some times it is failing.
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext()
sqlContext = SQLContext(sc)
myrdd = sc.parallelize([
(1, 638.55),
(2, 638.55),
(3, 638.55),
(4, 638.55),
(5, 638.55)
])
for i in range(100):
print(i)
df2 = sqlContext.createDataFrame(myrdd, samplingRatio=0.4)
When I give sampling ratio as 1, it doesn't fail. I don't know why it isn't consistent. Or am I missing any point about sampling ratio?