RDD skip headers - Pyspark

Question

I want to read an RDD with header. I found similar question here, but it's not working for me. How do I skip a header from CSV files in Spark?

rdd.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1)

else iter }

so I tried

def f(idx, iter): 
    if idx==0:
        iter.drop(1)
    else:
        yield list(iterator)
rdd2 = rdd.mapPartitionsWithIndex(f)

but it says AttributeError: 'generator' object has no attribute 'drop'

any help?

Found simple way by collecting header and filtering out, but I want to learn more about how mapPartitions work. — Yong Hyun Kwon, Oct 31 '17 at 09:10

score 0 · Accepted Answer · answered Oct 31 '17 at 09:38

0

Try something like this:

def f(idx, iter):
    output=[]
    for sublist in iter:
        output.append(sublist)
    if idx>0:
        return(output)
    else:
        return(output[1:])

answered Oct 31 '17 at 09:38

ags29

2,621
1
8
14

RDD skip headers - Pyspark

1 Answers1