Drop first row of Spark DataFrame

Question

I have a variable rawData of type DataFrame in my Spark/Scala code.

I would like to drop the first element, something like this:

rawData.drop(1)

However, the drop function is not available.

What's the simplest way of dropping the first element?

How do you know what's the first one? RDD is distributed among the nodes. — Nikita, Jul 12 '16 at 20:04
Because I'm assuming that every `Row` has an `id`. In my case, I read the data from a `csv` file so I'm assuming the first row of that file will become the `Row` with the smallest `id`. — bsky, Jul 12 '16 at 20:06
This one is very close to yours: http://stackoverflow.com/questions/27854919/how-to-skip-header-from-csv-files-in-spark — Nikita, Jul 12 '16 at 20:10
No it's not. That question refers to an `RDD`, I have a `DataFrame`. — bsky, Jul 12 '16 at 20:12

score 5 · Answer 1 · answered Jul 12 '16 at 20:18

To answer the question we first must clarify what is exactly the first element of a DataFrame, since we are not speaking about an ordered collection that placed on a single machine, but instead we are dealing with distributed collection with no particular order between partitions, so the answer is not obvious.

In case you want to drop the first element from every partition you can use:

df.mapPartitions(iterator => iterator.drop(1))

In case you want to drop the first element from the first partition, you can use:

val rdd = df.rdd.mapPartitionsWithIndex{
  case (index, iterator) => if(index==0) iterator.drop(1) else iterator
}
sqlContext.createDataFrame(rdd, df.schema)

Both solutions are not very graceful, and seems like bad practise, would be interesting to know the complete use case, maybe there is a better approach.

My DataFRame is made from a CSV file. When the `DataFrame` is formed, isn't an `id` field automatically generated for default ordering? — bsky, Jul 12 '16 at 20:49

Drop first row of Spark DataFrame

1 Answers1

Linked