Scala - Remove first row of Spark DataFrame

Question

I know dataframes are supposed to be immutable and everything and I know it's not a great idea to try to change them. However, the file I'm receiving has a useless header of 4 columns (the whole file has 50+ columns). So, what I"m trying to do is just get rid of the very top row because it throws everything off.

I've tried a number of different solutions (mostly found on here) like using .filter() and map replacements, but haven't gotten anything to work.

Here's an example of how the data looks:

H | 300 | 23098234 | N
D | 399 | 54598755 | Y | 09983 | 09823 | 02983 | ... | 0987098
D | 654 | 65465465 | Y | 09983 | 09823 | 02983 | ... | 0987098
D | 198 | 02982093 | Y | 09983 | 09823 | 02983 | ... | 0987098

Any ideas?

Possible duplicate of [How to skip header from csv files in Spark?](http://stackoverflow.com/questions/27854919/how-to-skip-header-from-csv-files-in-spark) — zero323, Sep 23 '16 at 00:23

score 2 · Answer 1 · answered Sep 23 '16 at 01:52

The cleanest way I've seen so far is something along the lines of filtering out the first row

csv_rows           = sc.textFile('path_to_csv')
skipable_first_row = csv_rows.first() 
useful_csv_rows    = csv_rows.filter(row => row != skipable_first_row)

Scala - Remove first row of Spark DataFrame

1 Answers1