0

I know dataframes are supposed to be immutable and everything and I know it's not a great idea to try to change them. However, the file I'm receiving has a useless header of 4 columns (the whole file has 50+ columns). So, what I"m trying to do is just get rid of the very top row because it throws everything off.

I've tried a number of different solutions (mostly found on here) like using .filter() and map replacements, but haven't gotten anything to work.

Here's an example of how the data looks:

H | 300 | 23098234 | N
D | 399 | 54598755 | Y | 09983 | 09823 | 02983 | ... | 0987098
D | 654 | 65465465 | Y | 09983 | 09823 | 02983 | ... | 0987098
D | 198 | 02982093 | Y | 09983 | 09823 | 02983 | ... | 0987098

Any ideas?

zero323
  • 322,348
  • 103
  • 959
  • 935
David Schuler
  • 1,011
  • 2
  • 10
  • 21
  • 3
    Possible duplicate of [How to skip header from csv files in Spark?](http://stackoverflow.com/questions/27854919/how-to-skip-header-from-csv-files-in-spark) – zero323 Sep 23 '16 at 00:23

1 Answers1

2

The cleanest way I've seen so far is something along the lines of filtering out the first row

csv_rows           = sc.textFile('path_to_csv')
skipable_first_row = csv_rows.first() 
useful_csv_rows    = csv_rows.filter(row => row != skipable_first_row)   
blr
  • 908
  • 4
  • 8