-1

Here is a data I want to retrieve by Scala. The data looks like this: userId,movieId 1,1172 1,1405 1,2193 1,2968 2,52 2,144 2,248

First I want to skip the first line, and then split user and movie by split(",") and map to (userID,movieID)

This is my first time trying scala, everything made me insane. I wrote this code to skip first line and split

rdd.mapPartitionsWithIndex{ (idx, iter) => 
if (idx == 0) 
    iter.drop(1) 
else     
    iter }.flatMap(line=>line.split(","))

But the result is something like this:

    1
    1172
    1
    1405
    1
    2193
    1
    2968
    2
    52

I guess it's because mapPartitionsWithIndex Is there any way to correctly skip the header without change the structure?

Lance Chuang
  • 31
  • 1
  • 7
  • Possible duplicate of [How to skip header from csv files in Spark?](http://stackoverflow.com/questions/27854919/how-to-skip-header-from-csv-files-in-spark) – stholzm Mar 08 '17 at 07:41
  • I use the same way as it, but what I want is generating (userid, movieid) – Lance Chuang Mar 08 '17 at 07:44
  • This question is quite misleading. It is actually about the `flatMap` part. A better title would be "How to split CSV lines into tuples with Spark Scala". – stholzm Mar 08 '17 at 08:30

1 Answers1

0

Ah, your question is not about the header, but about how to split the lines into (userid, movieid)? Instead of .flatMap(line=>line.split(",")) you should try this:

.map(line => line.split(",") match { case Array(userid, movieid) => (userid, movieid) })
stholzm
  • 3,395
  • 19
  • 31