How to split CSV lines into tuples with Spark Scala

Question

Here is a data I want to retrieve by Scala. The data looks like this: userId,movieId 1,1172 1,1405 1,2193 1,2968 2,52 2,144 2,248

First I want to skip the first line, and then split user and movie by split(",") and map to (userID,movieID)

This is my first time trying scala, everything made me insane. I wrote this code to skip first line and split

rdd.mapPartitionsWithIndex{ (idx, iter) => 
if (idx == 0) 
    iter.drop(1) 
else     
    iter }.flatMap(line=>line.split(","))

But the result is something like this:

I guess it's because mapPartitionsWithIndex Is there any way to correctly skip the header without change the structure?

Possible duplicate of [How to skip header from csv files in Spark?](http://stackoverflow.com/questions/27854919/how-to-skip-header-from-csv-files-in-spark) — stholzm, Mar 08 '17 at 07:41
I use the same way as it, but what I want is generating (userid, movieid) — Lance Chuang, Mar 08 '17 at 07:44
This question is quite misleading. It is actually about the `flatMap` part. A better title would be "How to split CSV lines into tuples with Spark Scala". — stholzm, Mar 08 '17 at 08:30

score 0 · Accepted Answer · answered Mar 08 '17 at 08:10

Ah, your question is not about the header, but about how to split the lines into (userid, movieid)? Instead of .flatMap(line=>line.split(",")) you should try this:

.map(line => line.split(",") match { case Array(userid, movieid) => (userid, movieid) })

How to split CSV lines into tuples with Spark Scala

1 Answers1

Linked

Related