New to spark here and I'm trying to read a pipe delimited file in spark. My file looks like this:
user1|acct01|A|Fairfax|VA
user1|acct02|B|Gettysburg|PA
user1|acct03|C|York|PA
user2|acct21|A|Reston|VA
user2|acct42|C|Fairfax|VA
user3|acct66|A|Reston|VA
and I do the following in scala:
scala> case class Accounts (usr: String, acct: String, prodCd: String, city: String, state: String)
defined class Accounts
scala> val accts = sc.textFile("accts.csv").map(_.split("|")).map(
| a => (a(0), Accounts(a(0), a(1), a(2), a(3), a(4)))
| )
I then try to group the key value pair by the key, and this is not sure if I'm doing this right...is this how I do it?
scala> accts.groupByKey(2)
res0: org.apache.spark.rdd.RDD[(String, Iterable[Accounts])] = ShuffledRDD[4] at groupByKey at <console>:26
I thought the (2) is to give me the first two results back but I don't seem to get anything back at the console...
If I run a distinct...I get this too..
scala> accts.distinct(1).collect(1)
<console>:26: error: type mismatch;
found : Int(1)
required: PartialFunction[(String, Accounts),?]
accts.distinct(1).collect(1)
EDIT: Essentially I'm trying to get to a key value pair nested mapping. For example, user1 would looke like this:
user1 | {'acct01': {prdCd: 'A', city: 'Fairfax', state: 'VA'}, 'acct02': {prdCd: 'B', city: 'Gettysburg', state: 'PA'}, 'acct03': {prdCd: 'C', city: 'York', state: 'PA'}}
trying to learn this step by step so thought I'd break it down into chunks to understand...