0

I have the bellow data- I want to group with the first element - I am trying with pySpark core ( NOT Spark SQL)

(u'CRIM SEXUAL ASSAULT', u'HZ256372', u'003', u'43'), 
(u'THEFT', u'HZ257172', u'011', u'27'), 
(u'ASSAULT', u'HY266148', u'019', u'6'), 
(u'WEAPONS VIOLATION', u'HY299741', u'010', u'29'), 
(u'CRIM SEXUAL ASSAULT', u'HY469211', u'025', u'19'), 
(u'NARCOTICS', u'HY313819', u'016', u'11'), 
(u'NARCOTICS', u'HY215976', u'003', u'42'), 
(u'NARCOTICS', u'HY360910', u'011', u'27'), 
(u'NARCOTICS', u'HY381916', u'015', u'25') 

I tried with

file.groupByKey().map(lambda x : (x[0], list(x[1]))).collect()

this didnt worked out

Sachin Sukumaran
  • 707
  • 2
  • 9
  • 25

2 Answers2

3

It shouldn't work. groupByKey can be called only on RDD of key-value pairs (How to determine if object is a valid key-value pair in PySpark) and a tuple of arbitrary length is not.

Decide which value is a key and map or keyBy first. For example

rdd.map(lambda x: (x[0], x[1:])).groupByKey()
Community
  • 1
  • 1
  • Adding to the above , I have to play with list() else we will get an error like "pyspark.resultiterable.ResultIterable" Thanks for the response - – Sachin Sukumaran Dec 06 '16 at 10:26
0

Got this worked with the bellow code

from pyspark import SparkContext
sc = SparkContext()
def chicagofile(line):
        sLine = line.split(",")
        cNum = sLine[1]
        cDist = sLine[11]
        cType = sLine[5]
        cCommArea = sLine[13]
        return (cType,cNum,cDist,cCommArea)
cFile = sc.textFile("/user/sachinkerala6174/inData/ChicagoCrime15/crimes2015.csv")
getFile = cFile.map(chicagofile)
mapCType = getFile.map(lambda x : (x[0],(x[1],x[2],x[3])))
grp = mapCType.groupByKey().map(lambda x : (x[0], (list(x[1]))))
saveFile = grp.saveAsTextFile("/user/sachinkerala6174/inData/ChicagoCrime15/res1")
print grp.collect()
Sachin Sukumaran
  • 707
  • 2
  • 9
  • 25