pySpark - groupByKey Not working

Question

I have the bellow data- I want to group with the first element - I am trying with pySpark core ( NOT Spark SQL)

(u'CRIM SEXUAL ASSAULT', u'HZ256372', u'003', u'43'), 
(u'THEFT', u'HZ257172', u'011', u'27'), 
(u'ASSAULT', u'HY266148', u'019', u'6'), 
(u'WEAPONS VIOLATION', u'HY299741', u'010', u'29'), 
(u'CRIM SEXUAL ASSAULT', u'HY469211', u'025', u'19'), 
(u'NARCOTICS', u'HY313819', u'016', u'11'), 
(u'NARCOTICS', u'HY215976', u'003', u'42'), 
(u'NARCOTICS', u'HY360910', u'011', u'27'), 
(u'NARCOTICS', u'HY381916', u'015', u'25')

I tried with

file.groupByKey().map(lambda x : (x[0], list(x[1]))).collect()

this didnt worked out

score 3 · Accepted Answer · edited May 23 '17 at 12:30

3

It shouldn't work. groupByKey can be called only on RDD of key-value pairs (How to determine if object is a valid key-value pair in PySpark) and a tuple of arbitrary length is not.

Decide which value is a key and map or keyBy first. For example

rdd.map(lambda x: (x[0], x[1:])).groupByKey()

edited May 23 '17 at 12:30

Community

1
1

answered Dec 05 '16 at 17:19

Adding to the above , I have to play with list() else we will get an error like "pyspark.resultiterable.ResultIterable" Thanks for the response - – Sachin Sukumaran Dec 06 '16 at 10:26

score 0 · Answer 2 · answered Dec 06 '16 at 10:24

Got this worked with the bellow code

from pyspark import SparkContext
sc = SparkContext()
def chicagofile(line):
        sLine = line.split(",")
        cNum = sLine[1]
        cDist = sLine[11]
        cType = sLine[5]
        cCommArea = sLine[13]
        return (cType,cNum,cDist,cCommArea)
cFile = sc.textFile("/user/sachinkerala6174/inData/ChicagoCrime15/crimes2015.csv")
getFile = cFile.map(chicagofile)
mapCType = getFile.map(lambda x : (x[0],(x[1],x[2],x[3])))
grp = mapCType.groupByKey().map(lambda x : (x[0], (list(x[1]))))
saveFile = grp.saveAsTextFile("/user/sachinkerala6174/inData/ChicagoCrime15/res1")
print grp.collect()

pySpark - groupByKey Not working

2 Answers2