I have used this code:
def process_row(row):
words = row.replace('"', '').split(' ')
for i in range(len(words)):
#if we find ‘-’ we will replace it with ‘0’
if(words[-1]=='-'):
words[i]='0'
return words
return [words(0),words(1), words(2), words(3), words(4), int(words(5))]
nasa = (
nasa_raw.flatMap(process_row)
)
nasa.persist()
for row in nasa.take(10):
print(row)
to transform this data:
in24.inetnebr.com [01/Aug/1995:00:00:01] "GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt" 200
1839
uplherc.upl.com [01/Aug/1995:00:00:07] "GET /" 304 0
uplherc.upl.com [01/Aug/1995:00:00:08] "GET /images/ksclogo-medium.gif" 304 0
uplherc.upl.com [01/Aug/1995:00:00:08] "GET /images/MOSAIC-logosmall.gif" 304 0
uplherc.upl.com [01/Aug/1995:00:00:08] "GET /images/USA-logosmall.gif" 304 0
ix-esc-ca2-07.ix.netcom.com [01/Aug/1995:00:00:09] "GET /images/launch-logo.gif" 200 1713
uplherc.upl.com [01/Aug/1995:00:00:10] "GET /images/WORLD-logosmall.gif" 304 0
slppp6.intermind.net [01/Aug/1995:00:00:10] "GET /history/skylab/skylab.html" 200 1687
piweba4y.prodigy.com [01/Aug/1995:00:00:10] "GET /images/launchmedium.gif" 200 11853
slppp6.intermind.net [01/Aug/1995:00:00:11] "GET /history/skylab/skylab-small.gif" 200 9202
into this pipelined rdd:
in24.inetnebr.com
[01/Aug/1995:00:00:01]
GET
/shuttle/missions/sts-68/news/sts-68-mcc-05.txt
200
1839
uplherc.upl.com
[01/Aug/1995:00:00:07]
GET
/
I want to create frequencies of adresses like : uplherc.upl.com by using pairs:
pairs = nasa.map(lambda x: (x , 1))
count_by_resource = pairs.reduceByKey(lambda x, y : x + y)
count_by_resource = count_by_resource.takeOrdered(10, key = lambda x: -x[1])
spark.createDataFrame(count_by_resource, ['Resource_location','Count']).show(10)
but the result is something of every element frequency:
--------------------+-------+
| Resource_location| Count|
+--------------------+-------+
| GET|1551681|
| 200|1398910|
| 0| 225418|
How should I refer to my element of interest?