0

I have used this code:

def process_row(row):
words = row.replace('"', '').split(' ')
for i in range(len(words)):
      #if we find ‘-’ we will replace it with ‘0’
      if(words[-1]=='-'):
          words[i]='0'
return words
return [words(0),words(1), words(2), words(3), words(4), int(words(5))]

nasa = (
nasa_raw.flatMap(process_row)
)
nasa.persist()
for row in nasa.take(10):
print(row)

to transform this data:

in24.inetnebr.com [01/Aug/1995:00:00:01] "GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt" 200 
1839
 uplherc.upl.com [01/Aug/1995:00:00:07] "GET /" 304 0
uplherc.upl.com [01/Aug/1995:00:00:08] "GET /images/ksclogo-medium.gif" 304 0
uplherc.upl.com [01/Aug/1995:00:00:08] "GET /images/MOSAIC-logosmall.gif" 304 0
 uplherc.upl.com [01/Aug/1995:00:00:08] "GET /images/USA-logosmall.gif" 304 0
ix-esc-ca2-07.ix.netcom.com [01/Aug/1995:00:00:09] "GET /images/launch-logo.gif" 200 1713
uplherc.upl.com [01/Aug/1995:00:00:10] "GET /images/WORLD-logosmall.gif" 304 0
slppp6.intermind.net [01/Aug/1995:00:00:10] "GET /history/skylab/skylab.html" 200 1687
piweba4y.prodigy.com [01/Aug/1995:00:00:10] "GET /images/launchmedium.gif" 200 11853
slppp6.intermind.net [01/Aug/1995:00:00:11] "GET /history/skylab/skylab-small.gif" 200 9202

into this pipelined rdd:

in24.inetnebr.com
[01/Aug/1995:00:00:01]
 GET
 /shuttle/missions/sts-68/news/sts-68-mcc-05.txt
 200
 1839
 uplherc.upl.com
 [01/Aug/1995:00:00:07]
 GET
 /

I want to create frequencies of adresses like : uplherc.upl.com by using pairs:

pairs = nasa.map(lambda x: (x , 1))
count_by_resource = pairs.reduceByKey(lambda x, y : x + y)
count_by_resource =  count_by_resource.takeOrdered(10, key = lambda x: -x[1])
spark.createDataFrame(count_by_resource, ['Resource_location','Count']).show(10)

but the result is something of every element frequency:

   --------------------+-------+
   |   Resource_location|  Count|
   +--------------------+-------+
   |                 GET|1551681|
   |                 200|1398910|
   |                   0| 225418|

How should I refer to my element of interest?

user1997567
  • 439
  • 4
  • 19

1 Answers1

0

Splitting each line by spaces and then creating a flatmap of all these values when you are primarily interested in a count of the domains may be giving additional work and definitely additional overhead and processing.

Based on the sample data provided, the domain is the first item on each line. I have also noted that some of your lines begin with an empty space and as such results in an additional string piece. You may considering using the strip function to trim the line before the process.

You may consider modifying process to return only the first bit of the string or creating another map operation which does.


def extract_domain_from_row(row):
    # if row is a string
    domain = row.strip().split(' ')[0]
    # if you send a list, you could always extract the first item from that list as the domain name
    # domain = row[0]
    return domain.lower()

#intermediary rdd 
nasa_domains = nasa_raw.map(extract_domain_from_row)

# continue operations as desired with `nasa`
pairs = nasa_domains.map(lambda x: (x , 1))
count_by_resource = pairs.reduceByKey(lambda x, y : x + y)
count_by_resource =  count_by_resource.takeOrdered(10, key = lambda x: -x[1])
spark.createDataFrame(count_by_resource, ['Resource_location','Count']).show(10)

Outputs

+--------------------+-----+
|   Resource_location|Count|
+--------------------+-----+
|     uplherc.upl.com|    5|
|slppp6.intermind.net|    2|
|   in24.inetnebr.com|    1|
|ix-esc-ca2-07.ix....|    1|
|piweba4y.prodigy.com|    1|
+--------------------+-----+

If the first item is not a domain, you may want to filter your collection using a pattern to match domains see suggestions here domain regex suggestions

ggordon
  • 9,790
  • 2
  • 14
  • 27