First, let me generate some random data to start working with.
import random
number_of_rows = int(1e6)
line_error = "error line"
text = []
for i in range(number_of_rows):
choice = random.choice([1,2,3,4])
if choice == 1:
line = line_error
elif choice == 2:
line = "1 2 3 4 5 6 7 8 9_1"
elif choice == 3:
line = "1 2 3 4 5 6 7 8 9_2"
elif choice == 4:
line = "1 2 3 4 5 6 7 8 9_3"
text.append(line)
Now I have a string text
looks like
1 2 3 4 5 6 7 8 9_2
error line
1 2 3 4 5 6 7 8 9_3
1 2 3 4 5 6 7 8 9_2
1 2 3 4 5 6 7 8 9_3
1 2 3 4 5 6 7 8 9_1
error line
1 2 3 4 5 6 7 8 9_2
....
Your solution:
def wrapException(a):
try:
return a[8]
except:
return 'error'
log.map(lambda s : s.split(' ')).map(wrapException).filter(lambda s : s!='error').map(lambda code : (code,1)).reduceByKey(lambda acu,value : acu + value).collect()
#[('9_3', 250885), ('9_1', 249307), ('9_2', 249772)]
Here is my solution:
from operator import add
def myfunction(l):
try:
return (l.split(' ')[8],1)
except:
return ('MYERROR', 1)
log.map(myfunction).reduceByKey(add).collect()
#[('9_3', 250885), ('9_1', 249307), ('MYERROR', 250036), ('9_2', 249772)]
Comment:
(1) I highly recommend also calculating the lines with "error" because it won't add too much overhead, and also can be used for sanity check, for example, all the counts should add up to the total number of rows in the log, if you filter out those lines, you have no idea those are truly bad lines or something went wrong in your coding logic.
(2) I will try to package all the line level operations in one function to avoid chaining of map
, filter
functions, so it is more readable.
(3) From performance perspective, I generated a sample of 1M records and my code finished in 3 seconds and yours in 2 seconds, it is not a fair comparasion since the data is so small and my cluster is pretty beefy, I would recommend you generate a bigger file (1e12?) and do a benchmark on yours.