-1

I'm trying to learn a little bit of mapreduce in combination with Python.

Now I have the following code running from a tutorial I'm doing.

from mrjob.job import MRJob

class SpendByCustomer(MRJob):

    def mapper(self, _, line):
        (customerID, itemID, orderAmount)  = line.split(',')
        yield customerID, float(orderAmount)

    def reducer(self, customerID, orders):
        yield customerID, sum(orders)


if __name__ == '__main__':
  SpendByCustomer.run()

It should do the following.

When I hit !python SpendByCustomers.py customer-orders.xls > test.txt it should read in a .xls file, map and reduce it and write the output to test.txt.

All works fine and I mostly understand it. However I would really like to get some more insights about the following:

  • In

    def mapper(self, _, line):
    

    What is the _ doing here?

  • In

    if __name__ == '__main__':
      SpendByCustomer.run()
    

    What is this function exactly doing?

das-g
  • 9,718
  • 4
  • 38
  • 80
John Dwyer
  • 189
  • 2
  • 13
  • The `name == 'main'` part of your question is a clear duplicate. Have you tried to research it? See this: http://stackoverflow.com/a/419185/4996248 – John Coleman Feb 21 '16 at 12:48

2 Answers2

2

In Python letters, digits, and underscore can be used for variable names. (A variable name cannot start with a digit.) The underscore is a commonly used variable name for values that will not be used. For example, if I wanted to print "hello" 5 times, I could do the following:

for _ in range(5):
    print("hello")

In that example, the _ isn't really doing anything; it is there because we have to put some variable name there, but using that variable name is a message to other programmers looking at the code that that variable will not be used. In the case of the function, it is because the mapper method is usually called with three arguments, but in your mapper method, that parameter is unused. As for the second question, SpendByCustomer.run() is a method defined in the mrjob.job.MRJob class. As you might have guessed from its name, it runs the MapReduce job. It uses your mapper() method and your reducer() method to do it.

zondo
  • 19,901
  • 8
  • 44
  • 83
0

In

def mapper(self, _, line):

What is the _ doing here?

MRJob.mapper has the method signature MRJob.mapper(key, value)

If you redefine it in your derived class (as you do), you must keep a compatible signature, so you have to accept two arguments. Your function will be called with some key as the first argument and some value as the second argument. The documentation states that "if you don’t mess with Protocols" the key you will be called with will be None, so it isn't particularly interesting. Consequently, your redefinition of the method doesn't do anything with it.

In Python, it's conventional to use the name _ for arguments one must have in the signature to stay compatible with interfaces. Technically, there's nothing special about this argument name. It's just a single-character name using one of the characters allowed for identifiers. It's chosen mainly for it's visual appearance (looking somewhat like an unfilled from field) to tell human readers of your code: I'm being passed a value here, but I don't care what it is (and won't use it in my implementation).

In

if __name__ == '__main__':
  SpendByCustomer.run()

What is this function exactly doing?

For the if __name__ == '__main__': part, see What does if __name__ == "__main__": do?

For the .run() part, refer to the MRJob documentation.

Community
  • 1
  • 1
das-g
  • 9,718
  • 4
  • 38
  • 80