1

I want to do a simple k-means with hadoop map reduce and python.

The mapper gets points and maps each point to its nearest center.
The reducer gets center as key and points as value and calculate a new center to the points.

But now , i need to gather all new centers from the reducers and give them in some way to the mapper at the next round.

How can i do it? I need to have a global array of centers for each map tasks.

What is the right way of doing it?

member555
  • 797
  • 1
  • 13
  • 40

1 Answers1

2

For info on how to encode a global constant see this question.

Mapper

Accepts

  • data
  • global constant representing the list of centers

Computes

  • the nearest center for each data instance

Emits

  • nearest centers (key) and points (value).

Reducer

Accepts

  • center instance / coordinate (key)
  • points (value)

Computes

  • the new centers based on clusters

Emits

  • new centers

You will provide the next epoch of K-Means with:

  1. the same data from your initial epoch
  2. the centers emitted from the reducer as global constants

Repeat until your stopping criteria are met.

Community
  • 1
  • 1
carpenter
  • 1,192
  • 1
  • 14
  • 25
  • You should take a look at [this answer](http://stackoverflow.com/questions/2499585/chaining-multiple-mapreduce-jobs-in-hadoop). I would imagine that if your data lends itself to clustering, you won't have very many iterations anyway though. Manually calling this ~6-8 times will get you good results and isn't to onerous. You could always write your own script though. – carpenter Aug 10 '15 at 19:36