Hadoop streaming with python K-MEANS

Question

I want to do a simple k-means with hadoop map reduce and python.

The mapper gets points and maps each point to its nearest center.
The reducer gets center as key and points as value and calculate a new center to the points.

But now , i need to gather all new centers from the reducers and give them in some way to the mapper at the next round.

How can i do it? I need to have a global array of centers for each map tasks.

What is the right way of doing it?

score 2 · Accepted Answer · edited May 23 '17 at 12:21

2

For info on how to encode a global constant see this question.

Mapper

Accepts

data
global constant representing the list of centers

Computes

the nearest center for each data instance

Emits

nearest centers (key) and points (value).

Reducer

Accepts

center instance / coordinate (key)
points (value)

Computes

the new centers based on clusters

Emits

new centers

You will provide the next epoch of K-Means with:

the same data from your initial epoch
the centers emitted from the reducer as global constants

Repeat until your stopping criteria are met.

edited May 23 '17 at 12:21

Community

1
1

answered Aug 10 '15 at 16:55

carpenter

1,192
1
14
25

You should take a look at [this answer](http://stackoverflow.com/questions/2499585/chaining-multiple-mapreduce-jobs-in-hadoop). I would imagine that if your data lends itself to clustering, you won't have very many iterations anyway though. Manually calling this ~6-8 times will get you good results and isn't to onerous. You could always write your own script though. – carpenter Aug 10 '15 at 19:36

Hadoop streaming with python K-MEANS

1 Answers1

Mapper

Reducer