How to convert below text to sequence file which again, will be converted to vector for mahout kmeans?

Question

Good afternoon to you all,

My data is in below format:

ID : VALUE(tags assigned by users)

0001: "PC, THINKPAD, T500"

0002: "PHONE, CELLPHONE, IPHONE, APPLE, IPHONE5"

.......and so on.

How can I write a code to:

1) first, convert these into sequence file in key:value format.

2) then, convert sequence file above to vectors that will be used for kmeans clustering?

I am checking out the SequenceFileFromdDirectory, and SparseVectorFromSequenceFiles, but these seems a little complicated and a little hard to read right now.

So, I wonder if anyone here could give me a simple sample code about how to do above two conversions?

Thank you very much!

score 0 · Answer 1 · answered Aug 13 '12 at 09:08

0

Those 2 processes do exactly what you want to do, now it's just a matter of making the output human readable, instead of Sequence Files, for which you would use the seqdumper functionality.

If you need a clearer picture, have a look here, very nice intro.

answered Aug 13 '12 at 09:08

Julian Ortega

947
4
11

2

seqdirectory converts a directory structure into sequence file, while all my data is in one file.anyway, I already wrote some code that put the data in Key:value format to the sequence file. and used seq2sparse , and kmeans successfully do the rest. Thank you vey much for your response! – phoenixbai Aug 13 '12 at 12:09
You can also check these 2 examples that also somewhat do/explain how to use the Sequence File API. [Here](http://stackoverflow.com/questions/11645294/how-can-i-use-mahouts-sequencefile-api-code/11645430#11645430) and [here](http://stackoverflow.com/questions/11479600/runing-a-simple-mahout-program) – Julian Ortega Aug 13 '12 at 15:28

How to convert below text to sequence file which again, will be converted to vector for mahout kmeans?

ID : VALUE(tags assigned by users)

1 Answers1