2

Good afternoon to you all,

My data is in below format:

ID : VALUE(tags assigned by users)

0001: "PC, THINKPAD, T500"

0002: "PHONE, CELLPHONE, IPHONE, APPLE, IPHONE5"

.......and so on.

How can I write a code to:

1) first, convert these into sequence file in key:value format.

2) then, convert sequence file above to vectors that will be used for kmeans clustering?

I am checking out the SequenceFileFromdDirectory, and SparseVectorFromSequenceFiles, but these seems a little complicated and a little hard to read right now.

So, I wonder if anyone here could give me a simple sample code about how to do above two conversions?

Thank you very much!

phoenixbai
  • 35
  • 4

1 Answers1

0

Those 2 processes do exactly what you want to do, now it's just a matter of making the output human readable, instead of Sequence Files, for which you would use the seqdumper functionality.

If you need a clearer picture, have a look here, very nice intro.

Julian Ortega
  • 947
  • 4
  • 11
  • 2
    seqdirectory converts a directory structure into sequence file, while all my data is in one file.anyway, I already wrote some code that put the data in Key:value format to the sequence file. and used seq2sparse , and kmeans successfully do the rest. Thank you vey much for your response! – phoenixbai Aug 13 '12 at 12:09
  • You can also check these 2 examples that also somewhat do/explain how to use the Sequence File API. [Here](http://stackoverflow.com/questions/11645294/how-can-i-use-mahouts-sequencefile-api-code/11645430#11645430) and [here](http://stackoverflow.com/questions/11479600/runing-a-simple-mahout-program) – Julian Ortega Aug 13 '12 at 15:28