Questions tagged [mahout]

Apache Mahout open source scalable machine learning project

This topic covers questions related to Apache Mahout, a scalable machine learning project written in Java and largely based on Apache Hadoop, with implementations of algorithms for:

1171 questions
57
votes
2 answers

What is the difference between Apache Mahout and Apache Spark's MLlib?

Considering a MySQL products database with 10 millions products for an e-commerce website. I'm trying to set up a classification module to categorize products. I'm using Apache Sqoop to import data from MySQL to Hadoop. I wanted to use Mahout over…
eliasah
  • 39,588
  • 11
  • 124
  • 154
44
votes
4 answers

Entity Extraction/Recognition with free tools while feeding Lucene Index

I'm currently investigating the options to extract person names, locations, tech words and categories from text (a lot articles from the web) which will then feeded into a Lucene/ElasticSearch index. The additional information is then added as…
Karussell
  • 17,085
  • 16
  • 97
  • 197
34
votes
3 answers

Large scale machine learning - Python or Java?

I am currently embarking on a project that will involve crawling and processing huge amounts of data (hundreds of gigs), and also mining them for extracting structured data, named entity recognition, deduplication, classification etc. I'm familiar…
jeffreyveon
  • 13,400
  • 18
  • 79
  • 129
28
votes
2 answers

What's difference between item-based and content-based collaborative filtering?

I am puzzled about what the item-based recommendation is, as described in the book "Mahout in Action". There is the algorithm in the book: for every item i that u has no preference for yet for every item j that u has a preference for compute a…
cstur4
  • 966
  • 2
  • 8
  • 21
27
votes
1 answer

Clustering (fkmeans) with Mahout using Clojure

I am trying to write a short script to cluster my data via clojure (calling Mahout classes though). I have my input data in this format (which is an output from a php script) format: (tag) (image) (frequency) tag_sit image_a 0 tag_sit image_b…
Jeffrey04
  • 6,138
  • 12
  • 45
  • 68
27
votes
2 answers

Using machine learning to de-duplicate data

I have the following problem and was thinking I could use machine learning but I'm not completely certain it will work for my use case. I have a data set of around a hundred million records containing customer data including names, addresses,…
25
votes
5 answers

Java's Mahout equivalent in Python

Java based Mahout's goal is to build scalable machine learning libraries. Are there any equivalent libraries in Python ?
Srikar Appalaraju
  • 71,928
  • 54
  • 216
  • 264
20
votes
4 answers

Support Vector Machine for Java?

I'd like to write a "smart monitor" in Java that sends out an alert any time it detects oncoming performance issues. My Java app is writing data in a structured format to a log file: | | So, for…
IAmYourFaja
  • 55,468
  • 181
  • 466
  • 756
18
votes
4 answers

Hadoop, Mahout real-time processing alternative

I intended to use hadoop as "computation cluster" in my project. However then I read that Hadoop is not inteded for real-time systems because of overhead connected with start of a job. I'm looking for solution which could be use this way - jobs…
mmatloka
  • 1,986
  • 1
  • 20
  • 46
16
votes
3 answers

How to start development for mahout

After Installation of mahout from (http://girlincomputerscience.blogspot.com/2010/11/apache-mahout.html).How to Run mahout algo and from where i can get most popular as easy tutorial for mahout freshers.... THanks in advance.
Vignesh Prajapati
  • 2,320
  • 3
  • 28
  • 38
16
votes
2 answers

What is the path to directory within Hadoop filesystem?

Recently I start learning Hadoop and Mahout. I want to know the path to directory within Hadoop filesystem directory. In hadoop-1.2.1/conf/core-site.xml, I have specified: hadoop.tmp.dir
Li'
  • 3,133
  • 10
  • 34
  • 51
15
votes
1 answer

HBase & Mahout - Using HBase as a Datastore/source for Mahout - Classification

I'm working on a large text classification project and we have our text data (simple messages) stored in HBase. We have two problems, first we would like to use HBase as the source for Mahout classifiers namely Bayers and Random Forests. Second,…
NightWolf
  • 7,694
  • 9
  • 74
  • 121
15
votes
3 answers

Why vector normalization can improve the accuracy of clustering and classification?

It is described in Mahout in Action that normalization can slightly improve the accuracy. Can anyone explain the reason, thanks!
Meng Zhang
  • 337
  • 1
  • 4
  • 13
15
votes
3 answers

Recommendation Systems using Solr and Mahout

I've been reading about using Solr and Mahout for developing Recommendation Systems. As I understood they handles two different problems. Since Solr is a search engine+classification system, it is used mostly for recommendations like "more like…
Ashika Umanga Umagiliya
  • 8,988
  • 28
  • 102
  • 185
14
votes
2 answers

is it possible to use apache mahout without hadoop dependency?

Is it possible to use Apache mahout without any dependency to Hadoop. I would like to use the mahout algorithm on a single computer by only including the mahout library inside my Java project but i dont want to use hadoop at all since i will be…
skyde
  • 2,816
  • 4
  • 34
  • 53
1
2 3
78 79