9

Can somebody explain me the main pros and cons of the most known datamining open-source tools?

Everywhere I read that RapidMiner, Weka, Orange, KNIME are the best ones. look at this blog post

Can somebody do a fast technical comparison in a small bullet list.

My needs are the following:

  • It should support classification algorithms (Naive Bayes, SVM, C4.5, kNN).
  • It should be easy to implement in Java.
  • It should have understandable documentation.
  • It should have reference production projects or use cases working on in.
  • some additional benchmark comparison if possible.

Thanks!

user2670818
  • 719
  • 5
  • 12
  • 28

4 Answers4

7

I would like to say firstly there are pro's and cons for each of them on your list however I would suggest out of your list weka from my personal experience it is incredibly simple to implement in your own java application using the weka jar file and has its own self contained tools for data mining.

Rapid miner seems to be a commercial solution offering an end to end solution however the most notable number of examples of external implementations of solutions for rapid miner are usually in python and r script not java.

Orange offers tools that seem to be targeted primarily at people with possibly less need for custom implementations into their own software but a far easier time with user itneraction, its written in python and source is available, user addons are supported.

Knime is another commercial platform offering end to end solutions for data mining and analysis providing all the tools required, this one has various good reviews around the internet but i havent used it enough to advise you or anyone on the pro's or cons of it.

See here for knime vs weka

Best data mining tools

As i said weka is my personal favorite as a software developer but im sure other people have varying reasons and opinions on why to choose one over the other. Hope you find the right solution for you.

Also per your requirements weka supports the following:

Naivebayes

SVM

C4.5

KNN

D3181
  • 2,037
  • 5
  • 19
  • 44
  • 1
    yea, great. Thank you! I personally also use WEKA, but in order to proove why is better than others I am not that quite sure. That's why I was interested if someone was comparing the performance or difference between algorithms implementation and API-s for development – user2670818 Jul 25 '16 at 12:13
  • After reading around to try and answer your question its really hard to find a clear and concise breakdown of performance between all of those data mining tools/platforms, which would actually be really useful for many reasons...hopefully we should see more services in future provide a breakdown but i found this....which was marginally helpful.. http://www.predictiveanalyticstoday.com which if you search through gives very rough reviews but better than nothing i guess... Anyway imo if you have used weka and have experience with it probably easiest sticking with it until you find a reason – D3181 Jul 25 '16 at 15:45
3

I have tried Orange and Weka with a 15K records database and found problems with the memory management in Weka, it needed more than 16Gb of RAM while Orange could've managed the database without using that much. Once Weka reaches the maximum amount of memory, it crashes, even if you set more memory in the ini file telling Java virtual machine to use more.

1

I recently evaluated many open source projects, comparing and contrasted them with regards to the decision tree machine learning algorithm. Weka and KNIME were included in that evaluation. I covered the differences in algorithm, UX, accuracy, and model inspection. You might chose one or the other depending on what features you value most.

Glenn
  • 7,874
  • 3
  • 29
  • 38
0

I have had positive experience with RapidMiner:

  • a large set of machine learning algorithms
  • machine learning tools - feature selection, parameter grid search, data partitioning, cross validation, metrics
  • a large set of data manipulation algorithms - input, transformation, output
  • applicable to many domains - finance, web crawling and scraping, nlp, images (very basic)
  • extensible - one can send and receive data other technologies: R, python, groovy, shell
  • portable - can be run as a java process
  • developer friendly (to some extent, could use some improvements) - logging, debugging, breakpoints, macros

I would have liked to see something like RapidMiner in terms of user experience, but with the underlying engine based on python technologies: pandas, scikit-learn, spacy etc. Preferably, something that would allow moving back and forth from GUI to code.

Amnon
  • 2,212
  • 1
  • 19
  • 35
  • 1
    You should have a look at https://orange.biolab.si/ . RapidMiner Studio is only free up to 10000 data rows. Then it becomes very costly. – asmaier Jun 12 '19 at 12:12