1

For Machine Learning/Data Mining, we need to learn about data, which means you need to learn Hadoop, which has implementation in Java for MapReduce (correct me if I am wrong). Hadoop also provides a streaming API to support other languages(like Python). Most grad students/researchers I know solve ML problems in Python. We see job posts for Hadoop and Java combination very often.

I observed that Java and Python (in my observation) are most widely used languages for this domain.

My question is what is the most popular language for working on this domain. What factors involve in deciding which language/framework one should choose?

I know both Java and Python but confused always:

  • whether I start programming in Java (because of hadoop implementation)
  • whether I start programming in Python (because its easier and quicker to write)

This is a very open ended question, I am sure the advice might help me and people who have same doubt.

TylerH
  • 20,799
  • 66
  • 75
  • 101
daydreamer
  • 87,243
  • 191
  • 450
  • 722
  • You might check http://stackoverflow.com/questions/1482282/java-vs-python-on-hadoop for a performance comparison between python and java on hadoop. – petrichor Jun 22 '11 at 06:35

5 Answers5

2

Unfortunately, it seems to me that the reigning language is MATLAB... I say unfortunately because I neither like nor use this language, I'm much more likely to program in C++/Java. But Data Miners and Machine Learning persons around me tend to stick to MATLAB...

Edit : I've just read a really interesting line in Wikipedia's page on R :

According to Rexer's Annual Data Miner Survey in 2010, R has become the data mining tool used by more data miners (43%) than any other.

B. Decoster
  • 7,723
  • 1
  • 32
  • 52
1

I'm not experienced in Java and Hadoop but I used both Python and MATLAB for machine learning stuff and I use MATLAB more often now. Actually, the important factors for my case are as follows:

  • Almost all of my colleagues use MATLAB and C++, and very few of them use Python. Their Python usage is limited to general scripting, not particular machine learning stuff. So, when I use Python, the only way to get help is web and we face problems to share code within the lab.
  • The IDE of MATLAB and its extensive documentation makes it powerful for my case.
  • You can handle large data sets in MATLAB. link 1 link2
  • There are many machine learning/data mining libraries written in MATLAB, and most of the libraries written in C++/Java have MATLAB wrappers.

Some points are also true for Python. But as I mentioned, the community I work in plays an important role in deciding the language.

petrichor
  • 6,459
  • 4
  • 36
  • 48
1

R is an excellent candidate for data mining (certainly) and machine learning as well.

(Generalizations, of course.)

Java and Hadoop are really meaningful in context of seriously big data and/or scaling requirements. Java gives you the libraries and and an army of programmers. Hadoop gives you fairly painless distribution and a growing knowledge base of mapping various algorithms to the framework.

Python seems to have the academics on its side, specially recent graduates who are now active and influential in the professional practice. Also, if you just want to try out stuff, an expressive dynamic language like Python obviously will prove to be quite useful.

Then there is R. (There is a lot more, but this is the extent of my knowledge /g/)

I think besides the obvious focus on data that R brings to the table (and thus a community of data geeks to help out with the science part as well), it is a delightfully lightweight system and not too shabby at all in terms of libraries as well.

That said, one would think the (~) functional languages (Scala, Clojure on JVM; Haskell, etc.) would be quite a good fit for manipulating data and working on huge datasets.

alphazero
  • 27,094
  • 3
  • 30
  • 26
0

Python is gaining in popularity, has a lot of libraries, and is very useful for prototyping. I find that due to the many versions of python and its dependencies on C libs to be difficult to deploy though.

R is also very popular, has a lot of libraries, and was designed for data science. However, the underlying language design tends to make things overcomplicated.

Personally, I prefer Clojure because it has great data manipulation support and can interop with Java ecosystem. The downside of it currently is that there aren't too many data science libraries yet!

0

I think in this field most popular combination is Java/Hadoop. When vacancies requires also python/perl/ruby it usually means that they are migrating from those script languages(usually main languages till that time) to java due to moving from startup code base to enterprise. Also in real world data mining application python is frequently used for prototyping, small sized data processing tasks.

yura
  • 14,489
  • 21
  • 77
  • 126