35

I have been a C++ developer for about 10 years. I need to pick up Java just for Hadoop. I doubt I will be doing any thing else in Java. So, I would like a list of things I would need to pick up. Of course, I would need to learn the core language, but what else?

I did Google around for this and this could be seen as a possible duplicate of "I want to learn Java. Show me how?" but it's not. Java is a huge programming language with lots, of libraries and what I need to learn will depend largely on what I am using Hadoop for. But I suppose it is possible to say something like don't bother learning this. This will be quite useful too.

Nikhil
  • 2,230
  • 6
  • 33
  • 51
  • 14
    I really don't like these questions, as that it assumes that we will know what you need to do for Hadoop. If I were you, I would get a good introduction on Java and its provided utilties before starting up with Hadoop. Java is huge, given just the JavaSE there are a lot of collections, generic usage, and classes sorrounding it. What you're asking for is how can you learn just the Japanese needed to send a letter to your favorite comic book maker. – monksy Apr 20 '11 at 14:58
  • 3
    You can actually use Hadoop with C++. It is called Streaming Library. That's the way Python/php and others work with hadoop. – Thomas Jungblut Apr 20 '11 at 15:28

12 Answers12

50

In my day job, I've just spent some time helping a C++ person to pick up enough Java to use some Java libraries via JNI (Java Native Interface) and then shared memory into their primarily C++ application. Here are some of the key things I noticed:

  1. You cannot manage for anything beyond a toy project without an IDE. The very first thing you should do is download a popular Java IDE (Eclipse is a fine choice, but there are also alternatives including Netbeans and IntelliJ). Do not be tempted to try and manage with vi / emacs and javac / make. You will be living in a cave and not realising it. Once you're up to speed with even basic IDE functions you will be literally dozens of times more poductive than without an IDE.
  2. Learn how to layout a simple project structure and packages. There will be simple walkthroughs of how to do this on the Eclipse site or elsewhere. Never put anything into the default package.
  3. Java has a type system whereby the reference and primitive types are relatively separate for historic / performance reasons.
  4. Java's generics are not the same as C++ templates. Read up on "type erasure".
  5. You may wish to understand how Java's GC works. Just google "mark and sweep" - at first, you can just settle for the naivest mental model and then learn the details of how a modern production GC would do it later.
  6. The core of the Collections API should be learned without delay. Map / HashMap, List / ArrayList & LinkedList and Set should be enough to get going.
  7. Learn modern Java concurrency. Thread is an assembly-language level primitive compared to some of the cool stuff in java.util.concurrent. Learn ConcurrentHashMap, Atomic*, Lock, Condition, CountDownLatch, BlockingQueue and the threadpools from Executors. Good books here are those by Brian Goetz and Doug Lea.
  8. As soon as you want to use 3rd party libraries, you'll need to learn how the classpath works. It's not rocket science, but it is a bit verbose.

If you're a low-level C++ guy, then you may find some of this interesting also:

  1. Java has virtual dispatch by default. The keyword static on a Java method is used to indicate a class method. private Java methods use invokespecial dispatch, which is a dispatch onto the exact type in use.
  2. On an Oracle VM at least, objects comprise two machine words of header (the mark word and the class word). The mark word is a bunch of flags the VM uses - notably for thread synchronization. The class word you can think of as a pointer to the VM's representation of the Class object (which is where the vtables for methods live). Following the class word are the member fields of the instance of the object.
  3. Java .class files are an intermediate language, and not really that similar to x86 object code. In particular there are lots more useful tools for .class files (including the javap disassembler which ships with the JVM)
  4. The Java equivalent of the symbol table is called the Constant Pool. It's typed and it has a lot of information in it - arguably more than the x86 object code equivalent.
  5. Java virtual method dispatch consists of looking up the correct method to be called in the Constant Pool and then converting that to an offset into a vtable. Then walking up the class hierarchy until a not-null value is found at that vtable offset.
  6. Java starts off interpreted and then goes compiled (for Oracle and some other VMs anyway). The switch to compiled mode is done method-by-method on a as-need basis. When benchmarking and perf tuning you need to make sure that you've warmed the system up before you start, and that you should typically profile at the method level to start with. The optimizations that are made can be quite aggressive / optimistic (with a check and a fallback if the assumptions are violated) - so perf tuning is a bit of an art.

Hopefully there's some useful stuff in there to be going on with - please comment / ask followup questions.

kittylyst
  • 5,640
  • 2
  • 23
  • 36
  • Thank you. Exactly what I was looking for, I am accepting your answer and you have your bounty. If you had anything to add please go ahead. Funny you point out the IDE thing... have been a emacs user for as long I can remember. – Nikhil Apr 28 '11 at 13:17
  • 5
    The IDE thing really surprised me - the team are senior guys - 10+ years in most cases - and I had to explain it several times. I'd come back and they'd have back-slid into using emacs or vi just by habit. That's why I was so forceful about it in my answer. It took some time for them to get the benefit of syntax highlighting, basic refactoring and other simple IDE features because it's just not the normal dev practices in C++. Thank you for the bonus. – kittylyst Apr 28 '11 at 15:52
  • This is an amazing answer, It's a pity the question was closed because this is exactly the type of thing that SO should be preserving. – Bill K Jul 15 '15 at 16:05
  • I disagree vehemently with point 1. More important than an ide is choosing a build tools that will let you manage your project. I think the poster is confusing IDE with build tool. Understanding the java "stack" is 50% java lang and 50% build tool... Ant, maven, gradle, etc. Your project will behave/look very differently based on what choice you make. Unfortunately this is the knowledge most java developers lack and don't see the value in learning (I have 10 years expirience with ems/java guys) The stuff u get with an IDE is nice, but best thing about Eclipse is the remote debugger. That's it. – niken Apr 11 '17 at 14:47
17

Learning "just enough" Java is learning Java. Either you learn all the core principles and language design decisions, or you suffer along making easily avoidable mistakes. Considering that you already know how to program, a lot of the information can be skimmed (with an eye for where it differs from other languages you are intimately familiar).

so you need to learn:

  1. How to get started
  2. The language itself
  3. The core, essential classes
  4. The major Collections

And if you don't have a build framework in place, how to package your compiled code.

Beyond that, nearly every other item you might need to learn depends heavily on what you intend to do. Don't discount the on-line tutorials from Oracle/Sun, they are quite good (compared to other online tutorials).

Edwin Buck
  • 69,361
  • 7
  • 100
  • 138
  • 2
    I agree on this. If you intend to use JAVA for anything (be it only hadoop) you'll need to master all the core concept of the java ecossytem. Next problem with be that soon you'll need to master many other things even for just hadoop depending of what you really try to achieve. You could come back with another question then, or take a look on JEE and Spring (the 2 most widely used entreprise framework on java). – Nicolas Bousquet Apr 26 '11 at 15:51
13

Hadoop can use C++ : WordCount example in C++

acharuva
  • 665
  • 8
  • 15
warren
  • 32,620
  • 21
  • 85
  • 124
7

You can't really use Java without knowing these packages in the standard API:

java.lang
java.util
java.io

And, to a lesser degree:

java.text
java.math
java.net
java.lang.reflect
java.util.concurrent

They contain a lot of classes you'll need to use constantly for pretty much any application, and it's a good idea to look through them until you know which classes they contain and what those are good for, lest you end up reinventing wheels.

Michael Borgwardt
  • 342,105
  • 78
  • 482
  • 720
7
  • Take it easy, learning Java could be pleasant and fast if you already know C++

    Buy these two books:

    1. The JavaTM Programming Language, (4th Edition) Ken Arnold, James Gosling, Davis Holmes
    2. Effective Java (2nd Edition), Joshua Bosh

You will soon be mastering Java, You will not regret. Good Luck.

Hernán Eche
  • 6,529
  • 12
  • 51
  • 76
4

Since C++ and Java share common roots, the core language shouldn't give you too much trouble. You will need to become familar with the java SDK, particularly java.lang and the Collections framework (java.util.)

But perhaps learning java is overkill if you don't see yourself using it elsewhere. Hadoop also has bindings to Python - perhaps learning python would be a better alternative? See Java vs Python on Hadoop.

Community
  • 1
  • 1
mdma
  • 56,943
  • 12
  • 94
  • 128
3

Answer 1 :

  • It is very desirable to know Java. Hadoop is written in Java. Its popular Sequence File format is dependent on Java.
  • Even if you use Hive or Pig, you'll probably need to write your own UDF someday. Some people still try to write them in other languages, but I guess that Java has more robust and primary support for them.
  • Most Hadoop tools are not mature enough (like Sqoop, HCatalog and so on), so you'll see many Java error stack traces and probably you'll want to hack the source code someday

Answer 2

  • It is not required for you to know Java.
  • As the others said, it would be very helpful depending on how complex your processing may be. However, there is an incredible amount you can do with just Pig and say Hive.
  • I would agree that it is fairly likely you will eventually need to write a user defined function (UDF), however, I've written those in Python, and it is very easy to write UDFs in Python.
  • Granted, if you have very stringent performance requirements, then a Java based MapReduce program would be the way to go. However, great advancements in performance are being made all of the time in both Pig and Hive.
  • So, the short answer to your question is, "No", it is not required for you to know Java in order to perform Hadoop development.

Source : http://www.linkedin.com/groups/Is-it-must-Hadoop-Developer-988957.S.141072851

Abhishek Goel
  • 18,785
  • 11
  • 87
  • 65
3

Here is the quickstart for all you will need I suggest Eclipse (java) to start working, see this for that

edgarmtze
  • 24,683
  • 80
  • 235
  • 386
3

Maybe you don't even need to know Java to use Hadoop.

Pig is far enough from simple to advanced usage of Hadoop.

KARASZI István
  • 30,900
  • 8
  • 101
  • 128
  • ... although you'd probably consider writing "user-defined functions" for user with Pig. Of course you can use the streaming interface but to keep everything nice and clean the UDFs are best composed in Java. – PP. Apr 20 '11 at 15:00
  • That's true but if he does not need to implement user-defined-functions then no Java knowledge is needed for Hadoop. – KARASZI István Apr 20 '11 at 15:04
  • 2
    Why not? No extra java knowledge is an answer, but of course it depends on the usage. – KARASZI István Apr 25 '11 at 16:37
3

I don't know how familiar are you with other higher level programming languages. Garbage collection is an important function in Java. It would be important to read a bit about the GC in your VM of choice.

Besides the obvious packages, check out the java.util packages for the collection framework. You might want to check out the source of some classes. I suggest HashMap to get the idea of the computing/memory cost of these operations.

Java likes to use streams instead of buffers when processing large amounts of data. That may take some time getting used to.

Java has no unsigned types. Depending on the packets of data you need to process at once you can either use larger variables and streight arythetics (if we're talking about relatively small packets), or you have to (b[i] & 0xff) every time you read for example unsigned bytes. Also note that Java uses network byte order (msbf) when serializing multibyte numbers.

The most beloved design patterns by the API are Singleton, Decorator and Factory. Check the source of JFC itself for best practices, how these patterns are achieved in the language.

... and you can still post more concrete questions on SO :)

vbence
  • 20,084
  • 9
  • 69
  • 118
2

Most of the stuff should be pretty familiar to you. I'd just download eclipse and google a tutorial site. Familiarize yourself with classloading, keywords. One tricky thing a lot of C++ guys run into is how to run a java app so that it finds its library classes(sort of analogous to dynamic linking). Learn the difference between the JRE and JDK. If you can get a few hello world type apps working you ought to be able to get a start on hadoop if you follow the tutorials.

nsfyn55
  • 14,875
  • 8
  • 50
  • 77
1

You dont need to learn java to use hadoop.

You need to know linux to installand configure hadoop

then you can write your map reduce jobs using the stream line api on any language which understand standard input/output

further you can do more complex map reduce using other libraries like hive etc

even other components of hadoop like hbase/ cassandra also has clients on most of the languages