0

This links talk about a excellent way of writing to HBase via Java API.

Writing to HBase via Spark: Task not serializable

How can I do the same for reading the data from HBase.

Say for example, I have a property name -> file which gives me Start and stop rows (each line provides a seperate start and stop row which can be divided among different servers) other property tells me server name and table name. Can any please provide me with sample, how it can be done, I tried but I keep on getting not serializable error.

(this is what i tried ,Spark serialization error)

any help will be deeply appreciated.

Community
  • 1
  • 1
user2799564
  • 147
  • 2
  • 8
  • It's a common problem, and TBH I do not fully understand it myself. But after some digging adding `@transient` (as per my comment http://stackoverflow.com/a/23620239/1586965) to all your non-serializable types should work. For me to better help, please could you post the minimum code, particularly for the class that is not serializing properly. – samthebest Aug 12 '14 at 19:31

1 Answers1

1

First -- you haven't given any detail at all. Nobody can help determine what is not serializing this way.

Any object that is used or referenced by the functions you execute in Spark must be Serializable by default. It's up to you to either design the objects that way, or not reference them. It can get complex to reason about what's referenced when inner classes come into play, and Functions are usually inner classes.

Still, there's always a straightforward reason why something of yours is not Serializable. I would advise trying to avoid non-static inner classes. Check that you're not pointlessly holding references to objects in your Function. Convert them to not hold these refs, not need them, or just use a part of the value that is Serializable. Next you may have to mark some objects Serializable if they aren't, and if it makes sense to let default serialization manage serialization. I disagree that @transient is a fix; it just makes some fields not be sent at all, which is only appropriate in certain cases. It will likely lead to surprising NullPointerExceptions when you find values have "disappeared" (not serialized) on the remote machine.

Sean Owen
  • 66,182
  • 23
  • 141
  • 173
  • Thanks @SeanOwen. I agree `@transient`ing everything is only appropriate in certain cases. Would you agree the natural use case is when a class holds non-serializable fields and one wishes to use a method from that class inside in a lambda being passed to an `RDD` high-order function? If not, how else can we use that method? – samthebest Aug 13 '14 at 08:05
  • I think I'd try to refactor if possible. If a `Function` calls a method and the method "really" needs no state, then it should be a `static` method somewhere. Sometimes `Function`s are defined in some big bloated central class that has 20 fields of various managers and connections. It just has to be further decomposed and separated to only depend on some smaller object with just the essential (and serializable) state. – Sean Owen Aug 13 '14 at 10:27
  • Hmm, I guess if a method needs no state from the class, it should be in the Companion Object of that class. Nice! – samthebest Aug 13 '14 at 10:44