8

I'm running a Spark job with Spark version 1.4 and Cassandra 2.18. I telnet from master and it works to cassandra machine. Sometimes the job runs fine and sometimes I get the following exception. Why would this happen only sometimes?

"Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 7, 172.28.0.162): java.io.IOException: Failed to open native connection to Cassandra at {172.28.0.164}:9042 at com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:155) "

It sometimes also gives me this exception along with the upper one:

Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /172.28.0.164:9042 (com.datastax.driver.core.TransportException: [/172.28.0.164:9042] Connection has been closed))

08Dc91wk
  • 4,254
  • 8
  • 34
  • 67
Nipun
  • 4,119
  • 5
  • 47
  • 83
  • Have you seen [this question](http://stackoverflow.com/q/30927615/1084879)? – zapstar Aug 11 '15 at 17:37
  • 1
    Yep, I do have. The problem is I get it sometimes and sometimes my code runs fines. When I restart all my master and slave it works and after runnings my job 2-3 times it again gives me this error. I closed all the TIME_WAIT ports but still see this issue – Nipun Aug 12 '15 at 08:28

1 Answers1

3

I had the second error "NoHostAvailableException" happen to me quite a few times this week as I was porting Python spark to Java Spark.

I was having issues with the driver thread being nearly out of memory and the GC was taking up all my cores (98% of all 8 core), pausing the JVM all the time.

In python when this happens it's much more obvious (to me) so it took me a bit of time to realize what was going on, so I got this error quite a few times.

I had two theory on the root cause, but the solution was not having the GC go crazy.

  1. First theory, was that because it was pausing so often, I just couldn't connect to Cassandra.
  2. Second theory: Cassandra was running on the same machine as Spark and the JVM was taking 100% of all CPU so Cassandra just couldn't answer in time and it looked to the driver like there were no Cassandra host.

Hope this helps!

Code Herder
  • 108
  • 5
  • 1
    Yep, my cassandra is running on the same machine as my spark worker. This can be a issue. Thanks a lot, I will pull out a separate machine for this – Nipun Aug 14 '15 at 03:33
  • It didn't solved the problem. I still get the issue. – Nipun Aug 18 '15 at 14:51