Questions tagged [distributed-computing]

Utilizing more than one computer, connected to each other with a communication link to accomplish a common task.

Distributed computing is a field of study which describes how multiple connected computing units can achieve a common task. The larger computing power enables more tasks to be performed than in a single unit, and searches can be coordinated for efficiency. Successes usually give the finder credit.

Distributed computing projects include hunting large prime numbers and analysing DNA codes.

Projects

References

2821 questions
413
votes
8 answers

Explaining Apache ZooKeeper

I am trying to understand ZooKeeper, how it works and what it does. Is there any application which is comparable to ZooKeeper? If you know, then how would you describe ZooKeeper to a layman? I have tried apache wiki, zookeeper sourceforge...but I…
topgun_ivard
  • 8,376
  • 10
  • 38
  • 45
410
votes
20 answers

Spark - repartition() vs coalesce()

According to Learning Spark Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of repartition() called coalesce() that allows avoiding data movement, but only if you are decreasing the…
Praveen Sripati
  • 32,799
  • 16
  • 80
  • 117
296
votes
2 answers

What are workers, executors, cores in Spark Standalone cluster?

I read Cluster Mode Overview and I still can't understand the different processes in the Spark Standalone cluster and the parallelism. Is the worker a JVM process or not? I ran the bin\start-slave.sh and found that it spawned the worker, which is…
Manikandan Kannan
  • 8,684
  • 15
  • 44
  • 65
239
votes
6 answers

What is the difference between cache and persist?

In terms of RDD persistence, what are the differences between cache() and persist() in spark ?
user1261215
131
votes
25 answers

Calculate the median of a billion numbers

If you have one billion numbers and one hundred computers, what is the best way to locate the median of these numbers? One solution which I have is: Split the set equally among the computers. Sort them. Find the medians for each set. Sort the sets…
anony
  • 1,473
  • 3
  • 13
  • 10
95
votes
4 answers

Meaning of inter_op_parallelism_threads and intra_op_parallelism_threads

Can somebody please explain the following TensorFlow terms inter_op_parallelism_threads intra_op_parallelism_threads or, please, provide links to the right source of explanation. I have conducted a few tests by changing the parameters, but the…
68
votes
3 answers

2PC vs Sagas (distributed transactions)

I'm developing my insight about distributed systems, and how to maintain data consistency across such systems, where business transactions covers multiple services, bounded contexts and network boundaries. Here are two approaches which I know are…
Tuomas Toivonen
  • 21,690
  • 47
  • 129
  • 225
67
votes
3 answers

Apache Spark vs Akka

Could you please tell me the difference between Apache Spark and AKKA, I know that both frameworks meant to programme distributed and parallel computations, yet i don't see the link or the difference between them. Moreover, I would like to get the…
user4658980
62
votes
4 answers

Why isn't RDBMS Partition Tolerant in CAP Theorem and why is it Available?

Two points I don’t understand about RDBMS being CA in CAP Theorem : 1) It says RDBMS is not Partition Tolerant but how is RDBMS any less Partition Tolerant than other technologies like MongoDB or Cassandra? Is there a RDBMS setup where we give up CA…
Glide
  • 20,235
  • 26
  • 86
  • 135
61
votes
5 answers

Difference between cloud computing and distributed computing?

I wanted to know about the difference about cloud computing and distributed computing. I read an article about cloud computing and got a feeling that somewhere there is a relation between cloud computing and distributed computing and so wanted to…
Rachel
  • 100,387
  • 116
  • 269
  • 365
60
votes
4 answers

Service discovery vs load balancing

I am trying to understand in which scenario I should pick a service registry over a load balancer. From my understanding both solutions are covering the same functionality. For instance if we consider consul.io as a feature list we have: Service…
60
votes
1 answer

"Eventual Consistency" vs "Strong Eventual Consistency" vs "Strong Consistency"?

I came across the concept of "Strong Eventual Consistency" . Is it supposed to be stronger than "Eventual Consistency" but weaker than "Strong Consistency"? Could someone explain the differences among these three concepts with applicable…
njzhxf
  • 837
  • 1
  • 7
  • 9
51
votes
1 answer

What is spark.driver.maxResultSize?

The ref says: Limit of total size of serialized results of all partitions for each Spark action (e.g. collect). Should be at least 1M, or 0 for unlimited. Jobs will be aborted if the total size is above this limit. Having a high limit may…
gsamaras
  • 71,951
  • 46
  • 188
  • 305
51
votes
1 answer

Flattening Rows in Spark

I am doing some testing for spark using scala. We usually read json files which needs to be manipulated like the following example: test.json: {"a":1,"b":[2,3]} val test = sqlContext.read.json("test.json") How can I convert it to the following…
51
votes
2 answers

What is a task in Spark? How does the Spark worker execute the jar file?

After reading some document on http://spark.apache.org/docs/0.8.0/cluster-overview.html, I got some question that I want to clarify. Take this example from Spark: JavaSparkContext spark = new JavaSparkContext( new…
EdwinGuo
  • 1,765
  • 2
  • 21
  • 27
1
2 3
99 100