0

I have method A() to compare a pair of 3D structure of proteins ( 3D objects). I would like to repeat this method for 10000000 times for 10000000 pairs of protein. One protein description is in one text file, they are separated. How can parallel the repeated method using spark? Thank for your help.

azro
  • 53,056
  • 7
  • 34
  • 70
  • For starters, you can [load the files into an RDD](https://stackoverflow.com/questions/24029873/how-to-read-multiple-text-files-into-a-single-rdd) and use a DataFrame with a user defined function that applies A() to the columns of the DataFrame to make comparisons. More details are necessary to formulate a full example, but for the parallelization you can take a look at [this](https://forums.databricks.com/questions/2119/how-do-i-process-several-rdds-all-at-once.html) and [this](https://stackoverflow.com/questions/38069239/parallelize-avoid-foreach-loop-in-spark) – mkaran Jul 11 '17 at 09:40
  • hi mkaran, Method A() i) read 2 text files, extract 3D coordinates , and information ii) find maxiu clique iii) save result to arrayrepeat A() for 1000000 pairs – Jimmy Le Viet Hung Jul 12 '17 at 07:01
  • A() is very fast though it might has computational bottleneck with large file, and spark will help here. But the real problem is repeat A() 100000 times. I usually use adhoc multithreading with Thread pool with 10 threads concurrently. My question is 1)my adhoc implementation can run in cluster of thousand of nodes? I think it can run only on my typical PC 2 core Ram 4G 2) How to implement it in spark Thanks much – Jimmy Le Viet Hung Jul 12 '17 at 07:24

0 Answers0