Apache Spark Processing Capabilites & Eligibility

Question

I am new at Apache Spark and wonder whether it is eligible for my specific scenario or not. In my case, I am crawling small datasets (as JSON files to MongoDB). Those files are actually related with the same entity but it is possible for them to have different structures (a specific JSON in the same collection may include more or less key/value pairs compared the other ones). What I am trying is to run Machine Learning (classification / regression) algorithms on those data files and derive info from them.

When you consider the case do you think Spark is eligible to speed things up by processing in parallel in a cluster environment? Or do you thing I should converge to some other alternatives?

Thank you.

score 0 · Answer 1 · answered Jun 11 '17 at 10:25

Parallel processing is a way to go for the big data world of today. And considering your case, Spark is definitely a good choice. Spark is in-memory computation tool which operates with driver-executor scheme. Memory is most important factor to consider for choosing spark. you can check out Apache-spark

Since your project is related to machine learning, spark has a lot of libraries for machine learning mllib-guide

MongoDB is supported too. you can check out databricks use case

I hope this is helpful

score 0 · Answer 2 · answered Jun 11 '17 at 14:01

Yes, Apache Spark supports these kinds of use cases. You can directly read from your JSON files, if you want. MongoDB as a data source is also supported. But, most importantly why you should use Spark is because it supports Machine Learning Algorithms directly on the datasets and you get parallel processing, fault tolerance, lazy loading and a lot more!

Referencing directly from their Machine Learning Page -

Its goal is to make practical machine learning scalable and easy. At a high level, it provides tools such as:

ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering
Featurization: feature extraction, transformation, dimensionality reduction, and selection
Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
Persistence: saving and load algorithms, models, and Pipelines
Utilities: linear algebra, statistics, data handling, etc.

Check out their page on Machine Learning for more details - http://spark.apache.org/docs/latest/ml-guide.html

MongoDB as a data source - https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html Loading JSON files from a folder directly - How to load directory of JSON files into Apache Spark in Python

Moreover, it has API's in Python, R, Scala, and Java! Choose whatever you are comfortable in.

Apache Spark Processing Capabilites & Eligibility

2 Answers2