is the choice of Python and Hadoop a good one for this scenario?

Question

I am looking for a solution to build an application with the following features:

A database compound of -potentially- millions of rows in a table, that might be related with a few small ones.
Fast single queries, such as "SELECT * FROM table WHERE field LIKE %value"
It will run on a Linux Server: Single node, but maybe multiple nodes in the future.

Do you think Python and Hadoop is a good choice?

Where could I find a quick example written in Python to add/retrieve information to Hadoop in order to see a proof of concept running with my one eyes and take a decision?

Thanks in advance!

This is way too broad a question, but by the sounds of it Hadoop seems overkill. A traditional in-memory framework with support for SQL (e.g. Django for a web app, Pandas for data analysis, etc.) should be more than enough and faster. — jdehesa, Aug 11 '17 at 09:25
When you say hadoop you mean HDFS? If so, then you would want to look at Apache Parquet. "Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language." https://parquet.apache.org/ — ohad edelstain, Aug 11 '17 at 09:27
I agree that the question is broad, but fortunately there is a straigtforward answer, so I don't think it needs to be closed. — Dennis Jaheruddin, Aug 11 '17 at 09:32
@ohadedelstain I don't want to discourage you, but the comment doesn't really hit the spot here, so perhaps you might want to delete it. — Dennis Jaheruddin, Aug 11 '17 at 09:35
Thanks for your suggestion Ohad, I'll take a look at Apache Parquet — , Aug 11 '17 at 10:50

score 1 · Answer 1 · answered Aug 11 '17 at 09:31

1

Not sure whether these questions are on topic here, but fortunately the answer is simple enough:

In these days a million rows is simply not that large anymore, even Excel can hold more than a million. If you have a few million rows in a large table, and want to run quick small select statements, the answer is that you are probably better off without Hadoop.

Hadoop is great for sets of 100 million rows, but does not scale down too wel (in performance and required maintenance).

Therefore, I would recommend you to try using a 'normal' database solution, like MySQL. At least untill your data starts growing significantly.

You can use python for advanced analytical processing, but for simple queries I would recommend using SQL.

answered Aug 11 '17 at 09:31

Dennis Jaheruddin

21,208
8
66
122

I meant millions and I've already checked Mysql has serious problems to process a well-designed database with such amount of data – Aug 11 '17 at 10:48
@hertsmael Perhaps you can give some exact numbers, because there is still a large difference between millions (probably SQL scale) and hundreds of millions (Where hadoop becomes interesting). -- In addition Hadoop mainly just adds overhead if you use it on a single box, so if you have a box that can process the data, don't bother with it. -- Some reference for SQL scalability (perhaps MySQL is not the most scalable, but then go for Oracle or so) https://stackoverflow.com/a/1995078/983722 – Dennis Jaheruddin Aug 11 '17 at 10:55
I really would like to give specific details, all I know the database is around 2.5 Tbytes of information – Aug 11 '17 at 11:36

is the choice of Python and Hadoop a good one for this scenario?

1 Answers1