-1

I am working on a project where there is a necessity to store considerable data. I was wondering what is the difference between using SQL and the datascience library in python. I intend to use SQL from its python based libraries too or use a csv file to store info if I am going to use "datascience". I am leaning very much towards "datascience" as I find the following advantages:

  1. It is subjectively very easy to use for me. I make much less mistakes.
  2. With my limited knowledge in runtime, I think the datascience library will be more efficient.
  3. Most importantly, it has many inbuilt functions that could allow me to make easier functions.

However, since so many people are using SQL, I was wondering if I am missing something major, particularly in scalability.

Some people online said that SQL allows us to store files on a database, but I do not see how that makes a difference. I can simply store the file in a folder on a system and save the link in the "datascience" table.

Kavi Vaidya
  • 119
  • 8
  • What do you mean by "datascience library" in Python? – antonioACR1 Jul 23 '18 at 19:43
  • Well... SQL is particularly good at searching, filtering, reordering, grouping, ranking, etc. You can also index and get high performance. Well... all of this at the cost of actually learning SQL -- something that can take a few months at least. – The Impaler Jul 23 '18 at 19:43
  • For simple operations SQL is straightforward. However for data manipulation it's better to use Python libraries. As a simple example, try to transpose a table in SQL: https://stackoverflow.com/questions/15297809/sql-transpose-full-table. As you can see, it becomes unnecessarily complicated whereas using Pandas in Python becomes really simple: `df.T`. – antonioACR1 Jul 23 '18 at 20:00
  • 1
    However, learning SQL is important. For example, if you want to deal with Big Data then you can use other tools like Hive to perform operations using SQL-based language. Learning Hive becomes simpler if you already know SQL language. Instead of Python, you can use Pyspark as well. So I think you should use both SQL together with Python libraries for data science, e.g. Numpy, Pandas, Matplotlib, ScikitLearn, etc. Both are really important – antonioACR1 Jul 23 '18 at 20:04
  • @user322778 Tables don't transpose, they pivot. Transposing a matrix means reordering its axes. It is possible to encode a 2-d matrix & its axis info into a table so that pivoting the table transposes the matrix, but that should be done after straightforward relational manipulation. The choice of matrix vs table depends on the manipulations one is doing (which could include converting between types). – philipxy Jul 23 '18 at 23:55
  • @philipxy Sorry about my wording. When I said "transpose a table" I meant transpose the data inside the table (or the matrix if you wish). Trying to do this using only SQL becomes really complicated but in Python it becomes really simple, it's just a single line: `df.T` where `df` is a Pandas dataframe. For fancy feature engineering you need something else than SQL. That's my point – antonioACR1 Jul 24 '18 at 00:02
  • Looks like we're on the same page. (Including your earlier comments.) – philipxy Jul 24 '18 at 00:05
  • @user322778 "datascience" is a library that allows us to store data in a table object and perform operations on it. – Kavi Vaidya Jul 24 '18 at 09:09
  • @Kavi Vaidya Oh I see. Well, as you can see, "datascience" library is not a standard tool among data scientists. Also, if you really have considerable data then I don't think it's a good idea to handle it using introductory libraries, it's a bit like trying to implement Gradient Boosting or Neural Networks using only VBA/Excel... – antonioACR1 Jul 24 '18 at 15:00

1 Answers1

1

The "datascience library" is only intended to be a tool for teaching basic concepts in an academic entry level class. Unless you are taking such a class, you should ignore it and learn more standard tools.

If it helps you, you can learn Data Science using Pandas starting just from flat data files, such as CSV and JSON. You will absolutely need to learn to interface with SQL and NoSQL servers eventually. The advantages of a database over flat files are numerous and well described elsewhere.

It's up to you whether you want to learn Pandas first and SQL second, or SQL first. Many people in the real world would have learned SQL before Python/Pandas/Data Science, so you may want to go that route.

If you go ahead and study that datascience library, you will learn some concepts, but will then have to re-learn everything in there "for real." Maybe this is best for your learning style, maybe it isn't. We don't know you well enough. Do you want academic hand holding or do you want to do things the real way?

Good luck and enjoy your journey.

Chad Bernier
  • 386
  • 1
  • 10