0

Context: I have been working on Cloudera/Impala in order to use a big database and create more manageable "aggregate" tables which contain substantially less information. These more manageable tables are of the order of tens to hundreds of gigabytes, and there are about two dozen tables. I am looking at about 500 gigabytes of data which will fit on a computer in my lab.

Question: I wish to use a non-cloud RDBMS in order to further work on these tables locally from my lab. The original Impala tables, most of them partitioned by date, have been exported to CSV, in such a way that the "table" folder contains a subfolder for each date, each subfolder containing a unique csv file (in which the partitioned "date" column is absent, since it is in its dated subfolder). Which would be an adequate RDBMS and how would I import these tables?

What I've found so far: there seem to be several GUIs or commands for MySQL which simplify importing, e.g.:

However these do not address my specific situation since 1. I only have access to Impala on the cluster, i.e. I cannot add any tools, so the heavy-lifting must be done on the lab computer, and 2. they do not say anything about importing an already partitioned table with the existing directory/partition structure.

Constraints:

  • Lab computer is on Ubuntu 20.04
  • Ideally, I would like to avoid having to load each csv / partition manually, as I have tens of thousands of dates. I am hoping for a RDBMS which already recognizes the partitioned directory structure...
  • the RDBMS itself should have a fairly recent set of functions available, including lead/lag/first/last window functions. aside from that, it needn't be too fancy.

I'm open to using Spark as an "overkill SQL engine", if that's the best way, I'm just not too sure if this is the best approach for a unique computer (not a cluster). Also, if need be (though I would ideally like to avoid this), I can export my Impala tables in another format in order to ease the import phase. E.g. a different format for text-based tables, parquet, etc.

Edit 1 As suggested in the comments, I am currently looking at Apache Drill. It is correctly installed, and I have successfully run the basic queries from the documentation / tutorials. However, I am now stuck at how to actually "import" (actually, I only need to "use" them since drill seems able to run queries directly on the filesystem) my tables. To clarify:

  • I currently have two "tables" in the directories /data/table1 and /data/table2 .
  • those directories contain subdirectories corresponding to the different partitions, e.g.: /data/table1/thedate=1995 , /data/table1/thedate=1996 , etc., and the same goes for table2.
  • within each subdirectory, I have a file (without an extension) that contains the CSV data, without headers.

My understanding (I'm still new to Apache-Drill) is that I need to create a File System Storage Plugin somehow for drill to understand where to look and what it's looking at, so I created a pretty basic plugin (a quasi copy/paste from this one) using the web interface on the Plugin Management page. The net result of that is that now I can type use data; and drill understands that. I can then say show files in data and it correctly lists table1 and table2 as my two directories. Unfortunately, I am still missing two key things to successfully be able to query these tables:

  1. running select * from data.table1 fails with an error, and I've tried table1 or dfs.data.table1 and I get different errors for each command (object 'data' not found, object 'table1' not found, schema [[dfs,data]] isnot valid with respect to either root schema or current default schema, respectively). I suspect this is because there are sub-directories within table1?
  2. I still have not said anything about the structure of the CSV files, and that structure would need to incorporate the fact that there is "thedate" field and value in the sub-directory name...

Edit 2 After trying a bunch of things, still no luck using text-based files, however using parquet files worked:

  • I can query a parquet file

  • I can query a directory containing a partitioned table, each directory being in the format: thedate=1995 , thedate=1996 as stated earlier.

  • I used the advice here in order to be able to query a table the usual way, i.e. without using dir0 but using thedate. Essentially, I created a view :

    create view drill.test as select dir0 as thedate, * from dfs.data/table1_parquet_partitioned

  • Unfortunately, thedate now is a text that says: thedate=1994 , rather than just 1994 (int). So I renamed the directories in order to only contain the date, however this was not a good solution as the type for thedate was not an int and therefore I could not use dates to join with table2 (which has thedate in a column). So finally, what I did was cast thedate to an int in the view

=> This is all fine as, although not csv files, this alternative is doable for me. However I am wondering if by using such a view, with a cast inside, will I benefit from partition pruning ? The answer in the referenced stackoverflow link suggests partition pruning is conserved by the view, however I am unsure about this when the column is used in a formula... Finally, given that the only way I can make this work is via parquet, it begs the question: is drill the best solution for this in terms of performance? So far, I like it, but migrating the database to this will be time-consuming and I would like to try to choose the best destination for this without too much trial and error...

sg1234
  • 600
  • 4
  • 19
  • 1
    Have you had a look at Apache Drill? https://drill.apache.org/ It allows you to query CSV files directly and should also allow you to specify a partitioning strategy – Simon D Jun 15 '21 at 09:41
  • 1
    No I had never heard of it, thanks! At first glance, the SQL portion has everything I need (window functions with the analytics I'm used to), so I'm really curious about actually getting this up and running. If it's as quick as they advertize, this should be easy :) Still, please allow me 24 hours or so to test it (I have a hard deadline to work on) and I will let you know. Thanks a ton, this might well be the answer I was looking for! – sg1234 Jun 15 '21 at 16:12
  • It's a pretty decent tool and fairly easy to install. 500GB is quite a lot of data though if you're running large queries across lots of files. Performance might be an issue still ¯\_(ツ)_/¯ – Simon D Jun 15 '21 at 20:23
  • I've installed it and am "playing around" with it, however I could use some help in importing the data. I feel confident this could be a good answer to my question, would you please be so kind as to formulate an answer that also involves the configuration of the dfs storage plugin? I will edit my question accordingly, thanks! – sg1234 Jun 16 '21 at 21:34

1 Answers1

0

I ended up using Spark. The only alternative I currently know about, which was brought to my attention by Simon Darr (whom I wish to thank again!), is Apache Drill. Pros and cons for each solution, as far as I could test:

  • Neither solution was great for offering a simple way to import the existing schema when the database is exported in text (in my case, CSV files).
  • Both solutions import the schema correctly using parquet files, so I have decided I must recreate my tables in the parquet format from my source cluster (which uses Impala).
  • The problem remaining is with respect to the partitioning: I was at long last able to figure out how to import partitioned files on Spark and the process of adding that partition dimension is seemless (I got help from here and here for that part), whereas I was not able to find a way to do this convincingly using Drill (although the creation of a view, as suggested here, does help somewhat):
    • On Spark. I used : spark.sql("select * from parquet.file:///mnt/data/SGDATA/sliced_liquidity_parq_part/"). Note that it is important to not use the * wildcard, as I first did, because if you use the wildcard each parquet file is read without looking at the directory it belongs to, so it doesn't take into account the directory structure for the partitioning or adding those fields into the schema. Without the wildcard, the directory name with syntax field_name=value is correctly added to the schema, and the value types themselves are correctly inferred (in my case, int because I use thedate=intvalue syntax).
    • On Drill, the trick of creating a view is a bit messy since it involves, first, using the substring of dir0 in order to extract the field_name and value, and second it requires a cast in order to send that field to the correct type in the schema. I am really not certain this sort of view would enable partition pruning when doing queries thereafter, so I was not fond of this hack. NB: there is likely another way to do this properly, I simply haven't found it.

I learned along the way about Drill (which seems great for logs and stuff that don't have a known structure), and learned that Spark could do a lot of what drill does if the data is structured (I had no idea it could read CSVs or parquet files directly without an underlying DB system). I also did not know that Spark was so easy to install on a standalone machine: after following steps here, I simply created a script in my bashrc which launches the master, a worker, and the shell all in one go (although I cannot comment on the performance of using a standalone computer for this, perhaps Spark is bad at this). Having used spark a bit in the past, this solution still seems best for me given my options. If there are any other solutions out there keep them coming as I won't accept my own answer just yet (I have a few days required to change all my tables to parquet anyway).

sg1234
  • 600
  • 4
  • 19