3

I am looking to load ~10GB of data into memory and perform SQL on it in the form of:

  • Sort on a single column (any column)
  • Aggregate on a single column (any column)
  • Filter on a single column (any column)

What might be a good choice for performance? Some solutions I've come across that could possibly work are TimesTen, ExtremeDB, and SQL In-memory, or even dataframes such as Vaex or Cudf.

I am looking to optimize query times -- that is really all I care about. For a conceptual example, think about something like an Excel table where a user can sort or filter any column (and the application does not know ahead of time, which columns to 'index' because all columns may be used).


Update: I'm posting my benchmarks from pandas below. I know pandas isn't ideal for this, but it's great to prototype and get benchmarking figures:

File (20M rows, 1.2GB): https://storage.googleapis.com/gcp-files/Sales20M.csv.

  • Load time (pd.read_csv): 10.7s
  • Aggregation: (df.groupby('currency_code_id').count): 3.3s
  • Sort: (df.sort_values('price')): 6.8s
  • Pivot: (df.pivot_table(index='code',columns='territory_id', values='id', aggfunc=len, fill_value=0)): 3.4s.

If using a database, please do not create indexes, since the use case is that we do not know the columns that are used beforehand. (Alternately, I suppose you could create an index on every field -- but if so, please include the creation of all those indexes in the load time).

Which tool would be the best for this?

David542
  • 104,438
  • 178
  • 489
  • 842
  • 3
    I don't think you'd need an in-memory database as much as a database with a lot of memory. 10 GB is nothing. –  Jul 20 '20 at 03:48
  • Regarding Oracle, Sorting and/or Aggregation can be done without using in-memory option. 10gb is nothing. – Roberto Hernandez Jul 20 '20 at 09:49
  • @RobertoHernandez when you say "it's nothing" do you mean, for example, it could do the aggregation in under 0.1s? Or what is 'nothing' ? – David542 Jul 20 '20 at 18:32
  • @bbaird could you please clarify? – David542 Jul 20 '20 at 18:34
  • 1
    I mean that with the right CPU and memory in your server, Oracle database, in-memory is not going to give you better performance for j10GB. Of course, based on your question aggregating over a single column and sorting by any column. However, if you plan in having more and more data, then it might be an option, but requires some sort of tuning and it has extra cost, as it is not included with the normal enterprise edition license. Besides, you need to understand how in memory works respect to tables that might be modified by batch processing. – Roberto Hernandez Jul 20 '20 at 20:58
  • at the end, it is a trade off. If you plan only on analytics and your plan is grow and grow, then it might be an option to considered. If for you it is a big difference between 1 second and 10 seconds, then you might consider it. SQL Analytics over tables without indexes using parallel could give you great results at zero cost. – Roberto Hernandez Jul 20 '20 at 21:00
  • @RobertoHernandez -- thanks. Could you send a link to this please? SQL Analytics over tables without indexes using parallel could give you great results at zero cost. Is that entirely OLAP? – David542 Jul 20 '20 at 21:04
  • Without using OLAP. It has also extra cost, so not worthy. I would prefer to pay for in-memory rather than paying for OLAP. Let me try to find you a link for that. I actually talked about experience. I designed an application that run queries over 100 million records in tables using parallel for sorting and aggregation. Avoid indexing and provide PGA TARGET huge enough for HASH JOIN. Don't use MEMORY TARGET. Anyway, I will try to send you something regarding parallel. I cannot put you a test case in the comment section :) – Roberto Hernandez Jul 20 '20 at 21:11
  • 3
    @David542 It's difficult to compare separate database products on this site. But I'm confident saying that instead of considering TimesTen, you should look into the Oracle In-Memory option. Oracle seems way more committed to In-Memory than TimesTen - I'm surprised TimesTen is still even available. The In-Memory option is now [free for the first 16GB](https://blogs.oracle.com/in-memory/base-level-198-ru). – Jon Heller Jul 21 '20 at 03:29
  • @JonHeller I've updated the question and added a bounty. Do you think you might be able to provide a benchmark for the above case? – David542 Jul 22 '20 at 23:35
  • @RobertoHernandez see updated question -- I've added a bounty as well! – David542 Jul 22 '20 at 23:35
  • How many concurrent users are you planning for? How many concurrent query executions? Are you okay with dirty reads or does it needs to be committed data? How ofter to the insert/updates happen ? Is the 10gb in one table or many tables? – Preben Huybrechts Jul 23 '20 at 11:10
  • @PrebenHuybrechts for this benchmark, just assume one reader/no concurrency and no updates (for now). – David542 Jul 23 '20 at 18:07
  • 1
    @David542, some commercial DBMS vendors, notably Oracle and Microsoft, [prohibit disclosing benchmarks](https://www.brentozar.com/archive/2018/05/the-dewitt-clause-why-you-rarely-see-database-benchmarks/) in their license agreement. I suggest you [race your horses](https://ericlippert.com/2012/12/17/performance-rant/). – Dan Guzman Jul 24 '20 at 10:44
  • FWIW, this query against a SQL Server clustered columnstore table runs in 10ms on my PC: `SELECT currency_code_id, COUNT(*) FROM dbo.Sales20M GROUP BY currency_code_id;`. YMMV. – Dan Guzman Jul 24 '20 at 10:56
  • I've added some (very) loose cuDF/pandas benchmarks in an answer in case it's helpful. Note that `pivot` is not yet available, though. – Nick Becker Jul 24 '20 at 14:09
  • I think @PrebenHuybrechts really hit the key question when asking whether it needs to be one table, or many tables. You've indicated load time as relevant, but not "how relevant". So the underlying technology choice does not seem like it would be the primary differentiator. The primary differentiator is likely to be "how much pre-processing are you allowed to perform?". As a trivial example, if you pre-aggregate every column at load time, then you don't need to aggregate at runtime. Separate structures optimising for each use case - in any tech - will of course beat any "generalist" approach. – allmhuran Jul 25 '20 at 08:28
  • @allmhuran one table. I don't think pre-aggregating would really be a viable options since we don't know which columns will be used beforehand. For example, let's say a user has 100 columns. Do we create 100 indexes/pre-aggregation for a single group by? What if the user wants to do a `group by x,y`, now that's 10k indexes/aggregations. – David542 Jul 25 '20 at 18:16

4 Answers4

4

I guess you want to materialized a random data file and perform sub-second queries over it and you are ready to pay the price (as in-memory features are usually enterprise).

For SQL Server for example, there many options:

or just using partitioning, or PostgreSQL or MongoDB shards. There so many examples and demonstration of such technologies showing sub-second performance ... but it depends on case because there are limitations.

For example:

  • column store indexes may have issues when filtering and getting only few rows compare to traditional indexes
  • in-memory OLTP

In your case, having 10 GB of data and wanting a good performance, you are not require to do something special. Just analyze and normalize the data prior the insert and create the corresponding indexes.

Well begun is half done and paying some time to have the data written in the right way will give you the performance you need.

For example:

  1. Insert the data file in a table

  2. For each column in the table

    • perform count distinct
    • if the value is smaller count distinct / count is smaller then X, create a separate table with columns id and value
    • insert the distinct values in it
    • add new column to the table and add the new ids there
    • create index on this column

Reducing the size of the table will improve the IO operations count. Searching and grouping by numbers is faster then doing such by text.

Of course, you need to change the application - instead of searching by some city name, you will filter by its ID. And after count of cities per countries ids, you will perform second query to transform these ids to names.

I feel applying some fundamental principles in your case will be better then using some high-level tech on high price and limitations that can be critical in the future when new requirements to the application come.


On virtual machine with 8 GB RAM and 4 virtual processors. Unfortunately, it is on HDD and pretty bad I/O from here. Running SQL Server 2019 Standard edition. So, because of the hardware the data load up is slow.

  • 2 minutes for the table (I am using the SSMS interface to import the data, if you use bcp it will be better but ... HDD after all)
  • 1.5 minutes to add index on each field

So, the table looks like:

enter image description here

You can see how lazy I am. No normalization and index on each field leading to 3.2 GB allocated for the data and indexes:

exec sp_spaceused 'dbo.Sales20M'

enter image description here

But some results:

select count(*) , currency_code_id from dbo.Sales20M group by currency_code_id

of course, 0 seconds as we use the index:

enter image description here

select TOP 10 * from dbo.Sales20M   order by PRICE_IN_USD desc;

0 seconds, but note I am using TOP - basically, when you are sorting you need to display part of the rows, right? And if I sort the whole rows for some reason:

select  * 
INTO #X
from dbo.Sales20M   order by PRICE_IN_USD desc;

it runs for 2 seconds (I am inserting the data in table as the rendering takes time, too).

As to the PIVOT, it is not very fast in SQL Server but you can use R if you need something massive. I do not understand yours, but made a PIVOT by Code and territory_id calculating the average price in USD:

SELECT *
FROM 
( 
    SELECT Code
              ,territory_id
              ,AVG(price_in_usd) price_in_usd
    FROM dbo.Sales20M
    GROUP BY Code
            ,territory_id
)DS
PIVOT
(
    MAX(price_in_usd) FOR territory_id IN ([AE], [AG], [AI], [AL], [AM], [AO], [AR], [AT], [AU], [AW], [AZ], [BA], [BE], [BF], [BG], [BH], [BJ], [BM], [BN], [BO], [BR], [BS], [BW], [BY], [BZ], [CA], [CH], [CI], [CL], [CO], [CR], [CV], [CY], [CZ], [DE], [DK], [DM], [DO], [EC], [EE], [EG], [ES], [FI], [FJ], [FM], [FR], [GA], [GB], [GD], [GH], [GM], [GR], [GT], [GW], [HK], [HN], [HR], [HT], [HU], [ID], [IE], [IL], [IN], [IS], [IT], [JM], [JO], [JP], [KG], [KH], [KN], [KR], [KW], [KY], [KZ], [LA], [LB], [LK], [LT], [LU], [LV], [MD], [MK], [ML], [MN], [MO], [MT], [MU], [MX], [MY], [MZ], [NA], [NE], [NI], [NL], [NO], [NP], [NZ], [OM], [PA], [PE], [PG], [PH], [PL], [PT], [PY], [QA], [RU], [RW], [SA], [SE], [SG], [SI], [SK], [SN], [SV], [SZ], [TG], [TH], [TJ], [TM], [TR], [TT], [TW], [TZ], [UA], [UG], [US], [UY], [UZ], [VE], [VG], [VN], [ZA], [ZM], [ZW])
) PVT;

I am lazy again and not using dynamic PIVOT. It takes 0-1 seconds.

enter image description here


Conclusion:

My point is that even my set up is bad and I am being super lazy as not paying time to normalize the data and create proper indexes I am still getting close to 0 seconds results. You can simply start with something free like PostgreSQL and I believe you will get good results. Of course, the "fancy" stuff are always there if you need them in order to optimize particular use case.

gotqn
  • 42,737
  • 46
  • 157
  • 243
  • what would be the difference (pros and cons) between using in-memory and columnar-store for the above? – David542 Jul 26 '20 at 03:27
  • @David542 My point is for such amount of data you do not need any fancy tech. You can use for example PostgreSQL which is free and will perform pretty well. Anyway, I will share some results from SQL Server to prove my point. – gotqn Jul 27 '20 at 19:07
3

I built a real test case with the csv provided with 20M records. For my proof of concept I will compare two options in Oracle. Oracle In-Memory Option vs Oracle Parallel Query. The idea is to see whether the results match your expectations.

  • In Memory Option is license fee, so you have to pay additional costs
  • Parallel query is a feature included with the Enterprise Edition of Oracle database.

Lab: Linux Red Hat 7 Server: 16 CPUs and 32 GB RAM In Memory section = 4GB VM Virtual Server on VMware using IBM ESX Hardware Series

Elements for the Test Case

SQL> create tablespace tbtest datafile '/bbdd_odcgrc1r/datos/test.dbf' size 2g autoextend on next 100m maxsize 10g ;

Tablespace created.

SQL> create user test_perf identified by "Oracle_1" ;

User created.

SQL> grant connect to test_perf ;

Grant succeeded.

SQL> grant create table to test_perf ;

Grant succeeded.

SQL> alter user test_perf quota unlimited on tbtest ;

User altered.

SQL>

Oracle Parallel Query without in memory option

I loaded the csv file using direct path sql loader:

Table TEST_PERFORMANCE:
  20000000 Rows successfully loaded.
  0 Rows not loaded due to data errors.
  0 Rows not loaded because all WHEN clauses were failed.
  0 Rows not loaded because all fields were null.

Bind array size not used in direct path.
Column array  rows :    5000
Stream buffer bytes:  256000
Read   buffer bytes: 1048576

Total logical records skipped:          1
Total logical records read:      20000000
Total logical records rejected:         0
Total logical records discarded:        0
Total stream buffers loaded by SQL*Loader main thread:     4112
Total stream buffers loaded by SQL*Loader load thread:        0

Run began on Sat Jul 25 00:57:23 2020
Run ended on Sat Jul 25 00:57:34 2020

Obviously loading a file into the database is not the same ( at all ) than loading the file in panda. As Panda does not need to load the data into anything ( in this case a database data file )

SQL> desc test_perf.test_performance
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 ID                                                 NUMBER
 DATE_ID                                            DATE
 INSTANCE_ID                                        NUMBER
 TERRITORY_ID                                       VARCHAR2(10 CHAR)
 CODE                                               VARCHAR2(10 CHAR)
 PRICE                                              NUMBER
 CURRENCY_CODE_ID                                   VARCHAR2(10 CHAR)
 PRICE_IN_USD                                       NUMBER

I collect statistics of the table and now let's see how it behaves

SQL> exec dbms_stats.gather_table_stats ( 'TEST_PERF' , 'TEST_PERFORMANCE' , block_sample => true );

PL/SQL procedure successfully completed.

Group by by currency_code_id ( without sorting )

SQL> select count(*) , currency_code_id from test_perf.test_performance group by currency_code_id
90 rows selected.

Elapsed: 00:00:00.35

Group by currency_code_id but sorting by counter

SQL>  select count(*) , currency_code_id from test_perf.test_performance group by currency_code_id order by count(*) desc ;

90 rows selected.

Elapsed: 00:00:00.70

Sorting all records and showing them by sqlplus consume a lot of time just for creating the output, but the query itself is ultra fast

SQL> select * from test_perf.test_performance order by PRICE_IN_USD desc ;

20000000 rows selected.

Elapsed: 00:01:31.48

Execution Plan
----------------------------------------------------------
Plan hash value: 1897103579

---------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name             | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
---------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |                  |    20M|   972M|       |  5940   (1)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR         |                  |       |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (ORDER)    | :TQ10001         |    20M|   972M|       |  5940   (1)| 00:00:01 |  Q1,01 | P->S | QC (ORDER) |
|   3 |    SORT ORDER BY        |                  |    20M|   972M|  1376M|  5940   (1)| 00:00:01 |  Q1,01 | PCWP |            |
|   4 |     PX RECEIVE          |                  |    20M|   972M|       |   435   (1)| 00:00:01 |  Q1,01 | PCWP |            |
|   5 |      PX SEND RANGE      | :TQ10000         |    20M|   972M|       |   435   (1)| 00:00:01 |  Q1,00 | P->P | RANGE      |
|   6 |       PX BLOCK ITERATOR |                  |    20M|   972M|       |   435   (1)| 00:00:01 |  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS FULL| TEST_PERFORMANCE |    20M|   972M|       |   435   (1)| 00:00:01 |  Q1,00 | PCWP |            |
---------------------------------------------------------------------------------------------------------------------------------

Note
-----
   - Degree of Parallelism is 32 because of table property


Statistics
----------------------------------------------------------
        364  recursive calls
          3  db block gets
      36963  consistent gets
      45558  physical reads
       2860  redo size
  703698256  bytes sent via SQL*Net to client
   14667271  bytes received via SQL*Net from client
    1333335  SQL*Net roundtrips to/from client
         64  sorts (memory)
          1  sorts (disk)
   20000000  rows processed

Real time consumed by the query itself is

SQL> select * from test_perf.test_performance order by PRICE_IN_USD desc ;
Elapsed: 00:00:02.03

Execution Plan
----------------------------------------------------------
Plan hash value: 1897103579

---------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name             | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
---------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |                  |    20M|   972M|       |  5940   (1)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR         |                  |       |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (ORDER)    | :TQ10001         |    20M|   972M|       |  5940   (1)| 00:00:01 |  Q1,01 | P->S | QC (ORDER) |
|   3 |    SORT ORDER BY        |                  |    20M|   972M|  1376M|  5940   (1)| 00:00:01 |  Q1,01 | PCWP |            |
|   4 |     PX RECEIVE          |                  |    20M|   972M|       |   435   (1)| 00:00:01 |  Q1,01 | PCWP |            |
|   5 |      PX SEND RANGE      | :TQ10000         |    20M|   972M|       |   435   (1)| 00:00:01 |  Q1,00 | P->P | RANGE      |
|   6 |       PX BLOCK ITERATOR |                  |    20M|   972M|       |   435   (1)| 00:00:01 |  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS FULL| TEST_PERFORMANCE |    20M|   972M|       |   435   (1)| 00:00:01 |  Q1,00 | PCWP |            |
---------------------------------------------------------------------------------------------------------------------------------

Note
-----
   - Degree of Parallelism is 32 because of table property

Select the distinct territory_id of the table

SQL> select distinct territory_id from test_perf.test_performance order by territory_id desc ;

136 rows selected.

Elapsed: 00:00:00.58

For the pivot I chose this simple example

SQL> select * from test_perf.test_performance
pivot ( count(*) for TERRITORY_ID in 
(
'ZW',
'ZM',
'ZA',
'VN',
'VG',
'VE',
'UZ',
'UY',
'US',
'UG',
'UA',
'TZ',
'TW',
'TT',
'TR',
'TM',
'TJ',
'TH',
'TG',
'SZ',
'SV',
'SN',
'SK',
'SI',
'SG',
'SE',
'SA',
'RW',
'RU',
'QA',
'PY',
'PT',
'PL',
'PH',
'PG',
'PE',
'PA',
'OM',
'NZ',
'NP',
'NO',
'NL',
'NI',
'NE',
'NA',
'MZ',
'MY',
'MX',
'MU',
'MT',
'MO',
'MN',
'ML',
'MK',
'MD',
'LV',
'LU',
'LT',
'LK',
'LB',
'LA',
'KZ',
'KY',
'KW',
'KR',
'KN',
'KH',
'KG',
'JP',
'JO',
'JM',
'IT',
'IS',
'IN',
'IL',
'IE',
'ID',
'HU',
'HT',
'HR',
'HN',
'HK',
'GW',
'GT',
'GR',
'GM',
'GH',
'GD',
'GB',
'GA',
'FR',
'FM',
'FJ',
'FI',
'ES',
'EG',
'EE',
'EC',
'DO',
'DM',
'DK',
'DE',
'CZ',
'CY',
'CV',
'CR',
'CO',
'CL',
'CI',
'CH',
'CA',
'BZ',
'BY',
'BW',
'BS',
'BR',
'BO',
'BN',
'BM',
'BJ',
'BH',
'BG',
'BF',
'BE',
'BA',
'AZ',
'AW',
'AU',
'AT',
'AR',
'AO',
'AM',
'AL',
'AI',
'AG',
'AE'
) ) 
 order by id 

Elapsed: 00:00:04.74

Oracle In Memory Option

I configured an In memory area of 4GB, which is not so much.

Total System Global Area 1.2885E+10 bytes
Fixed Size                 12192520 bytes
Variable Size            5184161016 bytes
Database Buffers         3388997632 bytes
Redo Buffers                4583424 bytes
In-Memory Area           4294967296 bytes
Database mounted.
Database opened.
SQL> SHOW PARAMETER INMEMORY

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
inmemory_adg_enabled                 boolean     TRUE
inmemory_clause_default              string
inmemory_expressions_usage           string      ENABLE
inmemory_force                       string      DEFAULT
inmemory_max_populate_servers        integer     8
inmemory_query                       string      ENABLE
inmemory_size                        big integer 4G
inmemory_trickle_repopulate_servers_ integer     1
percent
inmemory_virtual_columns             string      MANUAL
optimizer_inmemory_aware             boolean     TRUE

SQL> ALTER TABLE TEST_PERF.TEST_PERFORMANCE INMEMORY PRIORITY HIGH;

Table altered.

SQL> select segment_name
,    partition_name
  2    3  ,    inmemory_size / 1024 / 1024 as inmemory_size_mb
,    bytes / 1024 / 1024 as bytes_mb
,    populate_status
  4    5    6  ,    trunc(bytes / inmemory_size, 1) * 100 as compression_ratio
from v$im_segments
  7    8  order by segment_name, partition_name;

SEGMENT_NAME
--------------------------------------------------------------------------------
PARTITION_NAME
--------------------------------------------------------------------------------
INMEMORY_SIZE_MB   BYTES_MB POPULATE_STAT COMPRESSION_RATIO
---------------- ---------- ------------- -----------------
TEST_PERFORMANCE

          362.25 514.046875 COMPLETED                   140

SQL> select count(*),length(date_id) from test_perf.test_performance group by length(date_id)

Execution Plan
----------------------------------------------------------
Plan hash value: 3227171220

-----------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                         | Name             | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                  |                  |    11 |   121 |    29  (59)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR                   |                  |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)             | :TQ10001         |    11 |   121 |    29  (59)| 00:00:01 |  Q1,01 | P->S | QC (RAND)  |
|   3 |    HASH GROUP BY                  |                  |    11 |   121 |    29  (59)| 00:00:01 |  Q1,01 | PCWP |            |
|   4 |     PX RECEIVE                    |                  |    11 |   121 |    29  (59)| 00:00:01 |  Q1,01 | PCWP |            |
|   5 |      PX SEND HASH                 | :TQ10000         |    11 |   121 |    29  (59)| 00:00:01 |  Q1,00 | P->P | HASH       |
|   6 |       HASH GROUP BY               |                  |    11 |   121 |    29  (59)| 00:00:01 |  Q1,00 | PCWP |            |
|   7 |        PX BLOCK ITERATOR          |                  |    20M|   209M|    14  (15)| 00:00:01 |  Q1,00 | PCWC |            |
|   8 |         TABLE ACCESS INMEMORY FULL| TEST_PERFORMANCE |    20M|   209M|    14  (15)| 00:00:01 |  Q1,00 | PCWP |            |
-----------------------------------------------------------------------------------------------------------------------------------

Note
-----
   - Degree of Parallelism is 32 because of table property

Let's test some queries.

Group by territory

SQL> select count(*),TERRITORY_ID from test_perf.test_performance group by TERRITORY_ID ;

136 rows selected.

Elapsed: 00:00:00.24

Group by instance_id

SQL> select count(*) , INSTANCE_ID from test_perf.test_performance group by INSTANCE_ID

11251 rows selected.

Elapsed: 00:00:00.27

Group by 2 fields

SQL> select count(*), instance_id, territory_id from test_perf.test_performance group by instance_id, territory_id ;

278269 rows selected.

Elapsed: 00:00:00.84

As you can , obviously when I have one table in memory, the performance of these operations increase, but you would notice more improvement if you start applying analytics to the queries.

Several statistics functions and group by

SQL> select territory_id,sum(to_number(price)),avg(to_number(price)),max(to_number(price)),min(to_number(price))
  2  from test_perf.test_performance group by territory_id ;

Elapsed: 00:00:00.57

However, PIVOT is usually a bad idea for inmemory, because of the column storage used for the tables.

The same query with pivot executed before takes 

    Elapsed: 00:00:15.93

As you can see in the examples above, I did not change the PARALLEL property of the table, so let me show you how behaves the query when parallel is not enable for an object which has INMEMORY option

SQL> alter table test_perf.TEST_PERFORMANCE noparallel ;

Table altered.

SQL> select count(*) , currency_code_id from test_perf.test_performance group by currency_code_id
  2  ;

90 rows selected.

Elapsed: 00:00:02.14

Execution Plan
----------------------------------------------------------
Plan hash value: 151279035

------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                  |    90 |   450 |   909  (62)| 00:00:01 |
|   1 |  HASH GROUP BY              |                  |    90 |   450 |   909  (62)| 00:00:01 |
|   2 |   TABLE ACCESS INMEMORY FULL| TEST_PERFORMANCE |    20M|    95M|   411  (16)| 00:00:01 |
------------------------------------------------------------------------------------------------

IMPQ

For the last part, I leave the best of all the options, a combination of both parallel and inmemory, called IMPQ or In Memory Parallel Query. This feature provides the best of both worlds, and in order be kick in you have to enable parallel for your table , have the table in the inmemory area and define the parameter PARALLEL_DEGREE_POLICY to AUTO.

An example of this would be

SQL> alter session set parallel_degree_policy=auto ;

Session altered.

SQL> set autotrace traceonly explain
SQL> select count(*),territory_id from test_perf.test_performance group by territory_id

Execution Plan
----------------------------------------------------------
Plan hash value: 3227171220

-----------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                         | Name             | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                  |                  |   136 |   408 |    78  (59)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR                   |                  |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)             | :TQ10001         |   136 |   408 |    78  (59)| 00:00:01 |  Q1,01 | P->S | QC (RAND)  |
|   3 |    HASH GROUP BY                  |                  |   136 |   408 |    78  (59)| 00:00:01 |  Q1,01 | PCWP |            |
|   4 |     PX RECEIVE                    |                  |   136 |   408 |    78  (59)| 00:00:01 |  Q1,01 | PCWP |            |
|   5 |      PX SEND HASH                 | :TQ10000         |   136 |   408 |    78  (59)| 00:00:01 |  Q1,00 | P->P | HASH       |
|   6 |       HASH GROUP BY               |                  |   136 |   408 |    78  (59)| 00:00:01 |  Q1,00 | PCWP |            |
|   7 |        PX BLOCK ITERATOR          |                  |    20M|    57M|    38  (16)| 00:00:01 |  Q1,00 | PCWC |            |
|   8 |         TABLE ACCESS INMEMORY FULL| TEST_PERFORMANCE |    20M|    57M|    38  (16)| 00:00:01 |  Q1,00 | PCWP |            |
-----------------------------------------------------------------------------------------------------------------------------------

Note
-----
   - automatic DOP: Computed Degree of Parallelism is 12

Check in the above statement the change in DOP from 32 to 12. That indicates that IMPQ has identify that the best degree for the statement is 12, instead of 32 which was derived of the CPU_COUNT * 2 ( as the table was created with degree default automatic ).

This query below now with IMPQ runs in only 0.14 seconds, instead of 0.35.

SQL> select count(*),territory_id from test_perf.test_performance group by territory_id
  2  ;

136 rows selected.

Elapsed: 00:00:00.14

This other query now takes 0.14 seconds when before took 0.70 seconds

SQL> select count(*) , currency_code_id from test_perf.test_performance group by currency_code_id order by count(*) desc ;

90 rows selected.

Elapsed: 00:00:00.14

Summary

If you need times lower than a second, and you are using analytics mostly, Oracle in-memory option might be a good idea. If that is the case, you might want also enable IMPQ to obtain the best results. There will be a work to do in regards which columns should or not be included, how the table is populated into the inmemory area, etc. The greater the complexity in the analytics the greater the performance you will notice.

Nevertheless, if you can live with times between 1 and 5 seconds, Parallel query without indexes might be a solution free of cost and quite easy to configure.

Feel free to comment anything or ask for any clarification regarding the test case.

Roberto Hernandez
  • 8,231
  • 3
  • 14
  • 43
  • wow, what an extensive answer. Thanks for taking the time! Out of curiosity, why are things so different in terms of performance with Pivot queries? What does a pivot query do that is so much different that a normal aggregation one? – David542 Jul 26 '20 at 03:23
  • PIvot queries rely on transform rows to columns. In memory uses what is called column storage. Basically, when you invoke PIVOT, you go from columns to rows to go back from rows to columns. – Roberto Hernandez Jul 26 '20 at 09:34
  • I see. What is usually the best or suggested way to handle pivot queries in Oracle then? Or are they considered quite rare and not really optimized for (or rather, not a primary consideration when choosing a storage type). – David542 Jul 28 '20 at 02:57
  • @David542, you can optimise PIVOT queries on INMEMORY if your remove them from the column storage . It is an operation common , also you can modify the table, adding more records, etc. The best approach for example with tables that are populated by batch processes is remove them from inmemory at the beginning, run the batch, collect statistics and put them back to the inmemory area defining which columns should be subject to column storage ( usually the ones you don't want pivot for ) – Roberto Hernandez Jul 28 '20 at 08:50
  • @David542, also a good recommendation is to create the pivot tables as materialized resuts of the batch. Of course, you need more space, but that allows to better exploitation of the data, as the pivot resulting tables are not part of the INMEMORY area. – Roberto Hernandez Jul 28 '20 at 13:11
1

In case it's useful, these are some quick comparisons I get using cuDF for reading, aggregating, and sorting your file on a 32GB V100. For SQL, you may want to look at BlazingSQL

Loosely comparing to pandas on my machine, it looks like reading this file is about 20x faster, the aggregation is about 150x faster, and the sort is about 50x faster. Note that cuDF support for pivot is not yet available, so it's not included below.

import cudf
​
cudf.set_allocator(
    pool=True, # half the GPU by default
)
​
%time df = cudf.read_csv("Sales20M.csv")
%time x = df.groupby('currency_code_id').count()
%time x = df.sort_values('price')
CPU times: user 402 ms, sys: 157 ms, total: 559 ms
Wall time: 557 ms
CPU times: user 17.1 ms, sys: 8.97 ms, total: 26.1 ms
Wall time: 25.8 ms
CPU times: user 96.2 ms, sys: 32 ms, total: 128 ms
Wall time: 128 ms
import pandas as pd
​
%time df = pd.read_csv("Sales20M.csv")
%time x = df.groupby('currency_code_id').count()
%time x = df.sort_values('price')
CPU times: user 8.23 s, sys: 1.67 s, total: 9.9 s
Wall time: 9.89 s
CPU times: user 3.32 s, sys: 355 ms, total: 3.68 s
Wall time: 3.67 s
CPU times: user 5.77 s, sys: 586 ms, total: 6.35 s
Wall time: 6.35 s
Nick Becker
  • 4,059
  • 13
  • 19
  • thanks that's really great to see. Two questions: why is there no pivot? Does that operation not work on GPU or is that possible at all? How is the load operation so fast? Do you specify the type for each column or how does it determine the type and load in so quickly? Finally, are there any caveats with GPU -- for example, known analytic-type queries that are not suited for it (perhaps search? pivot? etc.). Thanks again! – David542 Jul 26 '20 at 03:26
  • All good questions. Some will be hard to answer effectively in comments, though. I will separate them for clarity. 1) Pivot hasn't been implemented *yet*, but there is no fundamental reason it cannot be done on a GPU (https://github.com/rapidsai/cudf/issues/1214). – Nick Becker Jul 26 '20 at 14:24
  • The load operation is fast for several reasons, but the high-level summary is that reading data with cuDF does a memory map and then copies the bytes from host memory into GPU memory. GPU-accelerated implementations for parsing and decompression are faster. Like in pandas, dtypes can be specified by the user but by default are inferred. – Nick Becker Jul 26 '20 at 14:28
  • 1
    RAPIDS libraries are designed to bring GPU speed into Python without sacrificing the flexibility or general APIs of the PyData programming model. There are caveats, but an effective treatment is too long for the comments. To give one example, naively iterating row-by-row through a GPU series/frame/array is more expensive than the CPU equivalent. Row-wise user defined functions should leverage Numba. You may want to look at cuDF, cuPy (by Preferred Networks), BlazingSQL, and more. https://docs.rapids.ai/overview/latest.pdf can provide more information. Disclaimer: I am a member of this team. – Nick Becker Jul 26 '20 at 14:34
0

The Oracle TimesTen In-Memory Database could be used, but the Oracle In-Memory feature of an Oracle Database would be better for this type of workload.

TimesTen is designed more for low latency / high throughput ACID transactions and analytic type queries

  • Thanks for this response -- could you please clarify the difference between the two? For example, wouldn't all the queries I had listed in the question be considered a variant of an analytic-type query? – David542 Oct 17 '20 at 04:56
  • Hi David, all of your queries are analytic in nature. Oracle In-Memory would be a good fit as you would not need to create any indexes for fast columnar scans. TimesTen can do analytics, but it is a row store not a column store. – ScalableDBDoug Oct 18 '20 at 13:14