0

What type of data structure (or data structures?) could be used to store a pivot table of two dimensions where the data can be accessed? For example, let's take the following, copied from Excel:

enter image description here

If the data was all in a one-dimensional hierarchy -- that is, it went Group > Product > Year > Value, we could do something along the lines of:

{
    "Electronics": {
        "(all)": {
            "Year": {
                "2018": 2,
                "2019": 1,
                "Grand Total": 3
            }
        },
        "Computer": {
            "Year": {
                "2018": 1,
                "2019": 0,
                "Grand Total": 1
            }
        },
        "Phone": {
            "Year": {
                "2018": 1,
                "2019": 1,
                "Grand Total": 2
            }
        }
    }
}
        

And then I could access the Value at Electronics > Computer > 2018 by doing:

obj["Electronics"]['Computer']['Year']
// {2018: 1, 2019: 0, Grand Total: 1}
obj["Electronics"]['Computer']['Year'][2018]
// 1

But I can't think of how this would be put into a two-dimensional data-structure outside of a 2D multi-dimensional array that really wouldn't have any usage other than being able to retrieve a value at a certain position (and I'd need to store tons of metadata in order to know what is stored at which placement).

What would be a suitable data structure for this? I've tagged this Java, C, C++ -- any language is fine, I'm more interested in the actual data structure.

David542
  • 104,438
  • 178
  • 489
  • 842
  • 4
    The question is really too open-ended to be suitable for Stack Overflow. That said, a "pivot table" is in effect a particular configuration of grouping and aggregation on a plain old data set. You should focus on three parts separately: the database itself (either an actual DB or just an in-memory table, or whatever), the query(ies) to produce the desired aggregated results, and then finally the visual representation of those aggregated results. – Peter Duniho Dec 12 '20 at 23:25
  • People will pile answers here, because of the bounty. But I doubt that any will really be helpful for you. I recommend to read [ask] again. I realise that you have enough reputation that you must have some insight into making a good question (directly or indirectly), but in order to not waste the bounty, making it more easily answerable is probably a wise move. So please excuse me for mentioning it again. – Yunnosch Dec 14 '20 at 23:13
  • I get the impression that the first line in the picture has redundant data, which could be determined from the rest. Or is that the point of your question? Please consider defining your understanding of "pivot table", maybe by referring to the context you learned it in; e.g. "MS Excel". – Yunnosch Dec 14 '20 at 23:17
  • @Yunnosch sure Ill add some details. – David542 Dec 14 '20 at 23:18
  • I the data is in the input format you show (looking like JSON), consider providing a [mre] of how you read it in, store it and output it as is. That would show the data structures, the input/output mechanisms and the output format. In that "output mechanism" could also be something like a data structure to be filled, with desired result for a given sample input. I.e. you have the first third of the desired program and the third third. Filling in the missing second third as an answer would then be much easier and more probably helpful for you. For that you probably want to decide on a language. – Yunnosch Dec 14 '20 at 23:22
  • There are various ways you could store the data, for example as a simple table of the underlying fact data. In order to answer your question we need to know what you want to do with the data. Do you need fast access to the totals for example, or would it suffice to compute them on-demand? – jon hanson Dec 19 '20 at 18:40
  • @jon-hanson either one is fine actually, but I suppose if I had to say one or the other, I'd say (1) whichever is easier; and (2) if it doesn't matter, then yes, materialize the totals. – David542 Dec 19 '20 at 20:04

2 Answers2

5

In python hierarchical indexing is used to implement pivot tables. For example in python pandas pivot tables are multi-index dataframes. The implementation of a dataframe is in https://github.com/pandas-dev/pandas/blob/v0.22.0/pandas/core/frame.py#L236-L6142. MultiIndex is implemented in pandas.core.indexes.multi.py

(https://github.com/pandas-dev/pandas/blob/8dbb593d0c94107ec8b91a4723c40af537807ca4/pandas/core/indexes/multi.py#L179)

For C++ and Java there are also hierarchical indexing libraries i.e. boost::multi_index, here in stackoverflow multi-index has its own tag.

Maybe check the boost::multi_index examples : https://www.boost.org/doc/libs/1_62_0/libs/multi_index/doc/examples.html

Multi-indexing in Java : Is there an equivalent of boost::multi_index for Java someplace?

In https://www.scss.tcd.ie/Owen.Conlan/4d2/4D2-9&10_Multi-Level_Indexes_v1.02.pdf, https://www.cs.uct.ac.za/mit_notes/database/htmls/chp11.html#multilevel-indexes and http://theteacher.info/index.php/architecture-data-comms-and-applications-unit-5/4-organisation-and-structure-of-data/all-topics/3940-multi-level-indexes is an analysis for multi-indexing in files (file system structure) and search algorithm

The commands used in python to handle multi-level indices is i.e. stack and unstack :

https://nikgrozev.com/2015/07/01/reshaping-in-pandas-pivot-pivot-table-stack-and-unstack-explained-with-pictures/, https://www.xplenty.com/glossary/what-is-hierarchical-indexing/

ralf htp
  • 9,149
  • 4
  • 22
  • 34
0

I would store the data in tabular form, i.e.:

+---------------------------------------+
| Sector      | Device   | Year | Count |
+---------------------------------------+
| Electronics | Computer | 2018 |     1 |
| Electronics | Phone    | 2018 |     1 |
| Electronics | Phone    | 2019 |     1 |
+---------------------------------------+

The pivot table is a just a visual way of presenting the above data.

You can use a columnar table representation (i.e. store the data by column rather than row) to optimise the representation for fast querying. Typically you would order the columns by the most-used first, sort the data on each column and store positions to mark where each set of distinct values begins within the column.

Columnar representations include Apache Arrow for in-memory data and Apache Parquet for data stored on disk.

For the above data it might look something like the following:

+-------------+
| Electronics |
| Electronics |
| Electronics |
+-------------+
| Computer    |
| Phone       |
| Phone       |
+-------------+
| 2018        |
| 2018        |
| 2019        |
+-------------+
| 1           |
| 1           |
| 1           |
+-------------+
jon hanson
  • 8,722
  • 2
  • 37
  • 61
  • sure, but you would also need all subtotals, right? (What about something like a distinct count where you can't sum the sub-records?) – David542 Dec 20 '20 at 22:00
  • 1
    The subtotals can be computed by querying the data and storing the result. If the values in the cube are non-static then you would need an entirely different data structure of course - probably a dependency directed graph. Hence my question, "what do you want to do with the data?". It's impossible to answer your question unless you provide this information up front. – jon hanson Dec 21 '20 at 12:33