What is the best way to store a table in C++

Question

I'm programming a decision tree in C++ using a slightly modified version of the C4.5 algorithm. Each node represents an attribute or a column of your data set and it has a children per possible value of the attribute.

My problem is how to store the training data set having in mind that I have to use a subset for each node so I need a quick way to only select a subset of rows and columns.

The main goal is to do it in the most memory and time efficient possible(in that order of priority).

The best way I have thought is to have an array of arrays(or std::vector) or something like that, and for each node have a list(array, vector, etc) or something with the column,line(probably a tuple) pairs that are valid for that node.

I now there should be a better way to do this, any suggestions?

UPDATE: What I need is something like this:

On the beginning I have this data:

Paris    4    5.0    True
New York 7    1.3    True
Tokio    2    9.1    False
Paris    9    6.8    True
Tokio    0    8.4    False

But for the second node I just need this data:

Paris    4    5.0
New York 7    1.3
Paris    9    6.8

And for the third node:

Tokio    2    9.1
Tokio    0    8.4

But with a table of millions of records with up to hundreds of columns.

What I have in mind is keep all the data in a matrix, and then for each node keep the info of the current columns and rows. Something like this:

Paris    4    5.0    True
New York 7    1.3    True
Tokio    2    9.1    False
Paris    9    6.8    True
Tokio    0    8.4    False

Node 2:

columns = [0,1,2]
rows = [0,1,3]

Node 3:

columns = [0,1,2]
rows = [2,4]

This way on the worst case scenario I just have to waste

size_of(int) * (number_of_columns + number_of_rows) * node

That is a lot less than having an independent data matrix for each node.

In C++, always think [`std::vector`](http://en.cppreference.com/w/cpp/container/vector) before arrays. — Some programmer dude, Mar 01 '13 at 06:59
@JoachimPileborg Yes I almost always do, but I have never used a vector of vectors so I wasn't sure if that is a good idea. — Topo, Mar 01 '13 at 07:01
@ÖöTiib As I see it the C5 algorithm just have some memory efficiency enhancements, and it doesn't really generates a better decision tree than the C4.5 or it's predecessor the ID3 algorithm. — Topo, Mar 01 '13 at 07:10

score 0 · Answer 1 · edited May 23 '17 at 11:56

0

How about using at trie: http://en.wikipedia.org/wiki/Trie.

There is also a discussion on how to implement trie: Trie implementation

edited May 23 '17 at 11:56

Community

1
1

answered Mar 01 '13 at 07:01

zzk

1,347
9
15

1

Can you elaborate on how to use a trie to represent a data table? – Topo Mar 01 '13 at 07:11

What is the best way to store a table in C++

1 Answers1