2

I'm programming a decision tree in C++ using a slightly modified version of the C4.5 algorithm. Each node represents an attribute or a column of your data set and it has a children per possible value of the attribute.

My problem is how to store the training data set having in mind that I have to use a subset for each node so I need a quick way to only select a subset of rows and columns.

The main goal is to do it in the most memory and time efficient possible(in that order of priority).

The best way I have thought is to have an array of arrays(or std::vector) or something like that, and for each node have a list(array, vector, etc) or something with the column,line(probably a tuple) pairs that are valid for that node.

I now there should be a better way to do this, any suggestions?

UPDATE: What I need is something like this:

On the beginning I have this data:

Paris    4    5.0    True
New York 7    1.3    True
Tokio    2    9.1    False
Paris    9    6.8    True
Tokio    0    8.4    False

But for the second node I just need this data:

Paris    4    5.0
New York 7    1.3
Paris    9    6.8

And for the third node:

Tokio    2    9.1
Tokio    0    8.4

But with a table of millions of records with up to hundreds of columns.

What I have in mind is keep all the data in a matrix, and then for each node keep the info of the current columns and rows. Something like this:

Paris    4    5.0    True
New York 7    1.3    True
Tokio    2    9.1    False
Paris    9    6.8    True
Tokio    0    8.4    False

Node 2:

columns = [0,1,2]
rows = [0,1,3]

Node 3:

columns = [0,1,2]
rows = [2,4]

This way on the worst case scenario I just have to waste

size_of(int) * (number_of_columns + number_of_rows) * node

That is a lot less than having an independent data matrix for each node.

Topo
  • 4,783
  • 9
  • 48
  • 70
  • 1
    In C++, always think [`std::vector`](http://en.cppreference.com/w/cpp/container/vector) before arrays. – Some programmer dude Mar 01 '13 at 06:59
  • @JoachimPileborg Yes I almost always do, but I have never used a vector of vectors so I wasn't sure if that is a good idea. – Topo Mar 01 '13 at 07:01
  • Why not C5.0 algorithm? – Öö Tiib Mar 01 '13 at 07:06
  • @ÖöTiib As I see it the C5 algorithm just have some memory efficiency enhancements, and it doesn't really generates a better decision tree than the C4.5 or it's predecessor the ID3 algorithm. – Topo Mar 01 '13 at 07:10

1 Answers1

0

How about using at trie: http://en.wikipedia.org/wiki/Trie.

There is also a discussion on how to implement trie: Trie implementation

Community
  • 1
  • 1
zzk
  • 1,347
  • 9
  • 15