1

Using a dataset, Weka and the J48 classifier I've got the following tree: J48 tree

And it splits off a lot on 'NumTweets' on the right side. Can I prevent J48 from doing more than a specified amount of splits on one field? Because this is obviously overfitting my data on a specific field. Ideally I'd want it to only reuse the same field in a branch 3-4 times. Is there any way I can do this?

Thanks in advance!

user3394131
  • 199
  • 13

2 Answers2

2

To answer your first question: No, the WEKA explorer does not offer split limits on a specific attribute. This can only be done manually in code.

With that said, there are several things you can try here to limit the tree size/reduce overfitting.

  1. You could try REPTree instead of J48. It uses the same splitting criteria as J48 but uses reduced error pruning. It has an option to limit the depth of the tree.

  2. Decreasing the J48 pruning confidence (-C parameter) will result in more pruning and thus smaller tree size.

  3. You can try to play around with the minNumObj (minimal number of instances reaching each leaf) parameter.

Percolator
  • 513
  • 5
  • 25
2

No. But you could set the J48 minNumObj config parameter higher. (The default value is 2.) This sets a constraint on the minimum number of data elements that each leaf node will have to contain.

This way (by trial and error) you can balance and/or simplify the decision tree to some extent.

Maybe you can drop or ignore the annoying attribute. Maybe discretizing the NumTweets into bins (e.g. <1 tweet/day, <10 tweets/day, more > 10 Tweets day) also helps? This could be done with a Discretizing Filter on the Preprocessing Tab.

knb
  • 9,138
  • 4
  • 58
  • 85
  • I've already played around with the `minNumObj` config, but nearly all settings result in an attribute not playing nice. Same thing for removing certain attributes. I'll try discretizing features, I was planning on doing that manually but know theres a preprocessing filter is good to know, thanks! – user3394131 Aug 10 '17 at 13:23