2

I am making some rather big Bayesian Networks for generating synthetic data, and I find pomegranate to be a good alternative as it generates data quickly and easily allows for inputting evidence. I have one problem with it: saving the trained models. Pomegranate's built-in methods stores as json's so big that I run out of memory when I have 30 or so variables, even when using "lighter" algorithms. The models can not be pickled due to the error

TypeError: self.distributions_ptr,self.parent_count,self.parent_idxs cannot be converted to a Python object for pickling

I am wondering if anyone has a good alternative for storing pomegranate models, or else knows of a Bayesian Network library that generates data quickly after training. I would be grateful for any tips.

Jamiu S.
  • 5,257
  • 5
  • 12
  • 34
Hakon
  • 23
  • 2

1 Answers1

2

if your model can be learned and stored in the memory, it can be saved in a file, but maybe not by 'pickling'. There are many different formats for Bayesian networks (bif, xmlbif, dsl, uai, etc.). I don't know pomegranate, but there is certainly a way to read/save using such a format. With pyAgrum (of which I am one of the authors), you just have to write gum.saveBN(model, "model.xxx") to save it, and then bn=gum.loadBN("model.xxx") to read it ... You can choose xxx among all the supported format, for now : bif|dsl|net|bifxml|o3prm|uai (https://pyagrum.readthedocs.io/en/1.3.1/functions.html#pyAgrum.loadBN).

As far as I understand, evidence for a sampling is just a way to filter the samples by keeping only the samples that respect the constraints (rejection sampling). There is no such a direct method in pyAgrum but this is can be done as a post-process :

import pyAgrum as gum

#create a BN with random CPTs
bn=gum.fastBN("A->B{yes|maybe|no}<-C->D->E<-F<-B") 

# generate a sample of size 100
g=gum.BNDatabaseGenerator(bn)
g.setRandomVarOrder()
g.drawSamples(100)
df=g.to_pandas()

#filtering the dataframe
rslt_df = df[(df['B'] == "yes") & 
             (df['E'] == "1")] 

And in a notebook :

jupyter notebook

  • Thank you, that looks very promising. When it comes to creating sample data, is it possible to provide evidence, i.e rules that all output sample data will follow? – Hakon Aug 30 '22 at 08:18
  • I change my answer to show you how. – Pierre-Henri Wuillemin Aug 30 '22 at 16:10
  • Great! I hope you dont mind me asking one more question: when I try to use gum in python 3.9, I sometimes encounter a "bad allocation" error. I am not sure if this is a problem with my laptops memory or something else. In python 3.7, on the other hand, I only have access to versions up to 0.22, and I get a type error related to C++ arrays. Is there a preferred environment to use gum in? – Hakon Aug 31 '22 at 12:53
  • The reason of the bad allocation error is the curse of dimensionality :-) Unfortunately, a Bayesian network can easily fill all your memory (large domain size, large number of parents, etc.)... aGrum/pyAgrum follows NEP29 (https://numpy.org/neps/nep-0029-deprecation_policy.html) that defines a policy for dropping support. – Pierre-Henri Wuillemin Aug 31 '22 at 17:17
  • Would using forbiddenArcs potentially lower the dimensionality? – Hakon Sep 02 '22 at 10:43
  • Indeed, the number max of parents, forbidden arcs, etc. are constraints that allow you to minimize the dimension (the number of parameters) and thus the size of the Bayesian network. However, it may be not sufficient for inference : the treewidth of the BN can be very large even if the number of parents is bounded. – Pierre-Henri Wuillemin Sep 03 '22 at 09:45