2

I'm currently working on a problem to do image classification on images using Bayesian Networks. I have tried using pomegranate, pgmpy and bnlearn. My dataset contains more than 200,000 images, on which I perform some feature extraction algorithm and get a feature vector of size 1026.

pgmpy

from pgmpy.models import BayesianModel
from pgmpy.estimators import HillClimbSearch, BicScore, K2Score
est = HillClimbSearch(feature_df, scoring_method=BicScore(feature_df[:20]))
best_model = est.estimate()
edges = best_model.edges()
model = BayesianModel(edges)

pomegranate

from pomegranate import *
model = BayesianNetwork.from_samples(feature_df[:20], algorithm='exact')

bnlearn

library(bnlearn)
df <- read.csv('conv_encoded_images.csv')
df$Age = as.numeric(df$Age)
res <- hc(df)
model <- bn.fit(res,data = df)

The program written in bnlearn in R completes running in couple of minutes, while the pgmpy runs for hours and pomegranate freezes my system after a few minutes. You can see from my code that I'm giving first 20 rows for training in both pgmpy and pomegranate programs, while bnlearn takes the whole dataframe. Since I am doing all my image preprocessing and feature extraction in python, it is difficult for me to switch between R and python for training.

My data contains continuous values ranging from 0 to 1. I've also tried discretizing the data to 0's and 1's, which didn't resolve the issue.

Is there any way I can speed up training in these python packages or am I doing anything wrong in my code?

Thanks for any help in advance.

Edit:

https://drive.google.com/file/d/1HbAqDQ6Uv1417zPFMgWBInC7-gz233j2/view?usp=sharing

This is dataset with 300 columns and ~40000 rows. In case you want to try reproducing the output.

Hari Krishnan
  • 2,049
  • 2
  • 18
  • 29
  • Are you sure that pgmpy and pomegranate can learn the structure for Gaussian data -- perhaps they are just converting each variable to a large number of states hence large cpt's, memory issues, then slow? As for speed, bnlearn is coded in C so is quite fast... what are the python packages coded in ? (quick comment -- are you sure that your bounded [0,1] data, which you call continuous is approximately Gaussian?) – user20650 Jun 03 '20 at 10:34
  • ... i have seen people calling r::bnlearn from python which would allow you to stay within the python environment (in fact iirc someone.people have written a python wrapper for bnlearn) – user20650 Jun 03 '20 at 10:46
  • 1
    I thought the same about continuous data, which is why I discretized the data into 0's and 1's. There's a bnlearn package in python which is a pgmpy wrapper than R one. Also I tried using rpy2 package in python to run R code. But I was facing error where python could not load R.dll, even though it was in the path. Thanks for the reply – Hari Krishnan Jun 03 '20 at 12:47
  • Have you found a solution? – tommy Aug 28 '20 at 12:10
  • I wrote a greedy structure learning algorithm to improve the speed. So the quality of the model dropped. – Hari Krishnan Aug 28 '20 at 13:35

0 Answers0