1

I have trained a Bayesian network using pgmpy library. I wish to find the joint probability of a new event (as the product of the probability of each variable given its parents, if it has any).

Currently I am doing

infer = VariableElimination(model)
evidence = dict(x_test.iloc[0])
result = infer.query(variables=[], evidence=evidence, joint=True)
print(result)

Here x_test is the test dataframe.

The result is very large output with all combination of train data with their probabilities.

+----------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------+------------------------------------------+-----------------+---------------------------+-----------------------------------------+------------------------------+------------------------+---------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+
| data_devicetype                                                                                                                              | data_username                      | data_applicationtype                     | event_type      | servicename               | data_applicationname                    | tenantname                   | data_origin            | geoip_country_name        |   phi(data_devicetype,data_username,data_applicationtype,event_type,servicename,data_applicationname,tenantname,data_origin,geoip_country_name) |
+==============================================================================================================================================+====================================+==========================================+=================+===========================+=========================================+==============================+========================+===========================+=================================================================================================================================================+
| data_devicetype(Mozilla_5_0_Windows_NT_10_0_Win64_x64_AppleWebKit_537_36_KHTML_like_Gecko_Chrome_94_0_4606_81_Safari_537_36)                 | data_username(christofer) | data_applicationtype(Custom_Application) | event_type(sso) | servicename(saml_runtime) | data_applicationname(GD)            | tenantname(amx-sni-ksll0) | data_origin(1_0_64_66) | geoip_country_name(Japan) |                                                                                                                                          0.0326 |
+----------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------+------------------------------------------+-----------------+---------------------------+-----------------------------------------+------------------------------+------------------------+---------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+
| data_devicetype(Mozilla_5_0_Windows_NT_10_0_Win64_x64_AppleWebKit_537_36_KHTML_like_Gecko_Chrome_94_0_4606_81_Safari_537_36)                 | data_username(marty) | data_applicationtype(Custom_Application) | event_type(sso) | servicename(saml_runtime) | data_applicationname(VAULT)      | tenantname(login_pqr_com) | data_origin(1_0_64_66) | geoip_country_name(Japan) |                                                                                                                                          0.0156 |
+----------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------+------------------------------------------+-----------------+---------------------------+-----------------------------------------+------------------------------+------------------------+---------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+
| data_devicetype(Mozilla_5_0_Windows_NT_10_0_Win64_x64_AppleWebKit_537_36_KHTML_like_Gecko_Chrome_94_0_4606_81_Safari_537_36)                 | data_username(lincon) | data_applicationtype(Custom_Application) | event_type(sso) | servicename(saml_runtime) | data_applicationname(apps_think4ch_com) | tenantname(login_abc_com) | data_origin(1_0_64_66) | geoip_country_name(Japan) |                                                                                                                                          0.0113 |
......contd

Please help me as to how can I find out the probability of a new event(i.e., a row in test data). The probability expression is P(data_devicetype, data_username, data_applicationtype, event_type, servicename, data_applicationname, tenantname, data_origin, geoip_country_name)

1 Answers1

0

If I am understanding you correctly, you are trying to compute the probability of a new data point. Unfortunately, there is no direct method to do it in pgmpy yet. Although you can get the probability value from the inference result. Something like this:

infer = VariableElimination(model)
result = infer.query(variables=list(model.nodes()), joint=True)
evidence = dict(x_test.iloc[0])
p_evidence = result.get_value(**evidence)

Essentially, here we are computing the joint distribution on all the variables and then taking the probability value of the evidence data point. As you would expect this can be computationally very inefficient in the case of large networks. In such cases, an approximate way to compute the probability would be to use simulations.

nsamples = int(1e6)
samples = model.simulate(nsamples)
evidence = dict(x_test.iloc[0])
matching_samples = samples[np.logical_and.reduce([samples[k]==v for k, v in evidence.items()])]
p_evidence = matching_samples.shape[0] / nsamples

With the simulation method, we are generating some simulated data from our model and checking how many of those samples match with our data point, which would be it's probability.

Ankur Ankan
  • 2,953
  • 2
  • 23
  • 38