How to find patterns between numerious causes and the result in python?

Question

For each instance I have a set of problems and a result, like this:

df = pd.DataFrame({
    "problems": [[1,2,3], [1,2,4], [1,4,5], [3,4,5], [1,5,6]],
    "results": ["A", "A", "C", "C", "A"]
})

I want to find patterns in the relationship between the problems and the result.

My first thought was Association Rule Mining, but this is more for finding patters within the problems (for example). I guess machine learning could help somehow, but I'm not interested in solely predicting the result, but in the patters that lead to that prediction.

I would be interested in patters like

Problem 1 causes result A
The combination of problems 4 and 5 cause result C

Any thoughts on that? As I'd implement with Python, corresponding packages are welcomed hints, too.

Thanks a lot!

Can use a neural net for it, I recommend tensorflow tutorials. — zabop, Aug 20 '20 at 09:53

MichaelJanz · Accepted Answer · 2020-08-25T16:02:50.923

I was curious and I did some experimental stuff, based on the comment of Daniel Möller in this thread in tensorflow 2.0 with keras:

Update: Make the order not matter anymore:

To make the order not matty anymore, we need to remove the order information from our dataset. To do this, we first convert it to a one-hot vector, then we take the max() value to squash the dimensions into 3 again:

x_no_order = tf.keras.utils.to_categorical(x)

This gives us a one-hot vector looking like this:

array([[[0., 1., 0., 0., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 0.],
    [0., 0., 0., 1., 0., 0., 0.]],

   [[0., 1., 0., 0., 0., 0., 0.],
    [0., 0., 1., 0., 0., 0., 0.],
    [0., 0., 0., 0., 1., 0., 0.]],

   [[0., 1., 0., 0., 0., 0., 0.],
    [0., 0., 0., 0., 1., 0., 0.],
    [0., 0., 0., 0., 0., 1., 0.]],

   [[0., 0., 0., 1., 0., 0., 0.],
    [0., 0., 0., 0., 1., 0., 0.],
    [0., 0., 0., 0., 0., 1., 0.]],

   [[0., 1., 0., 0., 0., 0., 0.],
    [0., 0., 0., 0., 0., 1., 0.],
    [0., 0., 0., 0., 0., 0., 1.]]], dtype=float32)

Taking the np.max() from that vector gives us a vector, that only knows about which numbers occur, without any information about the position, looking like this:

x_no_order.max(axis=1)

array([[0., 1., 1., 1., 0., 0., 0.],
   [0., 1., 1., 0., 1., 0., 0.],
   [0., 1., 0., 0., 1., 1., 0.],
   [0., 0., 0., 1., 1., 1., 0.],
   [0., 1., 0., 0., 0., 1., 1.]], dtype=float32)

First create the dataframe and create the training data

Thats a multiclass-classification task, so I use the tokenizer (there are for sure better approaches, since its rather for text)

import tensorflow as tf
import numpy as np
import pandas as pd

df = pd.DataFrame({
    "problems": [[1,2,3], [1,2,4], [1,4,5], [3,4,5], [1,5,6]],
    "results": ["A", "A", "C", "C", "A"]
})

x = df['problems']
y = df['results']

tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(y)
y_train = tokenizer.texts_to_sequences(y)

x = np.array([np.array(i,dtype=np.int32) for i in x])
y_train = np.array(y_train, dtype=np.int32)

**Then create the model **

input_layer = tf.keras.layers.Input(shape=(3))
dense_layer = tf.keras.layers.Dense(6)(input_layer)
dense_layer2 = tf.keras.layers.Dense(20)(dense_layer)
out_layer = tf.keras.layers.Dense(3, activation="softmax")(dense_layer2)

model = tf.keras.Model(inputs=[input_layer], outputs=[out_layer])
model.compile(optimizer="Nadam", loss="sparse_categorical_crossentropy",metrics=["accuracy"])

Train the model by fitting it

hist = model.fit(x,y_train, epochs=100)

Then, as based on Daniels comment, you take the sequence you want to test and mask out certain values, to test their influence

arr =np.reshape(np.array([1,2,3]), (1,3))
print(model.predict(arr))
arr =np.reshape(np.array([0,2,3]), (1,3))
print(model.predict(arr))
arr =np.reshape(np.array([1,0,3]), (1,3))
print(model.predict(arr))
arr =np.reshape(np.array([1,2,0]), (1,3))
print(model.predict(arr))

This will print this result, have in mind that since y starts at one, the first value is a placeholder, so the second value stands for "A"

[[0.00441748 0.7981055  0.19747704]]
[[0.00103579 0.9863035  0.01266076]]
[[0.0031549  0.9953074  0.00153765]]
[[0.01631758 0.00633342 0.977349  ]]

There we can see, that in the first place A is correctly predicted by 0.7981.. When the of [1,2,3] we change 3 to 0, so [1,2,0] we see that the model suddenly predicts "C". So the influence of 3 on position 3 is the biggest. Putting that in a function, you can use all training data you have and build statistic metrics to analyze that further.

This is just a very simple approach, but keep in mind that it is a big research field called sensitivity analysis. You might want to have a deeper look at that topic, if you are interested.

Thank you very much for this answer! This is definitely an interesting approach. However, the order of the problems does not matter - do you have an idea about how to approach that? — Julian, Aug 25 '20 at 08:00
Glad I could help you out! I updated my question, I hope this helps you further. — MichaelJanz, Aug 25 '20 at 16:03
I created a [blogpost](https://michaeljanz-data.science/deepllearning/opening-the-blackbox-of-a-neural-network-a-little-bit/) regarding this topic which might give already some more information: — MichaelJanz, Sep 23 '20 at 09:24

How to find patterns between numerious causes and the result in python?

1 Answers1