Training missing labels using Label Propagation/Spreading for a dataframe with several labels

Question

The example below is used for testing the label spreading algorithm using a dummy dataset (reference here: https://scikit-learn.org/stable/auto_examples/semi_supervised/plot_label_propagation_digits.html) before applying to my dataset.

import numpy as np
from sklearn import datasets

digits = datasets.load_digits()
rng = np.random.RandomState(2)
indices = np.arange(len(digits.data))
rng.shuffle(indices)

X = digits.data[indices[:340]]
y = digits.target[indices[:340]]
images = digits.images[indices[:340]]

tot_samples = len(y)
labeled_points = 40

indices = np.arange(tot_samples)

non_labeled_set = indices[labeled_points:]

# Shuffle
y_train = np.copy(y)
y_train[non_labeled_set] = -1

I would like to apply label propagation to an existing dataset that I have and has the following fields:

User1   User2     Class   Weight
A1       B1         1      2.1
A1       C1         1      3.3
A2       D3        -1      2.1
C3       C1         0      2.5
D1       A1         1      1.3
C3       D1        -1      2.5
A2       A4        -1      1.5

Class is a property of User1. Nodes are A1, A2, B1, C1, C3, D1, D3, A4 but only A1, A2, C3 and D1 have labels. The others (B1, C1, D3, A4) do not have it. I would like to predict their label using label propagation algorithm. Can someone explain me how to apply the above code in my case, as the challenge is in determining multiple labels? I think it should still work, even if I am considering a multi-class sample of data.

As per the algorithm considered, I think that it needs to propagate labels to neighboring unlabeled nodes according to the weight. This step should be repeated for many times until, eventually, the labels on the unlabeled nodes will reach an equilibrium (that will be the prediction for these nodes).

I would expect the following output:

B1: 1
C1: 0
D3: -1
A4: -1

which values do you want to replace and what would you like to replace them with? — itprorh66, Nov 13 '21 at 16:32
Hi itprorh66. My dataset is fully labelled so I would need to test the algorithm on it to see if the output is satisfying. I would like to keep one/two labelled in each class and determine the remaining, if possible, no matter which values will be replace. I would probably expect the others unlabelled can be found via algorithm. I know is a challenging dataset since it contains multiple classes. I hope this makes sense, in case it doesn't, let me know.Thanks — LdM, Nov 13 '21 at 17:04
I am sorry, but your explanation doesn't help me understand your problem. Can you provide a sample oif your input and what your desired output would look like? — itprorh66, Nov 14 '21 at 00:02
I tried to edit a bit the question. Hopefully it should be a bit more clear now. I have included a Weight edge property (I forgot to include it initially) and the expected output (it might be slightly different). Please let me know in case it is still not clear. Thanks — LdM, Nov 14 '21 at 01:47

score 2 · Accepted Answer · answered Nov 19 '21 at 20:05

Let's first fabricate some data to work with. As we are creating random pairing between the users, also the random class labels, predictions will not be meaningful, however, it will help us run and illustrate the code.

import random
import numpy as np
import pandas as pd

from sklearn.preprocessing import OneHotEncoder
from sklearn.semi_supervised import LabelSpreading

seed = 0
random.seed(seed)
np.random.seed(seed)

u_nodes = ['A1', 'A2', 'B1', 'C1', 'C3', 'D1', 'D3', 'A4']
n_nodes = len(u_nodes)

data = {'User1': [], 'User2': [], 'Class': [], 'Weight': []}
idxs = np.arange(n_nodes)
for u in u_nodes:
    data['User1'].extend([u]*n_nodes)
    data['User2'].extend(u_nodes) # we'll shuffle and remove duplicates in a bit
    cls = np.asarray([random.randint(0, 1) for i in range(n_nodes)])
    # delete two random labels
    cls[np.random.choice(idxs, 2)] = -1
    data['Class'].extend(list(cls))
    # build random weights in the range (0, 1)
    data['Weight'].extend([round(random.uniform(0, 1), 2) for i in range(n_nodes)])

df = pd.DataFrame(data)
df = df[~(df.User1 == df.User2)].sample(frac=1).reset_index(drop=True)
print(df)

In all 56 data points are generated and shown below is a truncated version:

Let's have a look at the points whose labels are missing:

df_missing = df[df.Class == -1]
unlabeled_set = list(df_missing.index)
print(df_missing)

Output:

Finally, encode features and train a label spreading model.

X = df[['User1', 'User2']]
y_train = df['Class'].values

enc = OneHotEncoder(sparse=False)
enc.fit(X[['User1']])

# using the same encoder assuming that the A1 whether as User1 or User2 are the same
X_train = np.concatenate((enc.transform(X[['User1']]), enc.transform(X[['User2']])), axis=1)

# without weights
lp_model = LabelSpreading(gamma=0.25, max_iter=20)
lp_model.fit(X_train, y_train)
predicted_labels = lp_model.transduction_[unlabeled_set]
print(predicted_labels)

The estimated labels are printed below from this model. Obviously, these carry no meaning as both the features and labels are generated randomly.

[0 1 1 0 1 0 0 0 1 1 1 1 1]

I haven't used the weights yet. You need to define your kernel of weight matrix based on the nature of the data as instructed in this documentation LabelSpreading

kernel:

String identifier for kernel function to use or the kernel function itself. Only ‘rbf’ and ‘knn’ strings are valid inputs. The function passed should take two inputs, each of shape (n_samples, n_features), and return a (n_samples, n_samples) shaped weight matrix.

Once you have defined your weight-matrix generator my_kernel, you should provide it in the constructor as shown below

lp_model = LabelSpreading(gamma=0.25, max_iter=20, kernel=my_kernel)

Training missing labels using Label Propagation/Spreading for a dataframe with several labels

1 Answers1