1

I've computed a proximity matrix for my ~1000 data points using Random Forest, but my results visualizing this matrix using sklearn's MDS are quite strange and difficult to reason about.

The code I used to process my data is below:

data_url = "https://raw.githubusercontent.com/ychennay/ychennay.github.io/master/KAG_conversion_data.csv"

# read data into memory and drop columns
data_string = requests.get(data_url).content
conversions_df = pd.read_csv(StringIO(data_string.decode("utf-8"))
                            )

ad_ids = conversions_df["ad_id"].tolist()

conversions_df = pd.read_csv(StringIO(data_string.decode("utf-8"))
                            ).drop(columns=COLUMNS_TO_DROP)

conversions_df["bias"] = 1 # add a bias/intercept column

# define the target
y = conversions_df[TARGET]

# define features
X = conversions_df.loc[:, ~conversions_df.columns.isin(TARGET)]

# using dictionary convert columns into categorical data types
convert_dict = {'gender': "category",
                'interest':"category",
                "age": "category"}

conversions_df = conversions_df.astype(convert_dict)
dummified_data = pd.get_dummies(conversions_df, drop_first=True) # get dummy features for categorical variables

TARGET = ["Approved_Conversion"]
y = dummified_data[TARGET].values.reshape(-1)
X = dummified_data.loc[:, ~dummified_data.columns.isin(TARGET)]
conversions_df = conversions_df.astype(convert_dict)

After this preprocessing, I run it through my RandomForestRegressor attempting to predict Approved_Conversions as the target:

from sklearn.ensemble import RandomForestRegressor

B = 500
rf = RandomForestRegressor(n_estimators=B)
rf.fit(X, y)

final_positions = rf.apply(X)
proximity_matrix = np.zeros((len(X), len(X)))
# adapted implementation found here: h
# https://stackoverflow.com/questions/18703136/proximity-matrix-in-sklearn-ensemble-randomforestclassifier
for tree_idx in range(B):
    proximity_matrix += np.equal.outer(final_positions[:,tree_idx], 
                                       final_positions[:,tree_idx]).astype(float)
# divide by the # of estimators
proximity_matrix /= B

distance_matrix = 1 - proximity_matrix
distance_matrix = pd.DataFrame(distance_matrix, columns=ad_ids, index=ad_ids)

However, when I plot my MDS visualization, the visualization is perfectly round, and not very informative. I expected some coherent clusters in the data that correspond with groups of data points that are the most similar:

from sklearn.manifold import MDS
# from sklearn.decomposition import PCA
mds = MDS(n_components=2,dissimilarity='precomputed')
reduced_dimensions = mds.fit_transform(distance_matrix)

enter image description here

If I try using MDS with the proximity_matrix instead, it's more or less the same pattern:

enter image description here

I'm not the most familiar with MDS, but I can't explain why this algorithm is giving me the worst results when most of the articles online have recommended using it for visualizing distance/similarity matrices.

I've also validated that the actual results of the matrix make sense. For instance, when I get the most similar ads to a particular Facebook ad (the dataset is paid Facebook campaigns performance), I get results that do indeed make sense (the ad I inputted is highlighted, and the most similar results show up below): enter image description here

Can anyone give me some pointers for what I might be doing wrong? If I reduce the dimensions using PCA, I get a bit more "normal" results (at least that variance is extending in both principal components):

enter image description here

Yu Chen
  • 6,540
  • 6
  • 51
  • 86
  • I'm not sure what this algorithm does, or specifically what you are trying to achieve, but thought I would just give some suggestions anyway, since a question like this will probably receive less attention because it's complicated. Have you tried removing the step where you use dummy data? Perhaps that could have some kind of normalising effect. On that, if any of this performs some kind of normalisation, I would very much guess that's the issue. – Recessive May 23 '19 at 01:23
  • The dummy data is one-hot encoding categorical variables that otherwise cannot be represented as numerical values. It's important for the feature engineering and data cleaning process part. I'm trying to show logical groups of data points (in this case, ads) together. However, the MDS algorithm essentially just shows a uniform spread (ie. data points are often all equidistant to each other) – Yu Chen May 23 '19 at 14:26
  • I have the same issue – alienflow Jul 02 '19 at 19:32

1 Answers1

0

I believe the issue is coming from this line reduced_dimensions = mds.fit_transform(distance_matrix) You are fitting your model and then transforming the results rather than scaling the input data and fitting the model.

I think doing it in this fashion is causing it to be manipulated into a normal distribution which would generate a bellcurve or oval in the case of multiple variables. What happens if you just try mds.fit(distance_matrix) ?

Apologies as this makes more sense as a comment I am just not allowed to comment yet.

jmdatasci
  • 111
  • 9
  • `fit()` will not return anything. It basically computes the points in the scaled space but does not return them. – Yu Chen May 23 '19 at 14:24
  • Sorry I was meaning to fit then use the model to predict and then plot the predictions to visualize – jmdatasci May 23 '19 at 14:28
  • Could you explain more what you mean by use the model to predict and plot the predictions to visualize? I'm not quite clear what you mean. – Yu Chen May 23 '19 at 14:29