-1

I have a dataframe that represents all the combinations of data sources and the number of common data points for each combination:

here's how to load a simplified dataframe:

data = { 's1': [True, False, False], 's2': [True, True, True], 's3': [False, False, 
True], 's4': [False, True, False], 'count': [2, 2, 2] }
df = pd.DataFrame(data)
s1 s2 s3 s4 count
True True False False 2
False True False True 2
False True True False 2

The first line says that we have 2 data points common to source 1 and 2 and that aren't available in source 3 and 4.

I'm trying to make it more "readable" by doing a plot that could be a heatmap, because as you can imagine there more combination. But I can't figure the right transformation to reach that objective.

how can I achieve that?

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
Bouji
  • 184
  • 2
  • 11
  • Does this answer your question? [Making heatmap from pandas DataFrame](https://stackoverflow.com/questions/12286607/making-heatmap-from-pandas-dataframe) – Matt Hall Jul 19 '23 at 10:14
  • @MattHall I understand how to make a heatmap, what I can't figure out is how to make a heatmap (or any visualisation) that makes sense with the data, as you can see I have combinations, so a regular heatmap won't work well – Bouji Jul 19 '23 at 10:30
  • `ax = sns.heatmap(data=df.iloc[:, :-1], annot=True)` seems to work just fine. [Plot](https://i.stack.imgur.com/43eUs.png). In any case, the question is ambiguous. It's your responsibility to indicate **clearly** what is needed, and post a complete [mre]. The question lacks clarity, because we need to guess at the desired outcome, and there's no code showing what has been attempted. – Trenton McKinney Jul 19 '23 at 17:09

2 Answers2

3

You could show a square heatmap showing for each pair of source how many times they appear together. (On the diagonal, then, you have the total number of times that data appeared. And that heatmap is symmetrical)

cols=df.columns[:-1] # Ignoring `count`. You haven't said what it is and what to do with it
M=df[cols].values # Numpy array of values (just me being more comfortable with numpy. There are certainly direct ways in pandas)
matrix=(M[:,:,None] & M[:,None,:]).sum(axis=0)
coOccDf = pd.DataFrame(matrix, index=cols, columns=cols)
sns.heatmap(coOccDf, annot=True)

enter image description here

Edit

To take the count into account, the way OCa did (also upvoted :D), we can do that, without for loops

cols=df.columns[:-1] # Ignoring `count`. You haven't said what it is and what to do with it
counts=df['count'].values
M=df[cols].values # Numpy array of values (just me being more comfortable with numpy. There are certainly direct ways in pandas)
matrix=((M[:,:,None] & M[:,None,:])*counts[:,None,None]).sum(axis=0)
coOccDf = pd.DataFrame(matrix, index=cols, columns=cols)
sns.heatmap(coOccDf, annot=True)

This requires some explanation.

M are the 2d array (shape (3,4) in the example) of the dataframe value, but for the count column. 1st axis is case number (well rows of dataframe. I am not sure what rows represent exactly here), and 2nd axis are the sources.

So M[:,:,None] is a 3d array (shape (3,4,1) in the example). 1st axis, case number, 2nd axis source #1, and 3rd axis source #2. With source #2 being a void axis (for broadcasting later)

Likewise M[:,None,:] is a 3d array (shape (3,1,4) in the example). 1st axis=case number, 2nd axis source #1, a void axis for broadcasting, and 3rd axis=source #2.

So, it is just M. But with different arrangements (when printing M, M[:,None,:] or M[:,:,None] all that change are some extra [ or ] in the printing)

So, any operation between those 2, would create a broadcasting: axis of size 1 are virtually expanded as if they were of the same size as the corresponding axis in the other operand, repeating the value along it.

This is the way we use in numpy to write nested for loops, without actually writting the for loops (not for aesthetical reason, of course, but because it is way faster to trick numpy in doing the for loops, in C, than to write them, in python)

Just one example for broadcasting: np.array([1,2,3])[:,None]+np.array([10,20,30])[None,:]. np.array([1,2,3])[:,None] is a 3x1 array [[1],[2],[3]]. np.array([10,20,30])[None,:] is a 1×3 array: [[10,20,30]]. So addition is as if we were adding [[1,1,1],[2,2,2],[3,3,3]] and [[10,20,30],[10,20,30],[10,20,30]]: data is repeated along singleton axis. With result [[11,21,31],[12,22,32],[13,23,33]]. Exactly as if I'd written for i in range(3): for j in range(3): res[i,j]=A[i]+B[j]. Search "numpy broadcasting" for better explanations than this one. But this is what I do with my M. Except that there are 3 axis, because I need 3 nested for loops: for i rows: for j in sources: for k in sources, to count the number of common cases between source j and source k, for all combination of j and k.

So here M[:,:,None] & M[:,None,:] is a 3d array, for each case i, and for all pair of sources j and k, a value (M[:,:,None] & M[:,None,:])[i,j,k], true iff in case i, both source j and k are present.

In my first version (M[:,:,None] & M[:,None,:]).sum(axis=0) is therefore a 2d array, telling for all pairs of sources j and k, the number of cases having both source j and k (we summed along axis 0, that is along case axis the True/False aka 1/0 values).

In the second version, to take into account the count, before summing, I multiply each case by a weight. But to be able to perform a multiplication between a 3d array (3,4,4), and the 3 values, I need another broadcast, [:,None,None], meaning that each of the 3 values, for each case, are virtually repeated 4x4 times.

In other words, is it as if I had written: for i in cases: for j in sources: for k in sources: res[i,j]+=(M[i,j]&M[i,k])*count[i]

I don't include another screenshot, because in that example, all count being 2, it would be the exact same as before, with all values multiplied by 2 (as we can see in OCa's).

It is not exactly the one-liner OCa'd called for. But that is just because I didn't want to obfuscate the computation. All the intermediary variable could be replaced by their values to produce a one-liner. The important point is: no for loops.

chrslg
  • 9,023
  • 5
  • 17
  • 31
1

Prior comments

  • Interesting question! although too open, hence the downvote by someone?
  • My added value to chrslg's great answer (upvoted): I take 'count' into account.
  • On the flip side, I resorted to a for loop. Pandas vectorized one-liner anyone?

Construct "table of sharing"

Initiating:

datasources= df.columns[:-1]

Sh = pd.DataFrame(columns = datasources,
                  index   = datasources,
                  data = 0)

Filling numbers of shared datapoints for every pair of data sources

for k in df.index:# going over the input dataframe line by line
    
    # Count of datapoints shared
    n = df.loc[k,'count'] 
    
    # Pair of sharing data sources
    pair_k = df.columns[df.loc[k]==True] 
    
    # Start filling
    Sh.loc[pair_k[0],pair_k[1]] = n
    Sh.loc[pair_k[1],pair_k[0]] = n # only for symmetry - this is redundant
    
    # Diagonal: ADD-UP (hence the "+=") number of datapoints shared
    Sh.loc[pair_k[0],pair_k[0]] += n
    Sh.loc[pair_k[1],pair_k[1]] += n
Sh
    s1  s2  s3  s4
s1   2   2   0   0
s2   2   6   2   2
s3   0   2   2   0
s4   0   2   0   2

Requested figure

Indeed a heatmap looks appropriate:

sns.heatmap(Sh, cmap='Blues')

Overlaps between datasources

Another remark.

When you say: "So basically the first line says that we have 2 data points common to source 1 and 2 and that aren't available in source 3 and 4."

Your input table will never tell you if the shared points are the same in between lines. Hence answering the part in bold is impossible, with the information available.

OCa
  • 298
  • 2
  • 13