0

I have a function that finds similarity between columns of two dataframes:

def jac_sim_df(df1, df2, thresh):
    L = []
    for col in df1.columns:  
        js_list = [] 
        genes1 = df1.loc[df1[col] >= 2,:].index  #get DEGs for each column in df1
        for column in df2.columns:
            genes2 = df2.loc[df2[column] >= thresh,:].index  #get genes with values higher than a threshold
            js = jaccard_similarity(genes1, genes2)     #calculate jaccard similarity for genes1 and genes2 
            js_list.append(js) 
        L.append(js_list)
    df = pd.DataFrame(L)
    return(df)

I want to vary threshold to see how it can affect the similarity between two dataframes.

Is there a way to apply this function to two dataframes df1 and df2 and a list of thresholds?

df1 = pd.DataFrame(np.random.randint(0,100,size=(100, 14)), columns=range(1,15))
df2 = pd.DataFrame(np.random.rand(100, 14), columns=range(1,15))

Thresholds values can be like this:

thresh = [x / 1000 for x in range(1, 10)]

jaccard_similarity function:

def jaccard_similarity(list1, list2):
    s1 = set(list1)
    s2 = set(list2)
    return float(len(s1.intersection(s2)) / len(s1.union(s2)))

the outcome should be multiple dataframes df, number of dfs = number of threshold values

Pranav Hosangadi
  • 23,755
  • 7
  • 44
  • 70
Yulia Kentieva
  • 641
  • 4
  • 13

2 Answers2

1

EDIT

It seems I misunderstood the question originally. You can do this with a map.

From https://docs.python.org/3/library/functions.html#map :

map(function, iterable, *iterables)
Return an iterator that appliesfunction to every item of iterable, yielding the results. Ifadditional iterables arguments are passed, function must take that many arguments and is applied to the items from all iterables in parallel. With multiple iterables, the iterator stops when the shortest iterable is exhausted. For cases where the function inputs are already arranged into argument tuples, see itertools.starmap().

This link also has some usage examples.


OP

If it's not out of the question you could just call it within a loop that's iterating over your thresholds; like:

for thresh in thresholds:
  ...
  result = jac_sim_df(df1, df2, thresh)
  ...

You don't need to do anything special in python to pass a list as an argument though. If you're expecting the parameter thresh as a list instead of a single item, then you just need to account for that in the body of your function definition, so this piece:

...
        genes2 = df2.loc[df2[column] >= thresh,:].index  #get genes with values higher than a threshold
...

would need to be changed to treat thresh like a list instead of a single object, how you do that would be up to you. You could iterate through with a for loop similar to above, like for t in thresh: ..., use some blanket checks with any or all, or something else.

A. Trevelyan
  • 136
  • 5
  • thank you. I thought there would be something like lapply in R to apply any function to a list of arguments. – Yulia Kentieva Dec 15 '22 at 14:40
  • @YuliaKentieva Oh I see, I went for the simplest explanation so I didn't originally mention it, but if that's what you're looking for it does indeed have something like that -- you can use `map()` to apply a function over a list. [Here](https://www.geeksforgeeks.org/python-map-function/) is a pretty good explanation of how to use it, but it's basically just `lapply()` with the arguments reversed. – A. Trevelyan Dec 15 '22 at 15:01
  • 1
    Using `map` still requires a call to [`functools.partial`](https://docs.python.org/3/library/functools.html#functools.partial) to fix the first two arguments, or creating two iterables that return `df1` and `df2` and are the size of `thresholds` – Pranav Hosangadi Dec 15 '22 at 16:48
0

We can fix the first two arguments of the function using functools.partial. This returns a partial object, that can be called with the remaining arguments.

jac_sim_partial = functools.partial(jac_sim_df, df1, df2)

Now, calling jac_sim_partial(t) is the same as calling jac_sim_df(df1, df2, t).

Finally, we can map each element of thresholds to the value that would be returned by the function:

results = list(map(jac_sim_partial, thresholds))

The list(...) around map just converts the result of map to a list so that you can access elements of it using indices.

I'm going to demonstrate a test using a toy function, since it's easier to understand what's happening with simpler inputs and a simpler function:

import functools

df1 = 10
df2 = 20

def jac_sim_df(df1, df2, thresh):
    return (df1 + df2) * thresh

thresholds = [1, 10, 100, 1000]
jac_sim_partial = functools.partial(jac_sim_df, df1, df2)
results = list(map(jac_sim_partial, thresholds)) # [30, 300, 3000, 30000]
Pranav Hosangadi
  • 23,755
  • 7
  • 44
  • 70
  • I'm not sure I understand the question, but why bother with a partial when you could just do a list comp? `results = [jac_sim_df(df1, df2, t) for t in thresholds]` – wjandrea Dec 15 '22 at 20:58
  • @wjandrea I figured the other answer was close enough to a list comprehension – Pranav Hosangadi Dec 15 '22 at 21:00