1

I have an apply function that goes through a list of indexes, plugs it into a scikit-learn KNN model, and returns two lists of n size (neighbor distances and neighbor indexes). (Imagine this is for a movie recommendation system).

I want to add these results to a new DF.

Ex: if my function iterates through 3 indexes, and the n-neighbor parameter is 5, I should get a DataFrame with 2 cols, and length 3x5=15. But currently my script is appending the entire list to one row, as seen below. enter image description here

This is my code. movies is the DF which has input indexes.

testDF = pd.DataFrame()

def get_distances_indices(index):

    distances, indices = model_knn.kneighbors(data[index], n_neighbors=6)

    distances = pd.Series(distances.flatten().tolist())
    indices = pd.Series(indices.flatten().tolist())

    return indices, distances

testDF[['index','distance']] = testDF.append(movies.apply(lambda row: get_distances_indices(row['index']), axis=1).apply(pd.Series),ignore_index=True)

Any help is appreciated. I am a beginner, and saw articles saying using apply here would help speed up the process of getting the list of neighbors.

For sake of simplicity, here is a reproduceable example: I just want the lists/Series to show up in vertical order, not horizontal.

testDF = pd.DataFrame()
moviesData = {'movie': ['The Big Whale', 'Stack Underflow'], 'index': [3, 99]}
movies = pd.DataFrame(data=moviesData)

def get_distances_indices(index):
    list1 = [51, 700, 999]
    list2 = [.2, .3, .4]
    df2 = pd.Series(list1)
    df3 = pd.Series(list2)

    return df2,df3

testDF[['index','distance']] = movies.apply(lambda row: get_distances_indices(row['index']), axis=1).apply(pd.Series)
testDF.head()
AxW
  • 582
  • 1
  • 6
  • 20
  • 1
    Please take a look at [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). We don't really care where the data comes from. We need small sample datastructures that we can copy and paste into our interpreters and the desired output datastructure. – timgeb May 06 '20 at 19:39
  • 1
    @timgeb I have added a reproducible example, let me know if I should add anything else. Thanks – AxW May 06 '20 at 20:02

1 Answers1

1

You could try something like this:

...

def get_distances_indices(index):
    list1 = [51, 700, 999]
    list2 = [.2, .3, .4]

    # return a dictionary
    return {'index':list1, 'distance':list2}

d = movies.apply(lambda row: get_distances_indices(row['index']), axis=1)

# flatten the resulting lists
l1 = [item for sublist in [x['index'] for x in d] for item in sublist]
l2 = [item for sublist in [x['distance'] for x in d] for item in sublist]

data_tuples = list(zip(l1,l2))
pd.DataFrame(data=data_tuples, columns=['index', 'distance'], index=None,)

If I understood your question correctly, this should give you your desired result:

index   distance
0   51  0.2
1   700 0.3
2   999 0.4
3   51  0.2
4   700 0.3
5   999 0.4
ssharma
  • 150
  • 1
  • 4