Iterating over the rows of two dataframes

Question

I have two dataframes let's call first one df and the second one compare_df: First one is like this:

Date         cell         tumor_size (assume it is three dimensional)
25/10/2015    113           [51, 52, 55]
22/10/2015    222           [50, 68, 22]
22/10/2015    883           [45, 23, 67]
20/10/2015    334           [35, 23, 76]

and second one is like that:

Date         cell         tumor_size
19/10/2015    564           [47, 23, 56]
19/10/2015    123           [56, 11, 23]
22/10/2014    345           [36, 66, 78]
13/12/2013    456           [44, 21, 83]

For each row in the dataframe I want to go through each row in the second dataframe and record the euclidean distances then get the minimum one. This is my code tries to accomplish this:

# These will be our lists of pairs and size differences.
pairs = []
diffs = []


for row in df.itertuples():
     compare_df['distance'] = np.linalg.norm(compare_df.tumor_size - row.tumor_size) # I get error for this line
     row_of_interest = compare_df.loc[compare_df.distance == compare_df.distance.min()]
     pairs.append(row_of_interest.cell.values[0])
     diffs.append(row_of_interest.distance.values[0])

df['most_similar_to'] = pairs
df['similarity'] = diffs

However I get:

ValueError: Length of values does not match length of index

Although size of the vectors are the same, and I drop Nan values. Any ideas?

Possible duplicate of: https://stackoverflow.com/questions/42382263/valueerror-length-of-values-does-not-match-length-of-index-pandas-dataframe-u — Elis Byberi, Nov 21 '17 at 19:47
It is not a dublicate because I checked the size of vectors. There is something wrong with my code itself but I don't know what — edyvedy13, Nov 21 '17 at 20:12

score 2 · Accepted Answer · answered Nov 22 '17 at 00:12

Your mistake is in trying to subtract a pd.Series of large size (compare_df.tumor_size) from a list of size three (row.tumor_size). When subtracting list/tuple from pd.Series, pandas tries to match both sequences and subtract each two matching rows. However, when the list and the pd.Series are of different size, it doesn't know how to match, and raises the exception.

Judging from the error code, your pandas version is probably a bit old. You can try to use apply to force the subtraction operator to be used row by row:

compare_df.tumor_size.apply(
    lambda compare_size: np.array(compare_size) - np.array(row.tumor_size)
)

Of course, it may be beneficial to convert all list to np.array ahead of time.

If you don't like np.array, you can use:

compare_df.tumor_size.apply(
    lambda compare_size: [compare_size[i] - row.tumor_size[i] for i in range(3)]
)

In pandas 0.21.0 (perhaps a bit earlier), you would have got a different error message:

TypeError: unsupported operand type(s) for -: 'list' and 'list'

In this case, there is an easier solution - just convert the list to an np.array, and it will work like magic

compare_df.tumor_size -  np.array(row.tumor_size)

For me, this work with pandas==0.21.0 and numpy==1.13.3.

Iterating over the rows of two dataframes

1 Answers1