How to subset a pandas dataframe based on iterating through all row values of another df?

Question

I have two pandas dataframes: key_df and value_df

key_dict = {"coordinates": ["AB1", "AC1", "AD1", "EF1", ... ], "start": [762, 1274, 1587, 1991, ...], "end": [2481, 1789, 1689, 2211, ...] }

key_df = pd.DataFrame(key_dict)

  coordinates     start       end
0         AB1       762      2481
1         AC1      1274      1789
2         AD1      1587      1689
3         EF1      1991      2211
...       ...      ...       ...

value_dict = {"coordinates": ["AD1", "AB1"], "meta_data": [101, 010]}

value_df = pd.DataFrame(value_dict)

  coordinates  meta_data
0         AD1        101
1         AB1        110
...       ...        ...

The coordinates column for key_df only contains unique values---there are no repeats. Similarly for value_df.

I would like to iterate through value_df on coordinates, checking each value of coordinates in value_df for the value of coordinates in key_df. I would then like to return start and end for these values.

My thought to subset the dataframe and grab the start and end values would be to create a function:

def parse(x, df):   ### 'x' is each row of value_df$coordinates
    df = df[df.coordinates == x]
    return (df.start, df.end)   ## return as a tuple

and I would call the function be parse(x, df=key_df)

However, I'm not sure how to iterate over value_df. .iterrows() is quick but it doesn't preserve the row dtype, which may be a problem.

Does an inner join would be what you are looking for? like `value_df.merge(key_df,on="coordinates",how="inner")`? — hanego, Dec 13 '17 at 06:43
@andrew_reece Nothing. It's a typo on my end. I think all R should be gone now — EB2127, Dec 13 '17 at 06:44
@angelwally Yes, but the DataFrames are in fact much larger than above. So I would need to somehow only merge with `start` and `end` — EB2127, Dec 13 '17 at 06:45
@angelwally I'm also worried that `merge()` may take up too much RAM, but that could be irrational — EB2127, Dec 13 '17 at 06:53
Well, you could try. And as coordinates is unique you can try to put index and use join instead (seems faster: https://stackoverflow.com/a/43003698/2857926) — hanego, Dec 13 '17 at 06:58
@angelwally How does this deal with "non-matching" `coordinates`? — EB2127, Dec 13 '17 at 17:03

Abhishek Sharma · Accepted Answer · 2017-12-13T16:55:38.537

1

Just doing a left join will suffice

key_df = key_df.merge(value_df,on='coordinates',how='left')

Another way could be to reindex value_df on basis of coordinates

value_df.index = value_df['coordinates']
common_values = set(key_df.coordinates).intersection(set(value_df.index))

value_df.loc[list(common_values),['start','end']]

edited Dec 13 '17 at 16:55

answered Dec 13 '17 at 08:11

Abhishek Sharma

1,909
2
15
24

Do you have benchmarks which would be faster/more RAM efficient? – EB2127 Dec 13 '17 at 16:30
Is it a left join or an inner join? – EB2127 Dec 13 '17 at 16:54
What is the difference between `value_df.index = value_df['coordinates']` and using `value.set_index("coordinates", inplace=True)`? – EB2127 Dec 13 '17 at 17:00
How does this deal with "non-matching" `coordinates`? – EB2127 Dec 13 '17 at 17:02
value_df.index = value_df['coordinates'] and using value.set_index("coordinates", inplace=True) both are same. non matching coordinates will have start and end as null values. – Abhishek Sharma Dec 13 '17 at 17:09

score 0 · Answer 2 · answered Dec 13 '17 at 07:12

0

try following:

key_df = pd.DataFrame(key_dict)
key_df.set_index('coordinates', inplace=True)
key_df.loc[value_dict['coordinates']]

answered Dec 13 '17 at 07:12

zyun

49
3

How does this deal with "non-matching" `coordinates`? – EB2127 Dec 13 '17 at 17:01
Why `value_dict`? – EB2127 Dec 13 '17 at 17:08

score 0 · Answer 3 · edited Dec 13 '17 at 17:09

0

You can do this by inner join, and from the result dataframe you can get the start and end, as shown below:

result = pd.merge(key_df, value_df, on='coordinates', how='inner')

edited Dec 13 '17 at 17:09

Andrew Chisholm

6,362
2
22
41

answered Dec 13 '17 at 07:23

Ankur Alankar Biswal

1,184
1
12
23

How to subset a pandas dataframe based on iterating through all row values of another df?

3 Answers3