0

I have 2 csv files. Each contains a data set with multiple columns and an ASSET_ID column. I used pandas to read each csv file in as a df1 and df2. My problem has been trying to define a function to iterate over the ASSET_ID value in df1 and compare each value against all the ASSET_ID values in df2. From there I want to return all the corresponding rows from df1's ASSET_ID's that matched df2. Any help would be appreciated I've been working on this for hours with little to show for it. dtypes are float or int.

My configuration = windows xp, python 2.7 anaconda distribution

BCR
  • 960
  • 11
  • 27

1 Answers1

2

Create a boolean mask of the values will index the rows where the 2 df's match, no need to iterate and much faster. Example:

# define a list of values
a = list('abcdef')
b = range(6)
df = pd.DataFrame({'X':pd.Series(a),'Y': pd.Series(b)})
# c has x values for 'a' and 'd' so these should not match
c = list('xbcxef')
df1 = pd.DataFrame({'X':pd.Series(c),'Y': pd.Series(b)})
print(df)
print(df1)

   X  Y
0  a  0
1  b  1
2  c  2
3  d  3
4  e  4
5  f  5

[6 rows x 2 columns]
   X  Y
0  x  0
1  b  1
2  c  2
3  x  3
4  e  4
5  f  5

[6 rows x 2 columns]

In [4]:  

# now index your df using boolean condition on the values
df[df.X == df1.X]

Out[4]:

   X  Y
1  b  1
2  c  2
4  e  4
5  f  5

[4 rows x 2 columns]

EDIT:

So if you have different length series then that won't work, in which case you can use isin:

So create 2 dataframes of different lengths:

a = list('abcdef')

b = range(6)

d = range(10)

df = pd.DataFrame({'X':pd.Series(a),'Y': pd.Series(b)})





c = list('xbcxefxghi')

df1 = pd.DataFrame({'X':pd.Series(c),'Y': pd.Series(d)})

print(df)

print(df1)

   X  Y
0  a  0
1  b  1
2  c  2
3  d  3
4  e  4
5  f  5

[6 rows x 2 columns]
   X  Y
0  x  0
1  b  1
2  c  2
3  x  3
4  e  4
5  f  5
6  x  6
7  g  7
8  h  8
9  i  9

[10 rows x 2 columns]

Now use isin to select rows from df1 where the id's exist in df:

In [7]:

df1[df1.X.isin(df.X)]

Out[7]:

   X  Y
1  b  1
2  c  2
4  e  4
5  f  5

[4 rows x 2 columns]
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • Busy now but will try your solution later this evening. Your prose gives me positive encouragement that there is a simple solution. – BCR Jan 31 '14 at 22:34
  • Hey I like your solution but it won't work in my case because the series are of unequal length. df1.index = 2709, df2.index = 117000. Is there another way to make the comparison? Again I want to take one ID from df1 and subset df2 based on each instance where df1.ID == df2.ID – BCR Feb 03 '14 at 18:16