Because of the way the data is being stored -- in strings of comma-separated
values -- code is required to iterate through the rows, pick apart the values,
iterate through the other DataFrame and pick apart its values, then compare the
two, etc.... I don't see a way to really improve upon this situation as long as
the inputs retain the comma-separated values.
Given the constraints, I think su79eu7k's
answer is quite good.
If, however, you buy into the idea that "tidy data" (PDF) is better -- if you allow us
to change the starting point to be DataFrames in tidy format -- then there is a
different approach which may be more performant especially when sd
has many
rows. The problem with using sd.apply(check, axis=1)
is that under the hood it
uses a Python loop to iterate over the rows of sd
. Calling check
once for
every row can be relatively slow compared to equivalent code that takes
advantage of Panda's faster vectorized methods (like merge
or groupby
).
However, to use merge
and groupby
you need the data to be in tidy format.
So suppose instead of pos
and sd
we start with tidypos
and tidysd
. (At
the end of this post you'll find a runnable example which converts pos
and sd
to their tidy
equivalents.)
In [238]: tidypos
Out[238]:
Position Site Station
0 Contractor A 1
1 Contractor A 2
2 Contractor B 1
3 Contractor B 2
4 President A 0
5 President A 1
6 President A 2
7 President A 3
8 President A 4
In [239]: tidysd
Out[239]:
index Site Station
0 0 A 1
1 0 A 2
2 1 B 1
3 1 B 2
4 3 C 1
5 3 C 2
6 4 A 0
7 4 A 1
8 4 A 2
9 5 A 2
tidypos
and tidysd
contain the same information as pos
and sd
(ignoring Items
since they play no role in this problem.)
The difference mainly is that each row in tidypos
and tidysd
corresponds to one "observation". Each observation is independent of every other.
Essentially, this boils down to simply splitting the comma-separated values so that each value ends up on a separate row.
Now we can join the two DataFrames based on the common columns, Site
and Station
:
In [241]: merged = pd.merge(tidysd, tidypos, how='left'); merged
Out[241]:
index Site Station Position
0 0 A 1 Contractor
1 0 A 1 President
2 0 A 2 Contractor
3 0 A 2 President
4 1 B 1 Contractor
5 1 B 2 Contractor
6 3 C 1 NaN
7 3 C 2 NaN
8 4 A 0 President
9 4 A 1 Contractor
10 4 A 1 President
11 4 A 2 Contractor
12 4 A 2 President
13 5 A 2 Contractor
14 5 A 2 President
Now, each row in merged
represents a match between a row of tidysd
and a row
of tidypos
. Therefore, the presence of a row implies that there was a match
between sd['Site']
and pos['Site']
, and moreover, a match between
tidysd['Station']
and tidypos['Station']
. In other words, for that row,
sd['Station(s)']
must contain a number found in pos['Station()']
. The only
critera we are not yet sure about is whether there are extra numbers in
sd['Station(s)']
that do not appear in pos['Station()']
.
We can find that out by counting the number of rows in merged
for each index
and Position
since each such row corresponds to a different Station
. If this
number equals the total number of possible Station
s for that index
then
sd['Station(s)']
contains no "extra numbers".
We can use groupby/nunique
to count the number of Stations
for each index
and Position
:
In [256]: pos_count = merged.groupby(['index', 'Position'])['Station'].nunique().unstack(); pos_count
Out[256]:
Position Contractor President
index
0 2.0 2.0
1 2.0 NaN
4 2.0 3.0
5 1.0 1.0
and we can count the total number of Station
s for each index
:
In [243]: total_count = tidysd.groupby(['index'])['Station'].nunique(); total_count
Out[243]:
index
0 2
1 2
3 2
4 3
5 1
Name: Station, dtype: int64
So that finally, we can assign 1's and 0's to the Contractor
and President
columns,
based on the criterion (pos_count[col] == total_count)
:
pos_count = pos_count.reindex(total_count.index, fill_value=0)
for col in pos_count:
pos_count[col] = (pos_count[col] == total_count).astype(int)
pos_count = pos_count.reindex(sd.index, fill_value=0)
# Position Contractor President
# 0 1 1
# 1 1 0
# 2 0 0
# 3 0 0
# 4 0 1
# 5 1 1
If you really wish, you can then concatenate this result to the original sd
to produce the exact desired result:
In [246]: result = pd.concat([sd, pos_count], axis=1); result
Out[246]:
Item 1 Item 2 Site Station(s) Contractor President
0 1 1 A ,1,2,, 1 1
1 1 0 B ,1,2,, 1 0
2 0 0 B ,,,, 0 0
3 0 1 C ,1,2,, 0 0
4 1 1 A 0,1,2,, 0 1
5 0 1 A ,,2,, 1 1
but again, if you buy into the idea that data should be tidy, you should avoid packing multiple rows worth of data into a comma-separated string.
How to tidy-up pos
and sd
:
You can use the vectorized string methods, .str.findall
and .str.split
to
convert the comma-separated strings to lists of values. Then use list
comprehensions to iterate through the rows and lists to build tidypos
and
tidysd
.
Putting it all together,
import itertools as IT
import pandas as pd
pos = pd.DataFrame({'Station(s)':[',1,2,,','0,1,2,3,4'],
'Position':['Contractor','President'],
'Site(s)':['A,B','A'],
'Item(s)':['1','1,2']})
sd = pd.DataFrame({'Site':['A','B','B','C','A','A'],
'Station(s)':[',1,2,,',',1,2,,',',,,,',',1,2,,','0,1,2,,',',,2,,'],
'Item 1':[1,1,0,0,1,0],
'Item 2':[1,0,0,1,1,1]})
mypos = pos.copy()
mypos['Station(s)'] = mypos['Station(s)'].str.findall(r'(\d+)')
mypos['Site(s)'] = mypos['Site(s)'].str.split(r',')
tidypos = pd.DataFrame(
[(row['Position'], site, station)
for index, row in mypos.iterrows()
for site, station in IT.product(
*[row[col] for col in ['Site(s)', 'Station(s)']])],
columns=['Position', 'Site', 'Station'])
mysd = sd[['Site', 'Station(s)']].copy()
mysd['Station(s)'] = mysd['Station(s)'].str.findall(r'(\d+)')
tidysd = pd.DataFrame(
[(index, row['Site'], station)
for index, row in mysd.iterrows()
for station in row['Station(s)']],
columns=['index', 'Site', 'Station'])
merged = pd.merge(tidysd, tidypos, how='left')
pos_count = merged.groupby(['index', 'Position'])['Station'].nunique().unstack()
total_count = tidysd.groupby(['index'])['Station'].nunique()
pos_count = pos_count.reindex(total_count.index, fill_value=0)
for col in pos_count:
pos_count[col] = (pos_count[col] == total_count).astype(int)
pos_count = pos_count.reindex(sd.index, fill_value=0)
result = pd.concat([sd, pos_count], axis=1)
print(result)
yields
Item 1 Item 2 Site Station(s) Contractor President
0 1 1 A ,1,2,, 1 1
1 1 0 B ,1,2,, 1 0
2 0 0 B ,,,, 0 0
3 0 1 C ,1,2,, 0 0
4 1 1 A 0,1,2,, 0 1
5 0 1 A ,,2,, 1 1