1

Given the following data frames:

import pandas as pd
import numpy as np
pos = pd.DataFrame({'Station(s)':[',1,2,,','0,1,2,3,4'],
                    'Position':['Contractor','President'],
                    'Site(s)':['A,B','A'],
                    'Item(s)':['1','1,2']
                   })

pos[['Position','Site(s)','Station(s)','Item(s)']]

pos

    Position    Site(s)     Station(s)  Item(s)
0   Contractor  A,B         ,1,2,,      1
1   President   A          0,1,2,3,4    1,2

and

sd = pd.DataFrame({'Site':['A','B','B','C','A','A'],
                   'Station(s)':[',1,2,,',',1,2,,',',,,,',',1,2,,','0,1,2,,',',,2,,'],
                   'Item 1':[1,1,0,0,1,0],
                   'Item 2':[1,0,0,1,1,1]})
sd[['Site','Station(s)','Item 1','Item 2']]

sd

    Site    Station(s)  Item 1  Item 2
0   A        ,1,2,,        1    1
1   B        ,1,2,,        1    0
2   B        ,,,,          0    0
3   C        ,1,2,,        0    1
4   A        0,1,2,,       1    1
5   A        ,,2,,         0    1

I'd like to end up with this:

    Contractor  President   Site(s)     Station(s)  Item 1  Item 2
0      1           1           A         ,1,2,,       1     1
1      1           0           B         ,1,2,,       1     0
2      0           0           B         ,,,,         0     0
3      0           0           C         ,1,2,,       0     1
4      0           1           A         0,1,2,,      1     1
5      1           1           A         ,,2,,        0     1

results = pd.DataFrame({'Contractor':[1,1,0,0,0,1],
                    'President':[1,0,0,0,1,1],
                   'Site(s)':['A','B','B','C','A','A'],
                   'Station(s)':[',1,2,,',',1,2,,',',,,,',',1,2,,','0,1,2,,',',,2,,'],
                   'Item 1':[1,1,0,0,1,0],
                   'Item 2':[1,0,0,1,1,1]})
results[['Contractor','President','Site(s)','Station(s)','Item 1','Item 2']]

based on this logic:

For each kind of position:

  1. Create a new column in sd with the name of that position.

  2. Make its value equal to 1 for each row in which the following conditions are met (else 0 for other rows):

    a. sd['Site'] contains at least 1 value in pos['Site(s)']

    b. sd['Station(s)'] contains at least 1 number found in pos['Station(s)'] but NO extra numbers

I started with this but was promptly beaten back into submission:

for i in pos['Position']:
    sd[i]= 1 if lambda x: 'x' if x for x in pos['Site(s)'] if x in sd['Site']
Dance Party
  • 3,459
  • 10
  • 42
  • 67

2 Answers2

4

Because of the way the data is being stored -- in strings of comma-separated values -- code is required to iterate through the rows, pick apart the values, iterate through the other DataFrame and pick apart its values, then compare the two, etc.... I don't see a way to really improve upon this situation as long as the inputs retain the comma-separated values.

Given the constraints, I think su79eu7k's answer is quite good.

If, however, you buy into the idea that "tidy data" (PDF) is better -- if you allow us to change the starting point to be DataFrames in tidy format -- then there is a different approach which may be more performant especially when sd has many rows. The problem with using sd.apply(check, axis=1) is that under the hood it uses a Python loop to iterate over the rows of sd. Calling check once for every row can be relatively slow compared to equivalent code that takes advantage of Panda's faster vectorized methods (like merge or groupby). However, to use merge and groupby you need the data to be in tidy format.

So suppose instead of pos and sd we start with tidypos and tidysd. (At the end of this post you'll find a runnable example which converts pos and sd to their tidy equivalents.)

In [238]: tidypos
Out[238]: 
     Position Site Station
0  Contractor    A       1
1  Contractor    A       2
2  Contractor    B       1
3  Contractor    B       2
4   President    A       0
5   President    A       1
6   President    A       2
7   President    A       3
8   President    A       4

In [239]: tidysd
Out[239]: 
   index Site Station
0      0    A       1
1      0    A       2
2      1    B       1
3      1    B       2
4      3    C       1
5      3    C       2
6      4    A       0
7      4    A       1
8      4    A       2
9      5    A       2

tidypos and tidysd contain the same information as pos and sd (ignoring Items since they play no role in this problem.) The difference mainly is that each row in tidypos and tidysd corresponds to one "observation". Each observation is independent of every other. Essentially, this boils down to simply splitting the comma-separated values so that each value ends up on a separate row.

Now we can join the two DataFrames based on the common columns, Site and Station:

In [241]: merged = pd.merge(tidysd, tidypos, how='left'); merged
Out[241]: 
    index Site Station    Position
0       0    A       1  Contractor
1       0    A       1   President
2       0    A       2  Contractor
3       0    A       2   President
4       1    B       1  Contractor
5       1    B       2  Contractor
6       3    C       1         NaN
7       3    C       2         NaN
8       4    A       0   President
9       4    A       1  Contractor
10      4    A       1   President
11      4    A       2  Contractor
12      4    A       2   President
13      5    A       2  Contractor
14      5    A       2   President

Now, each row in merged represents a match between a row of tidysd and a row of tidypos. Therefore, the presence of a row implies that there was a match between sd['Site'] and pos['Site'], and moreover, a match between tidysd['Station'] and tidypos['Station']. In other words, for that row, sd['Station(s)'] must contain a number found in pos['Station()']. The only critera we are not yet sure about is whether there are extra numbers in sd['Station(s)'] that do not appear in pos['Station()'].

We can find that out by counting the number of rows in merged for each index and Position since each such row corresponds to a different Station. If this number equals the total number of possible Stations for that index then sd['Station(s)'] contains no "extra numbers".

We can use groupby/nunique to count the number of Stations for each index and Position:

In [256]: pos_count = merged.groupby(['index', 'Position'])['Station'].nunique().unstack(); pos_count
Out[256]: 
Position  Contractor  President
index                          
0                2.0        2.0
1                2.0        NaN
4                2.0        3.0
5                1.0        1.0

and we can count the total number of Stations for each index:

In [243]: total_count = tidysd.groupby(['index'])['Station'].nunique(); total_count
Out[243]: 
index
0    2
1    2
3    2
4    3
5    1
Name: Station, dtype: int64

So that finally, we can assign 1's and 0's to the Contractor and President columns, based on the criterion (pos_count[col] == total_count):

pos_count = pos_count.reindex(total_count.index, fill_value=0)
for col in pos_count:
    pos_count[col] = (pos_count[col] == total_count).astype(int)
pos_count = pos_count.reindex(sd.index, fill_value=0)
# Position  Contractor  President
# 0                  1          1
# 1                  1          0
# 2                  0          0
# 3                  0          0
# 4                  0          1
# 5                  1          1

If you really wish, you can then concatenate this result to the original sd to produce the exact desired result:

In [246]: result = pd.concat([sd, pos_count], axis=1); result
Out[246]: 
   Item 1  Item 2 Site Station(s)  Contractor  President
0       1       1    A     ,1,2,,           1          1
1       1       0    B     ,1,2,,           1          0
2       0       0    B       ,,,,           0          0
3       0       1    C     ,1,2,,           0          0
4       1       1    A    0,1,2,,           0          1
5       0       1    A      ,,2,,           1          1

but again, if you buy into the idea that data should be tidy, you should avoid packing multiple rows worth of data into a comma-separated string.


How to tidy-up pos and sd:

You can use the vectorized string methods, .str.findall and .str.split to convert the comma-separated strings to lists of values. Then use list comprehensions to iterate through the rows and lists to build tidypos and tidysd.

Putting it all together,

import itertools as IT
import pandas as pd

pos = pd.DataFrame({'Station(s)':[',1,2,,','0,1,2,3,4'],
                    'Position':['Contractor','President'],
                    'Site(s)':['A,B','A'],
                    'Item(s)':['1','1,2']})

sd = pd.DataFrame({'Site':['A','B','B','C','A','A'],
                   'Station(s)':[',1,2,,',',1,2,,',',,,,',',1,2,,','0,1,2,,',',,2,,'],
                   'Item 1':[1,1,0,0,1,0],
                   'Item 2':[1,0,0,1,1,1]})

mypos = pos.copy()
mypos['Station(s)'] = mypos['Station(s)'].str.findall(r'(\d+)')
mypos['Site(s)'] = mypos['Site(s)'].str.split(r',')
tidypos = pd.DataFrame(
    [(row['Position'], site, station) 
     for index, row in mypos.iterrows() 
     for site, station in IT.product(
             *[row[col] for col in ['Site(s)', 'Station(s)']])], 
    columns=['Position', 'Site', 'Station'])

mysd = sd[['Site', 'Station(s)']].copy()
mysd['Station(s)'] = mysd['Station(s)'].str.findall(r'(\d+)')

tidysd = pd.DataFrame(
    [(index, row['Site'], station)
     for index, row in mysd.iterrows() 
     for station in row['Station(s)']], 
    columns=['index', 'Site', 'Station'])

merged = pd.merge(tidysd, tidypos, how='left')
pos_count = merged.groupby(['index', 'Position'])['Station'].nunique().unstack()
total_count = tidysd.groupby(['index'])['Station'].nunique()
pos_count = pos_count.reindex(total_count.index, fill_value=0)
for col in pos_count:
    pos_count[col] = (pos_count[col] == total_count).astype(int)
pos_count = pos_count.reindex(sd.index, fill_value=0)
result = pd.concat([sd, pos_count], axis=1)
print(result)

yields

   Item 1  Item 2 Site Station(s)  Contractor  President
0       1       1    A     ,1,2,,           1          1
1       1       0    B     ,1,2,,           1          0
2       0       0    B       ,,,,           0          0
3       0       1    C     ,1,2,,           0          0
4       1       1    A    0,1,2,,           0          1
5       0       1    A      ,,2,,           1          1
Community
  • 1
  • 1
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Incredible. Thanks again. The tidy data information is much appreciated. Unfortunately, I receive the data in an "untidy" state from the organization that provides the data, but I will certainly relay the information about tidy data to them. – Dance Party May 17 '16 at 00:32
1

I tried roughly, you can improve below code.

sd['Contractor'] = 0
sd['President'] = 0

def check(x):
    for p in pos['Position'].tolist():
        if x['Site'] in pos.set_index('Position').loc[p, 'Site(s)'].split(','):
            ss = pd.Series(x['Station(s)'].split(',')).replace('', np.nan).dropna()
            ps = pd.Series(pos.set_index('Position').loc[p, 'Station(s)'].split(',')).replace('', np.nan).dropna()
            if not ss.empty and ss.isin(ps).all():
                x[p] = 1

    return x

print sd.apply(check, axis=1)


   Item 1  Item 2 Site Station(s)  Contractor  President
0       1       1    A     ,1,2,,           1          1
1       1       0    B     ,1,2,,           1          0
2       0       0    B       ,,,,           0          0
3       0       1    C     ,1,2,,           0          0
4       1       1    A    0,1,2,,           0          1
5       0       1    A      ,,2,,           1          1
su79eu7k
  • 7,031
  • 3
  • 34
  • 40
  • For better performance, `ps` could be lifted outside of `check` (made a global or default parameter of `check`) so that it is computed only once instead of each time `check` is called. – unutbu May 17 '16 at 00:27