separate rows with same values in certain columns

Question

Hi I have dataframe in this form:

    Episode    Number Rating Series
    4 Days Out   2.9    9.1  "Breaking Bad" (2008)
    Buyout       5.6    9.0 "Breaking Bad" (2008)
    Pilot        1.1    9.0 "Breaking Bad" (2008)
    Dog Fight    1.12   9.0 "Suits" (2011)
    We're Done   4.7    9.0 "Suits" (2011)
    Privilege    5.6    8.9 "Suits" (2011)
    Pilot        1.1    8.9 "Suits" (2011)

I would like to create a new column for this dataframe called watched, where I would provide episode numbers (from the 'Number' column) in a list and the apply where method on it so the watched column would have yes or no values.

watchlist=[1.1, 4.7, 2.9]
df['watched'] = np.where(df['Number'].isin(watchlist), 'no', 'yes')

So this would create new column where there is 'no' value in rows with episodes 4.7, 2.9 and 1.1, however the problem is I would like to have the 'no' in one of those 1.1 only, not both. Is there a way to differentiatie somehow those two rows with value '1.1' in column number ? (they have different value in 'Series' column but same one in 'Episode' column.

Hmm didn't think about this in that way. Lets say I can create separate lists for different series — Alex T, Apr 16 '17 at 16:23
Which one should have the 'no'? What is the intended behavior if an element in `watchlist` appears more than twice in `df.Number`? — fuglede, Apr 16 '17 at 16:27
@fuglede so lets say maybe two separate lists for different series? So if there are more than two series there would more lists for each series. — Alex T, Apr 16 '17 at 17:19
@AlexT Please see another much simpler + more efficient (2.5* times faster) way to get the same output. It is much cleaner and with less overhead than the previous method. Let me know what you think :) — Chuck, Apr 18 '17 at 11:48

score 1 · Answer 1 · edited May 23 '17 at 12:18

For a single watchlist

You can use a selected isin with np.where by selecting which series you want to check for and by using a different watchlist for each series. For your dataframe df:

      Episode  Number  Rating               Series
0  4 Days Out    2.90     9.1  Breaking Bad (2008)
1      Buyout    5.60     9.0  Breaking Bad (2008)
2       Pilot    1.10     9.0  Breaking Bad (2008)
3   Dog Fight    1.12     9.0         Suits (2011)
4  We're Done    4.70     9.0         Suits (2011)
5   Privilege    5.60     8.9         Suits (2011)
6       Pilot    1.10     8.9         Suits (2011)

and watchlist:

[1.1, 4.7, 2.9]

Assume that watchlist is just for Breaking Bad. Use np.where to apply function only to rows that match Breaking Bad (2008) and then use isin to see if the value in the column Rating is in your watchlist:

df['Breaking Bad Watched'] = df['Number'][np.where(df['Series'] == "Breaking Bad (2008)")[0]].isin(watchlist)

Gives:

      Episode  Number  Rating               Series Breaking Bad Watched
0  4 Days Out    2.90     9.1  Breaking Bad (2008)                 True
1      Buyout    5.60     9.0  Breaking Bad (2008)                False
2       Pilot    1.10     9.0  Breaking Bad (2008)                 True
3   Dog Fight    1.12     9.0         Suits (2011)                  NaN
4  We're Done    4.70     9.0         Suits (2011)                  NaN
5   Privilege    5.60     8.9         Suits (2011)                  NaN
6       Pilot    1.10     8.9         Suits (2011)                  NaN

Then use map to convert from true / false to yes / no:

d = {True: 'Yes', False: 'No'}
df['Breaking Bad Watched'] = df['Breaking Bad Watched'].map(d)

      Episode  Number  Rating               Series Breaking Bad Watched
0  4 Days Out    2.90     9.1  Breaking Bad (2008)                  Yes
1      Buyout    5.60     9.0  Breaking Bad (2008)                   No
2       Pilot    1.10     9.0  Breaking Bad (2008)                  Yes
3   Dog Fight    1.12     9.0         Suits (2011)                  NaN
4  We're Done    4.70     9.0         Suits (2011)                  NaN
5   Privilege    5.60     8.9         Suits (2011)                  NaN
6       Pilot    1.10     8.9         Suits (2011)                  NaN

------------------------ For A dictionary of Watchlists --------------------

If you have a dictionary of watchlists where the series and episode number is specified separately:

watchlist = {'Breaking Bad (2008)': [1.1, 4.7, 2.9], 'Suits (2011)': [4.7, 5.6]}

You can interate over it as follows:

# Save name of new columns into new_col_list
new_col_list = []

for series, wlist in watchlist.iteritems():
    # Save names of new columns into new_col_list
    new_col_list.append('{} Watched'.format(series))
    # Do calculation
    print series, wlist
    df['{} Watched'.format(series)] = df['Number'][np.where(df['Series'] == series)[0]].isin(wlist)

This gives you:

      Episode  Number  Rating               Series  \
0  4 Days Out    2.90     9.1  Breaking Bad (2008)   
1      Buyout    5.60     9.0  Breaking Bad (2008)   
2       Pilot    1.10     9.0  Breaking Bad (2008)   
3   Dog Fight    1.12     9.0         Suits (2011)   
4  We're Done    4.70     9.0         Suits (2011)   
5   Privilege    5.60     8.9         Suits (2011)   
6       Pilot    1.10     8.9         Suits (2011)   

  Breaking Bad (2008) Watched Suits (2011) Watched  
0                        True                  NaN  
1                       False                  NaN  
2                        True                  NaN  
3                         NaN                False  
4                         NaN                 True  
5                         NaN                 True  
6                         NaN                False  

new_col_list = ['Breaking Bad (2008) Watched', 'Suits (2011) Watched']

[1]If have only a few names then manually write them: Then use pd.concatenate to concatenate the two watch columns, and drop those columns:

df['Watched'] = pd.concat([df['Breaking Bad (2008) Watched'].dropna(), df['Suits (2011) Watched'].dropna()])
# Remove old Columns
df.drop(['Breaking Bad (2008) Watched','Suits (2011) Watched'], axis=1, inplace=True)

[2] If have a list of columns names then add list of names to pd.concat using a simple list comprehension, iterating over column names in new_col_list:

df['Watched'] = pd.concat([df['{}'.format(i)].dropna() for i in new_col_list])
# Remove old Name Columns
df.drop(new_col_list, axis=1, inplace=True)

# Convert True False to Yes No
d = {True: 'Yes', False: 'No'}
df['Watched'] = df['Watched'].map(d)
# Final Output:
df:
      Episode  Number  Rating               Series Watched
0  4 Days Out    2.90     9.1  Breaking Bad (2008)     Yes
1      Buyout    5.60     9.0  Breaking Bad (2008)      No
2       Pilot    1.10     9.0  Breaking Bad (2008)     Yes
3   Dog Fight    1.12     9.0         Suits (2011)      No
4  We're Done    4.70     9.0         Suits (2011)     Yes
5   Privilege    5.60     8.9         Suits (2011)     Yes
6       Pilot    1.10     8.9         Suits (2011)      No

Sources

Source for isin:

[1] How to check if a value is in the list in selection from pandas data frame? http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isin.html

Source for concat:

[2] https://stackoverflow.com/a/10972557/2254228

Source for map:

[3] Convert Pandas series containing string to boolean

Somehow, in a bigger dataframe (more rows), there are some rows left with a NaN. Do you know what might be a case? — Alex T, Apr 16 '17 at 18:37
Which method are you using, and can you provide the sample values for one of these rows? — Chuck, Apr 16 '17 at 18:38
Hmm now I reran the code and Im getting NaN values for the series I didn't input into dict. So the best way is just to replace all the NaN as 'Yes' or 'No', right? — Alex T, Apr 16 '17 at 18:49
So, with the dictionary code, you need to include in the dictionary **all Series in the dataframe** even if this is an empty array (i.e. for the case where you haven't watched any of the episodes for a certain series, otherwise you **will** be left with nan values in the final array. Does that make sense? i..e your dictionary might look like `watchlist = {'Breaking Bad (2008)': [1.1, 4.7, 2.9], 'Suits (2011)': [4.7, 5.6],'House (2008)':[],'Seinfeld':[]}` — Chuck, Apr 16 '17 at 19:11
So in case I have many differen series (like 20+), do I have to type all of their names manually when doing concatenating and dropping columns, or do you think there is a good way to automate that? — Alex T, Apr 17 '17 at 15:39
For the concatenation, save the names of the `XX (200XX) Watched` columns into a list. Then use this list for your column names in the concatenation and drop part. (See my edit for clarification) — Chuck, Apr 17 '17 at 16:29

Chuck · Accepted Answer · 2017-04-18T11:49:41.763

There is a simple and more efficient (2.5* quicker than current answer) way to achieve this. For your dataframe df and dictionary of watchlists watchlist, you can use df.loc with multiple conditions.

First, create placeholder column:

df['Watched'] = 'No'

      Episode  Number  Rating               Series Watched
0  4 Days Out    2.90     9.1  Breaking Bad (2008)      No
1      Buyout    5.60     9.0  Breaking Bad (2008)      No
2       Pilot    1.10     9.0  Breaking Bad (2008)      No
3   Dog Fight    1.12     9.0         Suits (2011)      No
4  We're Done    4.70     9.0         Suits (2011)      No
5   Privilege    5.60     8.9         Suits (2011)      No
6       Pilot    1.10     8.9         Suits (2011)      No

Then iterate over watchlist:

for key, values in watchlist.iteritems():
    df.loc[(df['Number'].isin(values)) & (df['Series'] == key), 'Watched'] = 'yes'

This gives df:

      Episode  Number  Rating               Series Watched
0  4 Days Out    2.90     9.1  Breaking Bad (2008)     yes
1      Buyout    5.60     9.0  Breaking Bad (2008)      No
2       Pilot    1.10     9.0  Breaking Bad (2008)     yes
3   Dog Fight    1.12     9.0         Suits (2011)      No
4  We're Done    4.70     9.0         Suits (2011)     yes
5   Privilege    5.60     8.9         Suits (2011)     yes
6       Pilot    1.10     8.9         Suits (2011)      No

No need for extra columns / concatenation or dropping columns.

Total time this answer = 0.00800013542175 s
Total time accepted answer = 2.624944121596675 s

this means i can provide watchlist list once and dont care about dropping creating columns, seems much better to me, I was thinking about similar way to do this, thanks! — Alex T, Apr 19 '17 at 08:01
@AlexT You are welcome. I always find that with Pandas, there is always a simpler way - it's just having (or in this case finding) the knowledge. — Chuck, Apr 19 '17 at 08:09

separate rows with same values in certain columns

2 Answers2