Find rows that are not in a set of values (similar to SQL Except)

Question

What I'm trying to do is delete several rows of an Excel-Files (with pandas) and then save the File without those rows to .xlsx (with pyexcelerate module).

I'm aware that I can remove rows of a data frame by dropping them (I already got that to work). But I have read in several posts that when there are many (in my case > 5000) rows that should be deleted it's much faster to just get the indexes of the "to delete" rows from the data frame and then slice the data frame (just as a SQL Except statement for example would do). Unfortunately I can't get it to work, even though I've tried several methods.

Here are my "source posts":

Slice Pandas dataframe by labels that are not in a list - Answer from User ASGM

How to drop a list of rows from Pandas dataframe? - Answer from User Dennis Golomazov

And here is a part of the function, that should delete the rows and save the created file:

for index, cell in enumerate(wb_in[header_xlsx]):
    if str(cell) in delete_set:
        set_to_delete.append(index)
        print str(cell) + " deleted from set: " + str(len(set_to_delete))
wb_out = Workbook()
data_out = wb_in.loc[set(wb_in.index) - set(set_to_delete)]
ws_out = wb_out.new_sheet('Main', data=data_out)
wb_out.save(file_path + filename + "_2.xlsx")

Here is an example of the data frame:

               sku  product_group                      name  \
0  ABCDb00610-23.0           ABA1        Anti
1  ABCDb00610-10.0           ABA1        Anti
2   ABCDb00610-1.1           ABA1         Anti
3  ABCDb00609-23.0           ABA1         Anti
4  ABCDb00609-10.0           ABA1         Anti
5   ABCDb00609-1.1           ABA1         Anti
6  ABCDb00608-23.0           ABA1         Anti
7  ABCDb00608-10.0           ABA1         Anti
8   ABCDb00608-3.3           ABA1         Anti
9   ABCDb00608-3.0           ABA1         Anti

Delete_set is a set that contains only skus (e.g.: ABCDb00608-3.3 or ABCDb00609-1.1).

Btw: I have tried many solution suggestions!

Thanks in advance!

Please show two things: 1. Your dataframe 2. Your "delete_set". What does the set contain? Index? Values? This is neither clear nor helpful. — cs95, Oct 02 '17 at 14:21
The dataframe contains 3 columns: sku | product_group | name . And the 'set_to_delete' contains the indexes, that have been found by the for-loop — fmedv, Oct 02 '17 at 14:24
You haven't answered my question completely. It would be nice to see actual data in text form. — cs95, Oct 02 '17 at 14:24
Unfortunately I couldn't find a way to add a Table to my post. What would be the best way for you? Attach a text-file? — fmedv, Oct 02 '17 at 14:25
You could just `print(df.head(10))` into terminal and paste the output here. Also, I asked about "delete_set", not "set_to_delete". — cs95, Oct 02 '17 at 14:26

score 1 · Accepted Answer · answered Oct 02 '17 at 14:37

1

Use pd.Series.isin:

df = df[~df.sku.isin(delete_set)]

print(df)
               sku product_group                   name
0  ABAAb00610-23.0          ABA1  Anti-Involucrin [SY5]
1  ABAAb00610-10.0          ABA1  Anti-Involucrin [SY5]
2   ABAAb00610-1.1          ABA1      Anti-EpCAM [AUA1]
3  ABAAb00609-23.0          ABA1      Anti-EpCAM [AUA1]
4  ABAAb00609-10.0          ABA1      Anti-EpCAM [AUA1]
5   ABAAb00609-1.1          ABA1      Anti-EpCAM [AUA1]
6  ABAAb00608-23.0          ABA1      Anti-EpCAM [AUA1]
7  ABAAb00608-10.0          ABA1      Anti-EpCAM [AUA1]
8   ABAAb00608-3.3          ABA1      Anti-EpCAM [AUA1]
9   ABAAb00608-3.0          ABA1      Anti-EpCAM [AUA1]

print(delete_set)
('ABAAb00608-3.3', 'ABAAb00609-1.1')

m = ~df.sku.isin(delete_set)
print(m) 
0     True
1     True
2     True
3     True
4     True
5    False
6     True
7     True
8    False
9     True
Name: sku, dtype: bool

print(df[m])
               sku product_group                   name
0  ABAAb00610-23.0          ABA1  Anti-Involucrin [SY5]
1  ABAAb00610-10.0          ABA1  Anti-Involucrin [SY5]
2   ABAAb00610-1.1          ABA1      Anti-EpCAM [AUA1]
3  ABAAb00609-23.0          ABA1      Anti-EpCAM [AUA1]
4  ABAAb00609-10.0          ABA1      Anti-EpCAM [AUA1]
6  ABAAb00608-23.0          ABA1      Anti-EpCAM [AUA1]
7  ABAAb00608-10.0          ABA1      Anti-EpCAM [AUA1]
9   ABAAb00608-3.0          ABA1      Anti-EpCAM [AUA1]

answered Oct 02 '17 at 14:37

cs95

379,657
97
704
746

this is my output-file: https://ufile.io/sqn3b (.xlsx - no macros!). it only contains the headings with each character in one column. I don't get what the problem is – fmedv Oct 02 '17 at 14:54
Use `df.to_excel('out.xlsx', sheet_name='Sheet1')`. Don't write directly. – cs95 Oct 02 '17 at 14:55
pandas .to_excel function takes centuries to write my data to file. That's why I have decided to use pyexcelerate to write to file. And up until now (until I've decided not to just drop the rows) it worked perfectly. Can you explain what the difference is? – fmedv Oct 02 '17 at 14:59
@Frame I personally don't have much experience with using these tools. But the problem could be your delimiters? – cs95 Oct 02 '17 at 15:00
it works with df.to_excel! But i still can't imagine why it wouldn't work with pyexcelerate's function. What delimiters do you mean? – fmedv Oct 02 '17 at 15:08
@Frame I'm not sure! You should open a new question, someone more skilled might be able to help :-) – cs95 Oct 02 '17 at 15:14

Find rows that are not in a set of values (similar to SQL Except)

1 Answers1