0

Objective: I have a list of 200 elements(urls) and I would like to check if each one is in a specific column of the Dataframe. If it is, I would like to remove the element from the list.

Problem: I am trying a similar solution by adding to a new list the ones that are not there but it adds all of them.

pruned = []
for element in list1:
    if element not in transfer_history['Link']:
        pruned.append(element)

I have also tried the solution I asked for without success. I think it's a simple thing but I can't find the key.

for element in list1:
    if element in transfer_history['Link']:
        list1.remove(element)
  • Can you make a [Minimal, Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example) for this? Aka, can you include a smaller list and simpler DataFrame for testing~ – BeRT2me Aug 28 '22 at 19:18

3 Answers3

2

When you use in with a pandas series, you are searching the index, not the values. To get around this, convert the column to a list using transfer_history['Link'].tolist(), or better, convert it to a set.

links = set(transfer_history["Link"])

A good way to filter the list is like this:

pruned = [element for element in list1 if element not in links]

Don't remove elements from the list while iterating over it, which may have unexpected results.

Stuart
  • 9,597
  • 1
  • 21
  • 30
1

Remember, your syntax for transfer_history['Link'] is the entire column itself. You need to call each item in the column using another array transfer_history['Link'][x]. Use a for loop to iterate through each item in the column.

Or a much easier way is to just check if the item is in a list made of the entire column with a one liner:

pruned = []
for element in list1:
    if element not in [link for link in transfer_history['Link']]:
        pruned.append(element)
cap1hunna
  • 104
  • 7
  • 1
    It worked the first option. Thanks so much!!!!! – Carlos Lozano Aug 28 '22 at 19:26
  • No problem! I've updated my answer for you. Hope this makes it clearer to understand. – cap1hunna Aug 28 '22 at 19:30
  • "You're essentially checking if one url is equal to the entire column." - this isn't the problem. Also, don't use the 2nd option which removes elements from `list1` while iterating over it. It will often have unexpected results. – Stuart Aug 28 '22 at 19:38
  • @Stuart. You are right. What I meant was that he needs to check if the element is in each cell of the data frame, rather than the entire cells / column itself. Edited and removed. – cap1hunna Aug 28 '22 at 19:42
0

If the order of the urls doesn't matter, this can be simplified a lot using sets:

list1 = list(set(list1) - set(transfer_history['Link']))
BeRT2me
  • 12,699
  • 2
  • 13
  • 31