2

I am trying to remove some strings from a list when the string starts with or contains "@", "#", "http" or "rt". A sample list is below.

text_words1 = ['@football', 'haberci', '#sorumlubenim', 'dedigin', 'tarafsiz', 'olurrt', '@football', 'saysaniz', 'olur', '#sorumlubenim', 'korkakligin', 'sonu']

According to above list, I want to remove '@football' and '#sorumlubenim'. I tried the code below.

 i = 0
 while i < len(text_words1):
     if text_words1[i].startswith('@'):
         del text_words1[i] 
     if text_words1[i].startswith('#'):
         del text_words1[i] 
     i = i+1
 print 'The updated list is: \n', text_words1  

However, the code above only removed some strings, not all of the ones which start with "@" or "#" symbols.

Then, I added the code below into what is above as not all strings of interest start with "@", "#" or "http", but contains those symbols.

 while i < len(text_words1):
     if text_words1[i].__contains__('@'):
         del text_words1[i] 
     if text_words1[i].__contains__('#'):
         del text_words1[i]
     if text_words1[i].__contains__('http'):
        del text_words1[i]
     i = i+1
 print 'The updated list: \n', text_words1  

The above code removed some items which contains "#: or "@" but not all.

Can someone advise me how to remove all items which starts with or contain "@", "#", "http", or "rt"?

GreenMatt
  • 18,244
  • 7
  • 53
  • 79
Behzat
  • 121
  • 2
  • 10
  • which ones were not removed? – Stewart Jun 03 '15 at 18:44
  • 2
    skip the `i = i + 1` when you `del text_words1[i]` in one of your if clauses, because deleting will move the index of the next string to the position of the deleted word. Best to use an `if - elif - elif - else`-structure for this with `i = i + 1` in the `else` condition – avk Jun 03 '15 at 18:44
  • Please use `x in y` instead of `y.__contains__(x)` – Kevin Jun 03 '15 at 18:46
  • Yes as mentioned above don't change list while iterating. Please read this for better understanding http://stackoverflow.com/questions/1637807/modifying-list-while-iterating – James Sapam Jun 03 '15 at 18:48

2 Answers2

4

As the comments point out. With your approach you lose reference of the lists' indexes therefore not iterating the whole list. You can use a list comprehension to remove the words you don't need

new_list  = [i for i in text_words1 if not i.startswith(('@','#'))]
Daniel
  • 5,095
  • 5
  • 35
  • 48
3

Here is my solution:

import re
text_words1 = ['@football', 'haberci', '#sorumlubenim', 'dedigin', 'tarafsiz', 'olurrt', '@football', 'saysaniz', 'olur', '#sorumlubenim', 'korkakligin', 'sonu']
for i, word in reversed(list(enumerate(text_words1))):
    if re.search('(@|#|http|rt)', word):
        del text_words1[i]

With a list comprehension:

text_words1 = [w for w in text_words1 if not re.search('(@|#|http|rt)', w)]

Note that I'm using re.search because it checks for a match anywhere in the string, whereas re.match checks for a match only at the beginning of the string. This is important because you want to remove words that begin with and/or contain those characters.

The problem with your code snippet is that you're removing items while iterating. len(text_words1) won't allow you to examine every list item because of this. Add a print statement to your while loop and you will see what I mean.

fenceop
  • 1,439
  • 3
  • 18
  • 29