How to remove non-words from list in python

Question

I am running a function over my list that includes a dictionary look-up, so I need to remove all non-dictionary words because I'm getting a key error if I don't. I can't just use "continue" because I'm not doing this in a loop. I don't think I have very many so I can do it one by one if I have to (although I would prefer not to). The objects in the list are all in unicode, which has been making it more difficult to remove them.

My list looks like this:

my_list:
[[u'stuff',
  u'going',
  u'moment',
  u'mj',
  u've',
  u'started',
  u'listening',
  u'music'

etc...

or, if I call it like this I get a single bracket:

my_list[0]:
[u'stuff',
 u'going',
 u'moment',
 u'mj',
 u've',
 u'started',
 u'listening',
 u'music',

etc...

I've tried things like:

my_list.remove("mj")

and

my_list.remove("u'mj'")

and

my_list.remove[0,3]

Any ideas? Thanks

Edit: Response to Kevin: Here's how I got the data the way it is

my_list = []
for review in train["review"]:
    my_list.append(review_to_wordlist(review, remove_stopwords=True))

and the function is here:

def review_to_wordlist(review, remove_stopwords=False):
    #remove html
    review_text = BeautifulSoup(review).get_text()

#remove non-letters
#possibly update this later to include numbers?
review_text = re.sub("[^a-zA-Z]"," ", review_text)

#convert words to lower case and split
words = review_text.lower().split()

if remove_stopwords:
    stops = set(stopwords.words("english"))
    words = [w for w in words if not w in stops]

return(words)

How did you get the data into that structure in the first place? Is there a reason it's not already in a `set()` or dictionary? — Kevin, Jan 26 '15 at 03:14
You probably want to be using `my_list.extend()` instead of `my_list.append()`. — Kevin, Jan 26 '15 at 03:22

score 1 · Accepted Answer · answered Jan 26 '15 at 03:16

You are close. The problem isn't the unicode, it's that you are calling remove on your outer list. Since your text list is a list inside a list, that is where you need to remove from.

Do this instead:

my_list[0].remove('mj')

You can also prefix that to be a unicode string (same result in this case):

my_list[0].remove(u'mj')

Example:

my_list = [[u'stuff',
  u'going',
  u'moment',
  u'mj',
  u've',
  u'started',
  u'listening',
  u'music'
  ]]
my_list[0].remove('mj')

print my_list

Outputs:

[[u'stuff', u'going', u'moment', u've', u'started', u'listening', u'music']]

Notice that the string mj is removed.

Vedaad Shakib · Answer 2 · 2015-01-26T05:26:23.820

1

You mentioned that you were using the list for a key lookup.

Simply add the following line to your code to avoid the resulting key error:

if dict.has_key(list_item):
    # do your lookup

to avoid the error.

edited Jan 26 '15 at 05:26

answered Jan 26 '15 at 03:24

Vedaad Shakib

739
7
20

1

[Don't catch empty exceptions](http://stackoverflow.com/questions/21553327/why-is-except-pass-a-bad-programming-practice) – Andy Jan 26 '15 at 03:28
@Andy Noted and fixed – Vedaad Shakib Jan 26 '15 at 05:26

How to remove non-words from list in python

2 Answers2