Is it possible to shorten this using list comprehension?

Question

I am processing a large text file and as output I have a list of words:

['today', ',', 'is', 'cold', 'outside', '2013', '?', 'December', ...]

What I want to achieve next is to transform everything to lowercase, remove all the words that belong to a stopset (commonly used words) and remove punctuation. I can do it by doing 3 iterations:

lower=[word.lower() for word in mywords]
removepunc=[word for word in lower if word not in string.punctuation]
final=[word for word in removepunc if word not in stopset]

I tried to use

[word for word in lower if word not in string.punctuation or word not in stopset]

to achieve what last 2 lines of code are supposed to do but it's not working. Where is my error and is there any faster way to achieve this than to iterate through the list 3 times?

you were close, just need an `and` instead of an `or`. – roippi Dec 07 '13 at 22:45 — roippi, Dec 07 '13 at 22:45

score 2 · Answer 1 · answered Dec 07 '13 at 22:43

If your code is working as intended, I don't think it's a good idea. Now it is well readable and can be easily modified with additional processing. One-liners are good for SO to get more upvotes, you'll get hard time understainding its logic some time later.

You can replace intermediate steps with generators instead of lists, to make your computation work once, and not to generate several lists:

lower = (word.lower() for word in mywords)
removepunc = (word for word in lower if word not in string.punctuation)
final = [word for word in removepunc if word not in stopset]

Thanks for the useful comment! I am still learning Python so I am basically trying to work on small examples and apply whichever technique I can think of just to practice. I will keep this in mind. Thanks! — Anastasia, Dec 07 '13 at 22:53

jonrsharpe · Accepted Answer · 2013-12-07T23:08:58.580

1

You can certainly compress the logic:

final = [word for word in map(str.lower, mywords)
         if word not in string.punctuation and word not in stopset]

For example, if I define stopset = ['if'] I get:

 ['today', 'cold', 'outside', '2013', 'december']

edited Dec 07 '13 at 23:08

answered Dec 07 '13 at 22:41

jonrsharpe

115,751
26
228
437

Thanks! That worked. I see where my error was! I will mark this as an answer. It says I can do it in 10min. – Anastasia Dec 07 '13 at 22:43
But it would be nicer not to call `word.lower()` twice – John La Rooy Dec 07 '13 at 22:51

score 0 · Answer 3 · answered Dec 07 '13 at 22:48

Here is the equivalent single list comprehension, although I agree with alko that what you already have is clearer:

final = [lower_word for word in mywords for lower_word in (word.lower(),) if lower_word not in string.punction and lower_word not in stopset]

score 0 · Answer 4 · edited May 23 '17 at 12:03

0

note that list comprehensions are not the best way to go when it comes to large files, as the entire file will have to be loaded to memory.

instead do something like Read large text files in Python, line by line without loading it in to memory

with open("log.txt") as infile:
    for line in infile:
        if clause goes here:
            ....

edited May 23 '17 at 12:03

Community

1
1

answered Dec 07 '13 at 22:51

Guy Gavriely

11,228
6
27
42

Yes, I am aware of that. It's not super large text file, I am only trying to grasp list comprehension so this was just a toy example. In this specific example I already read the text file with something similar to what you posted and got a list of words. – Anastasia Dec 07 '13 at 22:59

score 0 · Answer 5 · answered Dec 07 '13 at 22:51

I'd guess the fastest approach is try to move as much as possible of the computation from Python to C. First precompute the set of forbidden strings. This needs to be done just once.

avoid = set(string.punctuation) | set(x.lower() for x in stopset)

then let the set subtraction operation to do as much of the filtering as possible

final = set(x.lower() for x in mywords) - avoid

Converting the whole source of words at once to lowercase before starting would probably improve speed too. In that case the code would be

final = set(mywords) - avoid

score 0 · Answer 6 · answered Dec 07 '13 at 22:53

0

You can use map to fold in the .lower processing

final = [word for word in map(str.lower, mywords) if word not in string.punctuation and word not in stopset]

You can simply add string.punctuation to stopset, then it becomes

final = [word for word in map(str.lower, mywords) if word not in stopset]

Are sure you don't want to preserve the case of the words in the output though?

answered Dec 07 '13 at 22:53

John La Rooy

295,403
53
369
502

Yes, I am sure. I am only interested in frequency of words in this case. – Anastasia Dec 07 '13 at 23:00

score 0 · Answer 7 · 2013-12-07T23:37:11.710

is there any faster way to achieve this than to iterate through the list 3 times?

Turn johnsharpe's code into a generator. This may drastically speed up the use and lower memory use as well.

import string
stopset = ['is']
mywords = ['today', ',', 'is', 'cold', 'outside', '2013', '?', 'December']
final = (word.lower() for word in mywords if (word not in string.punctuation and 
                                              word not in stopset))
print "final", list(final)

To display results outside of an iterator for debugging, use list as in this example

score 0 · Answer 8 · answered Dec 08 '13 at 00:48

0

If you use filter you can do it with one list comprehension and it is easier to read.

final = filter( lambda s: s not in string.punctation and s not in stopset  ,[word.lower() for word in mywords])

answered Dec 08 '13 at 00:48

Javier Castellanos

9,346
2
15
19

Is it possible to shorten this using list comprehension?

8 Answers8