0

I am processing a large text file and as output I have a list of words:

['today', ',', 'is', 'cold', 'outside', '2013', '?', 'December', ...]

What I want to achieve next is to transform everything to lowercase, remove all the words that belong to a stopset (commonly used words) and remove punctuation. I can do it by doing 3 iterations:

lower=[word.lower() for word in mywords]
removepunc=[word for word in lower if word not in string.punctuation]
final=[word for word in removepunc if word not in stopset]

I tried to use

[word for word in lower if word not in string.punctuation or word not in stopset]

to achieve what last 2 lines of code are supposed to do but it's not working. Where is my error and is there any faster way to achieve this than to iterate through the list 3 times?

Anastasia
  • 864
  • 5
  • 13
  • 26

8 Answers8

2

If your code is working as intended, I don't think it's a good idea. Now it is well readable and can be easily modified with additional processing. One-liners are good for SO to get more upvotes, you'll get hard time understainding its logic some time later.

You can replace intermediate steps with generators instead of lists, to make your computation work once, and not to generate several lists:

lower = (word.lower() for word in mywords)
removepunc = (word for word in lower if word not in string.punctuation)
final = [word for word in removepunc if word not in stopset]
alko
  • 46,136
  • 12
  • 94
  • 102
  • Thanks for the useful comment! I am still learning Python so I am basically trying to work on small examples and apply whichever technique I can think of just to practice. I will keep this in mind. Thanks! – Anastasia Dec 07 '13 at 22:53
1

You can certainly compress the logic:

final = [word for word in map(str.lower, mywords)
         if word not in string.punctuation and word not in stopset]

For example, if I define stopset = ['if'] I get:

 ['today', 'cold', 'outside', '2013', 'december']
jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
0

Here is the equivalent single list comprehension, although I agree with alko that what you already have is clearer:

final = [lower_word for word in mywords for lower_word in (word.lower(),) if lower_word not in string.punction and lower_word not in stopset]
Vaughn Cato
  • 63,448
  • 5
  • 82
  • 132
0

note that list comprehensions are not the best way to go when it comes to large files, as the entire file will have to be loaded to memory.

instead do something like Read large text files in Python, line by line without loading it in to memory

with open("log.txt") as infile:
    for line in infile:
        if clause goes here:
            ....
Community
  • 1
  • 1
Guy Gavriely
  • 11,228
  • 6
  • 27
  • 42
  • Yes, I am aware of that. It's not super large text file, I am only trying to grasp list comprehension so this was just a toy example. In this specific example I already read the text file with something similar to what you posted and got a list of words. – Anastasia Dec 07 '13 at 22:59
0

I'd guess the fastest approach is try to move as much as possible of the computation from Python to C. First precompute the set of forbidden strings. This needs to be done just once.

avoid = set(string.punctuation) | set(x.lower() for x in stopset)

then let the set subtraction operation to do as much of the filtering as possible

final = set(x.lower() for x in mywords) - avoid

Converting the whole source of words at once to lowercase before starting would probably improve speed too. In that case the code would be

final = set(mywords) - avoid
6502
  • 112,025
  • 15
  • 165
  • 265
0

You can use map to fold in the .lower processing

final = [word for word in map(str.lower, mywords) if word not in string.punctuation and word not in stopset]

You can simply add string.punctuation to stopset, then it becomes

final = [word for word in map(str.lower, mywords) if word not in stopset]

Are sure you don't want to preserve the case of the words in the output though?

John La Rooy
  • 295,403
  • 53
  • 369
  • 502
0

is there any faster way to achieve this than to iterate through the list 3 times?

Turn johnsharpe's code into a generator. This may drastically speed up the use and lower memory use as well.

import string
stopset = ['is']
mywords = ['today', ',', 'is', 'cold', 'outside', '2013', '?', 'December']
final = (word.lower() for word in mywords if (word not in string.punctuation and 
                                              word not in stopset))
print "final", list(final) 

To display results outside of an iterator for debugging, use list as in this example

0

If you use filter you can do it with one list comprehension and it is easier to read.

final = filter( lambda s: s not in string.punctation and s not in stopset  ,[word.lower() for word in mywords])
Javier Castellanos
  • 9,346
  • 2
  • 15
  • 19