2

I have a very long string named tekst (600 MB read from a file) and a list of 11.000 words called nlwoorden . I want to have everything that is in tekst, but not in nlwoorden.

belangrijk=[woord for woord in tekst.split() if woord not in nlwoorden]

would produce exactly what I want. Obviously, this takes very long to compute. Is there any more efficient way?

Thanks!

Ry-
  • 218,210
  • 55
  • 464
  • 476
damian
  • 165
  • 3
  • 2
    Make a `set` from `ntwoorden`. Tada! – tobias_k Jun 26 '14 at 20:42
  • 3
    `belangrijk = set(tekst.split()) - set(nlwoorden)` – Ry- Jun 26 '14 at 20:43
  • @false: I'm a bit of a new user, so maybe there's something I don't know, but is there a reason why you're posting the answer as a comment instead of an answer? – Sohcahtoa82 Jun 26 '14 at 20:48
  • @Sohcahtoa82: I’m still trying to find a duplicate, but I also wanted to expand on what tobias_k said, and answering a question you’re about to close is bad form. Stack Overflow’s hard! Anyways, sorry. – Ry- Jun 26 '14 at 20:49

4 Answers4

2

Using a set-based solution gives you O(len(nlwoorden)) for the whole thing. It should take another O(len(nlwoorden)) + O(len(tekst)) to make the two sets.

So the snippet you're looking for is basically the one listed in a comment:

belangrijk=list(set(tekst.split()) - set(nlwoorden))

(assuming you want it as a list again at the end)

Community
  • 1
  • 1
Patrick Collins
  • 10,306
  • 5
  • 30
  • 69
1

I think the most straightforward approach would be to use sets. For example,

s = "This is a test"
s2 = ["This", "is", "another", "test"]
set(s.split()) - set(s2)

# returns {'a'}

However, given the size of your text, it might be worthwhile to use a generator to avoid holding everything in memory at once, e.g.,

import re

def itersplit(s, sep=None):
    exp = re.compile(r'\s+' if sep is None else re.escape(sep))
    pos = 0
    while True:
        m = exp.search(s, pos)
        if not m:
            if pos < len(s) or sep is not None:
                yield s[pos:]
            break
        if pos < m.start() or sep is not None:
            yield s[pos:m.start()]
        pos = m.end()

[word for word in itersplit(s) if word not in s2]

# returns ['a']
0

This snippet:

woord not in nlwoorden

will take O(N) for N = len(nlwoorden) each time it is called.

So your list comprehension,

belangrijk=[woord for woord in tekst.split() if woord not in nlwoorden]

takes a total time of O(N * M) for M = len(tekst.split()).

This is because nlwoorden is a list, not a set. To test for membership in an unordered list, with a naive approach, you'd have to traverse the whole list in the worst case.

This is why your statement took a long time with a large input size.

If you have a hash set, instead, it'd take constant time to test for membership once the set is constructed.

So, in prototypical code form, something like this:

import io

def words(fileobj):
    for line in fileobj:             # takes care of buffering large files, chunks at a time
        for word in line.split():
            yield word

# first, build the set of whitelisted words
wpath = 'whitelist.txt'
wset = set()
with io.open(wpath, mode='rb') as w:
    for word in words(w):
        wset.add(word)

def emit(word):
    # output 'word' - to a list, to another file, to a pipe, etc
    print word

fpath = 'input.txt'
with io.open(fpath, mode='rb') as f:
    for word in words(f):               # total run time - O(M) where M = len(words(f))
        if word not in wset:            # testing for membership in a hash set - O(1)
            emit(word)
Santa
  • 11,381
  • 8
  • 51
  • 64
0

Read and process "woorden.txt" line by line

  1. Load all nlwoorden into set (this is more efficient than into list)
  2. read the large file part by part, do the split on each part and write to resulting file only what is not in lnwoorden.

Assuming, your large 600 MB file is having reasonable long lines (not 600 MB long), I would do it this way

nlwoorden = set()
with open("nlwoorden.txt") as f:
    for line in f:
        nlwoorden.update(line.split())

with open("woorden.txt") as f, with open("out.txt", "w") as fo:
    for line in f:
        newwords = set(line.split())
        newwords.difference_update(nlwoorden)
        fo.write(" ".join(newwords)

Conclusions

This solution shall not consume too much memory, as you never read all data from "woorden.txt" at once.

In case, your file in not being split by lines, you must change the way you read parts of the file. But I assume, your file will have newlines.

Jan Vlcinsky
  • 42,725
  • 12
  • 101
  • 98