3

I'm fairly new to Python and especially new to working with large amounts of data. I'm working on a little fun project, which is effectively an upscale of something I've done before in another language.

For now, I'm loading a sizeable (100mb+) text document, breaking it up into words and then determining the frequencies of what words follow each prefix (each prefix being one or more words). Fairly simple and fun to implement in Python, I ended up with something along the lines of:

def prefix(symbols):
    relationships = {}

    for i in reversed(range(len(symbols))):
        prefix = seperator.join(symbols[i:i+samples])
        suffix = None

        if i+samples < len(symbols):
            suffix = symbols[i+samples]

        if prefix not in relations:
            relations[prefix] = {}

        if suffix not in relations[prefix]:
            relations[prefix][suffix] = 1
        else:
            relations[prefix][suffix] += 1

    return relations

(the function name, its argument and the use of a global "samples" is just temporary while I work out the kinks)

This works well, taking about 20 seconds or so to process a 60mb plaintext dump of top articles from Wikipedia. Stepping up the sample size (the "samples" variable) from 1 to even 4 or 5 however, greatly increases memory usage (as expected -- there are ~10 million words and for each, "samples" many words are sliced and joined into a new string). This quickly approaches and reaches the memory limit of 2 gigabytes.

One method I've applied to alleviate this problem is to delete the initial words from memory as I iterate, as they are no longer needed (the list of words could simply be constructed as part of the function so I'm not modifying anything passed in).

def prefix(symbols):
    relationships = {}

    n = len(symbols)
    for i in range(n):
        prefix = seperator.join(symbols[0:samples])

        suffix = None
        if samples < len(symbols):
            suffix = symbols[samples]

        if prefix not in relationships:
            relationships[prefix] = {}

        if suffix not in relationships[prefix]:
            relationships[prefix][suffix] = 1
        else:
            relationships[prefix][suffix] += 1

        del symbols[0]
    return relationships

This does help, but not by much, and at the cost of some performance.

So what I'm asking is if this kind of approach is fairly efficient or recommended, and if not, what would be more suitable? I may be missing some method to avoid redundantly creating strings and copying lists, seeing as most of this is new to me. I'm considering:

  • Chunking the symbols/words list, processing, dumping the relationships to disk and combining them after the fact
  • Working with something like Redis as opposed to keeping the relationships within Python the whole time at all

Cheers for any advice or assistance!

Frohman
  • 111
  • 11
  • 1
    This has nothing to do with your question regarding memory efficiency / performance, but for your `relations[prefix]` and `relations[prefix][suffix]` dictionaries you may want to use a [`defaultdict(dict)`](https://docs.python.org/2/library/collections.html#collections.defaultdict) and a [`Counter`](https://docs.python.org/2/library/collections.html#collections.Counter) respectively - they should be a perfect fit and make your code even more succinct. – Lukas Graf Apr 01 '15 at 09:39
  • if `symbols` is a `list`, don't use `del symbols[0]`, it needs to shift every following element in the list, a `collections.deque()` should be more efficitent. Ideally you could make `symbols` a generator yielding `samples` words while consuming your input file. – mata Apr 01 '15 at 09:49
  • which variable is getting memory increasing here ? – itzMEonTV Apr 01 '15 at 09:50
  • Would a `shelve` help here? https://docs.python.org/3/library/shelve.html and http://pymotw.com/2/shelve/ – cdarke Apr 01 '15 at 10:00
  • @LukasGraf I tried both, in fact it's part of the reason I moved to Python 3.x (better performing Counter), but afterall I found that I got better performance doing it "manually". I'll probably move back to using defaultdict with Counter afterwards, though! – Frohman Apr 01 '15 at 10:08
  • @mata I'm using a [blist](https://pypi.python.org/pypi/blist/?), which is apparently functionally identical to a normal list other than being generally faster, but good call on the deque, I'll give that a shot. As for a generator, I forgot to add that to what I was considering -- I'm fairly sure I can use the zip() function to achieve this? I'm not very well versed on generators in Python, but I'm effectively trying to return every group of N elements, including overlap (k1 -> k1+N, k2 -> K2+N etc). – Frohman Apr 01 '15 at 10:16
  • I found some interesting tips that can help you - https://wiki.python.org/moin/PythonSpeed/PerformanceTips#Loops – Alexander R. Apr 01 '15 at 11:07
  • I suppose I should add I'm mostly concerned with memory usage, as I can't see my current implementation scaling to larger files well -- performance is less of an issue here, especially as I plan to eventually chunk out the input to multiple processes to speed up relationship generation. Generator expressions seem very promising, though! – Frohman Apr 01 '15 at 11:17
  • @Frohman [here](http://pastebin.com/2cU4Xtur) is a simple version of a generator yielding tuples of a specified length from an input file. It reads the file one line at a time, and keeps only last `samples` words, so it should be a good solution if you want to keep memory usage low. – mata Apr 01 '15 at 12:17

3 Answers3

1

The main advice when working with large strings in Python in case you need to do lot of changes, is:

  1. Convert string to list
  2. Do all the work with list
  3. When finished, convert to strings back

The reason is that string in Python is immutable. Every action as symbols[i:i+samples] for example, forces Python to allocate new memory, copy needed string, and return it as new string object. Because of that, when you need to do many changes to string (change by index, splitting), you better work with lists, which are mutable.

Alexander R.
  • 1,756
  • 12
  • 19
  • You definitely have a very good point there, accidentally copying strings around in memory can happen pretty quickly in Python and should be avoided. However, I think in the OP's case `symbols` is a list (of words). – Lukas Graf Apr 01 '15 at 09:54
  • Yeah, symbols is a list of strings - I'm fairly sure that means the slices I get from them at least aren't copies. The memory usage issue arises due to the fact that I immediately combine them into a (new?) string `prefix = seperator.join(symbols[0:samples])` (only just noticed the separator typo, hah!). I'm not sure how I can get around this, however, as I believe I need to be able to immediately add relationships (via the dict). I've also used a tuple instead as a key before, but this didn't improve memory usage -- I'm fairly sure it also makes copies. – Frohman Apr 01 '15 at 10:32
0

Re. speeding up the process, I've found it useful to use try/except blocks to update hashtables. For example:

try:
    relationships[prefix][suffix] += 1
except KeyError:
    relationships[prefix][suffix] = 1

Rather than using 'in' to check for a key and then updating/creating the key which would require another check for that key, the above code eliminates one check and thus works faster.

0

Use, iterator_of_symbols = iter(list_of_symbols), and do next() on that iter, i.e. iterator_of_symbols.next()

Read more Which is more efficient, a for-each loop, or an iterator?

Although it's explained with Java, I think the concepts are same.

Community
  • 1
  • 1
pnv
  • 2,985
  • 5
  • 29
  • 36