I'm fairly new to Python and especially new to working with large amounts of data. I'm working on a little fun project, which is effectively an upscale of something I've done before in another language.
For now, I'm loading a sizeable (100mb+) text document, breaking it up into words and then determining the frequencies of what words follow each prefix (each prefix being one or more words). Fairly simple and fun to implement in Python, I ended up with something along the lines of:
def prefix(symbols):
relationships = {}
for i in reversed(range(len(symbols))):
prefix = seperator.join(symbols[i:i+samples])
suffix = None
if i+samples < len(symbols):
suffix = symbols[i+samples]
if prefix not in relations:
relations[prefix] = {}
if suffix not in relations[prefix]:
relations[prefix][suffix] = 1
else:
relations[prefix][suffix] += 1
return relations
(the function name, its argument and the use of a global "samples" is just temporary while I work out the kinks)
This works well, taking about 20 seconds or so to process a 60mb plaintext dump of top articles from Wikipedia. Stepping up the sample size (the "samples" variable) from 1 to even 4 or 5 however, greatly increases memory usage (as expected -- there are ~10 million words and for each, "samples" many words are sliced and joined into a new string). This quickly approaches and reaches the memory limit of 2 gigabytes.
One method I've applied to alleviate this problem is to delete the initial words from memory as I iterate, as they are no longer needed (the list of words could simply be constructed as part of the function so I'm not modifying anything passed in).
def prefix(symbols):
relationships = {}
n = len(symbols)
for i in range(n):
prefix = seperator.join(symbols[0:samples])
suffix = None
if samples < len(symbols):
suffix = symbols[samples]
if prefix not in relationships:
relationships[prefix] = {}
if suffix not in relationships[prefix]:
relationships[prefix][suffix] = 1
else:
relationships[prefix][suffix] += 1
del symbols[0]
return relationships
This does help, but not by much, and at the cost of some performance.
So what I'm asking is if this kind of approach is fairly efficient or recommended, and if not, what would be more suitable? I may be missing some method to avoid redundantly creating strings and copying lists, seeing as most of this is new to me. I'm considering:
- Chunking the symbols/words list, processing, dumping the relationships to disk and combining them after the fact
- Working with something like Redis as opposed to keeping the relationships within Python the whole time at all
Cheers for any advice or assistance!