How to improve lambda expression in python suffix array

Question

There is a code that uses lambda expression

def ComputeArray(text):
    # text is ended with $
    if text[-1] != "$":
        text += "$"
    sarray  = sorted(range(len(text)), key = lambda i: text[i:])
    print ", ".join([str(x) for x in sarray])

if __name__ == "__main__":
    ComputeArray("AACGATAGCGGTAAACGATAGCGGTAGA$")

it correctly outputs desired array

28, 27, 12, 0, 13, 1, 14, 25, 6, 19, 4, 17, 2, 15, 8, 21, 26, 3, 16, 7, 20, 9, 22, 10, 23, 11, 24, 5, 18

How could I improve line

sarray  = sorted(range(len(text)), key = lambda i: text[i:])

so when increasing length of text I do not use lots of memory on a lambda expression?

Traceback (most recent call last):
  File "C:\Array.py", line 23, in <module>
    ComputeArray(text)
  File "C:\Array.py", line 11, in ComputeArray
    sarray  = sorted(range(len(text)), key = lambda i: text[i:])
  File "C:\Array.py", line 11, in <lambda>
    sarray  = sorted(range(len(text)), key = lambda i: text[i:])
MemoryError

UPDATE

There is other code like:

sarray=[]
for i in range(len(text)):
  sarray.append(text[i:])
order=[i[0] for i in sorted(enumerate(sarray), key=lambda x:x[1])]
print ", ".join([str(x) for x in order])

However is taking so much memory,

Also I tried solution using library available on https://code.google.com/p/pysuffix/

s = 'AACGATAGCGGTAGA'
s = unicode(s,'utf-8','replace') 
n = len(s) 
sa = tks.simple_kark_sort(s) 
lcp = tks.LCP(s,sa) 
print n print sa

although it solves the problem, it takes too much time with larger strings, ... do you know other library or a method to improve suffix?

score 6 · Accepted Answer · edited May 23 '17 at 10:25

Looks like you're trying to build a suffix array. Luckily, there are already Python implementations of this algorithm: https://code.google.com/p/pysuffix/

If you must implement it yourself, think about what your code is doing:

Make a list of integers the same length as your text with range.
Apply the key function to each element of the list and store the result in a new list.
Sort the new list
Return the original integer associated with each element of the new list.

(This is also known as the Schwartzian Transform, which a pretty neat idea.)

The point is, you're making a slice of your (presumably large) text for each offset in the text, and storing it in a new list. You'll want to use a more specialized suffix array construction algorithm to avoid this cost.

Finally, to address your original question: the lambda expression isn't the culprit here. You're simply running into an algorithmic wall.

EDIT

Here's a good resource for fast SA algorithms: What's the current state-of-the-art suffix array construction algorithm?

I am using library like: ` s = 'AACGATAGCGGTAGA' s = unicode(s,'utf-8','replace') n = len(s) sa = tks.simple_kark_sort(s) lcp = tks.LCP(s,sa) print n print sa ` however is taking too much with larger strings... do you know other library ? — edgarmtze, Jan 19 '14 at 04:05
I've added a link to more info. You may have to use a C/C++ library, perhaps wrapping the interface up to call it from Python code. — perimosocordiae, Jan 19 '14 at 05:04

abarnert · Answer 2 · 2014-01-19T03:35:54.933

5

How could I improve line … so when increasing length of text I do not use lots of memory on a lambda expression?

That lambda expression creates a simple function. Most likely the bytecode for that function is around 8 bytes, and the cost of the code and function objects that get wrapped around it are maybe 80-128 bytes depending on your platform. Only one such function exists at a time, so your total memory cost is 8 bytes in the module, and another 128 bytes while this function runs.

So… do nothing.

To reduce memory usage, you have to find the part of your code that actually is using a lot of memory, and reduce that. In particular, it's most likely that you're creating an list of N numbers (with range), and then another list of N numbers (with sorted). You're also creating N strings of N/2 length transiently, which you don't need if you make your comparison a little more subtle. And so on. Don't do that.

Using an xrange to replace the first list.
Rewrite your algorithm so it doesn't need to build a whole sorted list, and instead generates the elements one by one in the first place.
If that's not possible, consider using an array.array or a NumPy ndarray, which will at least eliminate the 24-48 bytes of "boxing" overhead for each number.
You can write a cleverer key function that doesn't need to actually hold onto the suffix in its result object; instead, it holds onto i and to text, and has custom comparison operators that reference text[i:] as needed. Then at worst two of these suffixes will exist at the same time.
If you're generating the values one by one, you can't use str.join (because it will create a list behind your back), so you will need a different alternative, like writing them to a StringIO. Or, in your case, since you're not actually returning anything, just dumping it to stdout, just print them one by one in a loop.

edited Jan 19 '14 at 03:35

answered Jan 19 '14 at 03:29

abarnert

354,177
51
601
671

2

I'm not entirely sure I agree. Because we're guaranteed that the key function is only evaluated once on each element being sorted, it has to cache the results, resulting in enormous intermediate storage. This is why `ComputeArray("a"*100000)` gives me a `MemoryError`, but `sarray = sorted(range(len(text)), cmp=lambda i,j: cmp(text[i:], text[j:]))` works just fine. ISTM the OP has actually correctly identified the location triggering the `MemoryError`. – DSM Jan 19 '14 at 03:59
Would it be a way to optimize it a little bit? – edgarmtze Jan 19 '14 at 04:09
@DSM: The function isn't taking up memory. The values that it returns, and that `sorted` caches, may be—as I already explained in the answer. But the way to fix that is to either (a) rewrite the algorithm so it doesn't use `sorted`, or (b) return _different values_ for it to cache. Changing the way it returns the same value will not help anything. Which, again, I already explained in the answer. – abarnert Jan 19 '14 at 04:16
@abarnert: it's surprising how often the two of us don't agree on what's clear, but perhaps I'm old and set in my country ways. – DSM Jan 19 '14 at 05:22
@DSM: I'm not sure how to be clearer than "You're also creating N strings of N/2 length transiently, which you don't need if you make your comparison a little more subtle…" followed up by "You can write a cleverer key function that doesn't need to actually hold onto the suffix in its result object; instead, it holds onto i and to text, and has custom comparison operators that reference text[i:] as needed. Then at worst two of these suffixes will exist at the same time." If you can suggest a better way to put that, I'd be happy to hear it. – abarnert Jan 19 '14 at 05:34
@DSM: And I realize there's a good chance that the answer is just "strip out irrelevant stuff and say less in general", which I know I sometimes have a problem with… But examples can still help. – abarnert Jan 19 '14 at 05:35

How to improve lambda expression in python suffix array

UPDATE

Also I tried solution using library available on https://code.google.com/p/pysuffix/

2 Answers2