20

If you use string.split() on a Python string, it returns a list of strings. These substrings that have been split-out are copies of their part of the parent string.

Is it possible to instead get some cheaper slice object that holds only a reference, offset and length to the bits split out?

And is it possible to have some 'string view' to extract and treat these sub-strings as if they are strings yet without making a copy of their bytes?

(I ask as I have very large strings I want to slice and am running out of memory occasionally; removing the copies would be a cheap profile-guided win.)

Will
  • 73,905
  • 40
  • 169
  • 246
  • 2
    The answers below that use buffer() only apply to 2.7. memoryview() cannot be used with unicode strings, which are normal strings in 3.x. – Terry Jan Reedy Oct 21 '14 at 19:39

3 Answers3

24

buffer will give you a read-only view on a string.

>>> s = 'abcdefghijklmnopqrstuvwxyz'
>>> b = buffer(s, 2, 10)
>>> b
<read-only buffer for 0x7f935ee75d70, size 10, offset 2 at 0x7f935ee5a8f0>
>>> b[:]
'cdefghijkl'
Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
8

String objects always point to a NUL-terminated buffer in Python, so substrings must be copied. As Ignacio pointed out, you can use buffer() to get a read-only view on the string memory. The buffer() built-in function has been superseded by the more versatile memoryview objects, though, which are available in Python 2.7 and 3.x (buffer() is gone in Python 3.x).

s = "abcd" * 50
view = memoryview(s)
subview = view[10:20]
print subview.tobytes()

This code prints

cdabcdabcd

As soon as you call tobytes(), a copy of the string will be created, but the same happens when slicing the old buffer objects as in Ignacio's answer.

Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
  • yes, its the copy I'm super-keen to avoid; thoughts on how to get something that always remains a view yet works like a string? – Will Apr 11 '12 at 05:25
  • 1
    @Will: Both, Ignacio's solution and this one, avoid the copy if you only keep around the buffer/memoryview. If you want to use it as a string, you have to temporarily turn it into a string and work on it. And as I said before, Python string buffers are NUL-terminated, so it's impossible to use a only a portion of another string as a string buffer. – Sven Marnach Apr 11 '12 at 12:04
  • I meant more quacking like a duck; I've added 'in' and iteration to my StringView and its working nicely. Just shame its not a built-in really. – Will Apr 11 '12 at 12:08
2

Here's the quick string-like buffer wrapper I came up with; I was able to use this in place of classic strings without changing the code that expected to consume strings.

class StringView:
    def __init__(self,s,start=0,size=sys.maxint):
        self.s, self.start, self.stop = s, start, min(start+size,len(s))
        self.size = self.stop - self.start
        self._buf = buffer(s,start,self.size)
    def find(self,sub,start=0,stop=None):
        assert start >= 0, start
        assert (stop is None) or (stop <= self.size), stop
        ofs = self.s.find(sub,self.start+start,self.stop if (stop is None) else (self.start+stop))
        if ofs != -1: ofs -= self.start
        return ofs
    def split(self,sep=None,maxsplit=sys.maxint):
        assert maxsplit > 0, maxsplit
        ret = []
        if sep is None: #whitespace logic
            pos = [self.start,self.start] # start and stop
            def eat(whitespace=False):
                while (pos[1] < self.stop) and (whitespace == (ord(self.s[pos[1]])<=32)):
                    pos[1] += 1
            def eat_whitespace():
                eat(True)
                pos[0] = pos[1]
            eat_whitespace()
            while pos[1] < self.stop:
                eat()
                ret.append(self.__class__(self.s,pos[0],pos[1]-pos[0]))
                eat_whitespace()
                if len(ret) == maxsplit:
                    ret.append(self.__class__(self.s,pos[1]))
                    break
        else:
            start = stop = 0
            while len(ret) < maxsplit:
                stop = self.find(sep,start)
                if -1 == stop:
                    break
                ret.append(self.__class__(self.s,self.start+start,stop-start))
                start = stop + len(sep)
            ret.append(self.__class__(self.s,self.start+start,self.size-start))
        return ret
    def split_str(self,sep=None,maxsplit=sys.maxint):
        "if you really want strings and not views"
        return [str(sub) for sub in self.split(sep,maxsplit)]
    def __cmp__(self,s):
        if isinstance(s,self.__class__):
            return cmp(self._buf,s._buf)
        assert isinstance(s,str), type(s)
        return cmp(self._buf,s)
    def __len__(self):
        return self.size
    def __str__(self):
        return str(self._buf)
    def __repr__(self):
        return "'%s'"%self._buf

if __name__=="__main__":
    test_str = " this: is: a: te:st str:ing :"
    test = Envelope.StringView(test_str)
    print "find('is')"
    print "\t",test_str.find("is")
    print "\t",test.find("is")
    print "find('is',4):"
    print "\t",test_str.find("is",4)
    print "\t",test.find("is",4)
    print "find('is',4,7):"
    print "\t",test_str.find("is",4,7)
    print "\t",test.find("is",4,7)
    print "split():"
    print "\t",test_str.split()
    print "\t",test.split()
    print "split(None,2):"
    print "\t",test_str.split(None,2)
    print "\t",test.split(None,2)
    print "split(':'):"
    print "\t",test_str.split(":")
    print "\t",test.split(":")
    print "split('x'):"
    print "\t",test_str.split("x")
    print "\t",test.split("x")
    print "''.split('x'):"
    print "\t","".split("x")
    print "\t",Envelope.StringView("").split("x")
Will
  • 73,905
  • 40
  • 169
  • 246
  • You should consider writing the main stanza as a doctest in the real thing. – Ignacio Vazquez-Abrams Apr 10 '12 at 17:55
  • On a 32-bit system, every single instance of this class will use 232 bytes of memory, on a 64-bit system it will be even more, so this will be only worth it for rather long substrings. You should at least use `__slots__` to reduce the memory each instance consumes to about half that amount. – Sven Marnach Apr 11 '12 at 12:23
  • To save even more memory, either get rid of the buffer object, or get rid of `s`, `start` and `stop`. In any case, get rid of `size`. – Sven Marnach Apr 11 '12 at 12:30
  • yeap; my strings are 10MB+ and I have a lot of them; you'll see that I'm using the s itself as much as possible in the hope that it's .find etc is in C – Will Apr 11 '12 at 12:47
  • Removing the use of buffer should allow a version of this class to work in 3.x. – Terry Jan Reedy Oct 21 '14 at 19:37