How efficient is Python substring extraction?

Question

I've got the entire contents of a text file (at least a few KB) in string myStr.

Will the following code create a copy of the string (less the first character) in memory?

myStr = myStr[1:]

I'm hoping it just refers to a different location in the same internal buffer. If not, is there a more efficient way to do this?

Thanks!

Note: I'm using Python 2.5.

@Glenn: Thanks for the edit. I always forget to proof-read the title! — Cameron, Mar 16 '10 at 19:41
@Mike: Hah, I guess I'm over-optimizing. The files loaded could _potentially_ be large (in theory) -- but currently the largest is 8KB :-) — Cameron, Mar 16 '10 at 22:44
A few KB is tiny, but if you're doing an algorithm like `[s[0:n] for n in range(0, len(s))]`, you'll end up with O(n^2), where in-place slicing would give you O(n). You can always code around it, obviously; it's just extra work. — Glenn Maynard, Mar 19 '10 at 01:13

score 4 · Accepted Answer · answered Mar 16 '10 at 19:30

4

At least in 2.6, slices of strings are always new allocations; string_slice() calls PyString_FromStringAndSize(). It doesn't reuse memory--which is a little odd, since with invariant strings, it should be a relatively easy thing to do.

Short of the buffer API (which you probably don't want), there isn't a more efficient way to do this operation.

answered Mar 16 '10 at 19:30

Glenn Maynard

55,829
10
121
131

1

Thanks for the info. I'm actually using Python 2.5 (I've updated my question) but I doubt it's done differently. I'll just have to live with the duplication, I guess (I _really_ need to remove that one character). – Cameron Mar 16 '10 at 19:37
1

Can't you just read the first character out of the file, and not assign it to the string to begin with? See my answer, coming momentarily. **edit:** see benson's answer instead. – jcdyer Mar 16 '10 at 20:53

score 3 · Answer 2 · answered Mar 16 '10 at 19:35

As with most garbage collected languages, strings are created as often as needed, which is very often. The reason for this is because tracking substrings as described would make garbage collection more difficult.

What is the actual algorithm you are trying to implement. It might be possible to give you advice for ways to get better results if we knew a bit more about it.

As for an alternative, what is it you really need to do? Could you use a different way of looking at the issue, such as just keeping an integer index into the string? Could you use a array.array('u')?

I'm removing the BOM from a UTF-8 decoded file in memory, then sending the contents of this file into a templating engine (Jinja2), then writing the result to an HTML response. I just figured out a way that I'll only have to do this once per template file, though, so it's not really an issue anymore :-) — Cameron, Mar 16 '10 at 19:39

score 1 · Answer 3 · answered Mar 16 '10 at 19:50

1

One (albeit slightly hacky) solution would be something like this:

f = open("test.c")
f.read(1)
myStr = f.read()
print myStr

It will skip the first character, and then read the data into your string variable.

answered Mar 16 '10 at 19:50

Benson

22,457
2
40
49

Actually, that will read the first byte, not necessarily the first character. In a utf-8 encoded file only 128 US-ASCII characters are encoded in one byte. – tgray Mar 16 '10 at 20:24
3

So read the first line, convert to unicode, and then strip the first character. Proceed more or less as above, converting to unicode as you go along. If you don't convert, then you're dealing with bytes. – jcdyer Mar 16 '10 at 20:56
I would use this technique, but at the time I'm reading it from file I don't know whether the BOM should be kept or not. When I later retrieve the contents (from a DB), I get the entire file back at once. A version of your technique has actually already been presented to me in the answer to another (related) question I asked earlier: http://stackoverflow.com/questions/2456380/utf-8-html-and-css-files-with-bom-and-how-to-remove-the-bom-with-python/2456524#2456524 – Cameron Mar 16 '10 at 22:52
2

Always use a context manager when dealing with files, i.e. `with open("test.c") as f:` – Mike Graham Mar 16 '10 at 23:22
@Mike: I would have, but he said he was using 2.5, and I didn't want to muck about with the from __future__ import with_statement junk. – Benson Mar 18 '10 at 21:20
@Benson, that isn't junk; that is making your code right. If he was using pre-2.5 Python, I would have said `close` needs to go inside a `finally` block. – Mike Graham Mar 18 '10 at 21:32
You might not be reading from a file in the first place; there are many (though infrequent) cases where being able to make string slices for "free" makes it easier to get an efficient algorithm. – Glenn Maynard Mar 19 '10 at 01:10

score 1 · Answer 4 · answered Mar 16 '10 at 23:21

1

Depending on what you are doing, itertools.islice may be a suitable memory-efficient solution (should one become necessary).

answered Mar 16 '10 at 23:21

Mike Graham

73,987
14
101
130

Cool, I didn't know that module even existed! – Cameron Mar 17 '10 at 00:40
Good find, then!—`itertools` is constantly useful. – Mike Graham Mar 17 '10 at 00:53

How efficient is Python substring extraction?

4 Answers4