1

Would I be better off using memoryview, itertools.islice or something else (e.g. var = (start, stop)) as a pointer in Python to a substring of a very large string?

Context: I have some very long strings that I need to manipulate (cut and paste substrings, etc.) without creating a new string each time.

I accomplish that by making a binary search tree in which each node represents a substring and then using split/merge operations (a Rope data structure).

Each node needs a reference attached to it to the substring of the original very large string that the node represents. (That's necessary so that, when I walk the tree in-order to produce the final edited string, I get back the parts of the original string in the amended sequence.)

I could attach a tuple representing start/stop values to each node and then use slicing string[start:stop], but in C you would use a pointer and a character count.

Would it be better to do something similar in Python, either with memoryview or with islice or with something else?

George Hilliard
  • 15,402
  • 9
  • 58
  • 96
curlew77
  • 393
  • 5
  • 15
  • How did you load the string into memory? If it's from the filesystem you might want to look into [`mmap`](https://docs.python.org/3.6/library/mmap.html) – metatoaster May 28 '18 at 00:12
  • Thanks, @metatoaster, will do. At present, I'm reading test files into memory all at once and storing them as default unicode strings in python3. But in the future, I'd like to be able to handle large files in chunks. – curlew77 May 28 '18 at 00:15
  • Oh, if you are doing actual human readable text manipulation this can be trickier, you might want to consider using `ctypes` directly, such as [`ctypes.create_unicode_buffer`](https://docs.python.org/3/library/ctypes.html#ctypes.create_unicode_buffer) and work directly within it. – metatoaster May 28 '18 at 00:47
  • I would vote for `memoryview` here, if you are fine with working with what are essentially bytes. `itertools.islice` will be memory efficient but will only allow a single pass, and will be slow – juanpa.arrivillaga Nov 14 '18 at 23:47

1 Answers1

2

I'm not familiar enough with the rope data structure or your specific requirements to know how hard a requirement it is not to copy data around. For a lot of use cases having an extra copy in memory isn't a problem, but some optimizations or large files may require other solutions.

Of the options you listed, memoryview is the only one that won't create an additional copy in memory. See this question for more information, as well as an answer that includes an example of where memoryview can be useful. While it can speed up some operations, as in the case of the example, there may be better ways to approach the problem or structure your code that eliminate the need to begin with. Your use case and mileage may vary of course.

I also found some links talking about memory mapped (mmap) files, and memoryview (buffer in py2) interfaces. If you do end up needing zero-copy pointers I'd definitely suggest checking out the memoryview interface.

btharper
  • 116
  • 1
  • 4