1

The input is a string containing a huge number of characters, and I hope to split this string into a list of strings with a special delimiter.

But I guess that simply using split would generate new strings rather than split the original input string itself, and in that case it consumes large memory(it's guaranteed that the original string would not be used any longer).

So is there a convenient way to do this destructive split?

Here is the case:

input_string = 'data1 data2 <...> dataN'
output_list = ['data1', 'data2', <...> 'dataN']

What I hope is that the data1 in output_list is and the data1(and all others) in input_string shares the same memory area.

BTW, for each input string, the size is 10MB-20MB; but as there are lots of such strings(about 100), so I guess memory consumption should be taken into consideration here?

Hongxu Chen
  • 5,240
  • 2
  • 45
  • 85
  • comline = "foo bar" com = commline.split(' ') print(com[0]) will result in "foo", so a string split will actually generate a list. – Rik Verbeek Nov 20 '14 at 07:52
  • What is your guess based on? – Tim Nov 20 '14 at 07:53
  • i think `split` generate a `list`in both python2 and python3. – Vishnu Upadhyay Nov 20 '14 at 07:54
  • 1
    "a huge number of characters" -- How huge are we talking? Mb? Gb? – mgilson Nov 20 '14 at 07:56
  • I know split would generate list in python, but I hope the elements of the list will `reuse` the original string. – Hongxu Chen Nov 20 '14 at 08:00
  • @mgilson MB. i know the design is bad, but i don't have the permission to change the input structure. – Hongxu Chen Nov 20 '14 at 08:01
  • can you show us the input data?? – Hackaholic Nov 20 '14 at 08:04
  • 2
    @HongxuChen -- MB isn't really that big of a deal these days. My advice would be to just `.split()` and not worry about it. You'll approximately double your memory consumption -- If the strings are 10-20 Mb, you'll use about 40 Mb. That's really not bad these days. Let the garbage collector clean up after you when it is able. :-) – mgilson Nov 20 '14 at 08:26
  • @mgilson what i worry about is that i have about 100 such strings to handle, will that lead to performance(currently i'm using the `split` version but *it seems* slow for the later string handling)? – Hongxu Chen Nov 20 '14 at 08:49

4 Answers4

0

In Python, strings are immutable. This means that any operation that changes the string will create a new string. If you are worried about memory (although this shouldn't be much of an issue unless you are dealing with gigantic strings), you can always overwrite the old string with the new, modified string, replacing it.

The situation you are describing is a little different though, because the input to split is a string and the output is a list of strings. They are different types. In this case, I would just create a new variable containing the output of split and then set the old string (that was the input to the split function) to None, since you guarantee it will not be used again.

Code:

split_str = input_string.split(delim)
input_string = None
Duke
  • 339
  • 1
  • 1
  • Is there any string wrapper(i.e. `mutable` string) to do the string modification operation? – Hongxu Chen Nov 20 '14 at 08:04
  • @HongxuChen There is no such thing as a mutable string in Python. I'm not sure why this is as important to you as it is. How large are your strings? – Adam Smith Nov 20 '14 at 08:05
  • @AdamSmith about 10-20MB; do I really need to consider about the memory consumption for that? – Hongxu Chen Nov 20 '14 at 08:16
0

The only alternative would be to access the substrings using slicing instead of split. You can use str.find to find the position of each delimiter. However this would be slow, and fiddly. If you can use split and get the original string to drop out of scope then it would be worth the effort.

You say that this string is input, so you might like to consider reading a smaller number of characters so you are dealing with more manageable chunks. Do you really need all the data in memory at the same time?

cdarke
  • 42,728
  • 8
  • 80
  • 84
0

Perhaps the Pythonic way would be to use iterators? That way, the new substrings will be in memory only one at a time. Based on Splitting a string into an iterator :

import re
string_long = "my_string " * 100000000 # takes some memory
# strings_split = string_long.split()  # takes too much memory
strings_reiter = re.finditer("(\S*)\s*", string_long) # takes no memory
for match in strings_reiter:
    print match.group()

This works fine without leading to memory problems.

Community
  • 1
  • 1
BlackShift
  • 2,296
  • 2
  • 19
  • 27
0

If you're talking about strings that are SO huge that you can't stand to put them in memory, then maybe running through the string once (O(n), probably improvable using str.find but I'm not sure) then storing a generator that holds slice objects would be more memory-efficient?

long_string = "abc,def,ghi,jkl,mno,pqr" # ad nauseum
splitters = [','] # add whatever you want to split by
marks = [i for i,ch in enumerate(long_string) if ch in splitters]
slices = []
start = 0
for end in marks:
    slices.append(slice(start,end))
    start = end+1
else:
    slices.append(slice(start,None))

split_string = (long_string[slice_] for slice_ in slices)
Adam Smith
  • 52,157
  • 12
  • 73
  • 112
  • I've never found the occasion to use it, but perhaps a [`mmap`](https://docs.python.org/2/library/mmap.html) would help as well... – mgilson Nov 20 '14 at 08:29