Split text in list of lists depending on delimiter, in one pass

Question

I've a large ASCII delimited text file that looks like this:

10000\x1f4959\x1f\4567\x1f\x1f\x1e20000\x1f456\x1f456\x1f\x1f\x1e...

The desired result would be a list of lists like:

[[10000,4959,4567],[20000,456,456],...]

I can do it in two passes, by first using text.split('\x1e') and then using a loop to split each sublist on '\x1f'.

But is there a way to achieve the same result in one pass?

What do you mean by "in one pass"? You want to ensure that the ASCII text is iterated over at most one time? — Two-Bit Alchemist, Apr 06 '14 at 21:51
I don't know what's happened, but someone gave me a quick answer that used a one-liner list comprehension. That worked just perfect. Now I don't see that anymore. — ADJ, Apr 11 '14 at 19:54

score 0 · Answer 1 · edited May 23 '17 at 11:57

Assumptions

"...[I]s there a way to achieve the same result in one pass?"
- I assume you mean iterating over the contents of your "large ASCII delimited text file" exactly once.
I further assume your posted snippet is a representative sample of your data.

Overview

You have three things here in your nested structure:

The strings (the records)
The sublists (lists of strings delimited by the record separator)
The outer list (the list of sublists given by the large file)

Based on your sample data, both the strings and the sublists are short, so we should be able to do relatively expensive operations on them without too much performance loss. It's really the outer list we want to optimize for, and we want to make sure we only pass over that huge ASCII text file once.

The Algorithm

Understand there is more than one way to do this, and I'm not claiming this the most powerful, efficient or expressive way. YMMV depending on how much my assumptions match reality. What this does is generates (read all about generators in this excellent SO question) sublists (which are not themselves generated since they are short). Processing the generator will then yield the outer list, while only processing the text once.

>>> def generate_sublists(text, outer_delim='\x1e', inner_delim='\x1f'):
...   sublist = []
...   list_item = ''
...   for character in text:  # comb over text exactly once (one loop)
...     if character == inner_delim:
...       # when we hit the inner delimiter, 
...       # push the string onto the list
...       sublist.append(list_item)
...       # and reset the string placeholder to empty
...       list_item = ''
...     elif character == outer_delim:
...       # when we hit the outer delimiter, we generate a sublist
...       yield sublist
...       # and reset the sublist placeholder to empty
...       sublist = []
...     else:
...       # any other character we add onto the string placeholder
...       list_item += character
... 
>>> # The sample data you provided
>>> text = '10000\x1f4959\x1f\4567\x1f\x1f\x1e20000\x1f456\x1f456\x1f\x1f\x1e'
>>> outer_list = []
>>> for sublist in generate_sublists(text):
...   outer_list.append(sublist)
... 
>>> outer_list
[['10000', '4959', '.7', ''], ['20000', '456', '456', '']]

Wait, why doesn't this match my expected output?

There are some oddities in the sample data you posted. For example, the record separator delmiter appears twice ('\x1f\x1f'). My algorithm treats this as an empty record, while your expected output leaves it out. One possible fix for this is to filter the output. In other words:

>>> outer_list = []
>>> for sublist in generate_sublists(text):
...   outer_list.append(filter(bool, sublist))
...
>>> outer_list
[['10000', '4959', '.7'], ['20000', '456', '456']]

Again, since your sublists are short, this shouldn't add much processing time. If it does, you could add another check condition before sublist.append(list_item):

...   for character in text:
...     if character == inner_delim:
...       if not list_item:
...         continue
...        sublist.append(list_item)

You've probably also noticed that in my output data I have .7 where you have 4567. That's because in your example input you have 4959\x1f\4567 (note the extra backslash -- maybe a typo?). That backslash causes \456 to be interpreted as an octal number. Using Python we can decipher this:

>>> 0o456
302
>>> 302 % 256
46

\456 in octal is the same as 302 in decimal, but the valid range is 0-256, so we have to apply modulo to see what value it really becomes: 46 (.).

I assume that is probably a typo, since you don't expect it in your output, though.

Split text in list of lists depending on delimiter, in one pass

1 Answers1

Assumptions

Overview

The Algorithm

Wait, why doesn't this match my expected output?