Assumptions
- "...[I]s there a way to achieve the same result in one pass?"
- I assume you mean iterating over the contents of your "large ASCII delimited text file" exactly once.
- I further assume your posted snippet is a representative sample of your data.
Overview
You have three things here in your nested structure:
- The strings (the records)
- The sublists (lists of strings delimited by the record separator)
- The outer list (the list of sublists given by the large file)
Based on your sample data, both the strings and the sublists are short, so we should be able to do relatively expensive operations on them without too much performance loss. It's really the outer list we want to optimize for, and we want to make sure we only pass over that huge ASCII text file once.
The Algorithm
Understand there is more than one way to do this, and I'm not claiming this the most powerful, efficient or expressive way. YMMV depending on how much my assumptions match reality. What this does is generates (read all about generators in this excellent SO question) sublists (which are not themselves generated since they are short). Processing the generator will then yield the outer list, while only processing the text once.
>>> def generate_sublists(text, outer_delim='\x1e', inner_delim='\x1f'):
... sublist = []
... list_item = ''
... for character in text: # comb over text exactly once (one loop)
... if character == inner_delim:
... # when we hit the inner delimiter,
... # push the string onto the list
... sublist.append(list_item)
... # and reset the string placeholder to empty
... list_item = ''
... elif character == outer_delim:
... # when we hit the outer delimiter, we generate a sublist
... yield sublist
... # and reset the sublist placeholder to empty
... sublist = []
... else:
... # any other character we add onto the string placeholder
... list_item += character
...
>>> # The sample data you provided
>>> text = '10000\x1f4959\x1f\4567\x1f\x1f\x1e20000\x1f456\x1f456\x1f\x1f\x1e'
>>> outer_list = []
>>> for sublist in generate_sublists(text):
... outer_list.append(sublist)
...
>>> outer_list
[['10000', '4959', '.7', ''], ['20000', '456', '456', '']]
Wait, why doesn't this match my expected output?
There are some oddities in the sample data you posted. For example, the record separator delmiter appears twice ('\x1f\x1f'
). My algorithm treats this as an empty record, while your expected output leaves it out. One possible fix for this is to filter
the output. In other words:
>>> outer_list = []
>>> for sublist in generate_sublists(text):
... outer_list.append(filter(bool, sublist))
...
>>> outer_list
[['10000', '4959', '.7'], ['20000', '456', '456']]
Again, since your sublists are short, this shouldn't add much processing time. If it does, you could add another check condition before sublist.append(list_item)
:
... for character in text:
... if character == inner_delim:
... if not list_item:
... continue
... sublist.append(list_item)
You've probably also noticed that in my output data I have .7
where you have 4567
. That's because in your example input you have 4959\x1f\4567
(note the extra backslash -- maybe a typo?). That backslash causes \456
to be interpreted as an octal number. Using Python we can decipher this:
>>> 0o456
302
>>> 302 % 256
46
\456
in octal is the same as 302
in decimal, but the valid range is 0-256
, so we have to apply modulo to see what value it really becomes: 46 (.).
I assume that is probably a typo, since you don't expect it in your output, though.