a better way to read in the substrings of a text without loop / python

Question

I am reading lines from a file and then traversing each overlapping substring of k size in a loop, then process these strings. What would be a better (more efficient and elegant) way to read in the substrings? How can I make a list without the loop?

for line in lines[1::4]:
    startIdx = 0
    while startIdx + k <= len(line):
        substring = line[startIdx:(startIdx+k)]
        countFromSb[substring] = countFromSb.get(substring, 0) + 1
        startIdx += 1
    linesProcessed += 1

Does your solution work? If so Why do you want something different? — wwii, Jul 27 '17 at 14:06
@wwii just want to make it more efficient. I need to make more sweeps on the same text to traverse the substrings. Efficient in the sense that faster/doesn't require same computations over and over/doesn't keep large data structures in the memory — dusa, Jul 27 '17 at 14:08
@cricket_007 I am trying to find substring matches but that is not the problem, I need to sweep multiple times over the text to get the substrings, I am trying to make that part more efficient/elegant than a while loop — dusa, Jul 27 '17 at 14:11
@dusa If you have working example you should ask on https://codereview.stackexchange.com/ — user3053452, Jul 27 '17 at 14:16
Possible duplicate of [What's the best way to split a string into fixed length chunks and work with them in Python?](https://stackoverflow.com/questions/18854620/whats-the-best-way-to-split-a-string-into-fixed-length-chunks-and-work-with-the) — wwii, Jul 27 '17 at 14:18
It doesn't look like you are doing anything unnecessary - maybe you could make use of [enumerate](https://docs.python.org/3/library/functions.html#enumerate) — wwii, Jul 27 '17 at 14:20
Another duplicate - [Split python string every nth character?](https://stackoverflow.com/q/9475241/2823755) — wwii, Jul 27 '17 at 14:23
@wwii, my interpretation of the question is that OP wants "abcdef" to yield "ab", "bc", "cd", "de", "ef" when window size is 2. The questions you are linking appear to yield "ab", "cd", "ef". — Kevin, Jul 27 '17 at 14:25
@wwii chunkstring looks interesting but seems like it doesn't do overlaps. I will edit my question, those should be overlaps. I just need to run this while loop a few times in the complete program, so it feels unnecessary to do it more than once, but I also don't want to keep a large datastructure in the memory. — dusa, Jul 27 '17 at 14:26
```while startIdx + k <= len(line):``` - did you intend to *abandon* fragments at the end of each line that are less than ```k``` in length? — wwii, Jul 27 '17 at 16:17
@wwii Oh no, thanks for bringing it up. It should be len(line)-k — dusa, Jul 27 '17 at 17:41
@wwii No, no..It should be as is..because I do this: startIdx + k — dusa, Jul 28 '17 at 17:40

Gribouillis · Answer 1 · 2017-07-27T14:39:29.953

1

It can be made more elegant by using a collections.Counter instance

countFromSb = Counter()
# ...
n = -1
for n, line in enumerate(lines[1::4]):
    countFromSb.update(line[i:i+k] for i in range(1+len(line)-k))
lines_processed = n + 1

edited Jul 27 '17 at 14:39

answered Jul 27 '17 at 14:33

Gribouillis

2,230
1
9
14

score 1 · Answer 2 · answered Jul 27 '17 at 14:35

You can't iterate over the fixed-size slices of a sequence any faster than O(N), so your current approach is already as efficient as it gets.

In terms of elegance, you could abstract the iteration into its own function, which will keep your current scope less cluttered with one letter variable names:

def iter_slices(s, size):
    for i in range(len(s)-size+1):
        yield s[i:i+size]

for line in lines[1::4]:
    for substring in iter_slices(line, k):
        countFromSb[substring] = countFromSb.get(substring, 0) + 1
    linesProcessed += 1

This can also be combined with Gribouillis' suggestion to use a Counter, eliminating the for blocks entirely:

countFromSb = Counter(substring for line in lines[1::4] for substring in iter_slices(line, k))

a better way to read in the substrings of a text without loop / python

2 Answers2