0

I am reading lines from a file and then traversing each overlapping substring of k size in a loop, then process these strings. What would be a better (more efficient and elegant) way to read in the substrings? How can I make a list without the loop?

for line in lines[1::4]:
    startIdx = 0
    while startIdx + k <= len(line):
        substring = line[startIdx:(startIdx+k)]
        countFromSb[substring] = countFromSb.get(substring, 0) + 1
        startIdx += 1
    linesProcessed += 1
dusa
  • 840
  • 3
  • 14
  • 31
  • What are you trying to do? How about a regex? – OneCricketeer Jul 27 '17 at 14:06
  • Does your solution work? If so Why do you want something different? – wwii Jul 27 '17 at 14:06
  • @wwii just want to make it more efficient. I need to make more sweeps on the same text to traverse the substrings. Efficient in the sense that faster/doesn't require same computations over and over/doesn't keep large data structures in the memory – dusa Jul 27 '17 at 14:08
  • @cricket_007 I am trying to find substring matches but that is not the problem, I need to sweep multiple times over the text to get the substrings, I am trying to make that part more efficient/elegant than a while loop – dusa Jul 27 '17 at 14:11
  • @dusa If you have working example you should ask on https://codereview.stackexchange.com/ – user3053452 Jul 27 '17 at 14:16
  • Possible duplicate of [What's the best way to split a string into fixed length chunks and work with them in Python?](https://stackoverflow.com/questions/18854620/whats-the-best-way-to-split-a-string-into-fixed-length-chunks-and-work-with-the) – wwii Jul 27 '17 at 14:18
  • It doesn't look like you are doing anything unnecessary - maybe you could make use of [enumerate](https://docs.python.org/3/library/functions.html#enumerate) – wwii Jul 27 '17 at 14:20
  • Another duplicate - [Split python string every nth character?](https://stackoverflow.com/q/9475241/2823755) – wwii Jul 27 '17 at 14:23
  • @wwii, my interpretation of the question is that OP wants "abcdef" to yield "ab", "bc", "cd", "de", "ef" when window size is 2. The questions you are linking appear to yield "ab", "cd", "ef". – Kevin Jul 27 '17 at 14:25
  • @wwii chunkstring looks interesting but seems like it doesn't do overlaps. I will edit my question, those should be overlaps. I just need to run this while loop a few times in the complete program, so it feels unnecessary to do it more than once, but I also don't want to keep a large datastructure in the memory. – dusa Jul 27 '17 at 14:26
  • @kevin My bad I will retract my duplicate vote if I can – wwii Jul 27 '17 at 14:27
  • @kevin you are right. I want "ab", "bc", "cd", "de", "ef" – dusa Jul 27 '17 at 14:27
  • ```while startIdx + k <= len(line):``` - did you intend to *abandon* fragments at the end of each line that are less than ```k``` in length? – wwii Jul 27 '17 at 16:17
  • @wwii Oh no, thanks for bringing it up. It should be len(line)-k – dusa Jul 27 '17 at 17:41
  • @wwii No, no..It should be as is..because I do this: startIdx + k – dusa Jul 28 '17 at 17:40

2 Answers2

1

It can be made more elegant by using a collections.Counter instance

countFromSb = Counter()
# ...
n = -1
for n, line in enumerate(lines[1::4]):
    countFromSb.update(line[i:i+k] for i in range(1+len(line)-k))
lines_processed = n + 1
Gribouillis
  • 2,230
  • 1
  • 9
  • 14
1

You can't iterate over the fixed-size slices of a sequence any faster than O(N), so your current approach is already as efficient as it gets.

In terms of elegance, you could abstract the iteration into its own function, which will keep your current scope less cluttered with one letter variable names:

def iter_slices(s, size):
    for i in range(len(s)-size+1):
        yield s[i:i+size]

for line in lines[1::4]:
    for substring in iter_slices(line, k):
        countFromSb[substring] = countFromSb.get(substring, 0) + 1
    linesProcessed += 1

This can also be combined with Gribouillis' suggestion to use a Counter, eliminating the for blocks entirely:

countFromSb = Counter(substring for line in lines[1::4] for substring in iter_slices(line, k))
Kevin
  • 74,910
  • 12
  • 133
  • 166