4

I have this text in a PDF: "John is a french person that likes pancakes, he also likes to play soccer"

I want to iterate through the characters in the PDF text three at a time. I tried the below, but I got the error that can only concatenate str (not "int") to str. I understand what this error means, but not sure how to solve this within the code.

pdf_text = pdf_file.getPage(1).extractText()

for c in pdf_text:
    print(pdf_text[c:c+3])

I was expecting to get a result, such as:

Joh
ohn
hn 
etc...

Any suggestions, with explanation, will be appreciated. Please let me know if you need more information. Thanks.

Edit: I was able to resolve this question utilizing the comment from @slider.

For educational purposes:

for c in range(len(text) - 3):
    print(text[c:c+3])
Patriots299
  • 365
  • 3
  • 15

1 Answers1

2

The code you provided is a misunderstanding of your problem. You have a text which is a str, "John is a french person...", and you say (in other words):

for char in text: print(text[char:char+3])

Here you can clearly see what's wrong -- char is not a valid index, because it is a str itself ("J" in the first iteration). Instead, you want to take the indices from the text, and because there are exactly as many indices as characters in the text, range(len(text)) does the trick.

You say you want to skip every 3 characters. Well, range() accepts a step argument (see docs), so should you provide any step, it will skip through the amount of indices that the argument indicates:

[i for i in range(0, 10, 3)]
>>> [0, 3, 6, 9]

Now you just have to account for the error you assume when you add a number to the index of a list with a fixed set of elements, as in text[i:i+3]...

steps = [i for i in range(0, len(text)-3, 3)]
for step in steps:
    print(text[step:step+3])

(Note that explicitly saying range(0, n) is the same as saying range(n))

Edit:

You say you need text overlapping, so instead of skipping characters you simply need to iterate through every index of your text, again, accounting for the last indices that don't exist:

steps = [i for i in range(len(text)-3)]
for step in steps:
     print(text[step:step+3])

which is the same as

for char_index in range(len(text)-3):
    print(text[char_index:char_index+3])

Also relevant.

mariogarcc
  • 386
  • 4
  • 13
  • Hi mariogarcc, I sincerily appreciate you taking the time to provide an answer. It however does not contemplate the overlapping of characters that I was needing. I have edited my OP with the solution. – Patriots299 Dec 30 '18 at 04:02
  • Then the question is just the first part of the answer :) @FranJ I will edit my answer. – mariogarcc Dec 30 '18 at 04:04