So I'm trying to make my program print out the indexes of each word and punctuation, when it occurs, from a text file. I have done that part. - But the problem is when I'm trying to recreate the original text with punctuation using those index positions. Here is my code:
with open('newfiles.txt') as f:
s = f.read()
import re
#Splitting string into a list using regex and a capturing group:
matches = [x.strip() for x in re.split("([a-zA-Z]+)", s) if x not in ['',' ']]
print (matches)
d = {}
i = 1
list_with_positions = []
# the dictionary entries:
for match in matches:
if match not in d.keys():
d[match] = i
i+=1
list_with_positions.append(d[match])
print (list_with_positions)
file = open("newfiletwo.txt","w")
file.write (''.join(str(e) for e in list_with_positions))
file.close()
file = open("newfilethree.txt","w")
file.write(''.join(matches))
file.close()
word_base = None
with open('newfilethree.txt', 'rt') as f_base:
word_base = [None] + [z.strip() for z in f_base.read().split()]
sentence_seq = None
with open('newfiletwo.txt', 'rt') as f_select:
sentence_seq = [word_base[int(i)] for i in f_select.read().split()]
print(' '.join(sentence_seq))
As i said the first part works fine but then i get the error:-
Traceback (most recent call last):
File "E:\Python\Indexes.py", line 33, in <module>
sentence_seq = [word_base[int(i)] for i in f_select.read().split()]
File "E:\Python\Indexes.py", line 33, in <listcomp>
sentence_seq = [word_base[int(i)] for i in f_select.read().split()]
IndexError: cannot fit 'int' into an index-sized integer
This error occurs when the program runs through 'sentence_seq' towards the bottom of the code
newfiles is the original text file - a random article with more than one sentence with punctuation
list_with_positions is the list with the actual positions of where each word occurs within the original text
matches is the separated DIFFERENT words - if words repeat in the file (which they do) matches should have only the different words.
Does anyone know why I get the error?