IndexError: cannot fit 'int' into an index-sized integer

Question

So I'm trying to make my program print out the indexes of each word and punctuation, when it occurs, from a text file. I have done that part. - But the problem is when I'm trying to recreate the original text with punctuation using those index positions. Here is my code:

with open('newfiles.txt') as f:
    s = f.read()
import re
#Splitting string into a list using regex and a capturing group:
matches = [x.strip() for x in re.split("([a-zA-Z]+)", s) if x not in ['',' ']]
print (matches)
d = {} 
i = 1
list_with_positions = []
# the dictionary entries:
for match in matches:
    if match not in d.keys():
        d[match] = i
        i+=1
    list_with_positions.append(d[match])

print (list_with_positions)
file = open("newfiletwo.txt","w")
file.write (''.join(str(e) for e in list_with_positions))
file.close()
file = open("newfilethree.txt","w")
file.write(''.join(matches))
file.close()
word_base = None
with open('newfilethree.txt', 'rt') as f_base:
    word_base = [None] + [z.strip() for z in f_base.read().split()]

sentence_seq = None
with open('newfiletwo.txt', 'rt') as f_select:
    sentence_seq = [word_base[int(i)] for i in f_select.read().split()]

print(' '.join(sentence_seq))

As i said the first part works fine but then i get the error:-

Traceback (most recent call last):
    File "E:\Python\Indexes.py", line 33, in <module>
       sentence_seq = [word_base[int(i)] for i in f_select.read().split()]
    File "E:\Python\Indexes.py", line 33, in <listcomp>
       sentence_seq = [word_base[int(i)] for i in f_select.read().split()]
IndexError: cannot fit 'int' into an index-sized integer

This error occurs when the program runs through 'sentence_seq' towards the bottom of the code

newfiles is the original text file - a random article with more than one sentence with punctuation

list_with_positions is the list with the actual positions of where each word occurs within the original text

matches is the separated DIFFERENT words - if words repeat in the file (which they do) matches should have only the different words.

Does anyone know why I get the error?

your `int` must be too big for array indexing: probable duplicate of http://stackoverflow.com/questions/4751725/python-overflowerror-cannot-fit-long-into-an-index-sized-integer (not closing the question yet) — Jean-François Fabre, Jan 19 '17 at 17:22
@Jean-FrançoisFabre Indeed because we are replacing each word in the text file for integers (it's indexes) - probably around 60-80 words. So, does that mean the only way to overcome this is to use a shorter text file? — The World In 5, Jan 19 '17 at 17:27
Stab in the dark here. `file.write (''.join(str(e) for e in list_with_positions))` writes the data with no spaces, such that when you read it back in, your `split()` does nothing and actually you're trying to index by an 80-digit number. — roganjosh, Jan 19 '17 at 17:28
@roganjosh Wow that did solve a lot of the problem but the final output comes as - " They say it ' s a dog ' s life " instead of "They say it's a dog's life" - Is it a whitespace error between the punctuation? This happens for full stops too - i guess all the punctuation gets treated like the words because of the way i split the original file. Do you know any way to let there not be any unnecessary space between the punctuation (as you do need whitespace after a fullstop but not before. etc) — The World In 5, Jan 19 '17 at 17:33
In that case, try `sentence_seq = [word_base[int(i)].strip() for i in f_select.read().split()]`. I won't write as an answer yet because I can't test any of this — roganjosh, Jan 19 '17 at 17:36
@roganjosh Unfortunately there is no difference BTW i just noticed, there are random letter 's' in the final output. This just makes me totally confused. Here is what it outputs:- "They say it ' s a dog ' s ' s life , but s for Estrella" - not the unnecessary letter 's's in the output — The World In 5, Jan 19 '17 at 17:40
This is getting tough for me to visualise; without data it's difficult for anyone to keep track in the debugging. But again you do have `file.write(''.join(matches))` where you join words with no separation. What happens if you change that to `file.write(' '.join(matches))`? Really, I might be reaching my limit to what I can suggest without a test case here. — roganjosh, Jan 19 '17 at 17:45
@roganjosh Genuine question just curious not rude : Are you not on a computer or anything? Why cant you test it - not being rude genuinely asking - is there anything wrong with the code? And i did separate the matches filewrite when you suggested to do it for the other one so no luck so far — The World In 5, Jan 19 '17 at 17:49
The first line of your code: `with open('newfiles.txt') as f:`. I don't have `newfiles.txt`, that's on _your_ computer. There is the idea of an [MCVE](http://stackoverflow.com/help/mcve) here, so that people can replicate the issue easily. I don't know what your file contains, so I don't know if any test case I create is accurate to what you're using and if I can't be assured I can recreate the issue, it's wasted effort on my part to end up giving false advice. It always helps to try pinpoint the issue you have and make it easily reproducible :) — roganjosh, Jan 19 '17 at 17:54
Oh, sorry for being stupid - newfiles is just a random article with more than one sentence with punctuation. That's all that matters within the context of the question - just saying in case you are bothered enough :-) I've changed the question - thx for letting me know — The World In 5, Jan 19 '17 at 17:56
So if I create a file containing `Welcome to Stack Overflow. It's fine that you didn't quite create an MCVE on your first question as otherwise it's quite interesting.` then I'm set? :) — roganjosh, Jan 19 '17 at 18:01
Hopefully the last question. I'm really trying to stick with your current code but I'm finding it tough. The issue here is that punctuation cannot be included in the `join()`. Do you need to stick with your current format? — roganjosh, Jan 19 '17 at 18:40
No, not necessarily as long as it's along the same lines and does the requirements — The World In 5, Jan 19 '17 at 18:43
@Jean-FrançoisFabre first dibs here; I've identified the problem in my answer but I don't like my solution. Is there a cleaner way? OP may/may not accept as answer but I will upvote if you find a better way. — roganjosh, Jan 19 '17 at 19:52
@roganjosh see my improvement suggestions. I don't want to post a slightly better solution plagarizing yours while you did all the legwork with the OP. Lower part can be improved, upper part cannot with listcomps because you generate 2 lists. — Jean-François Fabre, Jan 19 '17 at 20:38

roganjosh · Answer 1 · 2017-01-19T20:50:28.647

1

The issue with your approach is using ''.join() as this joins everything with no spaces. So, the immediate issue is that you attempt to then split() what is effectively a long series of digits with no spaces; what you get back is a single value with 100+ digits. So, the int overflows with a gigantic number when trying to use it as an index. Even more of an issue is that indices might go into double digits etc.; how did you expect split() to deal with that when numbers are joined without spaces?

Beyond that, you fail to treat punctuation properly. ' '.join() is equally invalid when trying to reconstruct a sentence because you have commas, full stops etc. getting whitespace on either side.

I tried my best to stick with your current code/approach (I don't think there's huge value in changing the entire approach when trying to understand where an issue comes from) but it still feels shakey for me. I dropped the regex, perhaps that was needed. I'm not immediately aware of a library for doing this kind of thing but almost certainly there must be a better way

import string

punctuation_list = set(string.punctuation) # Has to be treated differently

word_base = []
index_dict = {}
with open('newfiles.txt', 'r') as infile:
    raw_data = infile.read().split()
    for index, item in enumerate(raw_data):
        index_dict[item] = index
        word_base.append(item)

with open('newfiletwo.txt', 'w') as outfile1, open('newfilethree.txt', 'w') as outfile2:
    for item in word_base:
        outfile1.write(str(item) + ' ')
        outfile2.write(str(index_dict[item]) + ' ')

reconstructed = ''
with open('newfiletwo.txt', 'r') as infile1, open('newfilethree.txt', 'r') as infile2:
    indices = infile1.read().split()
    words = infile2.read().split()
    reconstructed = ''.join([item + ' ' if item in punctuation_list else ' ' + item + ' ' for item in word_base])

edited Jan 19 '17 at 20:50

answered Jan 19 '17 at 19:11

roganjosh

12,594
4
29
46

1

Thx a lot man, thx for your time & skills it really really helps. :-) – The World In 5 Jan 19 '17 at 19:13
@TheWorldIn5 you're very welcome, this felt like it became much more difficult than it should have been; in your broader goals I would think there is a library for this. Dealing with punctuation is always going to be an issue otherwise. I wonder if NLTK can help. – roganjosh Jan 19 '17 at 19:15
@TheWorldIn5 again, you're welcome :) I think I've missed the mark on this one, feel free to un-accept my answer, which leaves it open... this means that more people might answer. I am interested in an answer myself for more general things. My answer pinpoints the main issue but I don't like how it deals with it; we can both learn from this – roganjosh Jan 19 '17 at 19:46
How come when we print 'words', instead of the positions of each word in the file, a random output of numbers come, i checked manually and then the integers outputted do not represent the occurance of each word in the text. Just in case you wanted to know what my text actually says here it is: "They say it's a dog's life, but for Estrella, born without her front legs, she's adapted to more of a kangaroo way of living. The Peruvian mutt hasn't let her disability hold her back, gaining celebrity status in the small town of Tinga Maria. " – The World In 5 Jan 19 '17 at 20:02
Instead of - [1, 2, 3, 4, 5, 6, 7, 4, 5, 8, 9, 10, 11, 12, 9, 13, 14, 15, 16, 17, 9, 18, 4, 5, 19, 20, 21, 22, 6, 23, 24, 22, 25, 26, 27, 28, 29, 30, 4, 31, 32, 15, 33, 34, 15, 35, 9, 36, 37, 38, 39, 40, 41, 42, 22, 43, 44, 26] - the actual positions of each word in the text, which i gained by printing 'result' from the question above- i get ['0', '1', '2', '19', '4', '5', '6', '7', '8', '9', '10', '32', '12', '13', '14', '15', '16', '17', '41', '19', '20', '21', '41', '23', '24', '25', '26', '27', '28', '32', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43'] – The World In 5 Jan 19 '17 at 20:05
Which is not the positions of the word occurances? The other parts of the code work perfectly just not this part or is it that im printing the wrong variable? – The World In 5 Jan 19 '17 at 20:06
Cannot replicate with `words` however there are two potential things here: dictionaries are not [ordered](http://stackoverflow.com/questions/15479928/why-is-the-order-in-dictionaries-and-sets-arbitrary) and also dictionaries need unique keys so the index position will be overwritten if a word occurs more than once. – roganjosh Jan 19 '17 at 20:07
1

your solution looks all right, a few comments: `indices = [item for item in infile1.read().split()]` => `indices = infile1.read().split()` (same for line below). Also `for item in word_base: if item in punctuation_list: reconstructed += item + ' ' else: reconstructed += ' ' + item + ' '` is ugly & underperformant. I'd write `reconstructed = ''.join([item + ' ' if item in punctuation_list else ' ' + item + ' ' for item in word_base])`. But this isn't http://codereview-on-answers.stackexchange.com :) – Jean-François Fabre Jan 19 '17 at 20:36
@Jean-FrançoisFabre (s)he did, I asked for it to be revoked as I'm not sure I liked my approach. You have given valid feedback, working on it now. I don't know why I used list comp. for those – roganjosh Jan 19 '17 at 20:40
@roganjosh In the dictionary link you gave me, it says that we can use the first occuring key - if the same word occurs more than once. That is what i want to do here to. How do i do that? – The World In 5 Jan 19 '17 at 21:34
@TheWorldIn5 What difference would it make? – roganjosh Jan 19 '17 at 21:35
Lol, that was half of the purpose of my question which i had already done but, i don't think i wrote it in the question because i thought that since i already done that, that would be printed with what you have wrote. But, since it isn't i'm trying to find a way to conjoin the two – The World In 5 Jan 19 '17 at 21:38

IndexError: cannot fit 'int' into an index-sized integer

1 Answers1