-1

I am getting a MemoryError using Python 64-bits. Here is my function:

def entr_langue(path,nom_langue):
    mots_ts=[]
    table_tr=dict((ord(char),None) for char in string.punctuation)#table de translation/mapping
    with codecs.open(path,"r","utf-8") as filep:

        for i,line in enumerate(filep):
            #extraction par ligne
            line=" ".join(line.split()[1:])
            line=line.lower()
            line=re.sub(r"\d+"," ",line) #suppression des digits

            if len(line) !=0:
                line=line.translate(table_tr)#suppression des poncts
                mots_ts += line
                mots_ts.append(" ")#ajout des espaces

    ts_str=''.join(mots_ts)
    ts_str=re.sub(' +',' ',ts_str) #remp des series d'espaces par un seul espace
    seq_ts=[i for i in ts_str]


    #daba extraction des Bigram et les trier selon la frequ
    fn=BigramCollocationFinder.from_words(seq_ts)
    fn.apply_freq_filter(6) #"li 3ndhom frequ 9el m 6 ytfiltraw
    bigram_model=fn.ngram_fd.viewitems()
    bigram_model=sorted(fn.ngram_fd.viewitems(), key=lambda item: item[1],reverse=True)

    print (bigram_model)
    np.save(nom_langue+".npy",bigram_model)

The error:

File "C:/Users/msi/Documents/projIA/extraction_bigram.py", line 23, in entr_langue
    mots_ts += line
  MemoryError
martineau
  • 119,623
  • 25
  • 170
  • 301
  • 1
    How large is your input file and how much RAM is available? – Klaus D. Jan 12 '19 at 03:34
  • The line `mots_ts += line` is very inefficient. Use `.append()` and `.extend()` for lists. – Klaus D. Jan 12 '19 at 03:38
  • You may need to also install the 64-bit version of the NLTK (or reinstall it after installing the 64-bit version of Python). – martineau Jan 12 '19 at 03:39
  • 2
    @KlausD.: `list`s overload `+=` such that it's largely equivalent to `extend`. That said, there is a decent change the OP should be using `append` here; since `line` is a `str`, `+=` (and `extend`) would both add each character from `line` individually, and they probably just want the whole line as a single value. – ShadowRanger Jan 12 '19 at 03:41
  • 1
    Side-note: Folks, please stop using `codecs.open`. [It's buggy, slow, and unnecessary on Python 2.6 and higher, where `io.open` is available](https://stackoverflow.com/a/46438434/364696). On Py3, `open` is an alias of `io.open`, on Py2, `io.open` is basically a correct, efficient version of `codecs.open`. `with io.open(path,encoding="utf-8"):` is what you want here. – ShadowRanger Jan 12 '19 at 03:43
  • It's also possible you can't use the NLTK with 64-bit Python... – martineau Jan 12 '19 at 03:43

1 Answers1

0

If u haven't that error on python 32bit, that should be wrong ported code. Becouse on python 64bit you can contain more elements in list, none of standard PC actually can achieve to fullfill that huge data. But, if you even run that on 32bit OS, list can't cointain more than it's possible with 4 GB (or some similar, i'm not sure.)

There is about memory limit similar topic: Memory errors and list limits?

Guaz
  • 193
  • 1
  • 1
  • 12
  • I just consider you used all possible memory for you. In that link you have example of code to check it size with `sizeof()` function :) – Guaz Jan 12 '19 at 03:35