1

I have a bit of a weird one I can't seem to get to the bottom of. I am testing a new lemmatizer for an NLP project, and it works great in the test Jupyter I was using, but as soon as I copy it over to a .py file for production, it raises a StopIteration. Any tips or suggestions on where to look? I have spent far too long trying to produce work arounds, all to no avail. I am using the exact same test dataset for both, so it is not a difference in data frames, both are using the same environment, and ALL code is the exact same.

Thanks in advance!

Here is the function:

def prepareStringTEST(x):
    error = 'Error'
    x = re.sub(r"[^0-9a-z]", " ", x)
    if len(x)==0:
        return ''
    return " ".join([lemma(wd) for wd in x.split()]) 

and here is how it is being called:

df['text_cleaned_test'] = df['text'].apply(lambda x: prepareStringTEST(x))

Here is the error message:

Traceback (most recent call last):
  File "C:\Users\xxx\AppData\Roaming\Python\Python39\site-packages\pattern\text\__init__.py", line 609, in _read
    raise StopIteration
StopIteration

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "z:\CEC Python\NLP\clean_raw_text_new.py", line 138, in <module>
    df['text_cleaned_test'] = df['text'].apply(lambda x: prepareStringTEST(x))
  File "C:\Program Files\Python39\lib\site-packages\pandas\core\series.py", line 4138, in apply
    mapped = lib.map_infer(values, f, convert=convert_dtype)
  File "pandas\_libs\lib.pyx", line 2467, in pandas._libs.lib.map_infer
  File "z:\CEC Python\NLP\clean_raw_text_new.py", line 138, in <lambda>
    df['text_cleaned_test'] = df['text'].apply(lambda x: prepareStringTEST(x))
  File "z:\CEC Python\NLP\clean_raw_text_new.py", line 75, in prepareStringTEST
    return " ".join([lemma(wd) for wd in x.split()])
  File "z:\CEC Python\NLP\clean_raw_text_new.py", line 75, in <listcomp>
    return " ".join([lemma(wd) for wd in x.split()])
  File "C:\Users\xxx\AppData\Roaming\Python\Python39\site-packages\pattern\text\__init__.py", line 2172, in lemma
    self.load()
  File "C:\Users\xxx\AppData\Roaming\Python\Python39\site-packages\pattern\text\__init__.py", line 2127, in load
    for v in _read(self._path):
RuntimeError: generator raised StopIteration

Here is some code to test:

def prepareStringTEST(x):
    error = 'Error'
    x = re.sub(r"[^0-9a-z]", " ", x)
    if len(x)==0:
        return ''
    return " ".join([lemma(wd) for wd in x.split()])



string = ''''
Peter Navarro, who as a White House adviser to President Donald J. Trump worked to keep Mr. Trump in office after his defeat in the 2020 election, disclosed on Monday that he has been summoned to testify on Thursday to a federal grand jury and to provide prosecutors with any records he has related to the attack on the Capitol last year, including “any communications” with Mr. Trump.

The subpoena to Mr. Navarro — which he said the F.B.I. served at his house last week — seeks his testimony about materials related to the buildup to the Jan. 6 attack on the Capitol, and signals that the Justice Department investigation may be progressing to include activities of people in the White House.

Mr. Navarro revealed the existence of the subpoena in a draft of a lawsuit he said he is preparing to file against the House committee investigating the Jan. 6 attack, Speaker Nancy Pelosi and Matthew M. Graves, the U.S. attorney for the District of Columbia.
'''
print(prepareStringTEST(string))

Here are my results in Jupyter (in VS code):

peter navarro who a a white house adviser to president donald j trump work to keep mr trump in office after hi defeat in the 2020 election disclose on monday that he have be summons to testify on thursday to a federal grand jury and to provide prosecutor with any record he have relate to the attack on the capitol last year include any communication with mr trump the subpoena to mr navarro which he say the f b i serve at hi house last week seek hi testimony about material relate to the buildup to the jan 6 attack on the capitol and signal that the justice department investigation may be progress to include activity of people in the white house mr navarro reveal the existence of the subpoena in a draft of a lawsuit he say he be prepare to file against the house committee investigate the jan 6 attack speaker nancy pelosi and matthew m grave the u  attorney for the district of columbia

Here are my results running the exact same code in a .py file (in VS code)

PS Z:\CEC Python> & "C:/Program Files/Python39/python.exe" "z:/CEC Python/NLP/clean_raw_test_new.py"
Traceback (most recent call last):
  File "C:\Users\mkzou183\AppData\Roaming\Python\Python39\site-packages\pattern\text\__init__.py", line 609, in _read
    raise StopIteration
StopIteration

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "z:\CEC Python\NLP\clean_raw_test_new.py", line 31, in <module>
    print(prepareStringTEST(string.lower()))
  File "z:\CEC Python\NLP\clean_raw_test_new.py", line 22, in prepareStringTEST
    return " ".join([lemma(wd) for wd in x.split()])
  File "z:\CEC Python\NLP\clean_raw_test_new.py", line 22, in <listcomp>
    return " ".join([lemma(wd) for wd in x.split()])
  File "C:\Users\mkzou183\AppData\Roaming\Python\Python39\site-packages\pattern\text\__init__.py", line 2172, in lemma
    self.load()
  File "C:\Users\mkzou183\AppData\Roaming\Python\Python39\site-packages\pattern\text\__init__.py", line 2127, in load
    for v in _read(self._path):
RuntimeError: generator raised StopIteration
  • is it possible you are having different versions of python or NLP in jupyter vs where you run the `.py`file? – Rabinzel May 31 '22 at 12:11
  • Thanks for the response. They are in the same folders, using the same environments and packages. I also tried rebuilding a new environment and redoing the python script on a different machine and still got the same results – Mike Zoucha May 31 '22 at 13:44
  • just to be sure. did you check in jupyter `sys.version` for python and "yourNLP lib" with `.__version__` and do the same in your IDE? which IDE you use? You could also share the code and I'll see if it runs on my system – Rabinzel May 31 '22 at 13:59
  • Yes they are the same. How would you like me to send it? – Mike Zoucha May 31 '22 at 14:21
  • is it too big to add to your question ? there are some free share platform where you can upload it and i get it from the link then – Rabinzel May 31 '22 at 14:26
  • I was able to append some test code to the bottom of my original post. Thank you so much! – Mike Zoucha May 31 '22 at 14:30
  • I'd need to know what lemma is – Rabinzel May 31 '22 at 14:37
  • Sorry about that. It is from pattern, so `import pattern from pattern.en import lemma` – Mike Zoucha May 31 '22 at 14:43
  • I also found this doesn't work: `print(prepareStringTEST(string.lower()))` But this does: `try: print(prepareStringTEST(string.lower())) except: print(prepareStringTEST(string.lower()))` which makes absolutely 0 sense to me. – Mike Zoucha May 31 '22 at 14:44
  • I get the error in jupyterLab, aswell as in VSCode trying to run the code as `.py` and as `ipynb` file. python version 3.8, pattern version 3.6 – Rabinzel May 31 '22 at 15:25
  • https://github.com/RaRe-Technologies/gensim/issues/2716 looks like this is a bug, i just debugged the code and it crashed at the point where you try to split the string to list. https://stackoverflow.com/a/67544967/15521392 here a workaround, it seems like the pattern lib is not a good way to go – Rabinzel May 31 '22 at 15:42
  • two more things: 1) as in the 2nd link mentioned, it only crashes the first time you run it. I guess that is why you don't get an error in jupyter because you run the cell probably multiple times (i get also the error only the first time, then it works). 2) I think you want to change your regex pattern to `[0-9a-zA-Z]` or do you remove the capitalized letters on purpose ? – Rabinzel May 31 '22 at 15:48
  • Thanks for the input! I did find that adding way too many try/excepts was able to solve the issue, so that makes sense! The regex is that way becuase the full script includes a .lower() so it would be redundant. Do you have any suggestions for alternatives? I picked pattern because it is so damn fast compared to the others I have tried, and we have HUGE amounts of text data run through this script (nltk, spacy, etc.) – Mike Zoucha May 31 '22 at 16:11
  • ok, at least we know now why it doesnt work with `.py` but does with `ipynb`. Sorry, don't have enough expertise to give solid advice here. tbh I only tried nltk myself few times and didn't know others. – Rabinzel May 31 '22 at 16:17

0 Answers0