How to initialize a `Doc` in textacy 0.6.2?

Question

Trying to follow the simple Doc initialization in the docs in Python 2 doesn't work:

>>> import textacy
>>> content = '''
...     The apparent symmetry between the quark and lepton families of
...     the Standard Model (SM) are, at the very least, suggestive of
...     a more fundamental relationship between them. In some Beyond the
...     Standard Model theories, such interactions are mediated by
...     leptoquarks (LQs): hypothetical color-triplet bosons with both
...     lepton and baryon number and fractional electric charge.'''
>>> metadata = {
...     'title': 'A Search for 2nd-generation Leptoquarks at √s = 7 TeV',
...     'author': 'Burton DeWilde',
...     'pub_date': '2012-08-01'}
>>> doc = textacy.Doc(content, metadata=metadata)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/a/anaconda/envs/env1/lib/python2.7/site-packages/textacy/doc.py", line 120, in __init__
    {compat.unicode_, SpacyDoc}, type(content)))
ValueError: `Doc` must be initialized with set([<type 'unicode'>, <type 'spacy.tokens.doc.Doc'>]) content, not "<type 'str'>"

What should that simple intialization look like for a string or a sequence of strings?

UPDATE:

Passing unicode(content) to textacy.Doc() spits out

ImportError: 'cld2-cffi' must be installed to use textacy's automatic language detection; you may do so via 'pip install cld2-cffi' or 'pip install textacy[lang]'.

which would've been nice to have from the moment when textacy was installed, imo.

Even after instaliing cld2-cffi, attempting the code above throws out

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/a/anaconda/envs/env1/lib/python2.7/site-packages/textacy/doc.py", line 114, in __init__
    self._init_from_text(content, metadata, lang)
  File "/Users/a/anaconda/envs/env1/lib/python2.7/site-packages/textacy/doc.py", line 136, in _init_from_text
    spacy_lang = cache.load_spacy(langstr)
  File "/Users/a/anaconda/envs/env1/lib/python2.7/site-packages/cachetools/__init__.py", line 46, in wrapper
    v = func(*args, **kwargs)
  File "/Users/a/anaconda/envs/env1/lib/python2.7/site-packages/textacy/cache.py", line 99, in load_spacy
    return spacy.load(name, disable=disable)
  File "/Users/a/anaconda/envs/env1/lib/python2.7/site-packages/spacy/__init__.py", line 21, in load
    return util.load_model(name, **overrides)
  File "/Users/a/anaconda/envs/env1/lib/python2.7/site-packages/spacy/util.py", line 120, in load_model
    raise IOError("Can't find model '%s'" % name)
IOError: Can't find model 'en'

[Textacy's author said the ReadTheDocs documentation "builds stopped working months ago."](https://github.com/chartbeat-labs/textacy#links) The textacy documentation is not currently (Aug 2018) maintained on the ReadTheDocs and is here instead: https://chartbeat-labs.github.io/textacy — aaronpenne, Aug 03 '18 at 15:44
thanks for the pointer, fixed link. content of initialization steps is identical. — arturomp, Aug 03 '18 at 20:58

arturomp · Answer 1 · 2018-08-03T21:43:46.940

The issue, as shown in the traceback, is at textacy/doc.py in the _init_from_text() function, which tries to detect language and calls it with the string 'en' in line 136. (The spacy repo touches on this in this issue comment.)

I solved this by providing a valid lang (unicode) string of u'en_core_web_sm' and by using unicode in the content and lang argument strings.

import textacy

content = u'''
    The apparent symmetry between the quark and lepton families of
    the Standard Model (SM) are, at the very least, suggestive of
    a more fundamental relationship between them. In some Beyond the
    Standard Model theories, such interactions are mediated by
    leptoquarks (LQs): hypothetical color-triplet bosons with both
    lepton and baryon number and fractional electric charge.'''

metadata = {
    'title': 'A Search for 2nd-generation Leptoquarks at √s = 7 TeV',
    'author': 'Burton DeWilde',
    'pub_date': '2012-08-01'}

doc = textacy.Doc(content, metadata=metadata, lang=u'en_core_web_sm')

That a string instead of a unicode string (with a cryptic error message) changes behaviour, the fact there's a missing package, and the perhaps-outdated/perhaps-non-comprehensive way of using spacy language strings all seem like bugs to me. ‍♂️

Glad you figured this out! – aaronpenne Aug 03 '18 at 23:25 — aaronpenne, Aug 03 '18 at 23:25

aaronpenne · Answer 2 · 2018-08-03T15:53:33.317

0

It appears you are using Python 2 and got a unicode error. In the textacy docs there is a note about some unicode nuances when using Python 2:

Note: In almost all cases, textacy (as well as spacy) expects to be working with unicode text data. Throughout the code, this is indicated as str to be consistent with Python 3’s default string type; users of Python 2, however, must be mindful to use unicode, and convert from the default (bytes) string type as needed.

Therefore I would give this a shot (note the u'''):

content = u'''
          The apparent symmetry between the quark and lepton families of
          the Standard Model (SM) are, at the very least, suggestive of
          a more fundamental relationship between them. In some Beyond the
          Standard Model theories, such interactions are mediated by
          leptoquarks (LQs): hypothetical color-triplet bosons with both
          lepton and baryon number and fractional electric charge.'''

This produced a Doc object as expected for me (on Python 3 though).

edited Aug 03 '18 at 15:53

answered Aug 03 '18 at 15:41

aaronpenne

580
5
10

What output go you get when you try this? – arturomp Aug 03 '18 at 15:42
I get a working `Doc` object as expected. I'm using Python 3 though so can't test your exact case. Good luck! – aaronpenne Aug 03 '18 at 15:48
at my computer/desktop view again. a previous answer suggested this, and the poster deleted it because it didn't work. – arturomp Aug 03 '18 at 20:20
here's my response to the previous answer: also, passing `unicode(content)` to `textacy.Doc()` spits out `ImportError: 'cld2-cffi' must be installed to use textacy's automatic language detection; you may do so via 'pip install cld2-cffi' or 'pip install textacy[lang]'.`, which would've been nice to have from when textacy was installed, imo. Even after instaliing it, it throws out `IOError: Can't find model 'en'`. (I think this is around the time when I gave up last night, and went back to `sklearn`...). – arturomp Aug 03 '18 at 20:22
but thanks for reviving this, it made me look into it again and figure it out. if possible, in the future, it'll help to try your solution before posting it as an answer. ;) – arturomp Aug 03 '18 at 20:56
perhaps try installing spacy as well then run `python -m spacy download en`? – aaronpenne Aug 03 '18 at 21:29
I posted a solution to this. :) (and running that `python -m spacy download en` unfortunately doesn't work (I even tried it) because the issue isn't a missing language package.) – arturomp Aug 03 '18 at 21:39

How to initialize a `Doc` in textacy 0.6.2?

2 Answers2

Linked