2

Usually we start from:

nlp = spacy.load('en_encore_web_sm') # or medium, or large

or

nlp = English()

then:

doc = nlp('my text')

Then we can do a lot of fun with that even not knowing the nature of the first line.

But what exactly is 'nlp'? What is going on under the hood? Is "nlp" a pretrained model, as understood in machine learning, and therefore some big file located somewhere on the disc?

I met an explanation, that 'nlp' is an 'object, containing process pipeline', but that only explains a little.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Andrew Anderson
  • 1,044
  • 3
  • 17
  • 26
  • This doesn't seem like a programming question about non-working code, and should probably be on [Data Science](https://datascience.stackexchange.com/). Some research in the Spacy docs or forums would also probably directly answer your question; in particular, I think you're asking about the [`Doc`](https://spacy.io/api/doc) object. – Matt Hall Sep 15 '22 at 16:39
  • 1
    @kwinkunks I am asking about a step preceding the Doc intantiating, not a Doc object. – Andrew Anderson Sep 15 '22 at 17:50
  • 1
    SO has a lot of useful info not only about "not working code". E.g. https://stackoverflow.com/questions/53645882/pandas-merging-101 – Andrew Anderson Sep 15 '22 at 17:51

3 Answers3

2

You can always check the type of any python objects:

nlp = spacy.load('en_encore_web_sm') # or medium, or large
print(type(nlp))
print(dir(nlp))  # view a list of attributes

You will get something like this (depending on the passed arguments)

<class 'spacy.lang.en.English'>

You are right it is something like 'pretrained' model as it contains vocabulary, binary weights, etc.

Please check the official documentation:

https://spacy.io/api/language

u1234x1234
  • 2,062
  • 1
  • 1
  • 8
0

You could infer what nlp() is by exploring it. For example:

import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_lg")

text = "Elon Musk 889-888-8888 elonpie@tessa.net Jeff Bezos (345)123-1234 bezzi@zonbi.com Reshma Saujani example.email@email.com 888-888-8888 Barkevious Mingo"

text = nlp(text)

print(text)

Will print the exact same text. On the other hand if you do:

for word in text.ents:
    print(word.text,word.label_)

you will get the entities of the string:

Elon Musk PERSON
889-888 CARDINAL
Jeff Bezos PERSON
345)123 CARDINAL
Reshma Saujani PERSON

It is indeed large pre-trained model for the English language and has many functions (parser, lemmatizer, tagger) as the one demonstrated above. Hope this helps a bit to clarify your question.

lynx
  • 180
  • 16
0

nlp is a spaCy pipeline. You can see the details on it here: https://spacy.io/models/en#en_core_web_sm

Pipelines contain multiple components, in this case:

  • tok2vec: Token-to-Vector model for tokenizing and vectorizing text
  • tagger: Part-of-speech (POS) tagger
  • parser: Dependency parser
  • attribute_ruler: Attribute mapping based on rules
  • lemmatizer: Lemmatization (base forms of words)
  • ner: Named entity recognition

Hope this helps. There's more details in the documentation on Pipelines here: https://spacy.io/usage/processing-pipelines