What is nlp in spacy?

Question

Usually we start from:

nlp = spacy.load('en_encore_web_sm') # or medium, or large

or

nlp = English()

then:

doc = nlp('my text')

Then we can do a lot of fun with that even not knowing the nature of the first line.

But what exactly is 'nlp'? What is going on under the hood? Is "nlp" a pretrained model, as understood in machine learning, and therefore some big file located somewhere on the disc?

I met an explanation, that 'nlp' is an 'object, containing process pipeline', but that only explains a little.

This doesn't seem like a programming question about non-working code, and should probably be on [Data Science](https://datascience.stackexchange.com/). Some research in the Spacy docs or forums would also probably directly answer your question; in particular, I think you're asking about the [`Doc`](https://spacy.io/api/doc) object. — Matt Hall, Sep 15 '22 at 16:39
@kwinkunks I am asking about a step preceding the Doc intantiating, not a Doc object. — Andrew Anderson, Sep 15 '22 at 17:50
SO has a lot of useful info not only about "not working code". E.g. https://stackoverflow.com/questions/53645882/pandas-merging-101 — Andrew Anderson, Sep 15 '22 at 17:51

score 2 · Answer 1 · answered Sep 15 '22 at 19:26

You can always check the type of any python objects:

nlp = spacy.load('en_encore_web_sm') # or medium, or large
print(type(nlp))
print(dir(nlp))  # view a list of attributes

You will get something like this (depending on the passed arguments)

<class 'spacy.lang.en.English'>

You are right it is something like 'pretrained' model as it contains vocabulary, binary weights, etc.

Please check the official documentation:

https://spacy.io/api/language

score 0 · Answer 2 · answered Oct 27 '22 at 09:49

You could infer what nlp() is by exploring it. For example:

import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_lg")

text = "Elon Musk 889-888-8888 elonpie@tessa.net Jeff Bezos (345)123-1234 bezzi@zonbi.com Reshma Saujani example.email@email.com 888-888-8888 Barkevious Mingo"

text = nlp(text)

print(text)

Will print the exact same text. On the other hand if you do:

for word in text.ents:
    print(word.text,word.label_)

you will get the entities of the string:

Elon Musk PERSON
889-888 CARDINAL
Jeff Bezos PERSON
345)123 CARDINAL
Reshma Saujani PERSON

It is indeed large pre-trained model for the English language and has many functions (parser, lemmatizer, tagger) as the one demonstrated above. Hope this helps a bit to clarify your question.

NLP from scratch · Answer 3 · 2023-08-10T15:53:52.577

nlp is a spaCy pipeline. You can see the details on it here: https://spacy.io/models/en#en_core_web_sm

Pipelines contain multiple components, in this case:

tok2vec: Token-to-Vector model for tokenizing and vectorizing text
tagger: Part-of-speech (POS) tagger
parser: Dependency parser
attribute_ruler: Attribute mapping based on rules
lemmatizer: Lemmatization (base forms of words)
ner: Named entity recognition

Hope this helps. There's more details in the documentation on Pipelines here: https://spacy.io/usage/processing-pipelines

What is nlp in spacy?

3 Answers3