35

I installed spacy on my system and I want to parse/extract person name, organization for english. But I saw here, there is 4 model for english. And there is model versioning. I didn't get which model is large and which I have to choose for development?

Anil Jagtap
  • 1,740
  • 4
  • 27
  • 44

2 Answers2

36

sm/md/lg refer to the sizes of the models (small, medium, large respectively).

As it says on the models page you linked to,

Model differences are mostly statistical. In general, we do expect larger models to be "better" and more accurate overall. Ultimately, it depends on your use case and requirements. We recommend starting with the default models (marked with a star below).

FWIW, the sm model is the default (as alluded to above)

AKX
  • 152,115
  • 15
  • 115
  • 172
  • Thank you so much. You made my day. I will go with en_core_web_lg model. – Anil Jagtap May 23 '18 at 12:00
  • @AnilJagtap I would suggest starting with `_sm` as SpaCy recommends though. That model is 29 megabytes only, while `_lg` is over 800. – AKX May 23 '18 at 12:08
  • 1
    I tested my some text with **_sm**. I don't care about size. I want result more accurate. And I think **_lg** is giving more accurate result than **_sm** model. – Anil Jagtap May 23 '18 at 12:11
28

The difference is in the accuracy of the predictions.

But, as you can see in the comparison in the spaCy documentation, the difference is very small.

The en_core_web_lg (788 MB) compared to en_core_web_sm (10 MB):

  • LAS: 90.07% vs 89.66%
  • POS: 96.98% vs 96.78%
  • UAS: 91.83% vs 91.53%
  • NER F-score: 86.62% vs 85.86%
  • NER precision: 87.03% vs 86.33%
  • NER recall: 86.20% vs 85.39%

All that while en_core_web_lg is 79 times larger, hence loads a lot more slowly.

What I recommend is using the en_core_web_sm while developing and then switching to a larger model in production. You can easily switch just by changing the model you load.

nlp = spacy.load("en_core_web_lg")
typhon04
  • 2,350
  • 25
  • 22