How to check for differences between two spaCy Doc objects?

Question

I have two lists of the same strings each, except for slight variations in the strings of the second list, i.e. no capitalization, spelling errors, etc.

I want to check whether or not spaCy does anything differently between the two strings. This means that even if the strings aren't equivalent, I want to know if there are differences in the tagging and parsing.

I tried the following:

import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()

doc = nlp("foo")
doc2 = nlp("foo")

print(doc == doc2)

This prints False so == is not the way to go.

Ideally, I would want my code to find where potential differences are, but checking if anything at all is different would be a very helpful first step.

EDIT:

== was changed to work in newer SpaCy versions. However, it only compares the text level. For dependency, this is an entirely different story and it has not been answered for spaCy yet, apart from this thread now of course.

Possible duplicate of [In Python, when are two objects the same?](https://stackoverflow.com/questions/36898917/in-python-when-are-two-objects-the-same) — Wiktor Stribiżew, Mar 08 '19 at 13:55

aab · Accepted Answer · 2019-03-09T07:28:08.090

Token-Level Comparison

If you want to know whether the annotation is different, you'll have to go through the documents token by token to compare POS tags, dependency labels, etc. Assuming the tokenization is the same for both versions of the text, you can compare:

import spacy
nlp = spacy.load('en')
doc1 = nlp("What's wrong with my NLP?")
doc2 = nlp("What's wring wit my nlp?")
for token1, token2 in zip(doc1, doc2):
    print(token1.pos_, token2.pos_, token1.pos1 == token2.pos1)

Output:

NOUN NOUN True
VERB VERB True
ADJ VERB False
ADP NOUN False
ADJ ADJ True
NOUN NOUN True
PUNCT PUNCT True

Visualization for Parse Comparison

If you want to visually inspect the differences, you might be looking for something like What's Wrong With My NLP?. If the tokenization is the same for both versions of the document, then I think you can do something like this to compare the parses:

First, you'd need to export your annotation into a supported format (some version of CoNLL for dependency parses), which is something textacy can do. (See: https://www.pydoc.io/pypi/textacy-0.4.0/autoapi/export/index.html#export.export.doc_to_conll)

from textacy import export
export.doc_to_conll(nlp('What's wrong with my NLP?'))

Output:

# sent_id 1
1       What    what    NOUN    WP      _       2       nsubj   _       SpaceAfter=No
2       's      be      VERB    VBZ     _       0       root    _       _
3       wrong   wrong   ADJ     JJ      _       2       acomp   _       _
4       with    with    ADP     IN      _       3       prep    _       _
5       my      -PRON-  ADJ     PRP$    _       6       poss    _       _
6       NLP     nlp     NOUN    NN      _       4       pobj    _       SpaceAfter=No
7       ?       ?       PUNCT   .       _       2       punct   _       SpaceAfter=No

Then you need to decide how to modify things so you can see both versions of the token in the analysis. I'd suggest concatenating the tokens where there are variations, say:

1       What         what    NOUN    WP      _       2       nsubj   _       SpaceAfter=No
2       's           be      VERB    VBZ     _       0       root    _       _
3       wrong_wring  wrong   ADJ     JJ      _       2       acomp   _       _
4       with_wit     with    ADP     IN      _       3       prep    _       _
5       my           -PRON-  ADJ     PRP$    _       6       poss    _       _
6       NLP_nlp      nlp     NOUN    NN      _       4       pobj    _       SpaceAfter=No
7       ?            ?       PUNCT   .       _       2       punct   _       SpaceAfter=No

vs. the annotation for What's wring wit my nlp?:

1       What         what    NOUN    WP      _       3       nsubj   _       SpaceAfter=No
2       's           be      VERB    VBZ     _       3       aux     _       _
3       wrong_wring  wr      VERB    VBG     _       4       csubj   _       _
4       with_wit     wit     NOUN    NN      _       0       root    _       _
5       my           -PRON-  ADJ     PRP$    _       6       poss    _       _
6       NLP_nlp      nlp     NOUN    NN      _       4       dobj    _       SpaceAfter=No
7       ?            ?       PUNCT   .       _       4       punct   _       SpaceAfter=No

Then you need to convert both files to an older version of CoNLL supported by whatswrong. (The main issue is just removing the commented lines starting with #.) One existing option is the UD tools CoNLL-U to CoNLL-X converter: https://github.com/UniversalDependencies/tools/blob/master/conllu_to_conllx.pl, and then you have:

1       What         what    NOUN    NOUN_WP _       2       nsubj   _       _
2       's           be      VERB    VERB_VBZ        _       0       root    _       _
3       wrong_wring  wrong   ADJ     ADJ_JJ  _       2       acomp   _       _
4       with_wit     with    ADP     ADP_IN  _       3       prep    _       _
5       my           -PRON-  ADJ     ADJ_PRP$        _       6       poss    _       _
6       NLP_nlp      nlp     NOUN    NOUN_NN _       4       pobj    _       _
7       ?            ?       PUNCT   PUNCT_. _       2       punct   _       _

You can load these files (one as gold and one as guess) and compare them using whatswrong. Choose the format CoNLL 2006 (CoNLL 2006 is the same as CoNLL-X).

This python port of whatswrong is a little unstable, but also basically seems to work: https://github.com/ppke-nlpg/whats-wrong-python

Both of them seem to assume that we have gold POS tags, though, so that comparison isn't shown automatically. You could also concatenate the POS columns to be able to see both (just like with the tokens) since you really need the POS tags to understand why the parses are different.

For both the token pairs and the POS pairs, I think it would be easy to modify either the original implementation or the python port to show both alternatives separately in additional rows so you don't have to do the hacky concatenation.

Thank you so much for the time and effort spent on this. It's exactly what I wanted and more! — Smogshaik, Mar 08 '19 at 18:47

score -1 · Answer 2 · answered Mar 08 '19 at 14:38

-1

Try using doc.similarity() function of spaCy.

For example:

import spacy

nlp = spacy.load('en_core_web_md')  # make sure to use larger model!
tokens = nlp(u'dog cat banana')

for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

The result would be:

Refer from: https://spacy.io

answered Mar 08 '19 at 14:38

Giang Nguyen

450
8
17

1

This is not the right direction. This example is comparing tokens within one document to each other using lexical/semantic similarity metrics rather than comparing the *linguistic annotation* across two different documents. – aab Mar 08 '19 at 15:56

How to check for differences between two spaCy Doc objects?

2 Answers2

Token-Level Comparison

Visualization for Parse Comparison