3

AllenNLP Interpret and Textattack are supposed to "attack" models to figure out why they generate their output. I have mostly used spaCy to train my models and would like to try either of the frameworks and see if they give me a better understanding of my models. But it seems like they're not compatible with spaCy models (or maybe I'm doing something wrong). For Textattack I tried following this example: https://textattack.readthedocs.io/en/latest/quickstart/overview.html but swapping the model with a spaCy model. That didn't work well, because inside the class TokenizedText there is

ids = tokenizer.encode(text)

which throws an error, because spaCy's Tokenizer object doesn't have a method called encode(). I noticed that there were multiple subclasses of the Textattack's Tokenizer and a SpacyTokenizer among them. If that's the compatible version of Tokenizer why isn't it automatically detected and called instead? I tried swapping them up, but I got confused by some of the parameters SpacyTokenizer requires:

def __init__(self, word2id, oov_id, pad_id, max_seq_length=128)

word2id is a word-id pairing, but what kind of ids? Is it for all words in the vocab or just the tokens of this particular sentence? oov_id is even more confusing, because "oov" stands for "out-of-variable", not "out-of-vocabulary" as is the case in spaCy. Moreover in spaCy it's a boolean value, not an id. pad_id is not explained at all and I have no idea what it is.

So it seems like there is some connection between Textattack and spaCy, but I can't figure out how to put it together into a working example.

When it comes to AllenNLP Interpret I tried using the hotflip attack, but the very first thing that happens is this error message:

for i in self.vocab._index_to_token[self.namespace]:
AttributeError: 'spacy.vocab.Vocab' object has no attribute '_index_to_token'

so it doesn't seem that this framework is suited for spaCy either, because it expects the _index_to_token, but spaCy's Vocab doesn't have that.

Can someone help me out?

Kaisa K
  • 31
  • 1

1 Answers1

0

I'm one of the creators of TextAttack. Our built-in SpaCyTokenizer uses SpaCy to convert words to tokens, but takes a dictionary that is supposed to convert tokens to their corresponding IDs. This is so that you can pass in your embeddings' word-to-ID mapping and use those IDs with the SpaCy tokens. This is how our models work behind the scenes.

I need a slight bit more information to help. When you train your model, how do you convert the text to IDs? Can you provide me with a snippet of code that uses your tokenizer to convert a string to a list of token IDs? Then I can show you how to wrap your tokenizer to work with TextAttack.

Hope that makes sense. We could definitely be smarter about tokenizers and support tokenizers by default, and we'll work on that for future updates.

jxmorris12
  • 1,262
  • 4
  • 15
  • 27
  • Hi, thanks for your answer :) It looks like spaCy uses the **Vocab** to store the string-id mappings in **Vocab.strings**. Example: ``` import spacy nlp = spacy.load('spacy_models/news4') eple_id = nlp.vocab.strings['eple'] ``` **eple_id** is the hash value of the word 'eple' (apparently spaCy switched id's to hash values). The **Vocab** is also needed to create the **Tokenizer**: ``` tokenizer = Tokenizer(nlp.vocab) ``` Does this help? – Kaisa K May 26 '20 at 08:07
  • Hi @KaisaK - that's part of the info I need. Can you explain how you'd then use the *Vocab* to query your model? Please give an example of how you'd tokenize a string and pass it to the model. Then I'll show you how to configure that for TextAttack! – jxmorris12 May 26 '20 at 18:29
  • when you pass a string to the **Language** object (nlp) it creates a **Tokenizer** object that is passed the string. The **Tokenizer** then splits the text and saves the span and hash of a token to a **Doc** object containing all tokens (the **Doc** is defined like this: cdef Doc doc = Doc(self.vocab)). Note that **Tokenizer** and **Doc** are written in Cython/C++ (depending on spaCy version), which I'm unfortunately not very familiar with, so I'm trying my best :) To get the token ids you do: doc = nlp(text_string) for token in doc: hash_id = nlp.vocab.strings[token.text] – Kaisa K Jun 03 '20 at 12:03
  • Each **Doc** object consists of multiple **Token** objects that store their string value in **.text**, while the **Vocab** is "a storage class for vocabulary and other data shared across a language", so it's created for each language based on lookup tables. Hope it makes sense somehow :) – Kaisa K Jun 03 '20 at 12:04
  • @KaisaK -- ok, I think I can help you! Do you mind raising an issue on our Github repo so we can respond more quickly over there? Thanks. – jxmorris12 Jun 04 '20 at 03:22