0

I want to create spaCy doc given I have raw text and words but missing whitespaces data.

from spacy.tokens import Doc
doc = Doc(nlp.vocab, words=words, spaces=spaces)

How to do it correctly so information about whitespaces was not lost ? Example of data I have :

data= {'text': 'This is just a test sample.', 'words': ['This', 'is', 'just', 'a', 'test', 'sample', '.']}

D V
  • 139
  • 1
  • 3
  • 10

1 Answers1

0

Based on our discussion in the comments, I would suggest doing either of the following:

Preferred route:

Substitute in the Spacy pipeline those elements you want to improve. If you don't trust the POS tagger for a reason, substitute in a custom parser more fit-for-purpose. OPtionally, you can train the existing POS tagger model with your own annotated data using a tool like Prodigy.

Quick and dirty route:

  1. Load the document as plain text in a Spacy doc
  2. Loop over the tokens as Spacy parsed them and match to your own list of tokens by checking of all the characters match.
  3. If you don't get matches, handle exceptions as an input for a better tokenizer / check why your tokenizer is doing things differently
  4. if you do get a match, load your additional information as Extension Attributes (https://spacy.io/usage/processing-pipelines#custom-components-attributes)
    1. Use these extra attributes in further loops to check if these extra attributes match the Spacy Parser, and output the eventual training dataset.
T. Altena
  • 752
  • 4
  • 15
  • This will tokenize text according to spacy parser. I need to tokenize text according to data['words'] and at that to not loose any info about whitespaces. Maybe spacy has any inbuilt function for this? This real text was already tokenized and I want the tokens map to be exactly the same. – D V May 06 '19 at 17:49
  • Sorry D V, I didn't get that from the question. Please see this answer here: https://stackoverflow.com/questions/53594690/is-it-possible-to-use-spacy-with-already-tokenized-input You can pretty much extend the Spacy pipeline to suit your use-case. – T. Altena May 06 '19 at 17:51
  • looks like that answer has the same issue. Info about white spaces is not preserved in doc = Doc(nlp.vocab, words=words). It will tokenize text but each token will have whitespace. We would need to extract whitespaces from raw text. Can Spacy do it basing on raw text and 'words' ? – D V May 06 '19 at 18:11
  • Spacy is deliberately non destructive in its parsing, so you are kind of trying to get it to do something it doesn't want. Why do you want to keep your own whitespace info vs customize a tokenizer? – T. Altena May 06 '19 at 19:14
  • For example I have a tokenized , partially labeled data in spacy format. I would like to manually correct some labels , to see if all dependencies and tags match original dataset before I train my model ... I will need to create doc, so that I could use visualizer , and correct it properly. And spacy data format does't store information about whitespaces. – D V May 06 '19 at 19:41
  • I mean training data format...like : TRAIN_DATA = [ ( "They trade mortgage-backed securities.", { "heads": [1, 1, 4, 4, 5, 1, 1], "deps": ["nsubj", "ROOT", "compound", "punct", "nmod", "dobj", "punct"], }, )} – D V May 06 '19 at 19:55