I am attempting to extract quotations and quotation attributions (i.e., the speaker) from text, but I am not obtaining the desired output. I am using textacy. Here is what I have tried so far:
import textacy
from textacy import extract
from textacy.representations import Vectorizer
data = [
("\"Hello, nice to meet you,\" said world 1", {"url": "example1.com", "date": "Jan 1"}),
("\"Hello, nice to meet you,\" said world 2", {"url": "example2.com", "date": "Jan 2"}),
]
corpus = textacy.Corpus("en_core_web_sm", data=data)
vectorizer = Vectorizer(tf_type="linear", idf_type="smooth")
doc = vectorizer.fit_transform(
((term.lemma_ for term in extract.terms(doc, ngs=1, ents=True)) for doc in corpus)
)
quotes = (textacy.extract.triples.direct_quotations(doc) for records in doc)
print(list(quotes))
And here is the output:
[<generator object direct_quotations at 0x7fdc0faaf6d0>, <generator object direct_quotations at 0x7fdc0faaf5f0>]
The desired output is something like this:
[DQTriple(speaker=[world 1], cue=[said], content="Hello, nice to meet you,")] [DQTriple(speaker=[world 2], cue=[said], content="Hello, nice to meet you,")]
EDIT
Here is some improved code, with the doc now created using the corpus, not data:
import textacy
from textacy import extract
from textacy.representations import Vectorizer
data = [
("\"Hello, nice to meet you,\" said world 1", {"url": "example1.com", "date": "Jan 1"}),
("\"Hello, nice to meet you,\" said world 2", {"url": "example2.com", "date": "Jan 2"}),
]
corpus = textacy.Corpus("en_core_web_sm", data=data)
vectorizer = Vectorizer(tf_type="linear", idf_type="smooth")
doc = vectorizer.fit_transform(
((term.lemma_ for term in extract.terms(corpus, ngs=1, ents=True)) for record in corpus)
)
print(list((textacy.extract.triples.direct_quotations(doc))))
But now I have a new error:
AttributeError: 'spacy.tokens.doc.Doc' object has no attribute 'is_space'