1

I am attempting to extract quotations and quotation attributions (i.e., the speaker) from text, but I am not obtaining the desired output. I am using textacy. Here is what I have tried so far:

import textacy
from textacy import extract
from textacy.representations import Vectorizer

data = [
        ("\"Hello, nice to meet you,\" said world 1", {"url": "example1.com", "date": "Jan 1"}),
        ("\"Hello, nice to meet you,\" said world 2", {"url": "example2.com", "date": "Jan 2"}),
        ]

corpus = textacy.Corpus("en_core_web_sm", data=data)

vectorizer = Vectorizer(tf_type="linear", idf_type="smooth")
doc = vectorizer.fit_transform(
    ((term.lemma_ for term in extract.terms(doc, ngs=1, ents=True)) for doc in corpus)
    ) 
         
quotes = (textacy.extract.triples.direct_quotations(doc) for records in doc)

print(list(quotes))

And here is the output:

[<generator object direct_quotations at 0x7fdc0faaf6d0>, <generator object direct_quotations at 0x7fdc0faaf5f0>]

The desired output is something like this:

[DQTriple(speaker=[world 1], cue=[said], content="Hello, nice to meet you,")] [DQTriple(speaker=[world 2], cue=[said], content="Hello, nice to meet you,")]

EDIT

Here is some improved code, with the doc now created using the corpus, not data:

import textacy
from textacy import extract
from textacy.representations import Vectorizer

data = [
        ("\"Hello, nice to meet you,\" said world 1", {"url": "example1.com", "date": "Jan 1"}),
        ("\"Hello, nice to meet you,\" said world 2", {"url": "example2.com", "date": "Jan 2"}),
        ]

corpus = textacy.Corpus("en_core_web_sm", data=data)

vectorizer = Vectorizer(tf_type="linear", idf_type="smooth")
doc = vectorizer.fit_transform(
    ((term.lemma_ for term in extract.terms(corpus, ngs=1, ents=True)) for record in corpus)
    ) 
         
print(list((textacy.extract.triples.direct_quotations(doc))))

But now I have a new error:

AttributeError: 'spacy.tokens.doc.Doc' object has no attribute 'is_space'

vvvvv
  • 25,404
  • 19
  • 49
  • 81
jedmund
  • 55
  • 4
  • 1
    Did you try to turn the generators into a list (`list(textacy.extract.triples.direct_quotations(doc))`)? – fsimonjetz Jun 10 '22 at 18:26
  • Just tried, and that might be the right approach, but then I get the error `raise AttributeError(attr + " not found") AttributeError: lang_ not found` which is what I have gotten with other approaches as well. So I think there are additional issues. – jedmund Jun 10 '22 at 18:35
  • Look [here](https://textacy.readthedocs.io/en/latest/installation.html#downloading-data) and [here](https://spacy.io/usage/models). You have to install the spaCy language-specific model data to fix that: `python -m spacy download en_core_web_sm` – Timus Jun 10 '22 at 19:51
  • I actually had that installed already, and just re-installed to be sure. Also installed `python -m textacy download lang_identifier --version 2.0`. Still got the same error. – jedmund Jun 11 '22 at 04:13

1 Answers1

0

This works:

data = [
        ("\"Hello, nice to meet you,\" said world 1"),
        ("\"Hello, nice to meet you,\" said world 2"),
        ]
for record in data:
    doc = textacy.make_spacy_doc(record, lang="en_core_web_sm")
    print(list(textacy.extract.triples.direct_quotations(doc)))

This answer was posted as an edit to the question Extract quotations and attribution from text by the OP jedmund under CC BY-SA 4.0.

vvvvv
  • 25,404
  • 19
  • 49
  • 81