3

I am fairly new to spacy / textacy and I have a complicated task ahead. Your help is much appreciated.

In a nutshell, from a sentence like "Did assault paramedic by kicking and pushing him", I want to establish whether the reported abuse was against a police officer or other worker (ambulance, hospital staff, traffic warden, etc).

The challenges are: - The language in which the officers write is not standard English, also the sentences have many punctuation and other errors. - Subject is often omitted from the reports so using 'textacy.extract.subject_verb_object_triples' for example does not work as it cannot find a subject. (also subject is not necessary here as we already know that the individual has been charged with the abuse, we only want to know what category worker they assaulted from the text provided) - The text can comprise of a number of sentences that give other context to the crime or it might list a number of abuse charges to multiple types of workers in one text.

Examples: 1. "Did shout, swear and threaten her neighbours, assault A Police Officer." 2. "Did get ejected from a liecenced premises thereafter act aggressively towards his wife and push her.Did act in an aggressive threatening manner towards door staff and other persons.Did resist arrest.Did assault Police by biting and kicking." 3. "Accused did punch PC Smith then in the execution of his duty by throwing a punch towards his face to his non injury." 4. "Did throw a mobile phone at witness constable Smith"

What I am expecting to get is something like VERB,OBJECT (punch, PC Smith) which would then need to be learned to mean yes, this is a police officer. The compound objects could be PC (Police Constable), Sgt (Sargent), etc

I tried this:

import spacy
import textacy


nlp = spacy.load('en')
text = nlp(u'Did assault paramedic by kicking and pushing him')

text_ext = textacy.extract.subject_verb_object_triples(text)

But that only works after adding a subject (which i do not need), as well as 'the' in front of the object (paramedic). So the sentence becomes "Accused did assault the paramedic by kicking and pushing him). I have 55k statements to begin with so correcting the language is not feasible.

How can I work this issue? Thanks

Kristin
  • 31
  • 1
  • 2

1 Answers1

0

A good starting point would be to take the code for textacy.extract.subject_verb_object_triples() and modify it to work for your data (being aware that your non-standard sentences might not end up with great dependency parses, also try en_core_web_lg instead). From textacy.extract:

def subject_verb_object_triples(doc):
    """
    Extract an ordered sequence of subject-verb-object (SVO) triples from a
    spacy-parsed doc. Note that this only works for SVO languages.
    Args:
        doc (:class:`spacy.tokens.Doc` or :class:`spacy.tokens.Span`)
    Yields:
        Tuple[:class:`spacy.tokens.Span`]: The next 3-tuple of spans from ``doc``
        representing a (subject, verb, object) triple, in order of appearance.
    """
    # TODO: What to do about questions, where it may be VSO instead of SVO?
    # TODO: What about non-adjacent verb negations?
    # TODO: What about object (noun) negations?
    if isinstance(doc, Span):
        sents = [doc]
    else:  # spacy.Doc
        sents = doc.sents

    for sent in sents:
        start_i = sent[0].i

        verbs = spacy_utils.get_main_verbs_of_sent(sent)
        for verb in verbs:
            subjs = spacy_utils.get_subjects_of_verb(verb)
            if not subjs:
                continue
            objs = spacy_utils.get_objects_of_verb(verb)
            if not objs:
                continue

            # add adjacent auxiliaries to verbs, for context
            # and add compounds to compound nouns
            verb_span = spacy_utils.get_span_for_verb_auxiliaries(verb)
            verb = sent[verb_span[0] - start_i : verb_span[1] - start_i + 1]
            for subj in subjs:
                subj = sent[
                    spacy_utils.get_span_for_compound_noun(subj)[0]
                    - start_i : subj.i
                    - start_i
                    + 1
                ]
                for obj in objs:
                    if obj.pos == NOUN:
                        span = spacy_utils.get_span_for_compound_noun(obj)
                    elif obj.pos == VERB:
                        span = spacy_utils.get_span_for_verb_auxiliaries(obj)
                    else:
                        span = (obj.i, obj.i)
                    obj = sent[span[0] - start_i : span[1] - start_i + 1]

                    yield (subj, verb, obj)
aab
  • 10,858
  • 22
  • 38