How to use NLTK Regex patterns to annotate financial news with UP/DOWN indicator?

Question

I'm working on replicating an algorithm describe in this paper: https://arxiv.org/pdf/1811.11008.pdf

On the last page it describes extracting a leaf defined in the grammar labelled 'NP JJ' using the following example: Operating profit margin was 8.3%, compared to 11.8% a year earlier.

I'm expecting to see a leaf labelled 'NP JJ' but I'm not. I'm tearing my hair out as to why (relatively new to regular expressions.)

def split_sentence(sentence_as_string):
    ''' function to split sentence into list of words
    '''
    words = word_tokenize(sentence_as_string)

    return words

def pos_tagging(sentence_as_list):

    words = nltk.pos_tag(sentence_as_list)

    return words

def get_regex(sentence, grammar):

    sentence = pos_tagging(split_sentence(sentence));

    cp = nltk.RegexpParser(grammar) 

    result = cp.parse(sentence) 

    return result


example_sentence = "Operating profit margin was 8.3%, compared to 11.8% a year earlier."

grammar = """JJ : {< JJ.∗ > ∗}
            V B : {< V B.∗ >}
            NP : {(< NNS|NN >)∗}
            NP P : {< NNP|NNP S >}
            RB : {< RB.∗ >}
            CD : {< CD >}
            NP JJ : : {< NP|NP P > +(< (>< .∗ > ∗ <) >) ∗ (< IN >< DT > ∗ < RB > ∗ < JJ > ∗ < NP|NP P >) ∗ < RB > ∗(< V B >< JJ >< NP >)∗ < V B > (< DT >< CD >< NP >) ∗ < NP|NP P > ∗ < CD > ∗ < .∗ > ∗ < CD > ∗| < NP|NP P >< IN >< NP|NP P >< CD >< .∗ > ∗ <, >< V B > < IN >< NP|NP P >< CD >}"""

grammar = grammar.replace('∗','*')

tree = get_regex(example_sentence, grammar)

print(tree)

Thanks for the feedback Alvas, really helpful. Not sure who downvoted your answer, I thought it was really intuitive. — Iain MacCormick, May 14 '20 at 12:42

alvas · Answer 1 · 2020-05-13T01:30:14.027

Firstly, see How to use nltk regex pattern to extract a specific phrase chunk?

Lets see what's the POS tags for the sentence:

from nltk import word_tokenize, pos_tag

text = "Operating profit margin was 8.3%, compared to 11.8% a year earlier."
pos_tag(word_tokenize(text))

[out]:

[('Operating', 'NN'),
 ('profit', 'NN'),
 ('margin', 'NN'),
 ('was', 'VBD'),
 ('8.3', 'CD'),
 ('%', 'NN'),
 (',', ','),
 ('compared', 'VBN'),
 ('to', 'TO'),
 ('11.8', 'CD'),
 ('%', 'NN'),
 ('a', 'DT'),
 ('year', 'NN'),
 ('earlier', 'RBR'),
 ('.', '.')]

First gotcha! No `JJ` in any of the tags

There's no JJ tag in any of the POS in that sentence.

Lets head back to the paper https://arxiv.org/pdf/1811.11008.pdf

Thinking though, the `NP JJ` isn't the ultimate goal; the ultimate goal is to produce the `UP` or `DOWN` label based on some heuristics.

Lets rephrase the steps:

Parse the sentence with a parser (in this case regular expression parser using some sort of grammar)
Identify signal that the sentence has a pattern that can tell use about the ultimate label.

2a. Traverse the parse tree to extract another pattern that tells us about the performance indicator and numeric values.

2b. Use the extracted extracted numeric values to determine the directionality UP / DOWN using some heuristics

2c. Tag the sentence with the UP / Down identified in (2b)

Lets see which component we can build first.

2b. extract another pattern that tells us about the performance indicator and numeric values.

We know the output to some percentage is always CD NN from

('8.3', 'CD'), ('%', 'NN')
('11.8', 'CD'), ('%', 'NN')

So lets try catching that in the grammar.

patterns = """
PERCENT: {<CD><NN>}
"""

PChunker = RegexpParser(patterns)
PChunker.parse(pos_tag(word_tokenize(text)))

[out]:

Tree('S', [('Operating', 'NN'), ('profit', 'NN'), ('margin', 'NN'), ('was', 'VBD'), 
  Tree('PERCENT', [('8.3', 'CD'), ('%', 'NN')]), 
(',', ','), ('compared', 'VBN'), ('to', 'TO'), 
  Tree('PERCENT', [('11.8', 'CD'), ('%', 'NN')]), 
('a', 'DT'), ('year', 'NN'), ('earlier', 'RBR'), ('.', '.')])

Now, how do we get this:

Identify signal that the sentence has a pattern that can tell use about the ultimate label.

We know that <PERCENT> compared to <PERCENT> is a good pattern, so lets try to encode it.

We know that compared to has the tags VBN TO from

 ('8.3', 'CD'),
 ('%', 'NN'),
 (',', ','),
 ('compared', 'VBN'),
 ('to', 'TO'),
 ('11.8', 'CD'),
 ('%', 'NN'),

How about this:

patterns = """
PERCENT: {<CD><NN>}
P2P: {<PERCENT><.*>?<VB.*><TO><PERCENT>}
"""

PChunker = RegexpParser(patterns)
PChunker.parse(pos_tag(word_tokenize(text)))

[out]:

Tree('S', [('Operating', 'NN'), ('profit', 'NN'), ('margin', 'NN'), ('was', 'VBD'), 
           Tree('P2P', [
               Tree('PERCENT', [('8.3', 'CD'), ('%', 'NN')]), 
               (',', ','), ('compared', 'VBN'), ('to', 'TO'), 
               Tree('PERCENT', [('11.8', 'CD'), ('%', 'NN')])]
               ), 
           ('a', 'DT'), ('year', 'NN'), ('earlier', 'RBR'), ('.', '.')]
    )

But that pattern could have been any arbitrary number. We need a signal for the `performance indicator`

Since I'm no domain expert in the financial domain, simply using the existence of operating profit margin might be a good signal, i.e.

from nltk import word_tokenize, pos_tag, RegexpParser

patterns = """
PERCENT: {<CD><NN>}
P2P: {<PERCENT><.*>?<VB.*><TO><PERCENT>}
"""

PChunker = RegexpParser(patterns)


text = "Operating profit margin was 8.3%, compared to 11.8% a year earlier."

indicators = ['operating profit margin']
for i in indicators:
    if i in text.lower():
        print(PChunker.parse(pos_tag(word_tokenize(text))))

[out]:

(S
  Operating/NN
  profit/NN
  margin/NN
  was/VBD
  (P2P
    (PERCENT 8.3/CD %/NN)
    ,/,
    compared/VBN
    to/TO
    (PERCENT 11.8/CD %/NN))
  a/DT
  year/NN
  earlier/RBR
  ./.)

Now how do we get the `UP` / `DOWN`?

2b. Use the extracted extracted numeric values to determine the directionality UP / DOWN using some heuristics

Just from the example sentence, other than "earlier" nothing else tells us about antecedence of the numbers.

So lets hypothesize this, if we have the pattern PERCENT VBN TO PERCENT earlier, we say that the 2nd percent is an older number.

import nltk
from nltk import word_tokenize, pos_tag, RegexpParser

patterns = """
PERCENT: {<CD><NN>}
P2P: {<PERCENT><.*>?<VB.*><TO><PERCENT><.*>*<RBR>}
"""

def traverse_tree(tree, label=None):
    # print("tree:", tree)
    for subtree in tree:
        if type(subtree) == nltk.tree.Tree and subtree.label() == label:
            yield subtree

PChunker = RegexpParser(patterns)

parsed_text = PChunker.parse(pos_tag(word_tokenize(text)))
for p2p in traverse_tree(parsed_text, 'P2P'):
    print(p2p)

[out]:

(P2P
  (PERCENT 8.3/CD %/NN)
  ,/,
  compared/VBN
  to/TO
  (PERCENT 11.8/CD %/NN)
  a/DT
  year/NN
  earlier/RBR)

And the `UP` / `DOWN` label?

import nltk
from nltk import word_tokenize, pos_tag, RegexpParser

patterns = """
PERCENT: {<CD><NN>}
P2P: {<PERCENT><.*>?<VB.*><TO><PERCENT><.*>*<RBR>}
"""

PChunker = RegexpParser(patterns)


def traverse_tree(tree, label=None):
    # print("tree:", tree)
    for subtree in tree:
        if type(subtree) == nltk.tree.Tree and subtree.label() == label:
            yield subtree

def labelme(text):
    parsed_text = PChunker.parse(pos_tag(word_tokenize(text)))
    for p2p in traverse_tree(parsed_text, 'P2P'):
        # Check if the subtree ends with "earlier".
        if p2p.leaves()[-1] ==  ('earlier', 'RBR'):
            # Check if which percentage is larger. 
            percentages = [float(num[0]) for num in  p2p.leaves() if num[1] == 'CD']
            # Sanity check that there's only 2 numbers from our pattern.
            assert len(percentages) == 2
            if percentages[0] > percentages[1]:
                return 'DOWN'
            else:
                return 'UP'

text = "Operating profit margin was 8.3%, compared to 11.8% a year earlier."

labelme(text)

Now the question begets...

**Do you want to write so many rules and catch them using the labelme() above? **

Are the patterns you write foolproof?

E.g. will there be a case that the pattern to compare percentages using the indicator and "earlier" will not be "UP" or "DOWN" as expected

Why are we writing rules in the AI age?

Do you already have humanly annotated data where there are sentences and their corresponding UP/DOWN labels? If so, let me suggest something like https://allennlp.org/tutorials or https://github.com/huggingface/transformers/blob/master/notebooks/03-pipelines.ipynb

How to use NLTK Regex patterns to annotate financial news with UP/DOWN indicator?

1 Answers1

First gotcha! No JJ in any of the tags

Lets head back to the paper https://arxiv.org/pdf/1811.11008.pdf

Thinking though, the NP JJ isn't the ultimate goal; the ultimate goal is to produce the UP or DOWN label based on some heuristics.

Lets see which component we can build first.

But that pattern could have been any arbitrary number. We need a signal for the performance indicator

Now how do we get the UP / DOWN?

And the UP / DOWN label?

Now the question begets...

First gotcha! No `JJ` in any of the tags

Thinking though, the `NP JJ` isn't the ultimate goal; the ultimate goal is to produce the `UP` or `DOWN` label based on some heuristics.

But that pattern could have been any arbitrary number. We need a signal for the `performance indicator`

Now how do we get the `UP` / `DOWN`?

And the `UP` / `DOWN` label?