Firstly, see How to use nltk regex pattern to extract a specific phrase chunk?
Lets see what's the POS tags for the sentence:
from nltk import word_tokenize, pos_tag
text = "Operating profit margin was 8.3%, compared to 11.8% a year earlier."
pos_tag(word_tokenize(text))
[out]:
[('Operating', 'NN'),
('profit', 'NN'),
('margin', 'NN'),
('was', 'VBD'),
('8.3', 'CD'),
('%', 'NN'),
(',', ','),
('compared', 'VBN'),
('to', 'TO'),
('11.8', 'CD'),
('%', 'NN'),
('a', 'DT'),
('year', 'NN'),
('earlier', 'RBR'),
('.', '.')]
First gotcha! No JJ
in any of the tags
There's no JJ
tag in any of the POS in that sentence.

Thinking though, the NP JJ
isn't the ultimate goal; the ultimate goal is to produce the UP
or DOWN
label based on some heuristics.
Lets rephrase the steps:
Parse the sentence with a parser (in this case regular expression parser using some sort of grammar)
Identify signal that the sentence has a pattern that can tell use about the ultimate label.
2a. Traverse the parse tree to extract another pattern that tells us about the performance indicator and numeric values.
2b. Use the extracted extracted numeric values to determine the directionality UP
/ DOWN
using some heuristics
2c. Tag the sentence with the UP
/ Down
identified in (2b)
Lets see which component we can build first.
2b. extract another pattern that tells us about the performance indicator and numeric values.
We know the output to some percentage is always CD NN
from
('8.3', 'CD'), ('%', 'NN')
('11.8', 'CD'), ('%', 'NN')
So lets try catching that in the grammar.
patterns = """
PERCENT: {<CD><NN>}
"""
PChunker = RegexpParser(patterns)
PChunker.parse(pos_tag(word_tokenize(text)))
[out]:
Tree('S', [('Operating', 'NN'), ('profit', 'NN'), ('margin', 'NN'), ('was', 'VBD'),
Tree('PERCENT', [('8.3', 'CD'), ('%', 'NN')]),
(',', ','), ('compared', 'VBN'), ('to', 'TO'),
Tree('PERCENT', [('11.8', 'CD'), ('%', 'NN')]),
('a', 'DT'), ('year', 'NN'), ('earlier', 'RBR'), ('.', '.')])
Now, how do we get this:
- Identify signal that the sentence has a pattern that can tell use about the ultimate label.
We know that <PERCENT> compared to <PERCENT>
is a good pattern, so lets try to encode it.
We know that compared to
has the tags VBN TO
from
('8.3', 'CD'),
('%', 'NN'),
(',', ','),
('compared', 'VBN'),
('to', 'TO'),
('11.8', 'CD'),
('%', 'NN'),
How about this:
patterns = """
PERCENT: {<CD><NN>}
P2P: {<PERCENT><.*>?<VB.*><TO><PERCENT>}
"""
PChunker = RegexpParser(patterns)
PChunker.parse(pos_tag(word_tokenize(text)))
[out]:
Tree('S', [('Operating', 'NN'), ('profit', 'NN'), ('margin', 'NN'), ('was', 'VBD'),
Tree('P2P', [
Tree('PERCENT', [('8.3', 'CD'), ('%', 'NN')]),
(',', ','), ('compared', 'VBN'), ('to', 'TO'),
Tree('PERCENT', [('11.8', 'CD'), ('%', 'NN')])]
),
('a', 'DT'), ('year', 'NN'), ('earlier', 'RBR'), ('.', '.')]
)
But that pattern could have been any arbitrary number. We need a signal for the performance indicator
Since I'm no domain expert in the financial domain, simply using the existence of operating profit margin
might be a good signal, i.e.
from nltk import word_tokenize, pos_tag, RegexpParser
patterns = """
PERCENT: {<CD><NN>}
P2P: {<PERCENT><.*>?<VB.*><TO><PERCENT>}
"""
PChunker = RegexpParser(patterns)
text = "Operating profit margin was 8.3%, compared to 11.8% a year earlier."
indicators = ['operating profit margin']
for i in indicators:
if i in text.lower():
print(PChunker.parse(pos_tag(word_tokenize(text))))
[out]:
(S
Operating/NN
profit/NN
margin/NN
was/VBD
(P2P
(PERCENT 8.3/CD %/NN)
,/,
compared/VBN
to/TO
(PERCENT 11.8/CD %/NN))
a/DT
year/NN
earlier/RBR
./.)
Now how do we get the UP
/ DOWN
?
2b. Use the extracted extracted numeric values to determine the directionality UP / DOWN using some heuristics
Just from the example sentence, other than "earlier" nothing else tells us about antecedence of the numbers.
So lets hypothesize this, if we have the pattern PERCENT VBN TO PERCENT earlier
, we say that the 2nd percent is an older number.
import nltk
from nltk import word_tokenize, pos_tag, RegexpParser
patterns = """
PERCENT: {<CD><NN>}
P2P: {<PERCENT><.*>?<VB.*><TO><PERCENT><.*>*<RBR>}
"""
def traverse_tree(tree, label=None):
# print("tree:", tree)
for subtree in tree:
if type(subtree) == nltk.tree.Tree and subtree.label() == label:
yield subtree
PChunker = RegexpParser(patterns)
parsed_text = PChunker.parse(pos_tag(word_tokenize(text)))
for p2p in traverse_tree(parsed_text, 'P2P'):
print(p2p)
[out]:
(P2P
(PERCENT 8.3/CD %/NN)
,/,
compared/VBN
to/TO
(PERCENT 11.8/CD %/NN)
a/DT
year/NN
earlier/RBR)
And the UP
/ DOWN
label?
import nltk
from nltk import word_tokenize, pos_tag, RegexpParser
patterns = """
PERCENT: {<CD><NN>}
P2P: {<PERCENT><.*>?<VB.*><TO><PERCENT><.*>*<RBR>}
"""
PChunker = RegexpParser(patterns)
def traverse_tree(tree, label=None):
# print("tree:", tree)
for subtree in tree:
if type(subtree) == nltk.tree.Tree and subtree.label() == label:
yield subtree
def labelme(text):
parsed_text = PChunker.parse(pos_tag(word_tokenize(text)))
for p2p in traverse_tree(parsed_text, 'P2P'):
# Check if the subtree ends with "earlier".
if p2p.leaves()[-1] == ('earlier', 'RBR'):
# Check if which percentage is larger.
percentages = [float(num[0]) for num in p2p.leaves() if num[1] == 'CD']
# Sanity check that there's only 2 numbers from our pattern.
assert len(percentages) == 2
if percentages[0] > percentages[1]:
return 'DOWN'
else:
return 'UP'
text = "Operating profit margin was 8.3%, compared to 11.8% a year earlier."
labelme(text)
Now the question begets...
**Do you want to write so many rules and catch them using the labelme()
above? **
Are the patterns you write foolproof?
E.g. will there be a case that the pattern to compare percentages using the indicator and "earlier" will not be "UP" or "DOWN" as expected
Why are we writing rules in the AI age?
Do you already have humanly annotated data where there are sentences and their corresponding UP/DOWN labels? If so, let me suggest something like https://allennlp.org/tutorials or https://github.com/huggingface/transformers/blob/master/notebooks/03-pipelines.ipynb