3

I'm trying to do sentiment analysis on financial news, and I want to be able to recognise companies based on the ticker symbol. Eg. recognise Spotify from SPOT. The final objective would be to generate sentiment models of each company. spaCy is pretty good at named entity recognition out of the box but it falls short when comparing ticker symbol and company. I have a list of ticker symbol and company names (from NASDAQ, NYSE, AMEX) in csv format.

Based on using the similarity() function in spaCy, the results aren't good so far. The table below shows a sample of a few companies which have a low similarity score, even though the names are similiar visually. I want to train the model using the list of company names/ticker symbols, and have a higher similarity score after this training process.

+------------+-------------------------+------------+
|   Stock    |          Name           | Similarity |
+------------+-------------------------+------------+
| CSPI stock | CSP Inc.                | 0.072      |
| CHGG stock | Chegg, Inc.             | 0.071      |
| QADA stock | QAD Inc.                | 0.065      |
| SPOT stock | Spotify Technology S.A. | 0.064      |
+------------+-------------------------+------------+

Based on spaCy's documentation, some tools include using PhraseMatcher, EntityRuler, Rule-based matching, Token Matcher. Which one would be most suited for this use case?

Akshay
  • 43
  • 1
  • 4

4 Answers4

3

My recommendation would be to not try to match the ticke symbol to the company name, but the company name in the text to the company name you have in te CSV. You will get MUCH better results.

As a fuzzy match, I would recommend using Levenshtein algoritm, example here: T-SQL Get percentage of character match of 2 strings

For a Python Levenshtein, I would recommend this: https://github.com/ztane/python-Levenshtein/#documentation

I've personally have used EntityRuler with a combination of jsonl rules sets

But you will have to bring your own data. You need a DB with ticker simbols and company names.

nlp = spacy.load('en_core_web_lg')

stock_symbol_shapes_ruler = EntityRuler(nlp)
stock_symbol_shapes_ruler.name="stock_symbol_shapes_ruler"
patterns_stock_symbol_shapes = [            
            {"label": "ORG", "pattern": "NASDAQ"},
            {"label": "STOCK_SYMBOL", "pattern": [{"SHAPE": "XXX.X"}]},         
            {"label": "STOCK_SYMBOL", "pattern": [{"SHAPE": "XXXX.X"}]}, 
            ]
stock_symbol_shapes_ruler.add_patterns(patterns_stock_symbol_shapes)
nlp.add_pipe(stock_symbol_shapes_ruler, before='ner')

stock_symbol_ruler = EntityRuler(nlp).from_disk("./stock_symbol_pattern.jsonl")
stock_symbol_ruler.name = 'stock_symbol_ruler'
nlp.add_pipe(stock_symbol_ruler, before='ner')

company_name_ruler = EntityRuler(nlp).from_disk("./company_name_patterns.jsonl")
company_name_ruler.name="company_name_ruler"
nlp.add_pipe(company_name_ruler, before='ner')
doc = nlp(test_text)

The files are generated using SQL

{"label": "STOCK_SYMBOL", "pattern": "AAON"}
{"label": "STOCK_SYMBOL", "pattern": "AAP"}
{"label": "STOCK_SYMBOL", "pattern": "AAPL"}
{"label": "STOCK_SYMBOL", "pattern": "AAVL"}
{"label": "STOCK_SYMBOL", "pattern": "AAWW"}


{"label": "ORG", "pattern": "AMAG Pharmaceuticals"}
{"label": "ORG", "pattern": "AMAG Pharmaceuticals Inc"}
{"label": "ORG", "pattern": "AMAG Pharmaceuticals Inc."}
{"label": "ORG", "pattern": "AMAG Pharmaceuticals, Inc."}
{"label": "ORG", "pattern": "Amarin"}
{"label": "ORG", "pattern": "Amarin Corporation plc"}
{"label": "ORG", "pattern": "Amazon.com Inc."}
{"label": "ORG", "pattern": "Amazon Inc"}
{"label": "ORG", "pattern": "Amazonm"}
Dragos Durlut
  • 8,018
  • 10
  • 47
  • 62
  • Hey, I might sound stupid, but I am trying to do the same thing, but I can't transform this code into spacy v3(just started with python). Could you post a variant for v3? Thanks! – bmcristi Feb 13 '21 at 07:57
  • Sorry, am also a beginner in Python as well. Try maybe to make it work yourself, and if it does not work, make a SO post about it with the code you have and what you have tryed. – Dragos Durlut Feb 19 '21 at 10:46
  • Can you not just add this to the existing pipeline - 'ner' ? The existing one might have few organization names? Won't it be more efficient? I am new to NLP / Spacy... have been few days only, so pardon me if my question is not good... – Tushar Dec 15 '21 at 13:41
  • @Tushar I am also a beginner, Python, NLP is not my main language. You can try and see what results you get. :) – Dragos Durlut Dec 16 '21 at 07:03
1

You can train sense2vec models & then use them in conjunction with spaCy. They go well hand in hand. https://github.com/explosion/sense2vec

sense2vec would assist you in identifying that SPOT is similar to Spotify in a context.

DhruvPathak
  • 42,059
  • 16
  • 116
  • 175
1

I recommend trying fuzzywuzzy library..it is very easy to work with and I think it could do a good job in your situation. Good example could be found here: https://towardsdatascience.com/natural-language-processing-for-fuzzy-string-matching-with-python-6632b7824c49

Chadee Fouad
  • 2,630
  • 2
  • 23
  • 29
0

I used the entity ruler with this successfully.

You can create multiple patterns for both the company name and stock symbol, and create an id for both so that they are interlinked.

For the ticker, I used a combination of regex patterns to figure out, for example, 3 capital letters followed by a . or :

kayuzee
  • 75
  • 5