I am having a problem with this seemingly easy task to do. Here 's a recreation of my problem:
I have a dataframe called legal of this form:
+----+-----------------+
| | legal |
|----+-----------------|
| 0 | gmbh |
| 1 | kg |
| 2 | ag |
| 3 | GmbH & Co. KGaA |
| 4 | LP |
| 5 | LLP |
| 6 | LLLP |
| 7 | LLC |
| 8 | PLLC |
| 9 | corp |
| 10 | corporation |
| 11 | inc |
| 12 | cic |
| 13 | cio |
| 14 | ltd |
| 15 | s.a. |
+----+-----------------+
It contains all the words that can represent a legal term of a given company.
Now I have another dataframe containing a list of company raw names that might also contain some legal terms.
My task is to identify such legal terms for each company row name in the companies
dataframe.
I am trying to use some regex so that the legal terms might both be uppercase and lowercase (or a mix). So I am using the method extract for that.
For the sake of the demonstration, my first company raw name is 2&0 Technologies Inc
, so for that company I would expect to extract the world inc
from my legal dataframe.
This is the simplified version of my code with some comments:
def format_companies(self, legals, locations):
self.companies['base_name'] = ''
self.companies['location'] = ''
self.companies['legal'] = ''
for i, row in self.companies.iterrows():
legal_pattern = '/(' + "|".join(row['raw'].split()]) +')/ig'
legal_pattern = rf'{legal_pattern}'
print(legal_pattern) # It prints out -> /(2&0|Technologies|Inc)/ig
legal = legals['legal'].str.extract(legal_pattern)
print(tabulate(legal, headers='keys', tablefmt='psql')) # Everything is NaN. (results will be print below)
if i >= 0:
break
The first print statement is just to print out the pattern used in the extract method, which is /(2&0|Technologies|Inc)/ig
.
The second pattern is to print out the results from the extract method, and as said in the comments, it returns a list of NaNs:
+----+-----+
| | 0 |
|----+-----|
| 0 | nan |
| 1 | nan |
| 2 | nan |
| 3 | nan |
| 4 | nan |
| 5 | nan |
| 6 | nan |
| 7 | nan |
| 8 | nan |
| 9 | nan |
| 10 | nan |
| 11 | nan |
| 12 | nan |
| 13 | nan |
| 14 | nan |
| 15 | nan |
+----+-----+
I am very confused because if you try out the regular expression /(2&0|Technologies|Inc)/ig
on the text 'inc' on https://www.regextester.com/, inc gets selected correctly.
What am I doing wrong?