0

I am having a problem with this seemingly easy task to do. Here 's a recreation of my problem:

I have a dataframe called legal of this form:

+----+-----------------+
|    | legal           |
|----+-----------------|
|  0 | gmbh            |
|  1 | kg              |
|  2 | ag              |
|  3 | GmbH & Co. KGaA |
|  4 | LP              |
|  5 | LLP             |
|  6 | LLLP            |
|  7 | LLC             |
|  8 | PLLC            |
|  9 | corp            |
| 10 | corporation     |
| 11 | inc             |
| 12 | cic             |
| 13 | cio             |
| 14 | ltd             |
| 15 | s.a.            |
+----+-----------------+

It contains all the words that can represent a legal term of a given company.

Now I have another dataframe containing a list of company raw names that might also contain some legal terms. My task is to identify such legal terms for each company row name in the companies dataframe. I am trying to use some regex so that the legal terms might both be uppercase and lowercase (or a mix). So I am using the method extract for that.

For the sake of the demonstration, my first company raw name is 2&0 Technologies Inc, so for that company I would expect to extract the world inc from my legal dataframe.

This is the simplified version of my code with some comments:

def format_companies(self, legals, locations):
        self.companies['base_name'] = ''
        self.companies['location'] = ''
        self.companies['legal'] = ''
        for i, row in self.companies.iterrows():
            legal_pattern = '/(' + "|".join(row['raw'].split()]) +')/ig'
            legal_pattern = rf'{legal_pattern}'
            print(legal_pattern) # It prints out -> /(2&0|Technologies|Inc)/ig
            legal = legals['legal'].str.extract(legal_pattern)
            print(tabulate(legal, headers='keys', tablefmt='psql')) # Everything is NaN. (results will be print below)
            if i >= 0:
                break

The first print statement is just to print out the pattern used in the extract method, which is /(2&0|Technologies|Inc)/ig.

The second pattern is to print out the results from the extract method, and as said in the comments, it returns a list of NaNs:

+----+-----+
|    |   0 |
|----+-----|
|  0 | nan |
|  1 | nan |
|  2 | nan |
|  3 | nan |
|  4 | nan |
|  5 | nan |
|  6 | nan |
|  7 | nan |
|  8 | nan |
|  9 | nan |
| 10 | nan |
| 11 | nan |
| 12 | nan |
| 13 | nan |
| 14 | nan |
| 15 | nan |
+----+-----+

I am very confused because if you try out the regular expression /(2&0|Technologies|Inc)/ig on the text 'inc' on https://www.regextester.com/, inc gets selected correctly.

What am I doing wrong?

giulio di zio
  • 171
  • 1
  • 11

1 Answers1

1

str.extract() does not recognize regex pattern with /i to indicate IGNORECASE. To solve this, you can do it in 2 ways:

Method 1: Change your definition of legal_pattern without the / and /ig:

legal_pattern = '(' + "|".join(row['raw'].split()]) +')'
legal_pattern = rf'{legal_pattern}'

Instead, use the flag re.IGNORECASE in str.extract(), as follows:

import re
legals['legal'].str.extract(legal_pattern, re.IGNORECASE)

Method 2: Alternatively, you can also use (?i) in the regex to indicate IGNORECASE, as follows:

legal_pattern = '(?i)(' + "|".join(row['raw'].split()]) +')'
legal_pattern = rf'{legal_pattern}'

Then, you can use str.extract() without specifying re.IGNORECASE:

legals['legal'].str.extract(legal_pattern)

Result:

      0
0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
5   NaN
6   NaN
7   NaN
8   NaN
9   NaN
10  NaN
11  inc
12  NaN
13  NaN
14  NaN
15  NaN
SeaBean
  • 22,547
  • 3
  • 13
  • 25