Optimizing this pandas transformation

Question

First of all, I'm completely new to pandas, so I'm not sure if this is even called a "Transformation", but I've done this and it feels like there must be a much more efficient way (either in LOC or running time) to achieve the same. This is what I've written:

def findInCol(row, col, tags):
    try:
        for m in validTags:
            if m.lower() in row[col].lower():
                tags.add(m)
            else:
                m_fix = pat.sub('', m.lower())
                row_fix = pat.sub('', row[col].lower())
                if m_fix in row_fix:
                    tags.add(f"{m}")
    except KeyError:
        pass
def getTags(row):
    tags = set()
    findInCol(row, 'Ad_data', tags)
    findInCol(row, 'Name', tags)
    tags = clean(tags)
    return ','.join(tags) if tags else "NA"

def clean(tags):
    arr = list(tags)
    remove = set()
    for i in range(len(arr)):
        for j in range(i+1, len(arr)):
            i_l = pat.sub('', arr[i].lower())
            j_l = pat.sub('', arr[j].lower())
            if i_l in j_l:
                remove.add(i)
            elif j_l in i_l:
                remove.add(j)
    arr = [i for j, i in enumerate(arr) if j not in remove]
    return arr

df['Tag'] = df.fillna("NA").apply(getTags, axis=1)

If anything needs clarification please ask for it. This essentially tries to find some valid tags inside the fields 'Ad_data' y 'Name'. Many times it will find more than one matching tag, which is fine. However, it could be the case that both 'Horse' and 'RedHorse' are valid tags, so after I do the initial search, I need to clean up to keep just the more specific tag (ie. RedHorse) This is performed in the clean function.

EDIT:

Here is a sample dataset. So, going from a table with just the Name (ignore the Ad_data for now, as it may not exist in some cases), and a separate list of valid tags defined in python code (eg. just ["Horse", "RedHorse"] in this case), I need to get the following output table (with the tag added as a column):

Thanks for adding data, but Please [do not post images](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-errors-when-asking-a-question) of your data. You can just copy-paste the data in a code block, that’s [much more useful](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — Cimbali, Jun 21 '21 at 11:10

Cimbali · Accepted Answer · 2021-06-21T11:13:39.653

I would start with validTags being a sequence (list, series, whatever) of words, but importantly that are sorted from least specific to most specific (at least within each set of tags that can be applied to a single row).

>>> validTags = pd.Series(['Horse', 'RedHorse'])
>>> validTags
0       Horse
1    RedHorse
dtype: object

Now you can to extract these tags from the columns Ad_data and Name:

>>> df = pd.DataFrame({'Name': ['this row is a horse'], 'Ad_data': ['buy a nice REDHORSE']}, index=['a'])
>>> regex = '(' + '|'.join(validTags) + ')'
>>> import re
>>> df['Name'].str.extractall(regex, flags=re.IGNORECASE)[0]
   match
a  0        horse
Name: 0, dtype: object
>>> df['Ad_data'].str.extractall(regex, flags=re.IGNORECASE)[0]
   match
a  0        REDHORSE
Name: 0, dtype: object

You can see extractall returns all matches with a first index level that corresponds to the row where the tag was found, and a second index level that is just the numbering of the match.

I see in your data there a some unexpected characters that are removed by what I suppose is a regular expression pat. You can do that by replacing df[col].str.extractall(...) by df[col].str.replace(pat, '').str.extractall(...).

Let’s put all these matches together, then use GroupBy.unique() to extract unique elements:

>>> allmatches = pd.concat([
...   df[col].str.replace(pat, '').str.extractall(regex, flags=re.IGNORECASE)[0]
... for col in ['Name', 'Ad_data']])
>>> tags = allmatches.str.lower().groupby(level=0).unique()
>>> tags
a    [horse, redhorse]
Name: 0, dtype: object

Now you have lists of items, you can join them together with .str.join:

>>> df['Tag'] = tags.str.join(', ')
>>> df
                  Name              Ad_data              Tag
a  this row is a horse  buy a nice REDHORSE  horse, redhorse

Now if we want to clean we can use the order in validTags. Note that tags is simply df['Tag'].str.split(', ')

>>> tags = tags.explode()
>>> tags
a       horse
a    redhorse
Name: 0, dtype: object
>>> tag_order = validTags.str.lower().reset_index(name='tag').rename(columns={'index': 'order'})
>>> tag_order
   order       tag
0      0     horse
1      1  redhorse
>>> ordered_tags = pd.merge(tags.reset_index(name='tag'), tag_order, on='tag')
>>> ordered_tags
  index       tag  order
0     a     horse      0
1     a  redhorse      1
>>> clean = ordered_tags.loc[ordered_tags.groupby('index')['order'].idxmax()].set_index('index')['tag']
>>> clean
index
a    redhorse
Name: tag, dtype: object

As you can see the order column now defines which tags are the most specific, and GroupBy + idxmax allows to get the most specific thanks to the ordering.

>>> df['clean tag'] = clean
>>> df
                  Name              Ad_data              Tag clean tag
a  this row is a horse  buy a nice REDHORSE  horse, redhorse  redhorse

Wow, I was going to comment the Unexpected Characters thing and you fixed it just before I could comment. Thanks for this, this approach makes a lot more sense than what I was doing! — ALEJANDRO PEREZ MORENO, Jun 21 '21 at 11:18
I just tried to run this in my actual dataset, and I was probably not clear enough about that fact that multiple modeltags are allowed. For example, if both "Horse" and "Dog" match, both should be included. The clean up stage only remove tags that are substring of the others. So if name is "Hi this Dog is a Red Horse" and "Dog" is a valid tag too, initially the tags would be "Dog", "Horse" and "Red Horse", and after clean up it would be just "Dog" and "Red Horse". Any clue on how to achieve this modifying you code? — ALEJANDRO PEREZ MORENO, Jun 21 '21 at 12:02
I might just use my function for the clean up stage for now anyway, thanks for your help. — ALEJANDRO PEREZ MORENO, Jun 21 '21 at 12:07
@ALEJANDROPEREZMORENO I think modifying that cleanup is complex enough for another question. — Cimbali, Jun 21 '21 at 12:09

Optimizing this pandas transformation

1 Answers1