1

I need a code that does 2 functions. Update I added something to the requirements that I forgot

1, if a substring if found on a list, it should tag it with a name, for example, if "0325" is found then in another column, it should add "animal_customer" and if no match is found, it should say "unknown" in a column that in this example is named, 'specie'

2, if the substring is found, then, on another column that has a long string, it should extract 1 substring (space separated) to the left and 3 substrings (also space separated) to the right. After that if found and added to another column, the column (in this case named story) is not necesary. Note that if there was no match in step 1, then this column should say "I.D. not found"

3, There is a chance that more than 1 tag will occur, I need all those that show up as lists in the column

Here is an example of the lists and how the inital table and result table should look

animal = ["0325", "9985"]

human = ["9984", "1859"]

Original Table ->

name species_id story
Bob 010199840101 based on research, human is from U.S. or nearby
Fido 010199850101 based on research, animal is from taiwan or nearby
E.T. 010145660101 based on research, E.T. is from mars or nearby
ManBearPig 03259984010101 based on research, human is from mars or nearby and animal is probably alien too

Resulting table ->

name species_id specie origin_list extract
Bob 010199840101 human_customer human research, human is from U.S.
Fido 010199850101 animal_customer animal research, animal is from taiwan
E.T. 010145660101 unknown none I.D. Not found
manbearpig 03259984010101 "human_customer", "animal_customer" "animal", "human" "research, human is from mars", "and animal is probably alien"

My attempt to fix this:

animal_df = df[df["species_id"].str.contains('|'.join(map(re.escape, animal)))]
animal_df["specie"] = "animal_customer"

human_df = df[df["species_id"].str.contains('|'.join(map(re.escape, human)))]

human_df["specie"] = "human_customer"

df_append = pd.concat(["human_df", "animal_df"])

Problem:

As you can see my attempt will identify and tag the row as if it contains an animal or a human, but wont show up the error if it does not match anything and also will add duplicates.

  • You are looking for Pandas str.contains, there are multiple answers using this solution, eg - https://stackoverflow.com/questions/11350770/filter-pandas-dataframe-by-substring-criteria – Vaishali Apr 26 '23 at 15:18

1 Answers1

2

I would use regex for this. First you need to find any of the values that identify a species. This can be done with create_pattern function that returns a regex pattern with named group. For simplicity I put all species mappings under the species_mapping dictionary, where the key represents the value you want to put into the df. Then all the patterns will be simply joined with pipe to create the regex species_pattern. The map_species function simply returns whatever regex group will be found first or "unknown". The last step is to map this function to the column species_id and store it in species column.

import pandas as pd
import re

def create_pattern(name: str, values: list):
    return rf"(?P<{name}>{'|'.join(values)})"

df = pd.DataFrame(
    {
        "name": ["Bob", "Fido", "E.T.", "ManBearPig"],
        "species_id": ['010199840101', '010199850101', '010145660101', '03259984010101']
    }
)

species_mapping = {
    "animal_customer": ["0325", "9985"],
    "human_customer": ["9984", "1859"],
}

joined_patterns = "|".join(create_pattern(name, values) for name, values in species_mapping.items())
species_pattern = re.compile(rf"({joined_patterns})+")
# will be like '((?P<animal_customer>0325|9985)|(?P<human_customer>9984|1859))+'

def map_species(species_id):
    m = species_pattern.search(species_id)
    if not m:
        return "unknown"
    return ",".join(group for group, value in m.groupdict().items() if value is not None)

df["species"] = df["species_id"].map(map_species)

print(df.to_string())

>>>           name      species_id                         species
>>>  0         Bob    010199840101                  human_customer
>>>  1        Fido    010199850101                 animal_customer
>>>  2        E.T.    010145660101                         unknown
>>>  3  ManBearPig  03259984010101  animal_customer, human_customer

In a similar way I would approach the second task.

adam
  • 159
  • 9
  • Thank you for the alternative solution. I made a small change since I forgot that sometimes I need more than 1 tag. Could you update the answer accordingly? sorry about any issue. – Random Person Apr 26 '23 at 16:55
  • OK I edited the solution: the only thing I needed to touch is the `species_pattern` which now needs to be enclosed in another group with + quantifier. And of course return all matches in `map_species` in some way. As in your example I simply joined them with comma, but I personally would leave them as a list. – adam Apr 26 '23 at 18:37