Pandas: Create new column and add value depending on value (substring) in a string column and value on another column

Question

I am sorry if this is a duplicate question, I did hunt around a bit before I felt like I had to post a question.

I am trying to assign a value in a new column devicevalue based on a value of another 2 columns. My dataframe looks a bit like this;

devicename           make     devicevalue
switch1               cisco        0
switch1-web100        netgear      0  
switch10              cisco        0
switch23              cisco        1
switch31-web200       netgear      0
switch31              cisco        1
switch41-new          cisco        1
switch40e             cisco        1
switch31-web200-new   netgear      0
switch40e             cisco        1
switch11-data100e     netgear      0

I am trying to add a value depending on these criteria;

If make == netgear (set to 0)
If the value after switch is 20 or greater (set to 1, otherwise set to 0)

(If both conditions met, set to 0, i.e. condition of "make == netgear set to 0" takes precedence. Note that this is different from the existing codes where the 2nd condition override (and overwrite result value) if both conditions met.)

I originally had some help getting this together however some devices now have a -new and por a or e which breaks the code that looking at a number at the end of the string

The code I am using is essentially;

def get_number_suffix(devicename: str) -> int:
    i = 1
    while i < len(devicename) and devicename[-i:].isnumeric():
        i += 1

    return int(devicename[-(i-1):])


def compute_devicevalue(row) -> int:
    if 'netgear' in row['make']:
        return 0
    if 20 <= get_number_suffix(row['devicename']):
        return 1
    else:
        return 0

df['devicevalue'] = df.apply(compute_devicevalue, axis=1)

this worked fine before the new additions to the end of some of the naming, now it obviously breaks. I have tried all sorts of ways but I can't find a decent way that ignores -new and por a or e

edit

Sorry all, I completely messed up what I was trying to ask, I'm trying to do the value based on the value after 'switch'.

Essentially using the existing code when it converts the string to an integer and does len it falls over on any name that has a -new and por a or e following it

as an example saying

ValueError: invalid literal for int() with base 10: 'switch23-new'

Why does switch10 cisco have a value of 0 but switch23 cisco has a value of 1? — not_speshal, Oct 11 '21 at 16:11
Whenever 2 conditions of setting to 1 and 0 co-exist, what should we set ? E.g. make == netgear and web or data >= 200 at the same time ? — SeaBean, Oct 11 '21 at 16:29
ah damn, sorry I messed up the example, very sorry sleep deprived new dad. I'll tidy up the original example above — Uggers, Oct 12 '21 at 06:59
@Uggers I've edited my solution above to extract the numbers after 'switch' instead. Please take a look. One thing still outstanding to clarify with you is when both the conditions of make == netgear and the number after switch >= 20, what will be the required value? 0 or 1 ? Seen your edited example you seems want 0, but from your original codes sequence (i.e. your current processing) seems will give 1 since the logic of setting 0 for make == netgear is before checking value of number after switch and will be overwritten to 1 in this case. Please clarify. — SeaBean, Oct 12 '21 at 07:55
@Uggers my current solution above keeps your original code sequence and in the case of both conditions met will set to 1. But if you want 0 instead, you can simply move my codes of setting 0 for make == netgear to the end after setting 1 for switch case. — SeaBean, Oct 12 '21 at 08:02
thanks SeaBean will give it a whirl see if it fixes my issues, — Uggers, Oct 12 '21 at 08:23

SeaBean · Accepted Answer · 2021-10-12T11:14:01.003

3

You can use .loc and str.extract(), as follows:

df['devicevalue'] = 0     # init value to 0

# Set to 1 if the value after 'switch' >= 20. 
# Otherwise part is set during init to 0 at the first statement
df.loc[df['devicename'].str.extract(r'switch(\d+)', expand=False).astype(float) >= 20, 'devicevalue'] = 1

# Set to 0 if `make` == 'netgear'
df.loc[df['make'] == 'netgear', 'devicevalue'] = 0 
# If you have 2 or more values of `make` to match, use, e.g.:
#df.loc[df['make'].isin(['netgear', 'dell']), 'devicevalue'] = 0

Regex r'switch(\d+)' works together with str.extract() to extract the digits after 'switch' no matter they are at the end or in the middle. Therefore, it solves your problem of having the digits previously at the end now at the middle.

Result:

             devicename     make  devicevalue
0               switch1    cisco            0
1        switch1-web100  netgear            0
2              switch10    cisco            0
3              switch23    cisco            1
4       switch31-web200  netgear            0
5              switch31    cisco            1
6          switch41-new    cisco            1
7             switch40e    cisco            1
8   switch31-web200-new  netgear            0
9             switch40e    cisco            1
10    switch11-data100e  netgear            0

edited Oct 12 '21 at 11:14

answered Oct 11 '21 at 16:11

SeaBean

22,547
3
13
25

Doesn't match OP's output. "netgear" in row 8 has a value of 1. I'm unsure what OP needs though. – not_speshal Oct 11 '21 at 16:13
1

@not_speshal Yes, I guess OP's set the desired output quite random. Not adhere to his own data. – SeaBean Oct 11 '21 at 16:13
Correct - the output doesn't match the stated requirements. – not_speshal Oct 11 '21 at 16:14
1

Prefer `.str.extract('(?:web|data)(\d+)', expand=False)` instead of `.str.extract(r'(?:web|data)(\d+)')[0]`. IMHO – Corralien Oct 11 '21 at 16:15
@Corralien Right, that's better. Thank you :-) – SeaBean Oct 11 '21 at 16:19
Shouldn't this statement `df.loc[df['make'] == 'netgear', 'devicevalue'] = 0 ` be after loc statement bcz in ur output `netgar` rows have device value of 1 while it should be zero – Muhammad Hassan Oct 11 '21 at 16:23
1

@MuhammadHassan Good point! Thanks! Then it no longer redundant even when we init to 0 at the beginning. Thanks for pointing up to swap the sequence :-) Anyway, I think better clarify with OP. – SeaBean Oct 11 '21 at 16:27
1

@MuhammadHassan Before clarified with OP, better place the netgear condition before the web/data condition. This is because OP's old program logics is also in this order. – SeaBean Oct 11 '21 at 16:33
Very sorry, I meant the value after switch, not the 3 digit value, really REALLY appreciate the help looking at this though. – Uggers Oct 12 '21 at 07:06
I this appears to have worked perfectly, one quick question. How would I go about adding say new models to ignore. For example netgear and dell Either of these don't seem to work; 'df.loc[df['make'] == 'netgear', 'devicevalue'] = 0 ' 'df.loc[df['make'] == 'dell', 'devicevalue'] = 0 ' 'df.loc[df['make'] == 'netgear' or df['make'] == 'dell' or , 'devicevalue'] = 0' – Uggers Oct 12 '21 at 08:40
sorry messed up formating; `df.loc[df['make'] == 'netgear', 'devicevalue'] = 0 ` `df.loc[df['make'] == 'dell', 'devicevalue'] = 0 ` or `df.loc[df['make'] == 'netgear' or df['make'] == 'dell' or , 'devicevalue'] = 0 ` – Uggers Oct 12 '21 at 08:46
@Uggers you can use `df.loc[df['make'].isin(['netgear', 'dell']), 'devicevalue'] = 0` Simply use `.isin` and put the items in a list. – SeaBean Oct 12 '21 at 08:56
that worked perfectly, I did have to adjust the order the df.loc's were. I added the device name first to set all 20 or higher to =1 then set specific makes back to 0 again as if I did make first it would get set back to 1 again when the device name portion check. – Uggers Oct 12 '21 at 09:26
@Uggers That's great you got it. Remember to accept my solution and also upvote if you have not already done so. – SeaBean Oct 12 '21 at 09:44

score 0 · Answer 2 · answered Oct 11 '21 at 16:53

I tried with regex to extract number from string, here for example.

For my simplicity I converted your dataframe to list

a = [{"devicename" : "switch1","make": "cisco", "devicevalue" :0}, {"devicename" : "switch1-web100", "make" : "netgear", "devicevalue" :0}, {"devicename" : "switch10" , "make" : "cisco", "devicevalue" :0}.... ]

Then I used this function to do it:

import re

def clean_data(data):
    for i in range(len(data)): #remove this if using dataframe row
        row = data[i] #Dict
        if row["make"] == "netgear":
            row["devicevalue"] = 0
        
        tmp = -1
        if "web" in row["devicename"]:
            tmp = [int(s) for s in re.findall(r'\d+', row["devicename"].split("web")[1])][0]
        elif "data" in row["devicename"]:
            tmp = [int(s) for s in re.findall(r'\d+', row["devicename"].split("data")[1])][0]

        if tmp >= 200:
            row["devicevalue"] = 0
        elif tmp == -1:
            pass #Nothing to change

        data[i] = row 
    return data #remove this and return row

I get the following

[{'devicename': 'switch1', 'make': 'cisco', 'devicevalue': 0}, {'devicename': 'switch1-web100', 'make': 'netgear', 'devicevalue': 0}, {'devicename': 'switch10', 'make': 'cisco', 'devicevalue': 0}, {'devicename': 'switch23', 'make': 'cisco', 'devicevalue': 1}, {'devicename': 'switch31-web200', 'make': 'netgear', 'devicevalue': 0}, {'devicename': 'switch31', 'make': 'cisco', 'devicevalue': 1}, {'devicename': 'switch40', 'make': 'cisco', 'devicevalue': 1}, {'devicename': 'switch23', 'make': 'cisco', 'devicevalue': 1}, {'devicename': 'switch31-web200-new', 'make': 'netgear', 'devicevalue': 0}, {'devicename': 'switch31-web100a', 'make': 'cisco', 'devicevalue': 1}, {'devicename': 'switch40', 'make': 'cisco', 'devicevalue': 1}, {'devicename': 'switch11-data100e', 'make': 'cisco', 'devicevalue': 1}]

Since you are sending rows of dataframe, remove the outer loop and return row instead of data in your code

Pandas: Create new column and add value depending on value (substring) in a string column and value on another column

2 Answers2