1

I have a dataframe(df) with different columns. One of the column (col1) is as follows:

    col1
    ----
0   1
1   2
2   1-2
3   1,2
4   1-3
5   3

I am using .replace method in python/pandas to replace the codes in col1 using the code:

 df.col1.replace(to_replace=({'1':'Normal','2':'1-2 more than normal','3':'3-4 more than normal'}), regex=True)

I am using regex=True because there are codes like 1-2 in cells where 1 and 2 have different meanings as mentioned in the dictionary.

Output

    col1
    --------
0   Normal
1   1-2 more than normal
2   Normal-1-2 more than normal
3   Normal,1-2 more than normal
4   Normal-1-2 more than normal-3 more than normal
5   1-2 more than normal-3 more than normal

Desired Output

    col1
    --------
0   Normal
1   1-2 more than normal
2   Normal-1-2 more than normal
3   Normal,1-2 more than normal
4   Normal-3-4 more than normal
5   3-4 more than normal

The Problem:

If I do not consider the fourth row (1-3) then all the codes are replaced correctly, except for the code 3. I further experimented with adding a row with only code 3 and there I found that regex first replaces the values for code 3 and then in those values replaces the codes with values from the dictionary.

It is strange as I am running the regex code/command only once.

One solution is that instead of using numbers in the dictionary values i could use English words, e.g. instead of writing 1-2 more than normal, i can write one-two more than normal and then it works. But I want to keep the numbers as they are easy to interpret.

Any suggestions?

Hanif
  • 377
  • 4
  • 19

1 Answers1

0

Repeating your work I don't seem to get the same error as you do with input

df = pd.DataFrame({'col1' : ['1', '2', '1-2', '1,2', '1-3', '3']})

and applying the same .replace method:

df.col1.replace(to_replace=({'1':'Normal','2':'1-2 more than normal','3':'3-4 more than normal'}), regex=True)

My output matches your Desired Output

Output:

    col1
    ---------
0   Normal
1   1-2 more than normal
2   Normal-1-2 more than normal
3   Normal,1-2 more than normal
4   Normal-3-4 more than normal
5   3-4 more than normal

So I can't really see any problem.

Beyond that though I would consider what transformation you are doing here, and how readable the output is. If you are evaluating each value against some pre-determined limits, why not create another column with a label for each row indicating which classification group it is a member of? Hope that helps!