0

I understand there is a way to create case expressions in SQL based on current columns. Is there a way to do this in Python with Pandas?

The goal is for Python to read an Excel file and create new columns based on conditions.

The query would start like this:

import pandas as pd
df1 = pd.read_excel('data.xlsx')

in SQL it would be read as follows:

Case 
When Type = 'Banana%' THEN 'Fruit'
End

Case 
When Type = 'Apple%' THEN 'Fruit'
End

Case 
When Type = 'Carrot' THEN 'Vegetable'
End

The data in Excel looks like this:

Type
Apple
Banana
Carrot
Austin
  • 2,203
  • 3
  • 12
  • 28
bryd
  • 13
  • 4

3 Answers3

0

Yes, you would be able to do this:

import pandas as pd

# create an example dataframe
d = {'type':['apple', 'banana', 'carrot'],
    'amount':[10, 100, 200]}

df = pd.DataFrame(d)

# change the types
df['type'] = df['type'].replace('apple','fruit')
df['type'] = df['type'].replace('banana','fruit')
df['type'] = df['type'].replace('carrot','vegitable')

# show the modified dataframe
df

which returns this:

enter image description here

note that you could do the same in 1 line like this:

df['type'] = df['type'].replace(['apple','banana','carrot'],['fruit','fruit','vegitable'])

or with arrays, so there is a bit of additional flexibility.

D.L
  • 4,339
  • 5
  • 22
  • 45
0

A good way is to use regular expressions

import pandas as pd

# create an example dataframe
d = {'type':['apple', 'banana', 'carrot','apple1', 'banana1', 'carrot1'],
    'amount':[10, 100, 200,10,10,10]}

df = pd.DataFrame(d)

# change the types
#df['type'] = df['type'].replace(to_replace=r'^apple.$', value='fruit', regex=True)
df['type'] = df['type'].replace(regex={r'^apple(.$|$)': 'fruit', '^banana(.$|$)': 'fruit', '^carrot(.$|$)': 'vegitable'})

# show the modified dataframe
print(df)

Which would change the datatframe to

        type  amount
0      fruit      10
1      fruit     100
2  vegitable     200
3      fruit      10
4      fruit      10
5  vegitable      10
nbk
  • 45,398
  • 8
  • 30
  • 47
0

You can also apply anonymous functions on your pandas series like this

import pandas as pd

# create an example dataframe
data = {'type':['apple', 'banana', 'carrot'],
    'amount':[10, 100, 200]}
df = pd.DataFrame(data)

# define a function that converts value passed to type
def change_type(value: str):
    if 'banana' in str(value).lower():
        return 'Fruit'
    elif 'apple' in str(value).lower():
        return 'Fruit'
    elif 'carrot' in str(value).lower():
        return 'Vegetable'

# apply the function inside an anonymous function.
df['type'] = df['type'].apply(lambda value: change_type(value))

print(df)

and the output becomes

        type  amount
0      Fruit      10
1      Fruit     100
2  Vegetable     200
  • This works @dalmas otieno – bryd Sep 06 '22 at 17:19
  • How would I write code for a date range? For instance, Case WHEN date BETWEEN '2021-02-01' AND '2021-04-31' THEN 'Q1-2021' WHEN date BETWEEN '2021-05-01' AND '2021-07-31' THEN 'Q2-2021' WHEN date BETWEEN '2021-08-01' AND '2021-10-31' THEN 'Q3-2021' WHEN date BETWEEN '2021-11-01' AND '2022-01-31' THEN 'Q4-2021' – bryd Sep 06 '22 at 17:21
  • @bryd for date range you can still use the if conditions as shown in [this answer](https://stackoverflow.com/a/5464465/19800108) – Dalmas Otieno Sep 06 '22 at 17:25
  • thanks! I read through this page. Is there a "between" function in python like there is in SQL? – bryd Sep 07 '22 at 13:25
  • Nope but you can write your own function to do that based on the link that I share above and [this link](https://stackoverflow.com/questions/2634127/how-do-i-check-if-a-given-datetime-object-is-between-two-datetimes) – Dalmas Otieno Sep 07 '22 at 15:02