Pandas - Data Cleaning - Add New Column with if else statement for text values

Question

I have a list of publisher that looks like this :

+--------------+
|  Site Name   |
+--------------+
| Radium One   |
| Euronews     |
| EUROSPORT    |
| WIRED        |
| RadiumOne    |
| Eurosport FR |
| Wired US     |
| Eurosport    |
| EuroNews     |
| Wired        |
+--------------+

I'd like to create the following result:

+--------------+----------------+
|  Site Name   | Publisher Name |
+--------------+----------------+
| Radium One   | RadiumOne      |
| Euronews     | Euronews       |
| EUROSPORT    | Eurosport      |
| WIRED        | Wired          |
| RadiumOne    | RadiumOne      |
| Eurosport FR | Eurosport      |
| Wired US     | Wired          |
| Eurosport    | Eurosport      |
| EuroNews     | Euronews       |
| Wired        | Wired          |
+--------------+----------------+

I would like to understand how I can replicate this code I use in Power Query :

search first 4 characters

if Text.Start([Site Name],4) = "WIRE" then "Wired" else

search last 3 characters

if Text.End([Site Name],3) = "One" then "RadiumOne" else

If no match is found, then add "Rest"

It does not have to be case sensitive.

score 0 · Answer 1 · answered Nov 01 '16 at 12:01

0

you can use apply method and function like :

def handle_text(txt):
    if txt.lower()[:4] == 'wire':
        return 'Wired'
    elif txt.lower()[-3:] == 'one':
        return 'RadiumOne'
    return 'Rest'

df['Publisher Name'] = df['Site Name'].apply(handle_text)

answered Nov 01 '16 at 12:01

Dmitry Ziolkovskiy

3,834
2
18
20

Hi Dmitry, thank you for the quick reply, is there a way, when I do a merge using your measure, to make sure that elif txt is blank then show Rest – Matteo M Nov 07 '16 at 10:40
`handle_text` is usual Python function, which accepts single string, do something and returns string. Of course you can do everything inside it. – Dmitry Ziolkovskiy Nov 07 '16 at 12:30

score 0 · Accepted Answer · answered Nov 01 '16 at 13:00

0

I think you can use double numpy.where with conditions created with indexing with str:

s = df['Site Name'].str.lower()
df['new'] = np.where(s.str[:4] == 'wire', 'Wired', 
            np.where(s.str[-3:] == 'one', 'RadiumOne', 'Rest'))

but if need your output, need also split and title:

df['new1'] = np.where(s.str[:4] == 'wire', 'Wired', 
             np.where(s.str[-3:] == 'one', 'RadiumOne', s.str.split().str[0].str.title()))

print (df)
      Site Name        new       new1
0    Radium One  RadiumOne  RadiumOne
1      Euronews       Rest   Euronews
2     EUROSPORT       Rest  Eurosport
3         WIRED      Wired      Wired
4     RadiumOne  RadiumOne  RadiumOne
5  Eurosport FR       Rest  Eurosport
6      Wired US      Wired      Wired
7     Eurosport       Rest  Eurosport
8      EuroNews       Rest   Euronews
9         Wired      Wired      Wired

answered Nov 01 '16 at 13:00

jezrael

822,522
95
1,334
1,252

this is amazing. So if I understood correctly, the split and title will use the original site name if the result is Rest? One last thing, is there a way to merge the new column, e.g. 'new', to an existing dataframe that contains site name, but simultaniously delete the old site name column? The steps would be then a vlookup, which after matching, deletes/substitutes the old column automatically – Matteo M Nov 01 '16 at 17:31
no, there is different. I use split for two words name like `Eurosport FR` for `split` by whitespace - get `eurosport` and `fr` and then select first (`eurosport`) by `str[0]`. `title` is for add first letter to uppercase - so `eurosport` is changed to `Eurosport`. – jezrael Nov 01 '16 at 17:49
also I get output if `Rest` to column `new` and if use spliting with title output is in `new1`. But if need overwrite old column `Site Name` by new values use assign - `df['Site Name'] = np.where(s.str[:4] == 'wire', 'Wired', np.where(s.str[-3:] == 'one', 'RadiumOne', s.str.split().str[0].str.title())) ` – jezrael Nov 01 '16 at 17:54
I thinkt the best is try it for all column - so you can use your sample - `df = pd.DataFrame({'Site Name': {0: 'Radium One', 1: 'Euronews', 2: 'EUROSPORT', 3: 'WIRED', 4: 'RadiumOne', 5: 'Eurosport FR', 6: 'Wired US', 7: 'Eurosport', 8: 'EuroNews', 9: 'Wired'}}) print (df)` – jezrael Nov 01 '16 at 17:59
then test all my code per partes `print (df['Site Name'].str.lower())` for convert to lowercase, `print (df['Site Name'].str.lower().str.split())` for split by whitespace (default separator), `print (df['Site Name'].str.lower().str.split().str[0])` - select first value of list, `print (df['Site Name'].str.lower().str.split().str[0].str.title())` - convert first letter to uppercase – jezrael Nov 01 '16 at 18:01
@MatteoM - some problem? `np.where` is much faster as `apply` - check [this question](http://stackoverflow.com/q/41166348/2901002) – jezrael Jan 09 '17 at 10:57
@jezreal, sorry I changed by mistake! – Matteo M Jan 09 '17 at 11:12

Pandas - Data Cleaning - Add New Column with if else statement for text values

search first 4 characters

search last 3 characters

2 Answers2