How to use conditionnal statement with startswith() on Python - dfply?

Question

I'm doing data wrangling on Python, using the package dfply.

I want to create a new variable "a06", from 'FC06' of the dataset data_a, so that :

a06 = 1 if FC06[i] starts with the character "1" (ex : FC06[i]=173)
a06 = 2 if FC06[i] starts with the character "2"
a06 = NaN if FC06[i] = NaN

For instance, with the input :

df = pd.DataFrame({'FC06':[173,170,220,float('nan'),110,230,float('nan')]})

I want to get the output :

df1= pd.DataFrame({'a06':[1,1,2,float('nan'),1,2,float('nan')]})

On R it would be obtained by :

data_a %>% mutate(a06 = ifelse(substr(FC06,1,1)=="1",1,ifelse(substr(FC06,1,1)=="1",2,NaN)))

but I don't find how to do this with Python.

I achieved a first version with just 2 alternatives : NaN or 1, with :

data_a >>        mutate(a06=if_else((X['FC06'].apply(pd.isnull)),float('nan'),1)

but I can't find how to differentiate the result according to the first character of FC06.

(I tried things like :

(data_a >> mutate(a06=if_else(X['FC06'].apply(pd.isnull),float('nan'),if_else(X['FC06'].apply(str)[0]=='1',1,2))))

but without success : - [0] doesn't work there to get the first character - and/or str() can't be used with apply (neither str.startswith('1'))

Does anybody knows how to solve such situations ?

Or another package to do that on Python ?

Thank you !!

Can you please provide some example data and your expected output? [Here](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) are some tips on how to make good, reproducible pandas examples. — juanpa.arrivillaga, Aug 13 '18 at 12:51
Ok, for instance, my input can be : df = pd.DataFrame({'FC06':[173,170,220,float('nan'),110,230,float('nan')]}) — Elise1369, Aug 13 '18 at 13:08
and the expected output : df1= pd.DataFrame({'a06':[1,1,2,float('nan'),1,2,float('nan')]}) — Elise1369, Aug 13 '18 at 13:08
Please [edit](https://stackoverflow.com/posts/51822709/edit) your question and provide these examples there. — juanpa.arrivillaga, Aug 13 '18 at 13:09
Note, your examples provide *integers*, not strings... which are they? — juanpa.arrivillaga, Aug 13 '18 at 13:09
About the data type : there must be rather strings, since some values are like "110M" (the data are about functions in an association, 110M or 270 is the code of the function). if I write type(data_a['FC06']), I get : pandas.core.series.Series — Elise1369, Aug 13 '18 at 13:17
You want `data_a['FC06'].dtypes`. The series is of course of type series, you want the type of the *data* in the series. But it seems you have some object column, presumably from some csv? — juanpa.arrivillaga, Aug 13 '18 at 13:20
They must always start with 1, 2 or NaN (if ever some start with something else, I want to associate them with '2') — Elise1369, Aug 13 '18 at 13:20
Yes, indeed. it's logic that the type is series (I believed that it could be different like on R ;)). My dataset : data_a is a dataframe from a CSV, with several columns including FC06 — Elise1369, Aug 13 '18 at 13:34
Well, I've provided a solution that should work if your data is as messy as you think it might be, you could also try jpp's answer, in case your data is more well-behaved. — juanpa.arrivillaga, Aug 13 '18 at 13:35

score 0 · Accepted Answer · answered Aug 13 '18 at 13:24

If you only have 3-digit numbers, you can use floor division:

df['FC06'] //= 100

If you have strings, you can use pd.Series.mask:

ints = pd.to_numeric(df['FC06'].astype(str).str[:1], errors='coerce')
df['FC06'].mask(df['FC06'].notnull(), ints, inplace=True)

print(df)

   FC06
0   1.0
1   1.0
2   2.0
3   NaN
4   1.0
5   2.0
6   NaN

You will notice that your integers become floats. This is forced by the existence of NaN values, which are considered float. In general, this shouldn't be a problem.

Thank you ! I would try to use dfply (or an equivalent of dplyr in R), but as long as it works it's very good :) — Elise1369, Aug 13 '18 at 14:51

How to use conditionnal statement with startswith() on Python - dfply?

1 Answers1