-3

I have two columns company_size and company_type. company_size contains NaN values which I want to fill based on company_type column. How can I map to particular value so that I can fill the NaN values i.e missing values in company_size based on company_type.

Example data:

 company_size    company_type
   10-50           startup
   1000+            PvtLtd
   NaN           Public Sector
   NaN              PvtLtd
   100-500       Public Sector
   NaN             startup
   100-500       Public Sector

Now based company_type I want to fill company_size.

startup -> 10-50
Public Sector -> 100-500
PvtLtd -> 1000+

Note both company_type and company_size are categorical variables.

stone rock
  • 1,923
  • 9
  • 43
  • 72
  • I think there's a better dupe that actually does it from a dataframe not a dictionary – user3483203 Jul 19 '18 at 18:22
  • @xyzjayne Here I want to fill NaN values how can I do that in pandas ? – stone rock Jul 19 '18 at 18:22
  • 1
    @stonerock you can select the rows where company_size is NaN – xyzjayne Jul 19 '18 at 18:23
  • Perhaps this one? [remap values in pandas column with a dict](https://stackoverflow.com/questions/20250771/remap-values-in-pandas-column-with-a-dict) – ALollz Jul 19 '18 at 18:23
  • 1
    Have you tried anything? It is quite similar to your last question, which was answered. :)) – harvpan Jul 19 '18 at 18:24
  • 1
    I don't think it's mapping at all. Why not just groupby `company_type` and ffill or bfill – user3483203 Jul 19 '18 at 18:25
  • @HarvIpan I don't think `between()` and `where()` will work – stone rock Jul 19 '18 at 18:25
  • Instead of posting comments can someone please post answer which will be more helpful :) – stone rock Jul 19 '18 at 18:26
  • 2
    Instead of asking questions that are already answered on the forum you could have done some research first and saved all of us some time :) – xyzjayne Jul 19 '18 at 18:28
  • 4
    I think people are reluctant to post an answer because this question is almost certainly answered elsewhere on this site, and an almost identical question (asked by you) was answered earlier today, with a method that could easily be expanded to answer this question. – ALollz Jul 19 '18 at 18:29

2 Answers2

3

You can first sort by company size to make sure that NaNs are at the end, then use groupby and ffill, then sort by index again to re-instate your original order:

df.sort_values('company_size').groupby('company_type').ffill().sort_index()

  company_size   company_type
0        10-50        startup
1        1000+         PvtLtd
2      100-500  Public Sector
3        1000+         PvtLtd
4      100-500  Public Sector
5        10-50        startup
6      100-500  Public Sector
sacuL
  • 49,704
  • 8
  • 81
  • 106
3

groupby your company time, and transform('first') to get rid of NaN values:

df['company_size'] = df.groupby("company_type")["company_size"].transform("first")

  company_size  company_type
0        10-50       startup
1        1000+        PvtLtd
2      100-500  PublicSector
3        1000+        PvtLtd
4      100-500  PublicSector
5        10-50       startup
6      100-500  PublicSector
user3483203
  • 50,081
  • 9
  • 65
  • 94