Handle missing values in python by attribute mean for all samples belonging to the same class as the given tuple?

Question

I have a large database of housing data and I need to fill in the missing values by mean of the same class. For example, in the column "Bedrooms" the missing data needs to be filled by the mean bedrooms of houses with the same/similar size and price. The sizes are stored in sq. ft. in an attribute called "Area". There are a lot of different values for the Area and price attributes so I'm a little confused about how to approach this. Is there a simple way to do this in python? Also, is combining areas into intervals to have less distinct values and finding the mean for each interval more suitable?

Here is the sample data:

location    bedrooms    Size(sq. ft.)   price
abc             7           4500        5.5 Crore 
cde             6           2250        2.1 Crore 
bda             7           4500        4.75 Crore 
abc             NA          4500        4.5 Crore 
abc             5           2250        2.3 Crore 
bda             NA          1350        54 Lakh 
cde             5           1575        1.6 Crore 
bda             NA          2452        3.25 Crore 
bda             3           1260        95 Lakh 
cde             6           2250        2.15 Crore 
abc             8           4500        3.5 Crore

Please share the data as data structure and not as a picture. — balderman, Feb 20 '19 at 18:02
@balderman how do I do that? The whole dataset is pretty large I made this for simplicity. — Marij Khan, Feb 20 '19 at 18:05
Select 10 or 20 lines and upload them as part of the question. — balderman, Feb 20 '19 at 18:06
Define "similar." Is it an exact match or not? If you are specifying a range of values to use in the range then you need to determine first what similar means to you. — Edeki Okoh, Feb 20 '19 at 18:19
@EdekiOkoh there are a few common areas such as 9000 sq ft. For these I can use an exact match. However, there are a few rare values like 9200 sq ft. these are only a few and I'd like to use the mean of 9000 sq ft one as it is a common house area. Hope that makes sense. — Marij Khan, Feb 20 '19 at 19:16
You need to make helper columns for your range. Use the answer [here](https://stackoverflow.com/questions/27041724/using-conditional-to-generate-new-column-in-pandas-dataframe) to make the column so 9200 becomes 9000 in the new column. Then use my answer below with the new helper column — Edeki Okoh, Feb 20 '19 at 19:18

Edeki Okoh · Accepted Answer · 2019-02-20T19:19:11.233

You can use the groupby and transform method in python to get the desired results.

Example :

d = {'col1': ['a', 'b', 'a', 'a', 'b', 'a'], 'col2': [3, 4, None, 5, None, 7], 'col3':[4,5,6,7,8,6]}
df = pd.DataFrame(data=d)

Output:

  col1  col2  col3
0    a   3.0     4
1    b   4.0     5
2    a   NaN     6
3    a   5.0     7
4    b   NaN     8
5    a   7.0     6

Now using the groupby and transform method:

df["col2"] = df.groupby("col1").transform(lambda x: x.fillna(x.mean()))

Output

  col1  col2  col3
0    a   3.0     4
1    b   4.0     5
2    a   5.0     6
3    a   5.0     7
4    b   4.0     8
5    a   7.0     6

As you can see, the Nan values in col2 were replaced with the mean of the class in col1. For example index two became 5 which is (3+5+7)/3 and index 4 became 4, which is 4/1.

If your case, if you want to add multiple columns as your "grouper", you can simply pass in more columns in the groupby method in order to do so. However, these will only take the mean of that values where the values are Excatly the same across all of the columns. In your case what you can do is:

df["bedrooms"] = df.groupby(["location","Size(sq. ft.)", "price"]).transform(lambda x: x.fillna(x.mean()))

So for example if location was abc, size was 4000 and price was 2000, it would take the mean for all instances of values where those three values are the same. However if you had a column with abc, size was 4000, and price was 2001, it would not use the mean of the values that had the same location and size, but different price.

You would need a helper column which defines your similarity measure in order to use the mean of a range of values.

This is a good answer, however is there a way to do this without Pandas? — Rodney, Apr 24 '19 at 23:08

Handle missing values in python by attribute mean for all samples belonging to the same class as the given tuple?

1 Answers1