Trying to create a DF using the values from multiple arrays/groups (within a loop)
DataFrame:
sample_data = [['USA', 'gdp', 2001, 10],['USA', 'avgIQ', 2001, 100],['USA', 'people', 2001, 1000],['USA', 'dragons', 2001, 3],['CHN', 'gdp', 2001, 12], ['CHN', 'avgIQ', 2001, 120],['CHN', 'people', 2001, 2000],['CHN', 'dragons', 2001, 1],['RUS', 'gdp', 2001, 11],['RUS', 'avgIQ', 2001, 105], ['RUS', 'people', 2001, 1500],['RUS', 'dragons', 2001, np.nan],['USA', 'gdp', 2002, 12],['USA', 'avgIQ', 2002, 105],['USA', 'people', 2002, 1200], ['USA', 'dragons', 2002, np.nan],['CHN', 'gdp', 2002, 14],['CHN', 'avgIQ', 2002, 127],['CHN', 'people', 2002, 3100],['CHN', 'dragons', 2002, 4], ['RUS', 'gdp', 2002, 11],['RUS', 'avgIQ', 2002, 99],['RUS', 'people', 2002, 1600],['RUS', 'dragons', 2002, np.nan],['USA', 'gdp', 2003, 15], ['USA', 'avgIQ', 2003, 115],['USA', 'people', 2003, 2000],['USA', 'dragons', 2003, np.nan],['CHN', 'gdp', 2003, 16],['CHN', 'avgIQ', 2003, 132], ['CHN', 'people', 2003, 4000],['CHN', 'dragons', 2003, 6],['RUS', 'gdp', 2003, 11],['RUS', 'avgIQ', 2003, 108],['RUS', 'people', 2003, 2000], ['RUS', 'dragons', 2003, np.nan],['USA', 'gdp', 2004, 18],['USA', 'avgIQ', 2004, 111],['USA', 'people', 2004, 2500],['USA', 'dragons', 2004, np.nan], ['CHN', 'gdp', 2004, 18],['CHN', 'avgIQ', 2004, 140],['CHN', 'people', 2004, np.nan],['CHN', 'dragons', 2004, np.nan], ['RUS', 'gdp', 2004, 15],['RUS', 'avgIQ', 2004, 103],['RUS', 'people', 2004, 2800],['RUS', 'dragons', 2004, np.nan], ['USA', 'gdp', 2005, 23],['USA', 'avgIQ', 2005, 111],['USA', 'people', 2005, 3700],['USA', 'dragons', 2005, 8],['CHN', 'gdp', 2005, 22], ['CHN', 'avgIQ', 2005, 143],['CHN', 'people', 2005, 6000],['CHN', 'dragons', 2005, 15],['RUS', 'gdp', 2005, 17],['RUS', 'avgIQ', 2005, np.nan], ['RUS', 'people', 2005, 3000],['RUS', 'dragons', 2005, 3]]
sample_df = pd.DataFrame(sample_data, columns = ['A','B','C','D'])
sample_df['C'] = sample_df['C'].astype(float)
sample_df.head()
Data columns (total 4 columns):
A 60 non-null object
B 60 non-null object
C 60 non-null float64
D 50 non-null float64
dtypes: float64(2), object(2)
from impyute.imputation.cs import mice
sample_group = sample_df.groupby(['A', 'B'])
for group_index, group in sample_group:
if group.isnull().values.any() == True:
group['D'] = ((mice(group.apply({'C': lambda x: x.values, 'D': lambda y: y.values})))[1]).values
print(group)
else:
print(group)
continue
A B C D
5 CHN avgIQ 2,001.00 120.00
17 CHN avgIQ 2,002.00 127.00
29 CHN avgIQ 2,003.00 132.00
41 CHN avgIQ 2,004.00 140.00
53 CHN avgIQ 2,005.00 143.00
A B C D
7 CHN dragons 2,001.00 1.00
19 CHN dragons 2,002.00 4.00
31 CHN dragons 2,003.00 6.00
43 CHN dragons 2,004.00 10.86
55 CHN dragons 2,005.00 15.00
A B C D
4 CHN gdp 2,001.00 12.00
16 CHN gdp 2,002.00 14.00
28 CHN gdp 2,003.00 16.00
40 CHN gdp 2,004.00 18.00
52 CHN gdp 2,005.00 22.00
A B C D
6 CHN people 2,001.00 2,000.00
18 CHN people 2,002.00 3,100.00
30 CHN people 2,003.00 4,000.00
42 CHN people 2,004.00 5,014.29
54 CHN people 2,005.00 6,000.00
A B C D
9 RUS avgIQ 2,001.00 105.00
21 RUS avgIQ 2,002.00 99.00
33 RUS avgIQ 2,003.00 108.00
45 RUS avgIQ 2,004.00 103.00
57 RUS avgIQ 2,005.00 104.50
A B C D
11 RUS dragons 2,001.00 3.00
23 RUS dragons 2,002.00 3.00
35 RUS dragons 2,003.00 3.00
47 RUS dragons 2,004.00 3.00
59 RUS dragons 2,005.00 3.00
A B C D
8 RUS gdp 2,001.00 11.00
20 RUS gdp 2,002.00 11.00
32 RUS gdp 2,003.00 11.00
44 RUS gdp 2,004.00 15.00
56 RUS gdp 2,005.00 17.00
A B C D
10 RUS people 2,001.00 1,500.00
22 RUS people 2,002.00 1,600.00
34 RUS people 2,003.00 2,000.00
46 RUS people 2,004.00 2,800.00
58 RUS people 2,005.00 3,000.00
A B C D
1 USA avgIQ 2,001.00 100.00
13 USA avgIQ 2,002.00 105.00
25 USA avgIQ 2,003.00 115.00
37 USA avgIQ 2,004.00 111.00
49 USA avgIQ 2,005.00 111.00
A B C D
3 USA dragons 2,001.00 3.00
15 USA dragons 2,002.00 4.25
27 USA dragons 2,003.00 5.50
39 USA dragons 2,004.00 6.75
51 USA dragons 2,005.00 8.00
A B C D
0 USA gdp 2,001.00 10.00
12 USA gdp 2,002.00 12.00
24 USA gdp 2,003.00 15.00
36 USA gdp 2,004.00 18.00
48 USA gdp 2,005.00 23.00
A B C D
2 USA people 2,001.00 1,000.00
14 USA people 2,002.00 1,200.00
26 USA people 2,003.00 2,000.00
38 USA people 2,004.00 2,500.00
50 USA people 2,005.00 3,700.00
Note that the nan values have been properly filled in^^^ (All good so far)
Having trouble making a dataframe similar to the one I began with, out of the updated values:
I followed a solution from a similar problem seen here: How to build and fill pandas dataframe from for loop?
sample_group = sample_df.groupby(['A', 'B'])
d = []
for group_index, group in sample_group:
if group.isnull().values.any() == True:
group['D'] = ((mice(group.apply({'C': lambda x: x.values, 'D': lambda y: y.values})))[1]).values
d.append({'A': group.A.values, 'B': group.B.values, 'C': group.C.values, 'D': group.D.values})
else:
d.append({'A': group.A.values, 'B': group.B.values, 'C': group.C.values, 'D': group.D.values})
continue
d = pd.DataFrame(d)
d.head(10)
- You can see that the values are all accurate, but it is returning the whole list of values for each index
Alternative approach using .concat
(seems to take longer):
sample_group = sample_df.groupby(['A', 'B'])
d = pd.DataFrame()
for group_index, group in sample_group:
if group.isnull().values.any() == True:
group['D'] = ((mice(group.apply({'C': lambda x: x.values, 'D': lambda y: y.values})))[1]).values
temp = pd.DataFrame({'A': group.A.values, 'B': group.B.values, 'C': group.C.values, 'D': group.D.values})
else:
temp = pd.DataFrame({'A': group.A, 'B': group.B, 'C': group.C.values, 'D': group.D.values})
continue
d = pd.concat([d, temp])
d.head()
- This output looks correct, but as you can see, only returns the first group...
Desired Output:
I'd like to have a DataFrame that look exactly like the initial one, but with the updated nan values in their respective places.