0

Trying to create a DF using the values from multiple arrays/groups (within a loop)

DataFrame:

sample_data = [['USA', 'gdp', 2001, 10],['USA', 'avgIQ', 2001, 100],['USA', 'people', 2001, 1000],['USA', 'dragons', 2001, 3],['CHN', 'gdp', 2001, 12], ['CHN', 'avgIQ', 2001, 120],['CHN', 'people', 2001, 2000],['CHN', 'dragons', 2001, 1],['RUS', 'gdp', 2001, 11],['RUS', 'avgIQ', 2001, 105], ['RUS', 'people', 2001, 1500],['RUS', 'dragons', 2001, np.nan],['USA', 'gdp', 2002, 12],['USA', 'avgIQ', 2002, 105],['USA', 'people', 2002, 1200], ['USA', 'dragons', 2002, np.nan],['CHN', 'gdp', 2002, 14],['CHN', 'avgIQ', 2002, 127],['CHN', 'people', 2002, 3100],['CHN', 'dragons', 2002, 4], ['RUS', 'gdp', 2002, 11],['RUS', 'avgIQ', 2002, 99],['RUS', 'people', 2002, 1600],['RUS', 'dragons', 2002, np.nan],['USA', 'gdp', 2003, 15], ['USA', 'avgIQ', 2003, 115],['USA', 'people', 2003, 2000],['USA', 'dragons', 2003, np.nan],['CHN', 'gdp', 2003, 16],['CHN', 'avgIQ', 2003, 132], ['CHN', 'people', 2003, 4000],['CHN', 'dragons', 2003, 6],['RUS', 'gdp', 2003, 11],['RUS', 'avgIQ', 2003, 108],['RUS', 'people', 2003, 2000], ['RUS', 'dragons', 2003, np.nan],['USA', 'gdp', 2004, 18],['USA', 'avgIQ', 2004, 111],['USA', 'people', 2004, 2500],['USA', 'dragons', 2004, np.nan], ['CHN', 'gdp', 2004, 18],['CHN', 'avgIQ', 2004, 140],['CHN', 'people', 2004, np.nan],['CHN', 'dragons', 2004, np.nan], ['RUS', 'gdp', 2004, 15],['RUS', 'avgIQ', 2004, 103],['RUS', 'people', 2004, 2800],['RUS', 'dragons', 2004, np.nan], ['USA', 'gdp', 2005, 23],['USA', 'avgIQ', 2005, 111],['USA', 'people', 2005, 3700],['USA', 'dragons', 2005, 8],['CHN', 'gdp', 2005, 22], ['CHN', 'avgIQ', 2005, 143],['CHN', 'people', 2005, 6000],['CHN', 'dragons', 2005, 15],['RUS', 'gdp', 2005, 17],['RUS', 'avgIQ', 2005, np.nan], ['RUS', 'people', 2005, 3000],['RUS', 'dragons', 2005, 3]]

sample_df = pd.DataFrame(sample_data, columns = ['A','B','C','D'])

sample_df['C'] = sample_df['C'].astype(float) 
sample_df.head()

enter image description here

Data columns (total 4 columns):
A    60 non-null object
B    60 non-null object
C    60 non-null float64
D    50 non-null float64
dtypes: float64(2), object(2)
from impyute.imputation.cs import mice
sample_group = sample_df.groupby(['A', 'B'])

for group_index, group in sample_group:

    if group.isnull().values.any() == True:

                
        group['D'] = ((mice(group.apply({'C': lambda x: x.values, 'D': lambda y: y.values})))[1]).values        

        
        print(group)

        
    else:
                
        print(group)
        continue
        
A      B        C      D
5   CHN  avgIQ 2,001.00 120.00
17  CHN  avgIQ 2,002.00 127.00
29  CHN  avgIQ 2,003.00 132.00
41  CHN  avgIQ 2,004.00 140.00
53  CHN  avgIQ 2,005.00 143.00
      A        B        C     D
7   CHN  dragons 2,001.00  1.00
19  CHN  dragons 2,002.00  4.00
31  CHN  dragons 2,003.00  6.00
43  CHN  dragons 2,004.00 10.86
55  CHN  dragons 2,005.00 15.00
      A    B        C     D
4   CHN  gdp 2,001.00 12.00
16  CHN  gdp 2,002.00 14.00
28  CHN  gdp 2,003.00 16.00
40  CHN  gdp 2,004.00 18.00
52  CHN  gdp 2,005.00 22.00
      A       B        C        D
6   CHN  people 2,001.00 2,000.00
18  CHN  people 2,002.00 3,100.00
30  CHN  people 2,003.00 4,000.00
42  CHN  people 2,004.00 5,014.29
54  CHN  people 2,005.00 6,000.00
      A      B        C      D
9   RUS  avgIQ 2,001.00 105.00
21  RUS  avgIQ 2,002.00  99.00
33  RUS  avgIQ 2,003.00 108.00
45  RUS  avgIQ 2,004.00 103.00
57  RUS  avgIQ 2,005.00 104.50
      A        B        C    D
11  RUS  dragons 2,001.00 3.00
23  RUS  dragons 2,002.00 3.00
35  RUS  dragons 2,003.00 3.00
47  RUS  dragons 2,004.00 3.00
59  RUS  dragons 2,005.00 3.00
      A    B        C     D
8   RUS  gdp 2,001.00 11.00
20  RUS  gdp 2,002.00 11.00
32  RUS  gdp 2,003.00 11.00
44  RUS  gdp 2,004.00 15.00
56  RUS  gdp 2,005.00 17.00
      A       B        C        D
10  RUS  people 2,001.00 1,500.00
22  RUS  people 2,002.00 1,600.00
34  RUS  people 2,003.00 2,000.00
46  RUS  people 2,004.00 2,800.00
58  RUS  people 2,005.00 3,000.00
      A      B        C      D
1   USA  avgIQ 2,001.00 100.00
13  USA  avgIQ 2,002.00 105.00
25  USA  avgIQ 2,003.00 115.00
37  USA  avgIQ 2,004.00 111.00
49  USA  avgIQ 2,005.00 111.00
      A        B        C    D
3   USA  dragons 2,001.00 3.00
15  USA  dragons 2,002.00 4.25
27  USA  dragons 2,003.00 5.50
39  USA  dragons 2,004.00 6.75
51  USA  dragons 2,005.00 8.00
      A    B        C     D
0   USA  gdp 2,001.00 10.00
12  USA  gdp 2,002.00 12.00
24  USA  gdp 2,003.00 15.00
36  USA  gdp 2,004.00 18.00
48  USA  gdp 2,005.00 23.00
      A       B        C        D
2   USA  people 2,001.00 1,000.00
14  USA  people 2,002.00 1,200.00
26  USA  people 2,003.00 2,000.00
38  USA  people 2,004.00 2,500.00
50  USA  people 2,005.00 3,700.00

Note that the nan values have been properly filled in^^^ (All good so far)

Having trouble making a dataframe similar to the one I began with, out of the updated values:

I followed a solution from a similar problem seen here: How to build and fill pandas dataframe from for loop?


sample_group = sample_df.groupby(['A', 'B'])

d = []

for group_index, group in sample_group:

        if group.isnull().values.any() == True:
        
            group['D'] = ((mice(group.apply({'C': lambda x: x.values, 'D': lambda y: y.values})))[1]).values
            
            d.append({'A': group.A.values, 'B': group.B.values, 'C': group.C.values, 'D': group.D.values})
            

        else:
            
            d.append({'A': group.A.values, 'B': group.B.values, 'C': group.C.values, 'D': group.D.values})
                
            continue
            
d = pd.DataFrame(d)
d.head(10)
        

enter image description here

  • You can see that the values are all accurate, but it is returning the whole list of values for each index

Alternative approach using .concat (seems to take longer):

sample_group = sample_df.groupby(['A', 'B'])

d = pd.DataFrame()

for group_index, group in sample_group:

        if group.isnull().values.any() == True:
        
            group['D'] = ((mice(group.apply({'C': lambda x: x.values, 'D': lambda y: y.values})))[1]).values
            
            temp = pd.DataFrame({'A': group.A.values, 'B': group.B.values, 'C': group.C.values, 'D': group.D.values})
            
        else:
            
            temp = pd.DataFrame({'A': group.A, 'B': group.B, 'C': group.C.values, 'D': group.D.values})
                
            continue

d = pd.concat([d, temp])
d.head()

enter image description here

  • This output looks correct, but as you can see, only returns the first group...

Desired Output:

I'd like to have a DataFrame that look exactly like the initial one, but with the updated nan values in their respective places.

Community
  • 1
  • 1
Alex
  • 188
  • 11
  • I also tried it using a similar method, wherein I use `.concat`, and while this worked, it only worked for the first group (did not iterate through all the way). Furthermore, this method was significantly slower, and seemed to be frowned upon in terms of "best practices". – Alex Dec 13 '19 at 00:27
  • I've also seen reference to `.transform`, but I could not figure out if this was applicable here... – Alex Dec 13 '19 at 00:29

1 Answers1

0

Solved it!

sample_group = sample_df.groupby(['A', 'B'])

d = pd.DataFrame([])

for group_index, group in sample_group:

        if group.isnull().values.any() == True:

            group['D'] = ((mice(group.apply({'C': lambda x: x.values, 'D': lambda y: y.values})))[1]).values

            d = d.append(pd.DataFrame({'A': group.A.values, 'B': group.B.values, 'C': group.C.values, 'D': group.D.values}))



        else:


            d = d.append(pd.DataFrame({'A': group.A.values, 'B': group.B.values, 'C': group.C.values, 'D': group.D.values}))


            continue


d.head(20)

The index looked a little odd in the output, but after a simple .reset_index it was right as rain. Hope this helps someone at some point down the line.

This was very helpful:

Using pandas .append within for loop

Alex
  • 188
  • 11