R an algorithm for missing data imputation

Question

I have a matrix M[i,j] with a large number of NA representing the population stock in a number of urban areas [i,] over the period 2000-2013 [,j]. I would like to complete missing data assuming that population grows at a constant rate.

I would like to run a loop which, for each row i of the matrix M 1. Computes the average growth rate using all the non-missing observations 2. Fills in the gaps using imputed values

Generating an example dataset:

city1=c(10,40,NA,NA,320,640);
city2=c(NA,300,NA,1200,2400,NA);
city3=c(NA,NA,4000,8000,16000,32000);
mat=rbind(city1,city2,city3)

Since the average growth rate is 4 for city1, 3 for city2, and 2 for city3, the corresponding result matrix should be:

r.city1=c(10,40,80,160,320,640);
r.city2=c(100,300,600,1200,2400,4800);
r.city3=c(1000,2000,4000,8000,16000,32000);
r.mat=rbind(city1,city2,city3)

Do you have any idea of how I could go about this?

Best,

Clément

I think you have a mistake in `city1` as once it grows by `4` and the other time only by `2`. Also your desired output shows growth rate of `2` instead of `4` (as per your description). — David Arenburg, May 06 '15 at 10:33
@DavidArenburg yep you're right the growth rates are the same for all cities unless the entries are not representing the same time ... between cities ... have aded some code to answer — Spektre, May 06 '15 at 11:17

score 1 · Answer 1 · edited May 23 '17 at 11:59

simply approximate the growth rate

you have 1D array p[n] per city so start with that (p as population)
index i={0,1,2...,n-1} means time in some constant step

so if growth rate is constant (let call it m) then

p[1]=p[0]*m
p[2]=p[0]*(m^2)
p[3]=p[0]*(m^3)
p[i]=p[0]*(m^i)

So now just guess or approximate m and minimize the distance from each known point

as a start point you can get 2 continuous known values
m0=p[i+1]/p[i];
and then make loop around this value minimizing error for all known values at once
also you can use mine approx class (in C++) if you want

If you want to be more precise use dynamic growth rate

for example interpolation cubic or spline
and add curve fitting or also use approximation ...

Here C++ example (simple ugly slow non precise...)

const int N=3; // cities
const int n=6; // years
int p[3][n]=
    {
    10, 40,  -1,  -1,  320,  640,   // city 1 ... -1 = NA
    -1,300,  -1,1200, 2400,   -1,   // city 2
    -1, -1,4000,8000,16000,32000,   // city 3
    };

void estimate()
    {
    int i,I,*q,i0;
    float m,m0,m1,dm,d,e,mm;
    for (I=0;I<N;I++)   // all cities
        {
        q=p[I];         // pointer to actual city
        // avg growth rate
        for (m0=0.0,m1=0.0,i=1;i<n;i++)
         if ((q[i]>0)&&(q[i-1]>0))
          { m0+=q[i]/q[i-1]; m1++; }
        if (m1<0.5) continue;   // skip this city if avg m not found
        m0/=m1;
        // find m more closelly on interval <0.5*m0,2.0*m0>
        m1=2.0*m0; m0=0.5*m0; dm=(m1-m0)*0.001;
        for (m=-1.0,e=0.0;m0<=m1;m0+=dm)
            {
            // find first valid number
            for (mm=-1,i=0;i<n;i++)
             if (q[i]>0) { mm=q[i]; break; }
            // comute abs error for current m0
            for (d=0.0;i<n;i++,mm*=m0)
             if (q[i]>0) d+=fabs(mm-q[i]);
            // remember the best solution
            if ((m<0.0)||(e>d)) { m=m0; e=d; }
            }
        // now compute missing data using m
        // ascending
        for (mm=-1,i=0;i<n;i++) if (q[i]>0) { mm=q[i]; break; }
        for (;i<n;i++,mm*=m) if (q[i]<0) q[i]=mm;
        // descending
        for (mm=-1,i=0;i<n;i++) if (q[i]>0) { mm=q[i]; break; }
        for (;i>=0;i--,mm/=m) if (q[i]<0) q[i]=mm;
        }
    }

Result:

// input
[    10    40    NA    NA   320   640 ]
[    NA   300    NA  1200  2400    NA ]
[    NA    NA  4000  8000 16000 32000 ]
// output
[    10    40    52   121   320   640 ]
[   150   300   599  1200  2400  4790 ]
[  1000  2000  4000  8000 16000 32000 ]

can add some rounding and tweak it a bit
also you can compute the error in both asc and desc order to be more safe
also can enhance this by using as start point value surrounded by valid values instead of first valid entry
or by computing each NA gap separately

Thanks a lot for your answer. I'm going to try translating that into R language and keep you updated. — goclem, May 06 '15 at 14:16

R an algorithm for missing data imputation

1 Answers1