2

I have a matrix M[i,j] with a large number of NA representing the population stock in a number of urban areas [i,] over the period 2000-2013 [,j]. I would like to complete missing data assuming that population grows at a constant rate.

I would like to run a loop which, for each row i of the matrix M 1. Computes the average growth rate using all the non-missing observations 2. Fills in the gaps using imputed values

Generating an example dataset:

city1=c(10,40,NA,NA,320,640);
city2=c(NA,300,NA,1200,2400,NA);
city3=c(NA,NA,4000,8000,16000,32000);
mat=rbind(city1,city2,city3)

Since the average growth rate is 4 for city1, 3 for city2, and 2 for city3, the corresponding result matrix should be:

r.city1=c(10,40,80,160,320,640);
r.city2=c(100,300,600,1200,2400,4800);
r.city3=c(1000,2000,4000,8000,16000,32000);
r.mat=rbind(city1,city2,city3)

Do you have any idea of how I could go about this?

Best,

Clément

Johan
  • 74,508
  • 24
  • 191
  • 319
goclem
  • 904
  • 1
  • 10
  • 21
  • 1
    I think you have a mistake in `city1` as once it grows by `4` and the other time only by `2`. Also your desired output shows growth rate of `2` instead of `4` (as per your description). – David Arenburg May 06 '15 at 10:33
  • @DavidArenburg yep you're right the growth rates are the same for all cities unless the entries are not representing the same time ... between cities ... have aded some code to answer – Spektre May 06 '15 at 11:17

1 Answers1

1

simply approximate the growth rate

  • you have 1D array p[n] per city so start with that (p as population)
  • index i={0,1,2...,n-1} means time in some constant step
  • so if growth rate is constant (let call it m) then

    p[1]=p[0]*m
    p[2]=p[0]*(m^2)
    p[3]=p[0]*(m^3)
    p[i]=p[0]*(m^i)
    

So now just guess or approximate m and minimize the distance from each known point

  • as a start point you can get 2 continuous known values
  • m0=p[i+1]/p[i];
  • and then make loop around this value minimizing error for all known values at once
  • also you can use mine approx class (in C++) if you want

If you want to be more precise use dynamic growth rate

  • for example interpolation cubic or spline
  • and add curve fitting or also use approximation ...

Here C++ example (simple ugly slow non precise...)

const int N=3; // cities
const int n=6; // years
int p[3][n]=
    {
    10, 40,  -1,  -1,  320,  640,   // city 1 ... -1 = NA
    -1,300,  -1,1200, 2400,   -1,   // city 2
    -1, -1,4000,8000,16000,32000,   // city 3
    };

void estimate()
    {
    int i,I,*q,i0;
    float m,m0,m1,dm,d,e,mm;
    for (I=0;I<N;I++)   // all cities
        {
        q=p[I];         // pointer to actual city
        // avg growth rate
        for (m0=0.0,m1=0.0,i=1;i<n;i++)
         if ((q[i]>0)&&(q[i-1]>0))
          { m0+=q[i]/q[i-1]; m1++; }
        if (m1<0.5) continue;   // skip this city if avg m not found
        m0/=m1;
        // find m more closelly on interval <0.5*m0,2.0*m0>
        m1=2.0*m0; m0=0.5*m0; dm=(m1-m0)*0.001;
        for (m=-1.0,e=0.0;m0<=m1;m0+=dm)
            {
            // find first valid number
            for (mm=-1,i=0;i<n;i++)
             if (q[i]>0) { mm=q[i]; break; }
            // comute abs error for current m0
            for (d=0.0;i<n;i++,mm*=m0)
             if (q[i]>0) d+=fabs(mm-q[i]);
            // remember the best solution
            if ((m<0.0)||(e>d)) { m=m0; e=d; }
            }
        // now compute missing data using m
        // ascending
        for (mm=-1,i=0;i<n;i++) if (q[i]>0) { mm=q[i]; break; }
        for (;i<n;i++,mm*=m) if (q[i]<0) q[i]=mm;
        // descending
        for (mm=-1,i=0;i<n;i++) if (q[i]>0) { mm=q[i]; break; }
        for (;i>=0;i--,mm/=m) if (q[i]<0) q[i]=mm;
        }
    }

Result:

// input
[    10    40    NA    NA   320   640 ]
[    NA   300    NA  1200  2400    NA ]
[    NA    NA  4000  8000 16000 32000 ]
// output
[    10    40    52   121   320   640 ]
[   150   300   599  1200  2400  4790 ]
[  1000  2000  4000  8000 16000 32000 ]
  • can add some rounding and tweak it a bit
  • also you can compute the error in both asc and desc order to be more safe
  • also can enhance this by using as start point value surrounded by valid values instead of first valid entry
  • or by computing each NA gap separately
Community
  • 1
  • 1
Spektre
  • 49,595
  • 11
  • 110
  • 380
  • Thanks a lot for your answer. I'm going to try translating that into R language and keep you updated. – goclem May 06 '15 at 14:16