Get certain data from 2 files into a matrix

Question

I went to the World Bank database - and chose 2 files - GDP and Literacy rates. Intuitively I know there may be a correlation. Thus the problem statement is to find the correlation of GDP and Literacy Rates over 60 years for 200 (about) countries.

Here are the links;

http://data.worldbank.org/indicator/NY.GDP.PCAP.CD?view=chart [FOR GDP]

http://data.worldbank.org/indicator/SE.ADT.LITR.ZS?view=chart [FOR LIT]

I got the data in .CSV format and read it after skipping a few lines from the top.

Then, this is the code I started writing;

Lit = read.csv("C:/DIRECTORY/API_SE.ADT.LITR.ZS_DS2_en_csv_v2.csv", skip = 3, header = TRUE, dec = ".")
Gdp = read.csv("C:/DIRECTORY/API_NY.GDP.MKTP.CD_DS2_en_csv_v2.csv", skip = 3, header = TRUE, dec = ".")



#creating a list of variables for each different year
#Without initializing the variables here, the code below did not work

for (i in 5:62)
{
assign(paste0("year", i), 0*i)
}



#running a loop for all the values of each dataset
#The desired result of this in 55 vectors (1 for each year) of some length 
 (as there are many missing values) which have in them values of gdp and lit 
of the same country in the same row 

for (y in 5:62){
  for (c in 1:264){


#checking if values are available as many values are missing
q = is.na(Gdp[c,y])
r = is.na(Lit[c,y])

#now we will assign the values to the specific year

  assign(paste0("year", y), c(Gdp[c,y], Lit[c,y]))

}}

What I get from this is a 55 vectors (titles year1 to year55) with 2 values in each.

I understand that what is happening is for each vector, only the last coexisting values are set (the ones before are replaced by the next and so on and so forth till the last).

Now, What would be ideal, is a way to grow the year vector so that it contains all the coexisting (i.e. when a country, for a given year, has both gdp and lit values) values for a given year.

Welcome to SO. I'm having a hard time understanding your question. Are you asking how to put the data into long form, so that there is a row for each combination of country & year with observations of gdp, and lit? — C8H10N4O2, Aug 15 '17 at 14:57
Hey Caffeine - thanks. I am asking how to put the data into a matrix form so that each matrix is for one year and has 2 columns (GDP, LIT) and as many rows as there is a country with data for both GDP and LIT for that year... — Sharma Ji, Aug 15 '17 at 15:03
OK - it would help if you made a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) by including the code to either download & unzip the files in question or (even better) little example versions of your vectors (maybe for year1-year3) — C8H10N4O2, Aug 15 '17 at 15:09
You might want to look into the `wbstats` package, which provides an API interface to the worldbank data — Jake Kaupp, Aug 15 '17 at 15:49
@C8H10N4O2 for the reproducible example - I read the link and am happy to tell you that I am operating on only packages already inbuilt in R. the data is easily available through the link and then by clicking download .csv for a basic idea of the data Country name Indicator Year1 Year 2 Year 3.... x GDP 50 60 70 Would be more than happy to provide any other additional info — Sharma Ji, Aug 15 '17 at 16:46
@JakeKaupp Thanks a lot for the package! It's very useful... while it does help with what I am trying to do, it does distract from being able to write my own code for other similar data sets. Thanks again but :) — Sharma Ji, Aug 15 '17 at 16:54

Sean Murphy · Answer 1 · 2017-08-15T17:53:53.157

Out of curiosity, Have you worked previously with MATLAB? Your approach looks alot like I would have tried early on and I came to R from MATLAB. In R when possible I'd recommend performing operations on columns/vars in a data.frame as a whole rather than trying to iterate through them by each cell.

Forgive me if this response isn't formatted well, Im fairly new to stack exchange.

I'm not entirely sure what your goal is here, This code

> for (i in 5:62) { assign(paste0("year", i), 0*i) }

creates 57 numerical objects named "year5","year6","year7" etc... through "year62" each containing nothing but the number 0. These are not connected in any way with the rest of your code and are overwritten by the second portion of your code

for (y in 5:62){
  for (c in 1:264){


#checking if values are available as many values are missing
q = is.na(Gdp[c,y])
r = is.na(Lit[c,y])

#now we will assign the values to the specific year

  assign(paste0("year", y), c(Gdp[c,y], Lit[c,y]))

}}

which in the last portion is generating objects and overwriting them in the same breath.

As to what you are trying to accomplish

Now, What would be ideal, is a way to grow the year vector so that it contains all the coexisting (i.e. when a country, for a given year, has both gdp and lit values) values for a given year.

This is difficult as selecting the data that only exists in both Gdp and Lit will create an oddly shape data frame.
If you run

!is.na(Gdp) & !is.na(Lit)

You can see this, as all the TRUE values are years without NA's in either dataset for a given country and the FALSE values are all those which do not.

EDIT:

If I am correct in understanding your response try this

mapply(FUN = function(x = Gdp, y = Lit){
  output <- cbind(x,y)
  output[!is.na(x)&!is.na(y),]
}, x = Gdp[,5:62],y = Lit[,5:62])

What this does is for each column for GDP and Lit it returns the value for that year if and only if both values are present.

It returns this as a list object where each entry on the list is a dataframe for one of the years. I am not quite sure if this is what you want though as you no longer have the country label rows and thus have no idea what country each entry belongs to. You could fix this by rejoining the data with the names or by modifying that code to save the country name variable as well but I'll leave that to you.

I was introduced to some C++ back in the day, and did indeed work on Matlab, though I guess this has more to do with just being an absolute noob with R. Your analysis of the situation is correct. The end result will be an oddly shaped data frame. Which is expected and wanted. Let me try to explain the aim again: To get 57 vectors (1 for each year of the data) which contain 2 columns (GDP and LIT). The 57 vectors vary in number of rows (depending on no. of countries, for a given year, having both GDP and LIT values.) — Sharma Ji, Aug 15 '17 at 16:59
Thanks a LOT! That seems to solve exactly what I want to do! And I have the values. This is amazing! Can you please explain what and how you did? — Sharma Ji, Aug 15 '17 at 17:59
Certainly, mapply is useful for cases like this where you want to do the same thing to respective columns from two different datasets. For more on that I'd refer you to the help just run `?mapply` . The rest of it is just an anonymous function which takes the two columns (x and y which I've set in the args for mapply to be Gdp and Lit) combines them into one dataframe called output then filters output to only contain results where neither x nor y was an NA. — Sean Murphy, Aug 15 '17 at 18:09
Glad I could help! would appreciate it if you'd accept my answer :) — Sean Murphy, Aug 15 '17 at 18:30

Get certain data from 2 files into a matrix

1 Answers1