1

I have a big chunk of Excel Data with 4260 lines which I want to process using R (and the XLConnect package). There are 43 different districts in these 4260 lines, each district containing 98 lines. Instead of repeating a chunk of similar code for every district 43 times, I want to use a loop going through all lines and storing specific data (columns) into vectors.

Here is an abstract of my initial primitive code:

wb = loadWorkbook("Bel_housing_prices_disctrict.xlsx") 

#--------------Malines (district 1)
df_Malines = readWorksheet(wb, sheet="Par arrondissement", startRow = 103, endRow = 201, startCol = 0, endCol = 0)
maisons_total_price_Malines <- df_Malines$Col8
maisons_price_mean_Malines <- df_Malines$Col10

villas_total_price_Malines <- df_Malines$Col18
villas_price_mean_Malines <- df_Malines$Col20

#--------------Turnhout (district 2)
df_Turnhout = readWorksheet(wb, sheet="Par arrondissement", startRow = 202, endRow = 300, startCol = 0, endCol = 0)
maisons_total_price_Turnhout <- df_Turnhout$Col8
maisons_price_mean_Turnhout <- df_Turnhout$Col10

villas_total_price_Turnhout <- df_Turnhout$Col18
villas_price_mean_Turnhout <- df_Turnhout$Col20

#-------------- (district 3)

What changes at every district (Malines, Turnhout,... etc.) is the "startRow"- and the "endRow"- value, which increase by the number 99 until the last line 4260 is reached. I thought of writing a loop that looks somehow like this:

i=103
for (n in c("Malines","Turnhout","district3",...))
{
df_"n" = readWorksheet(wb, sheet="Par arrondissement", startRow = i, endRow = i+98, startCol = 0, endCol = 0)
maisons_total_price_"n" <- df_"n"$Col8
...
i=i+99
}

But of course, this loop/function does not work because I did it wrong and I could not find a solution so far... It is merely an "idea" for a function. The function (loop) would create new variables with my data "on the go", by runnning through all 4260 lines. If my idea of saving this data do different variables is wrong, I am glad for every alternative solution (list?)?

I hope I made my problem clear and I am glad for any tips on a neat solution!

Best regards

Gilles Cosyn
  • 435
  • 2
  • 9
  • 17
  • Please make the question more clear. You have a function which you want to use loop to apply on 34 data set with the same structure etc? –  Mar 01 '15 at 13:22
  • @Nemo I edited my question in the hope that it is more clear now. – Gilles Cosyn Mar 01 '15 at 14:16

2 Answers2

2

From you post, I assumed you had a vector with the names of all the districts you are interested in.

You can use an apply type function to get the data from all the districts:

wb <- loadWorkbook("your_file.xlsx") 

#dummy vector of district names
districts<-c("district1","district2","district3")

#creates a vector of startRows
p<-seq(from=103,to=(length(districts)+1)*99,by=99)
p
#[1] 103 202 301

#for each value of p, get the rows, and rename the columns of the dataframes
data_list<-lapply(p,function(x){
        df<-readWorksheet(wb, sheet="Par arrondissement", startRow = x, endRow = x+98)[,c(8,10,18,20)];
        colnames(df)<-c("maisons_total_price","maisons_price_mean","villas_total_price","villas_price_mean")
        df$sum_price<-df$maisons_total_price+df$villas_price_mean
        df})

#this will return a list of dataframes, one for each district.
#to easily access them, add the names of the districts.
names(data_list)<-district

With this list, you can access the data for each district by using data_list$district1 for example.

If you want to make a dataframe of all your data, you can do, assuming there are no rows missing:

data<-do.call(cbind,res)

You can then access the data for each district using data$district1.maisons_total_price for example

NicE
  • 21,165
  • 3
  • 51
  • 68
  • What can I do if I want to add one (or more) colums in each district? A column that is calculated as the division from 2 other columns. For example: price_per_m2 = total_price/total_surface – Gilles Cosyn Mar 01 '15 at 17:11
  • you can modify the `df` dataframe in the function as you would any other dataframe, for example I added to the code a column `sum_price` that is the sum of two other columns – NicE Mar 01 '15 at 17:44
1

First, it is much easier for others to help you if you are able to provide a reproducible example....

One way of accomplishing this is with a combination of assign() and paste0() or paste(). Run this simple example in a clean session to get a feel for what is happening:

for (i in 1:3) {
  assign(paste0("Variable_", i), i)
}

Extending to your example, you should be able to do something like this:

i=103
for (n in c("Malines","Turnhout","district3",...))
{
  assign(paste0("df_", n), readWorksheet(wb, sheet="Par arrondissement", startRow = i, endRow = i+98, startCol = 0, endCol = 0))
  assign(paste0("maisons_total_price_", n), paste0("maisons_total_price_", n, "$Col8"))
  ...
  i=i+99
}  
Community
  • 1
  • 1
JasonAizkalns
  • 20,243
  • 8
  • 57
  • 116