0

I am relatively new to R and I am facing an issue in extracting summarized data from one dataset and then storing this data in another dataset. I am presently trying to analyze flight dataset along with the weather information. I am a having a query in this regard which is as follows:

I am having one dataset for the flight data.This dataset comprises of columns such as flight number, date,origin city name, origin state name, departure time, departure time block(each block of 1 hour duration spanning the entire 24 hours), arrival time, arrival time block, flight distance, total flight time and a binary response variable indicating whether the flight was delayed or not.

I have another dataset which comprises of weather data. This dataset includes columns such as date, time of recording, hour block(each block of 1 hour duration spanning the entire 24 hours), state, city and the respective weather related data such as temperature,humidity levels, wind speed etc. For each city,state,date and a single hour block, there can be several entries in the weather dataset indicating varying temperature levels at various times in the hour block.

I am trying to extract the temperature and humidity levels from the weather dataset and place this information in the flight dataset. I have created four new columns in the flight dataset for origin temperature,origin humidity, destination temperature and destination humidity for this purpose.

My objective is to store the average temperature and humidity levels for the origin city during the departure time hour block as well as for the destination city during the arrival time hour block in the flight dataset. The reason I am choosing average temperature and humidity is because there can be multiple entries in weather dataset for the same hour block,city and state combination.

I have tried to extract the origin temperature information from weather dataset using the following loop:

for(i in 1:nrow(flight))  
  {

  flight[[i,13]]=mean(weather$temperature[weather$date == flight[[i,1]] & weather$time_blk == train[[i,5]]
              & weather$state == train[[i,3]] & weather$city == train[[i,2]]])

  }

However, the weather data set contains around 2 Million rows and the flight dataset also comprises of around 20000 observations. As a result, the above calculation takes a lot of time to execute and even then the origin temperature data in flight data set is not calculated properly.

I am also attaching the images of sample flight and weather datasets for reference.

Can you please suggest another approach for obtaining the above information considering the constraints of the large datasets and the time required for processing.

enter image description here enter image description here

Willie Cheng
  • 7,679
  • 13
  • 55
  • 68
rj09
  • 1
  • 1
  • 1
    Please use `dput` to show a small reproducible example instead of images. – akrun May 13 '16 at 01:53
  • http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Bulat May 13 '16 at 02:56
  • What you would need to do: (i) create a numerical variable (e.g. Julian date) for date and time on both arrival & weather data; (ii) subset weather data-set by finding overlaps for based on the numerical time format for arrival/departure, using `foverlaps` (data.table package) see http://stackoverflow.com/questions/25815032/finding-overlaps-between-interval-sets-efficient-overlap-joins (iii) profit.... oops! I meant summarse/aggregate; (iv) merge – Adam Quek May 13 '16 at 04:04
  • 2
    Yecch. Imaged data. Moving on to next question. – IRTFM May 13 '16 at 04:07

0 Answers0