0

I have a R dataset of flight data. I need to add 365 columns to this dataset, one for each Day-of-the-year, with value 1 if the data[i]$FlightDate of the entry corresponds to that Day-of-the-year, 0 otherwise (see this question for why).

Previously I had managed to extract the day of Year from a FlightDate string using lubridate

data$DayOfYear <- yday(ymd(data$FlightDate))

How would I go about generating each 365 columns, and keep only those columns (along with some others) for a future SVD ? I will actually need to repeat the same for the hours in the day (which I will probably split into ranges of 30 or 10 minutes), so an extra 48-120 one-hot columns for a different variable will have to be added later.

Note : my dataset contains about 500k flights per month, (so about 16k flights for a single dayOfTheYear if I just take one year of data), and has 100 variable (columns)

Sample input data row data[1,]:

{
  DayOfYear: 10, 
  FieldGoodForSvd1 : 235
  FieldBadForSvd2 : "some string"
  ...
} 

Sample output data row (after generating 365 binary cols and selecting fields compatible with an SVD)

{
  DayOfYear1: 0,
  ... 
  DayOfYear9: 0, 
  DayOfYear10: 1, // The flight had taken place on that DayOfYear
  DayOfYear11: 0, 
  ...
  DayOfYear365: 0, 
  FieldGoodForSvd1 : 235
} 

EDIT

Suppose my input data matrix looks like that

DayOfYear ; FieldGoodForSvd1 ; FieldBadForSvd2

1         ; 275              ; "los angeles"
1         ; 256              ; "san francisco"
5         ; 15               ; "chicago"

The final output should be

FieldGoodForSvd1 ; DayOfYear1 ; DayOfYear2 ; ... ; DayOfYear4 ; DayOfYear5 ; DayOfYear6 ; ... ; DayOfYear365

275              ;    1       ;      0     ; ... ; 0           ; 0         ; 0          ; ... ; 0
256              ;    1       ;      0     ; ... ; 0           ; 0         ; 0          ; ... ; 0
5                ;    0       ;      0     ; ... ; 0           ; 1         ; 0          ; ... ; 0
Community
  • 1
  • 1
Cyril Duchon-Doris
  • 12,964
  • 9
  • 77
  • 164
  • Possible duplicate of [Reshape data from long to wide format R](http://stackoverflow.com/questions/5890584/reshape-data-from-long-to-wide-format-r) – germcd Jan 16 '16 at 21:07
  • @germcd That question seems to be about swapping some rows/columns, but in my case I want to keep the same entries of my dataset and just add 365 more columns based on the content of one column. I also don't want to group any entries together. – Cyril Duchon-Doris Jan 17 '16 at 00:21
  • ah right, I see the difference, [this](http://stackoverflow.com/questions/13727918/r-convert-row-data-to-binary-columns) might help – germcd Jan 17 '16 at 14:16

1 Answers1

0

Here is my final code that does the one hot encoding for DayOfYear and the TimeSlot, and proceeds to the svd

dsan = (d[!is.na(d$FieldGoodForSvd1) & d[!is.na(d$FieldGoodForSvd2),])

# We need factors to perform one hot encoding
dsan$DayOfYear <- as.factor(yday(ymd(dsan$FlightDate)))
dsan$TimeSlot <- as.factor(round(dsan$DepTime/100)) # in my case time slots were like 2055 for 20h55

dSvd= with(dsan,data.frame(
  FieldGoodForSvd1,
  FieldGoodForSvd2,
  # ~ performs one hot encoding (on factors), -1 removes intercept term
  model.matrix(~DayOfYear-1,dsan),
  model.matrix(~TimeSlot-1,dsan)
))
theSVD = svd(scale(dSvd))
Cyril Duchon-Doris
  • 12,964
  • 9
  • 77
  • 164