0

I am trying to make a dataset of railway timetable data usable for a project. The raw data has the following structure:

#00000001
%100,11410,      ,001,002,Nachtnettrein                 
%100,01412,      ,002,007,Nachtnettrein                 
-00002,000,999
&IC  ,001,007
>rtd    ,2532
?11   ,11   ,00002
;rtn    
;rta    
;cps    
;nwk    
+gd     ,2550,2557
?5    ,5b   ,00002
;ztmo   
;ztm    
;ypb    
;vb     
;gvc    
+gv     ,2616,2623
?6    ,6    ,00002
;laa    
;gvm    
;vst    
;dvnk   
+ledn   ,2635,2645
?5b   ,5b   ,00002
;ssh    
;nvp    
;hfd    
+shl    ,2700,2702
?1-2  ,1-2  ,00002
;asdl   
;ass    
+asd    ,2716,2719
?11a  ,11a  ,00002
;asdm   
;asa    
;dvd    
;asb    
;ashd   
;ac     
;bkl    
;mas    
;utzl   
<ut     ,2758
?5    ,5    ,00002
#00000002
%100,11410,      ,001,002,Nachtnettrein 

I am interested in separating this data based on trip number. Each trip number starts with an #. I use the following code to separate by trip number:

DF <- read.delim("timetbls.dat", sep="#", skip=1)

This produces the following result:

> head(DF)
                                        X X00000001
1 %100,11410,      ,001,002,Nachtnettrein         1
2 %100,01412,      ,002,007,Nachtnettrein        NA
3                          -00002,000,999        NA
4                           &IC  ,001,007        NA
5                           >rtd    ,2532        NA
6                      ?11   ,11   ,00002        NA

Where column X contains some of the data for that particular trip and column X00000001 contains the trip number. The trip number is only shown for the first line of each trip. The rest of the values are NA until another trip number is detected. Now I want to merge the rows that contain the same trip. As a first step I tried creating a for loop that fills every NA with it's corresponding trip number, so I could perhaps group by this trip number later. The problem is that this for loop is painstakingly slow for a million-plus rows dataset.

DF$X00000001[1] <- 1 #Initializing first number, which is normally missing out
DF$Trip[1] <-1 #Bruteforcing first trip number


for(i in dim(DF)[1] {

  ifelse(DF$X00000001[i] == NA, DF$Trip[i] <- DF$Trip[i-1], DF$Trip[i] <-DF$X00000001[i])
}

I am looking for a replacement of the for loop or any other method to gather the data belonging to one trip in a single row. Thanks!

Kevin
  • 61
  • 6
  • 1
    Have you tried using `data.table`? It's usually the speediest option for bigger data sets. This previous question might help you: https://stackoverflow.com/questions/38526086/speed-up-this-loop-to-create-dummy-columns-with-data-table-and-set-in-r – Randall Helms Oct 13 '18 at 10:48
  • 1
    I'd suggest reading this (and the following post about the `unheadr` package), which walk through a similar challenge: https://luisdva.github.io/rstats/unbreaking-vals/ – Jon Spring Oct 13 '18 at 20:53
  • @JonSpring Excellent suggestion. This link solved all of my problems, thank you very much. – Kevin Oct 14 '18 at 15:09

0 Answers0