I have time series data in a fixed-width-file where observation rows (n varies depending on the sample size) occur under a "header" row which contains the important meta data (ie. sample number, date, etc.). Both types of rows contain alphanumeric characters. It looks something like this (strings shortened for ease of reading:
4 64001416230519844TP blahblah
5416001130 1 F 492273
5416001140 3 F 492274
5416001145 1 F 492275
5416001150 19 F 492276
5416001155 21 F 492277
5416001160 21 F 492278
5416001165 13 F 492279
5416001170 3 F 492280
5416001180 1 F 492281
4 64001544250619844RA blahblah
5544001125 1 F 492291
5544001130 3 F 492292
5544001135 4 F 492293
5544001140 11 F 492294
5544001145 13 F 492295
4 64002544250619844RA blahblah
etc.
Header rows are distinguished by the first character in the string == 4 and have 89 characters. Observation rows == 5 and have 24 characters.
What I would like is to paste the header row to each subsequent observation row (the subset of data) so that I can later parse the string with read_fwf and be assured that I can sort each observation by the information contained in the header row. I don't care if the original header row is dropped. Like this:
5416001130 1 F 492273 4 64001416230519844TP blahblah
5416001140 3 F 492274 4 64001416230519844TP blahblah
5416001145 1 F 492275 4 64001416230519844TP blahblah
5416001150 19 F 492276 4 64001416230519844TP blahblah
5416001155 21 F 492277 4 64001416230519844TP blahblah
5416001160 21 F 492278 4 64001416230519844TP blahblah
5416001165 13 F 492279 4 64001416230519844TP blahblah
5416001170 3 F 492280 4 64001416230519844TP blahblah
5416001180 1 F 492281 4 64001416230519844TP blahblah
5544001125 1 F 492291 4 64001544250619844RA blahblah
5544001130 3 F 492292 4 64001544250619844RA blahblah
5544001135 4 F 492293 4 64001544250619844RA blahblah
5544001140 11 F 492294 4 64001544250619844RA blahblah
5544001145 13 F 492295 4 64001544250619844RA blahblah
etc...
The closest solution I found was here fwf file with headers every 5th row, headers were characters and observations numeric
The solution provided was a loop that iteratively scrolled through rows and tested if they were characters or numeric and pasted them together accordingly.
text <- readLines('/path/to/file') # read in the file
split_text <- strsplit(text, "\\s+") # split each line on whitespace
for (line in split_text) { # iterate through lines
numeric_line <- suppressWarnings(as.numeric(line)) # try to convert the current line into a vector of numbers
if (is.na(numeric_line[[1]])) { # if it fails, we know we're on a header line
header <- line
} else {
for (i in seq(1, length(line), 2)) { # otherwise, we're on a data line, so take two numbers at once
print(c(header, line[[i]], line[[i+1]])) # and output the latest header with each pair of values
}
}
}
I tried to adapt this to my data by first reading the fwf with read.fwf() or read_fwf() and defining the first character as a column to distinguish between headers and observations:
packages = c('tidyverse','rgdal','car','audio','beepr','xlsx','magrittr','lubridate','RColorBrewer','haven')
invisible(lapply(packages, function(x) {if (!require(x, character.only = T)) {install.packages(x);require(x)}}))
DF <- read.fwf("directory/.dat", widths = c(1, 88), header = FALSE)
My adaptation:
newdf <- for (i in DF) { # iterate through lines
if (DF$V1 == 4) { # if true, we know we're on a header row
header <- i
} else {
for (i in seq(1, length(DF$V2), 1)) { # otherwise = observation row
print(c(header, DF$V2[[i]], DF$V2[[i+1]])) # and output the latest header with each observation until you hit another header
}
}
}
#this is very slow and/or does not work
# I get the following error message
#Warning messages:
1: In if (DF$V1 == 4) { :
the condition has length > 1 and only the first element will be used
I have also tried specifying headers vs observation rows by nchar() hearders =89 and observations = 24. I realise the loop solution here might be to use ifelse but another issue arises. The dataset is ~39700 rows long and I get new data all the time. Loops would take a long time...
I would like to do this with the data.table or dplyr syntax.
I've tried playing with dplyr::lag as per these posts: dplyr example 1 and dplyr example 2 and get close to what I want:
newdf<-DF %>%
mutate(new = replace(lag(V2), V1 != '5', NA))
But as you see the new column only pastes the information from the previous row... as lag() is supposed to do.
Any help would be greatly appreciated, thank you in advance.
As a side note. This data was previously processed in SAS but since I don't do SAS there you go. I do have the SAS code if that helps:
DATA A1;
FILENAME FREQLONG 'dir/FL.DAT';
INFILE FREQLONG;
INPUT
TYPE 1 @ ;
IF TYPE EQ 4 THEN LINK LIGNE4;
IF TYPE EQ 5 THEN DELETE;
RETURN;
LIGNE4:
INPUT var1 $ 6 - 8
var2 $ 9 - 11
var3 12 - 13
var4 14 - 15
var5 18 - 19
var6 $ 20 - 22
var7 $ 44 - 46
var8 $ 78;
DATA A2;
FILENAME FREQLONG 'dir/FL.DAT';
INFILE FREQLONG;
INPUT
TYPE 1 @ ;
IF TYPE EQ 4 THEN DELETE;
IF TYPE EQ 5 THEN LINK LIGNE5;
RETURN;
LIGNE5:
INPUT var1 $ 5 - 7
var2 $ 2 - 4
varz 8 - 10
vara 11 - 13
varb $ 15;
DATA A3;
SET A1;
PROC SORT;
BY var1 var2;
RUN;
DATA A4;
SET A2;
PROC SORT;
BY var1 var2;
RUN;
DATA A5;
MERGE A4 A3;
BY var1 var2;
RUN;
As you can see it splits the file, sorts the variables, merges them. However this was done year by year and I want to work with one file with all the years.