3

I'd like to split a dataset made of character strings into columns specified by start and end.

My dataset looks something like this:

>head(templines,3)
[1] "201801 1  78"
[2] "201801 2  67"
[3] "201801 1  13"

and i'd like to split it by specifying my columns using the data dictionary:

>dictionary
col_name col_start col_end  
year      1         4  
week      5         6  
gender    8         8  
age       11        12  

so it becomes:

year    week    gender    age
2018    01      1         78
2018    01      2         67
2018    01      1         13

In reality the data comes from a long running survey and the white spaces between some columns represent variables that are no longer collected. It has many variables so i need a solution that would scale.

In tidyr::separate it looks like you can only split by specifying the position to split at, rather than the start and end positions. Is there a way to use start / end?

I thought of doing this with read_fwf but I can't seem to be able to use it on my already loaded dataset. I only managed to get it to work by first exporting as a txt and then reading from this .txt:

write_lines(templines,"t1.txt")

read_fwf("t1.txt", 
     fwf_positions(start = dictionary$col_start,
                   end = dictionary$col_end,
                   col_names = dictionary$col_name)

is it possible to use read_fwf on an already loaded dataset?

chrisjacques
  • 635
  • 1
  • 5
  • 17

6 Answers6

3

Answering your question directly: yes, it is possible to use read_fwf with already loaded data. The relevant part of the docs is the part about the argument file:

Either a path to a file, a connection, or literal data (either a single string or a raw vector).
...
Literal data is most useful for examples and tests. 
It must contain at least one new line to be recognised as data (instead of a path).

Thus, you can simply collapse your data and then use read_fwf:

templines %>% 
  paste(collapse = "\n") %>% 
  read_fwf(., fwf_positions(start = dictionary$col_start,
                            end = dictionary$col_end,
                            col_names = dictionary$col_name))

This should scale to multiple columns, and is fast for many rows (on my machine for 1 million rows and four columns about half a second).

There are a few warnings regarding parsing failures, but they stem from your dictionary. If you change the last line to age, 11, 12 it works as expected.

Thomas K
  • 3,242
  • 15
  • 29
2

A solution with substring:

library(data.table)
x <- transpose(lapply(templines, substring, dictionary$col_start, dictionary$col_end))
setDT(x)
setnames(x, dictionary$col_name)
# > x
#    year week gender age
# 1: 2018   01      1  78
# 2: 2018   01      2  67
# 3: 2018   01      1  13
mt1022
  • 16,834
  • 5
  • 48
  • 71
1

How about this?

data.frame(year=substr(templines,1,4), 
           week=substr(templines,5,6), 
           gender=substr(templines,7,8), 
           age=substr(templines,11,13))
Amnon
  • 2,212
  • 1
  • 19
  • 35
  • too simplistic unfortunately,, sorry i should have specified that i need a solution that would scale to hundreds of variables... – chrisjacques Jul 03 '18 at 11:10
1

Using base R:

m = list(`attr<-`(dat$col_start,"match.length",dat$col_end-dat$col_start+1))

d = do.call(rbind,regmatches(x,rep(m,length(x))))

setNames(data.frame(d),dat$col_name)

  year week gender age
1 2018   01      1  78
2 2018   01      2  67
3 2018   01      1  13

DATA USED:

x = c("201801 1  78", "201801 2  67", "201801 1  13")

dat=read.table(text="col_name col_start col_end  
           year      1         4  
           week      5         6  
           gender    8         8  
           age       11        13 ",h=T)
Onyambu
  • 67,392
  • 3
  • 24
  • 53
1

We could use separate from tidyverse

library(tidyverse)
data.frame(Col = templines) %>% 
      separate(Col, into = dictionary$col_name, sep= head(dictionary$col_end, -1))
#  year week gender  age
#1 2018   01      1   78
#2 2018   01      2   67
#3 2018   01      1   13

The convert = TRUE argument can also be used with separate to have numeric columns as output

tibble(Col = templines) %>% 
   separate(Col, into = dictionary$col_name, 
       sep= head(dictionary$col_end, -1), convert = TRUE)
# A tibble: 3 x 4
#   year  week gender   age
#  <int> <int>  <int> <int>
#1  2018     1      1    78
#2  2018     1      2    67
#3  2018     1      1    13

data

dictionary <- structure(list(col_name = c("year", "week", "gender", "age"), 
col_start = c(1L, 5L, 8L, 11L), col_end = c(4L, 6L, 8L, 13L
)), .Names = c("col_name", "col_start", "col_end"),
 class = "data.frame", row.names = c(NA, -4L))

templines <- c("201801 1  78", "201801 2  67", "201801 1  13")
akrun
  • 874,273
  • 37
  • 540
  • 662
0

This is an explicit function which seems to be working the way you wanted.

split_func<-function(char,ref,name,start,end){
  res<-data.table("ID" = 1:length(char))
  for(i in 1:nrow(ref)){
    res[,ref[[name]][i] := substr(x = char,start = ref[[start]][i],stop = ref[[end]][i])]
  }
  return(res)
}

I have created the same input files as you:

templines<-c("201801 1  78","201801 2  67","201801 1  13")
dictionary<-data.table("col_name" = c("year","week","gender","age"),"col_start" = c(1,5,8,11),
                       "col_end" = c(4,6,8,13))
#   col_name col_start col_end
#1:     year         1       4
#2:     week         5       6
#3:   gender         8       8
#4:      age        11      13

As for the arguments,
char - The character vector with the values you want to split
ref - The reference table or dictionary
name - The column number in the reference table containing the column names you want
start - The column number in the reference table containing the start points
end - The column number in the reference table containing the stop points

If I use this function with these inputs, I get the following result:

out<-split_func(char = templines,ref = dictionary,name = 1,start = 2,end = 3)

#>out
#   ID year week gender age
#1:  1 2018   01      1  78
#2:  2 2018   01      2  67
#3:  3 2018   01      1  13

I had to include an "ID" column to initiate the data table and make this easier. In case you want to drop it later you can just use:

out[,ID := NULL]

Hope this is closer to the solution you were looking for.

Rage
  • 323
  • 1
  • 13
  • thanks, but again this wouldn't scale up to hundreds of variables... is there a way of supplying start and end as vectors? – chrisjacques Jul 03 '18 at 11:21
  • You could provide an additional dataset like the one you showed here which would be used to pick up the start and end values I think. – Rage Jul 03 '18 at 12:09
  • OK let me look through it and I'll edit my answer accordingly – Rage Jul 03 '18 at 12:35
  • Please check if the updated solution works better. This was an interesting exercise :) – Rage Jul 03 '18 at 13:42