I think this is a great question. I have similar files I need to read so this was a useful exercise for me.
Here is some code that does what you ask. First I save out the example text you gave to a file so that I can read it in via sparklyr::spark_read_text
like you suggest.
Then I use dplyr::mutate
and substr
(which will be evaluated by SparkR::substr
in this case) to split the data into the columns you requested.
# because we'll want to use pipes later
library(magrittr)
text <- c("0001BRAjonh ",
"0002USAmarina ",
"0003GBPcharles")
# let's save out this text to a file for the purposes of this example
save_path <- normalizePath("~/testing_fwf.dat")
readr::write_lines(text, file = save_path)
# use spark_read_text to load it into spark
test_fwf <- sparklyr::spark_read_text(
sc,
name = "test_fwf",
path = paste0("file://",save_path)
)
# just checking the column names
# it creates a single column called "line"
colnames(test_fwf)
# [1] "line"
# split that column into the ones you requested
split_fwf <- test_fwf %>%
dplyr::mutate(
ID = substr(line, 1, 4), # Column, start, stop
Country = substr(line, 5, 7),
Name = substr(line, 8, 14)
) %>%
dplyr::select(-line) # we can get rid of the original unsplit column now
# for ease of printing the results, let's collect this
split_fwf_collected <- split_fwf %>%
dplyr::collect()
# take a look at what we've created
split_fwf_collected
# # A tibble: 3 x 3
# ID Country Name
# <chr> <chr> <chr>
# 1 0001 BRA "jonh "
# 2 0002 USA "marina "
# 3 0003 GBP "charles"