Stata vs R: loop to load multiple inputs with unique fixed width variables

Question

I am curious if anyone has a good R equivalency for the Stata program command. I have a bunch of fixed width .asc files that I needed to load up individually because fixed width positions for the same data change year to year. Stata program command does this very succinctly, managing the changes in position file by file as the program runs.

I am unclear how to do this in R without loading up each year's data individually then rbind.

I understand that something like :: ldply(myfiles, read_csv) would do a good job of identically formatted files in a set directory, but, because the conditions of some of the variables change from file to file, I am stuck on needing to write it file by file too.

target data frame would have c("name", "fips", "var1", "var2", "year")

my attempt:

a97 <- read_fwf("/A/A2000.asc", fwf_positions(c(67, 122, 18533, 18563), c(91, 126, 18538, 18568), c("name", "fips", "var1", "var2"))) %>%  filter(fips == "12345") %>% mutate(year = 1997)

a00 <- read_fwf("/A/A2004.asc", fwf_positions(c(67, 122, 17982, 18012), c(91, 126, 17987, 18017), c("name", "fips", "var1", "var2"))) %>%  filter(fips == "12345") %>% mutate(year = 2000)

rbind(a97, a00)

A Stata program would create a dataframe of name, fips, var1, var2, filename, year; and then fill it based on file by file positions for var1 and 2.

program import_a

infix str name 67-91 str fips 122-126 var1 `1' var2 `2' ///
using "$\A\\`3'.asc", clear

keep if fips=="12345"

gen year = `4'

append using "$output\a.dta"

save "$output\a.dta", replace

end

import_a 18533-18538 18563-18568 A2000 1997

import_a 17982-17987 18012-18017 A2004 2000

I am not looking for a solution just ideas on how to loop something in R that can do this for read_fwf().

Could you share a toy example of a couple files illustrating the problem? Obviously we don't want 18000 characters of example, but if you could mock up a small example of 2 or 3 lines for a couple files to get the idea across it would help someone like me (who doesn't know Stata) understand the problem better. — Gregor Thomas, Jul 28 '22 at 18:21
Like, when you say *"fill it based on file by file positions for var1 and 2"*, can you illustrate that? Are these strings in the header/first row that we can locate the positions of to infer the widths of the columns? — Gregor Thomas, Jul 28 '22 at 18:22
sure i guess it would look like, say for the first file c("abc defg asb var1") and fwf_positions needs to point 14-17 in file 1, but in file2, c("abc var1 xcv wpt"), var1 is in position 4-7. the stata code loads "abc" based on 1-3 for every file, then file1 14-17, file2 4-7.... — gered, Jul 28 '22 at 18:29
So that makes more sense, but I can't draft a solution in R and test it on what you provided. Could you please edit into your question something like that, with 1 or 2 rows of data per table, so that there's something to test solutions on? — Gregor Thomas, Jul 28 '22 at 18:44
But basically, yeah, you can use `read_table(..., n_max = 1)` to read the first line and use string manipulation functions like `stringr::str_locate` to find the positions and feed that to `fwf_positions`. — Gregor Thomas, Jul 28 '22 at 18:55
makes sense. I guess the question really is, how to tell R to loop a directory where each file has a different set of positions... — gered, Jul 28 '22 at 19:01
Looping through files in a directory is easy. And writing a little custom function as I described in my previous comment shouldn't be too bad. — Gregor Thomas, Jul 28 '22 at 19:30
interesting. would this work with a flat file that has no column names? — gered, Jul 29 '22 at 12:55
If you can precisely describe the logic to determine where the columns start, then we can code up a solution. Usually this is not clear from fixed width files, which is why `read_fwf()` expects the user to supply these. In your example `"abc var1 xcv wpt"` there are spaces, so the columns are clear. It made it seem like we could look for spaces (or at least specific strings) in the headers to learn the rule. That's not generally true for fixed with files. If the first line is `123456` and even if we know there are 3 columns, there's no way to know if its `12, 34, 56` or `1, 234, 5, 6`, etc. — Gregor Thomas, Jul 29 '22 at 13:25

Stata vs R: loop to load multiple inputs with unique fixed width variables

0 Answers0