1

I have a verly large textfile with several million lines containing census data like this

83400081732734890
2873846391010001944545
1829304000292399445
934745875985958344552
40599505500505055
3457584947597594933332
3938493840333398333
444234432346777927272
...

every row contains a set of variables that are separated based on a given width. In my example above, four rows together set up one complete questionnaire from the survey. Therefore the example shows two complete questionnaires/ two visited households.

What I would like to do is to read only specific variables from each household since reading the whole file takes too much time. Therefore I would like to read only specific lines from the file without loading it entirely to the memory.

Lets say that I am only interested in variables that are contained within line 1 and 3 of each block of 4 lines how could I force R to read only line 1,3,5,7?

And: Besides reading only the relevant lines, is it possible to limit reading furthermore to a specific chunk of each line that contains the relevant information? Say e.g. I would like to read only the first three digits from the first line (834 and 405) and the last five digits from the third line (99445 and 98333)?

Edit

Since I want to read selectively the solutions offered here do not solve my problem. Furthermore I cannot set up a SQL database since I work on a Windows 7 working-station without administrative rights. I do can use command-line tools from Powershell or similar.

Community
  • 1
  • 1
  • @EricJ. - not sure it's a duplicate of that one. This adds the requirement of selecting certain chunks / lines. Maybe processing the file outside of R using command line tools (awk, sed etc) might be beneficial? – thelatemail Oct 07 '15 at 22:26

1 Answers1

2

The scan function can handle multi-line input if the origianl file is sufficiently regular. Doesn't do to well with variable records lengths, though.

 res <- scan(text="83400081732734890
 2873846391010001944545
 1829304000292399445
 934745875985958344552
 40599505500505055
 3457584947597594933332
 3938493840333398333
 444234432346777927272
 ", what=list(one="", two="", three="", four=""))  # one list element per , line;
                                                  # use "" for text

Read 2 records
> first <- lapply(res[1], substr, 1, 3)
> first
$one
[1] "834" "405"


> third <- lapply(res[3], function(x) substr(x , nchar(x)-4, nchar(x)))
> third
$three
[1] "99445" "98333"

Another method would be to read with readLines, which would then let you choose division markers at irregular intervals.

IRTFM
  • 258,963
  • 21
  • 364
  • 487