How to count the number of non-empty fields in a delimited file?

Question

You can count the number of fields per line in a comma/tab/whatever delimited text file using utils::count.fields.

Here's a reproducible example:

d <- data.frame(
  x = c(1, NA, 3, NA, 5),
  y = c(NA, "b", "c", NA, NA),
  z = c(NA, "beta", "gamma", NA, "epsilon")
)

fname <- "test.csv"
write.csv(d, fname, na = "",  row.names = FALSE)
count.fields(fname, sep = ",")
## [1] 3 3 3 3 3 3

I want to calculate the number of non-empty fields per line. I can do this in a clunky way by reading in everything and counting the number of values that aren't NA.

d2 <- read.csv(fname, na.strings = "")
rowSums(!is.na(d2))
## [1] 1 2 3 0 2

I'd really like a way of scanning the file (like count.fields) so I can target specific sections to read in.

Is there a better way of counting the number of non-empty fields in a delimited file?

score 6 · Answer 1 · edited May 23 '17 at 11:51

This should be completely portable provided you have the Rcpp & BH packages installed:

library(Rcpp)
library(inline)

csvblanks <- '
string data = as<string>(filename);
ifstream fil(data.c_str());
if (!fil.is_open()) return(R_NilValue);

typedef tokenizer< escaped_list_separator<char> > Tokenizer;

vector<string> fields;
vector<int> retval;
string line;

while (getline(fil, line)) {
  int numblanks = 0;
  Tokenizer tok(line);
  for(Tokenizer::iterator beg=tok.begin(); beg!=tok.end(); ++beg){
    numblanks += (beg->length() == 0) ? 1 : 0 ;
  };
  retval.push_back(numblanks);
}
return(wrap(retval));
'

count_blanks <- rcpp(
  signature(filename="character"),
  body=csvblanks,
  includes=c("#include <iostream>",
             "#include <fstream>",
             "#include <vector>",
             "#include <string>",
             "#include <algorithm>",
             "#include <iterator>",
             "#include <boost/tokenizer.hpp>",
             "using namespace Rcpp;",
             "using namespace std;",
             "using namespace boost;")
)

Once that's sourced you can call count_blanks(FULLPATH) and it will return a numeric vector of counts of blank fields per line.

I ran it against this file:

"DATE","APIKEY","FILENAME","LANGUAGE","JOBID","TRANSCRIPT"
1,2,3,4,5
1,,3,4,5
1,2,3,4,5
1,2,,4,5
1,2,3,4,5
1,2,3,,5
1,2,3,4,5
1,2,3,4,
1,2,3,4,5
1,,3,,5
1,2,3,4,5
,2,,4,
1,2,3,4,5

via:

count_blanks("/tmp/a.csv")
## [1] 0 0 1 0 1 0 1 0 1 0 2 0 3 0

CAVEATS

It's fairly obvious that it's not ignoring the header, so it could use a header logical parameter with associated C/C++ code (which will be pretty straightforward).
If you're counting "spaces" (i.e. [:space:]+) as "empty" you'll need something a bit more complex than the call to length. This is one potential way to deal with it if you need to.
It's using the default configuration for the Boost function escaped_list_separator which is defined here. That can also be customized with with quote & separator characters (making it possible to further mimic read.csv/read.table.

This will more closely approach count.fields/C_countfields performance and will eliminate the need to consume memory by reading in every line just to find the lines you eventually want to more optimally target. I don't think preallocating space for the returned numeric vector will add much to the speed, but you can see the discussion here which shows how to do so if need be.

i just realized you wanted to count _non-empty_, i hope that the inverse of this is trivial enough to not warrant an edit ;-) — hrbrmstr, Sep 20 '15 at 23:44

How to count the number of non-empty fields in a delimited file?

1 Answers1