This should be completely portable provided you have the Rcpp
& BH
packages installed:
library(Rcpp)
library(inline)
csvblanks <- '
string data = as<string>(filename);
ifstream fil(data.c_str());
if (!fil.is_open()) return(R_NilValue);
typedef tokenizer< escaped_list_separator<char> > Tokenizer;
vector<string> fields;
vector<int> retval;
string line;
while (getline(fil, line)) {
int numblanks = 0;
Tokenizer tok(line);
for(Tokenizer::iterator beg=tok.begin(); beg!=tok.end(); ++beg){
numblanks += (beg->length() == 0) ? 1 : 0 ;
};
retval.push_back(numblanks);
}
return(wrap(retval));
'
count_blanks <- rcpp(
signature(filename="character"),
body=csvblanks,
includes=c("#include <iostream>",
"#include <fstream>",
"#include <vector>",
"#include <string>",
"#include <algorithm>",
"#include <iterator>",
"#include <boost/tokenizer.hpp>",
"using namespace Rcpp;",
"using namespace std;",
"using namespace boost;")
)
Once that's sourced you can call count_blanks(FULLPATH)
and it will return a numeric vector of counts of blank fields per line.
I ran it against this file:
"DATE","APIKEY","FILENAME","LANGUAGE","JOBID","TRANSCRIPT"
1,2,3,4,5
1,,3,4,5
1,2,3,4,5
1,2,,4,5
1,2,3,4,5
1,2,3,,5
1,2,3,4,5
1,2,3,4,
1,2,3,4,5
1,,3,,5
1,2,3,4,5
,2,,4,
1,2,3,4,5
via:
count_blanks("/tmp/a.csv")
## [1] 0 0 1 0 1 0 1 0 1 0 2 0 3 0
CAVEATS
- It's fairly obvious that it's not ignoring the header, so it could use a
header
logical parameter with associated C/C++ code (which will be pretty straightforward).
- If you're counting "spaces" (i.e.
[:space:]+
) as "empty" you'll need something a bit more complex than the call to length
. This is one potential way to deal with it if you need to.
- It's using the default configuration for the Boost function
escaped_list_separator
which is defined here. That can also be customized with with quote & separator characters (making it possible to further mimic read.csv
/read.table
.
This will more closely approach count.fields
/C_countfields
performance and will eliminate the need to consume memory by reading in every line just to find the lines you eventually want to more optimally target. I don't think preallocating space for the returned numeric vector will add much to the speed, but you can see the discussion here which shows how to do so if need be.