-1

I need to open a file, read a long line, produce a hash (MD5 is fine) and write that hash to a file. The file has millions of lines and a hashing function is something ideally suited to C++.

Opening the files, reading line by line and writing to file has been completed but the hash function..

I have found some hashing code but shoehorning this code into Rcpp just doesn't seem to work. I have also used the R function within Rcpp but for some reason the result is truncated creating a lot of collisions. The result also changes each time it is ran but stays the same when the same string is ran in R, the result is also full length in R.

Has anyone successfully implemented a line by line MD5 (or SHA doesn't matter) hash with Rcpp?

tadman
  • 208,517
  • 23
  • 234
  • 262
  • Can you use an existing C or C++ implementation of MD5? – tadman Jan 22 '21 at 02:54
  • That is essentially what my [`digest` package](https://cran.r-project.org/package=digest) does. It predates `Rcpp` but is pretty quick. Otherwise, this is a pretty basic task if you have a little bit of C and C++ experience. If you don't then text and strings and char vectors ... can take same getting used to. (And I didn't downvote you.) – Dirk Eddelbuettel Jan 22 '21 at 03:00
  • Thanks Dirk, it was the R digest package I used and it works really well in R but the output was truncated when called from Rcpp and I would have expected the MD5 to be the same and not change. Any thoughts why? – Neil Walker Jan 22 '21 at 04:28
  • Why would you use Rcpp to call digest? Just use readLines, and loop over the lines, or use the new vectorised access in digest. – Dirk Eddelbuettel Jan 22 '21 at 04:56
  • thanks again, I will explain a bit more. Each line in the file is over 500 char, one file is over 4Gb and has over 20M rows. R throws an error when importing a 2Gb file but i can see nothing obvious and it is very slow. I need to check for duplicates hence i thought i would create a hash and work on that. I want to avoid using organic R because reading the file is slow and fails and Rcpp works well. I have a loop going to read each line in Rcpp and works fine. I am having trouble using hash functions so using the R digest which does work but with the problems i have mentioned. – Neil Walker Jan 22 '21 at 06:03
  • Sure. It is still a pretty simple task and a great opportunity to learn some basic C and C++ (or incentivise someone to do it for you). Take two parameters for input and output file, read input line by line and transform (maybe string reverse as a first test) and then write to output. Calling a hasher instead of string reversal is then a simple extension. – Dirk Eddelbuettel Jan 22 '21 at 15:12

1 Answers1

3

Here is complete answer, relying on two other StackOverflow answers:

  • one for one of a bazillion possible ways to read a file line by line and write the result out
  • and one for the somewhat for interesting part of getting md5 from Boost so that we don't have to link.

This now works with one simple sourceCpp() call provided you have CRAN package BH installed.

Demo

> sourceCpp("~/git/stackoverflow/65838609/answer.cpp")
> hasher("/home/edd/git/stackoverflow/65838609/answer.cpp", "/tmp/md5.txt")
[1] TRUE
> 

Code

#define BOOST_NO_AUTO_PTR

#include <Rcpp.h>
#include <fstream>
#include <boost/uuid/detail/md5.hpp>
#include <boost/algorithm/hex.hpp>

// [[Rcpp::depends(BH)]]

using boost::uuids::detail::md5;

// [[Rcpp::export]]
bool hasher(const std::string& infilename, const std::string& outfilename) {

  // cf https://stackoverflow.com/questions/48545330/c-fastest-way-to-read-file-line-by-line
  //    https://stackoverflow.com/questions/55070320/how-to-calculate-md5-of-a-file-using-boost
  std::ifstream infile(infilename);
  std::ofstream outfile(outfilename);
  std::string line;
  while (std::getline(infile, line)) {
    // line contains the current line
    md5 hash;
    md5::digest_type digest;
    hash.process_bytes(line.data(), line.size());
    hash.get_digest(digest);
    const auto charDigest = reinterpret_cast<const char *>(&digest);
    std::string res;
    boost::algorithm::hex(charDigest, charDigest + sizeof(md5::digest_type), std::back_inserter(res));
    outfile << res << '\n';
  }
  outfile.close();
  infile.close();
  return true;
}
Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
  • Thanks Dirk, I had all the file opening and reading already done - I already use the method you mentioned. Got it all working, wasn't familiar with Boost so I will need to study up on that and see what else it does. Thanks again. – Neil Walker Jan 26 '21 at 02:11