After reading carefully I thought the reason could be signed characters. If
signed characters are involved,
for (char i : str) {
value = (value + i) % m;
}
could result in negative hashes. However it's unlikely that this would actually
happen, since urls don't often contain high ascii (indeed you would expect
their IDNA encoded
version
to be in the list).
A quick check reveals 70 such domains in the list
xn---21-6cdxogafsegjgafgrkj4opcf2a.xn--p1ai/images/aaa/,bad
xn--cafsolo-dya.de/media/header/A.php,bad
xn--b1afpdqbb8d.xn--p1ai/web/data/mail/-/aaaaa/Made.htm,bad
xn-----6kcbb3dhfdcijbv8e2b.xn--p1ai/libraries/openid/Auth/OpenID/sec/RBI/index/index.htm,bad
etc.
If that's not the case, then something else is throwing out_of_range
.
Nothing /should/ because operator[]
doesn't bounds-check according to the
standard.
However maybe some implementations do the bounds-checking in Debug builds (look
at MSVC here, they have all manner of iterator debugging enabled by default in
Debug builds, so maybe this as well?).
Ironically you could use some bounds-checking yourself e.g. here:
int main(int argc, char* argv[]) {
std::vector<std::string_view> const args(argv, argv + argc);
std::string const filename(args.at(1));
This way you don't just invoke Undefined
Behaviour when the command
line argument is not given.
Line Ends
There's a gotcha with line-ends. The files are CRLF, sothe way you parse the
columns results in the last column containing "bad\r"
, instead of "bad"
on
Linux .
Review/Simplify
In the search for other bugs I've simplified the code. It will perform a lot
better now. Here's the run-down of improvement suggestions.
The includes. Just include what you need, really
#include <bitset>
#include <fstream>
#include <iostream>
#include <sstream>
#include <string>
#include <vector>
No need for wonky types or macros.
static constexpr size_t m = 100;
static constexpr size_t k = 1; // unused for now
As mentioned before, protect against signed character results:
static size_t hash1(std::string_view str) {
size_t value = 0;
for (unsigned char i : str) {
value = (value + i) % m;
}
return value;
}
Also as mentioned before protect against trailing special characters in the
classification text:
enum goodbad { good, bad, invalid }; // The Good, The Bad and ...
static goodbad to_classification(std::string_view text) {
if (text == "bad") return bad;
if (text == "good") return good;
return invalid;
}
Next up, a big one. You parse through the same file twice. And repeat the
code. Let's not. Instead have a function that parses it, and pass it a
callback to decide what to do with the parsed data.
While we're at it, let's stop ubiquitous vector<vector<vector<string> > >
disease. Really, you know how many columns there are, and what they mean.
This also reduces the allocations by a lot.
void process_csv(std::string filename, auto callback) {
std::ifstream fin(filename);
std::stringstream ss;
for (std::string line; getline(fin, line);) {
std::string_view row(line);
// eat line end remnants
row = row.substr(0, row.find_last_not_of("\r\n") + 1);
if (auto comma = row.find_last_of(','); comma + 1) {
auto url = row.substr(0, comma);
auto goodbad = row.substr(comma + 1);
auto classif = to_classification(goodbad);
if (classif == invalid)
std::cerr << "Ignored unclassified line '" << row << "'\n";
else
callback(url, to_classification(goodbad));
}
}
}
That's all. Notice a key insight where we split using the last comma only. Because otherwise you got wrong results in case the url contained commas.
Now, you can piece together the main program:
int main(int argc, char* argv[]) {
std::vector const args(argv, argv + argc);
std::string const filename(args.at(1));
Starting with the aforementioned safe way to use the command line arguments,
std::bitset<m> bloom;
reducing the verbosity vector<vector<> >
syndrome (and improving the name - v
?!)
Here comes the first file reading pass:
size_t bloom_size = 0;
process_csv(filename,
[&](std::string_view url, goodbad classification) {
if (classification == bad) {
bloom_size += 1;
bloom.set(hash1(url));
//std::cout << url << " inserted into bloom filter\n";
}
});
I decided it would not be necessary (and slow) to print all bad
urls, so
let's print the number of them instead:
// Now bitset contains all the malicious urls
std::cerr << "Bloom filter primed with " << bloom_size << " bad urls\n";
Now the validation pass:
// do a 1 in 10 validation check
process_csv(filename,
[&bloom, line = 0](std::string_view url,
goodbad classification) mutable {
line += 1;
if (rand() % 10) return; // TODO #include <random>
auto hit = bloom.test(hash1(url));
bool expected = (classification == bad);
std::cout << line << ": " << std::boolalpha
<< (hit == expected)
<< (hit ? " positive" : " negative") << "\n";
});
}
Live Demo
Live On Coliru
#include <bitset>
#include <fstream>
#include <iostream>
#include <sstream>
#include <string>
#include <vector>
static constexpr size_t m = 100;
//static constexpr size_t k = 1;
static size_t hash1(std::string_view str) {
size_t value = 0;
for (unsigned char i : str) {
value = (value + i) % m;
}
return value;
}
enum goodbad { good, bad, invalid }; // The Good, The Bad and ...
static goodbad to_classification(std::string_view text) {
if (text == "bad") return bad;
if (text == "good") return good;
return invalid;
}
void process_csv(std::string filename, auto callback) {
std::ifstream fin(filename);
std::stringstream ss;
for (std::string line; getline(fin, line);) {
std::string_view row(line);
// eat line end remnants
row = row.substr(0, row.find_last_not_of("\r\n") + 1);
if (auto comma = row.find_last_of(','); comma + 1) {
auto url = row.substr(0, comma);
auto goodbad = row.substr(comma + 1);
auto classif = to_classification(goodbad);
if (classif == invalid)
std::cerr << "Ignored unclassified line '" << row << "'\n";
else
callback(url, to_classification(goodbad));
}
}
}
int main(int argc, char* argv[]) {
std::vector const args(argv, argv + argc);
std::string const filename(args.at(1));
std::bitset<m> bloom;
size_t bloom_size = 0;
process_csv(filename, [&](std::string_view url, goodbad classification) {
if (classification == bad) {
bloom_size += 1;
bloom.set(hash1(url));
}
});
// Now bitset contains all the malicious urls
std::cerr << "Bloom filter primed with " << bloom_size << " bad urls\n";
// do a 1 in 10 validation check
process_csv(filename,
[&bloom, line = 0](std::string_view url,
goodbad classification) mutable {
line += 1;
if (rand() % 10) return;
auto hit = bloom.test(hash1(url));
bool expected = (classification == bad);
std::cout << line << ":" << std::boolalpha
<< (hit == expected)
<< (hit ? " positive" : " negative") << "\n";
});
}
On Coliru only the bad dataset fits, so we never get any positives
g++ -std=c++2a -O2 -Wall -pedantic -pthread main.cpp
./a.out bad.csv | cut -d: -f2 | sort | uniq -c | sort -n
Prints
Ignored unclassified line 'url,label'
Bloom filter primed with 75643 bad urls
Ignored unclassified line 'url,label'
7602 true positive
On my own system:

Oh, and it runs in 0.4s instead of 3.7s before.
After parsing bug fix even faster: 0.093s on average for the full set.