6

We have a 'delete all my data' feature. I'd like to delete a set of IPs from many many web log files.

Currently at runtime I open a CSV with the IP addresses to delete, turn it into a set, scan through files, and execute the delete logic if log IPs match.

Is there any way I can load the CSV and turn it into a set at compile time? We're trying to migrate things to AWS lambda, and it's nifty to have only a single static binary to deploy with no dependencies.

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
ForeverConfused
  • 1,607
  • 3
  • 26
  • 41

3 Answers3

9

The Rust-PHF crate provides compile-time data structures, including (ordered) maps and sets.

Unfortunately, to date, it does not support initialization of a set of std::net::IpAddr, but can be used with static strings:

static IP_SET: phf::Set<&'static str> = phf_set! {
    "127.0.0.1",
    "::1",
};
loganfsmyth
  • 156,129
  • 30
  • 331
  • 251
mcarton
  • 27,633
  • 5
  • 85
  • 95
  • isn't IPV6 const ? https://doc.rust-lang.org/src/std/net/ip.rs.html#855, for some reason IPV4 is not stable... https://doc.rust-lang.org/src/std/net/ip.rs.html#332 – Stargateur Mar 10 '19 at 07:13
  • 1
    @Stargateur it doesn't matter, the crate doesn't allow any expression in `phf_set!` and an `enum` constructor isn't one of the allowed one. I guess it needs to computes the hash at compile time, and since the crate existed before `const`, it could only accept simpler expressions. – mcarton Mar 10 '19 at 11:41
  • If this only IPs, not IP masks, then a `u32` may be preferable. – Matthieu M. Mar 10 '19 at 17:13
  • @MatthieuM. "If this only IP*v4*s" – mcarton Mar 10 '19 at 17:32
  • 1
    Couldn't you just create a newtype around `IpAddr` and implement [`PhfHash`](https://docs.rs/phf/0.7.24/phf/trait.PhfHash.html) for it? – Shepmaster Mar 10 '19 at 23:47
  • @Shepmaster maybe if you use `phf_codegen`, but if you use the simpler `phf_set!` macro, you are [limited to simple expressions](https://github.com/sfackler/rust-phf/blob/53000d3dc844968e955fe30272734d3af36efe8f/phf_macros/src/lib.rs#L52-L102) and `enum` constructors (or anything function-call-like) aren't allowed. – mcarton Mar 11 '19 at 10:00
3

I would recommend to simply use a Build Script to read the CSV and produce a source file containing the initialized of a standard HashSet with a custom hasher (FxHash, for example).

This would let you keep the convenience of editing a CSV file, while still baking all the data into a binary. It would require some initialization time (unlike PHF), but the ability to specify a custom hash is quite beneficial.

Also, depending on the format of IPs in the logs, you may want to store either &'static str or u32; the latter is more efficient (search-wise), but the gain may be negated if a conversion is required.

Matthieu M.
  • 287,565
  • 48
  • 449
  • 722
  • Why the custom hasher? I don't see what benefit it brings here. – Shepmaster Mar 10 '19 at 23:43
  • Actually, why create a *set* at all at runtime? Create a set in the build script. Now you know it's unique and you can just spit out all the data as a big array or slice. – Shepmaster Mar 11 '19 at 00:44
  • @Shepmaster: I think you may have misunderstood the purpose of the exercise. The goal is not to print a list of unique IPs in the csv (they are likely already unique), but to scrap a lot of log files, looking up any IP we stumble upon in this set. As such, the goal is to optimize look-up speed, and the look-up speed of a hash-map with an optimized hash is *much* better than the look-up speed in a sorted array... even assuming an Eytzinger layout. – Matthieu M. Mar 11 '19 at 07:46
3

have only a single static binary to deploy

Inline your entire CSV file using include! or include_str! and then go about the rest of your program as usual.

use csv; // 1.0.5

static CSV_FILE: &[u8] = include_bytes!("/etc/hosts");

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut rdr = csv::ReaderBuilder::new()
        .delimiter(b'\t')
        .from_reader(CSV_FILE);

    for result in rdr.records() {
        let record = result?;
        println!("{:?}", record);
    }

    Ok(())
}

See also:

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
  • 1
    Is it possible to parse that csv at compile time using Rust-PHF mcarton mentioned? – ForeverConfused Mar 10 '19 at 23:02
  • 2
    @ForeverConfused nothing special from this answer, but yes, in general. As [Matthieu M. mentions](https://stackoverflow.com/a/55090316/155423), you can use a build script to perform the reading of the CSV itself, then construct the set. That being said, why do you need a *set* at runtime? Once you've ensured there are no duplicates, an array would be sufficient in most cases. – Shepmaster Mar 11 '19 at 00:07
  • For each log I check if it exists in the set of emails, and I delete it if found. This type of operation is much faster on sorted set. Lambda charges per invocation, so there's some minimal benefit to having a set generated at compile time. – ForeverConfused Mar 11 '19 at 06:22