3

One of my first minor Rust projects involves running a regex on a large XML file:

extern crate regex;

use regex::Regex;
use std::fs::File;
use std::io::Read;

fn main() {
    let filename = "data.xml";
    let mut f = File::open(filename).expect("file not found");

    let mut contents = String::new();
    f.read_to_string(&mut contents)
        .expect("something went wrong reading the file");

    let re = Regex::new("url=\"(?P<url>.+?)\"").unwrap();
    let urls: Vec<&str> = re.captures_iter(&contents)
        .map(|c| c.name("url").unwrap().as_str())
        .collect();

    println!("{}", urls.len());
}

I am sure I am doing something very inefficient:

time ./target/release/hello_cargo 144408 ./target/release/hello_cargo
1.60s user
0.03s system
99% cpu
1.643 total 

It seems unusual that 99% of the CPU usage is by system.

Python 2.7 does the same job in far less than a second:

import re 
data = open('data.xml').read()
urls = set(re.findall('url="(.+?)"', data))
print len(urls)

Using a BufReader like this doesn't seem to change performance:

let f = File::open(filename).expect("file not found");
let mut reader = BufReader::new(f);
let mut contents = String::new();
reader
    .read_to_string(&mut contents)
    .expect("something went wrong reading the file");

If you'd like to try it locally, this is the zipped XML file.

What am I doing inefficiently?

user964375
  • 2,201
  • 3
  • 26
  • 27
  • 6
    [Please, don't use regular expressions to parse XML.](https://stackoverflow.com/a/1732454/234590) (That answer is about HTML, but the same reasoning applies to XML.) – Francis Gagné Feb 14 '18 at 00:28
  • Did you compile in release mode (`cargo run --release`)? – Francis Gagné Feb 14 '18 at 00:31
  • Fair point on regex with XML, this was just convenient data to learn rust with regex with. Release mode made it run in 1.72 seconds, which is a vast improvement. Is the line with map doing something inefficient? This was pulled together from a few examples, so it wasn't well thought out. – user964375 Feb 14 '18 at 00:34
  • Try wrapping your `File` in a [`BufReader`](https://doc.rust-lang.org/stable/std/io/struct.BufReader.html). – Francis Gagné Feb 14 '18 at 00:38
  • Is the regex `"url=\"(?P[^\"]+)\""` faster? – Francis Gagné Feb 14 '18 at 00:52
  • I just independently came up with the same regex :) Unfortunately it seems to be around the same speed. It seems like the regex is the bottleneck, reading the data from the file seems to be very fast, as expected. – user964375 Feb 14 '18 at 00:54
  • I tried this code, and it seems like it is the actual regex processing which is slow, not map etc: let mut counter = 0; for cap in re.captures_iter(&contents) { counter += 1 } – user964375 Feb 14 '18 at 00:56
  • 1
    Please provide enough details so that other people can reproduce your problem. This includes your Python program and data. – BurntSushi5 Feb 14 '18 at 01:02
  • Please *do* use [my crate, sxd-document](https://crates.io/crates/sxd-document), to parse XML. ;-) – Shepmaster Feb 14 '18 at 02:19
  • 1
    Please don't respond with vital information in comments; instead, [edit] your question to include the requested feedback. Comments are ephemeral (may be deleted at any time) and it's *really* hard to read code in them (on purpose). – Shepmaster Feb 14 '18 at 02:23
  • 2
    Your Python program is not doing equivalent work. It is merely counting matches. The Rust program isn't just getting matches, but collecting all instances of a specific capture group into a vec. You also don't include the timing information for your Python program. – BurntSushi5 Feb 14 '18 at 02:50
  • The python program is generating a set of the strings of all matches, which is then being counted, but the set does exist in memory. It takes 0.47 seconds to run. This is equivalent and takes the same time to run: urls = set(re.findall('url="(.+?)"', data)); print len(urls) – user964375 Feb 14 '18 at 03:40
  • No, it is most certainly *not* equivalent. The Rust program is extracting a capture group, but your Python program doesn't. This Rust program does the equivalent work of your Python program, and is faster: https://gist.github.com/anonymous/4936167a5ee28bb66b345f5642a3bdf8 --- With that said, adding the capturing group code to the Python program doesn't impact its runtime significantly. I'll elaborate in a proper answer. – BurntSushi5 Feb 14 '18 at 21:10

1 Answers1

6

The answer to your question is that the Rust code you've written is optimal with respect to standard usage of the regex crate. The reason why it's a little bit slower than Python here (about 3x slower on my machine) is because resolving capturing groups hasn't received much optimization attention in Rust's regex crate. In particular, resolving capture groups requires running a slower internal regex engine.

In the question as stated today, do note though that your Python program isn't doing equivalent work because all it's doing is collecting the matches and counting them. The Rust program is actually extracting a capture group, which is more work. For example, if you use this instead:

let urls: Vec<&str> = re.find_iter(&contents).map(|m| m.as_str()).collect();

then the Rust program is doing the same work as the Python program, and it is about 2x faster on my machine. Now, to be fair, if you modify the Python program to do the same work as your original Rust program, i.e.,

urls = set(m.group('url') for m in re.finditer('url="(?P<url>.+?)"', data))

then the Python program only gets a little slower, and the original Rust program is still noticeably slower as explained above.

If, instead of waiting for the regex crate to better optimize capture handling, you'd like to make your program run more quickly today, then you can exploit a particular feature of your regex. That is, avoid asking for a capture group and just extract the URL from the matched text directly. Like so:

let urls: Vec<&str> = re
    .find_iter(&contents)
    .map(|m| {
        let text = m.as_str();
        &text[5..text.len() - 1]
    })
    .collect();

This runs as fast as my modification above, which is about 2x faster than Python. Not ideal, but it's something.

BurntSushi5
  • 13,917
  • 7
  • 52
  • 45
  • 1
    Thank you for your response, apologies for misunderstanding what you meant by capture group. With your first answer example, the rust program is now 3x faster than the python program for me. Thanks again – user964375 Feb 14 '18 at 22:57