One of my first minor Rust projects involves running a regex on a large XML file:
extern crate regex;
use regex::Regex;
use std::fs::File;
use std::io::Read;
fn main() {
let filename = "data.xml";
let mut f = File::open(filename).expect("file not found");
let mut contents = String::new();
f.read_to_string(&mut contents)
.expect("something went wrong reading the file");
let re = Regex::new("url=\"(?P<url>.+?)\"").unwrap();
let urls: Vec<&str> = re.captures_iter(&contents)
.map(|c| c.name("url").unwrap().as_str())
.collect();
println!("{}", urls.len());
}
I am sure I am doing something very inefficient:
time ./target/release/hello_cargo 144408 ./target/release/hello_cargo
1.60s user
0.03s system
99% cpu
1.643 total
It seems unusual that 99% of the CPU usage is by system.
Python 2.7 does the same job in far less than a second:
import re
data = open('data.xml').read()
urls = set(re.findall('url="(.+?)"', data))
print len(urls)
Using a BufReader
like this doesn't seem to change performance:
let f = File::open(filename).expect("file not found");
let mut reader = BufReader::new(f);
let mut contents = String::new();
reader
.read_to_string(&mut contents)
.expect("something went wrong reading the file");
If you'd like to try it locally, this is the zipped XML file.
What am I doing inefficiently?