0

I am very new to Rust, and trying to build a HTML parser. I first tried to parse the string and put it in the Hashmap<&str, i32>. and I figured out that I have to take care of letter cases. so I added tag.to_lowercase() which creates a String type. From there it got my brain to panic.

Below is my code snippet.

fn html_parser<'a>(html:&'a str, mut tags:HashMap<&'a str, i32>) -> HashMap<&'a str, i32>{

    let re = Regex::new("<[:alpha:]+?[\\d]*[:space:]*>+").unwrap();
    let mut count;
    for caps in re.captures_iter(html) {        
        if !caps.at(0).is_none(){
            let tag = &*(caps.at(0).unwrap().trim_matches('<').trim_matches('>').to_lowercase());
            count = 1;

            if tags.contains_key(tag){
                count = *tags.get_mut(tag).unwrap() + 1;
            }
            tags.insert(tag,count);
        }       
    }    
    tags
}

which throws this error,

src\main.rs:58:27: 58:97 error: borrowed value does not live long enough
src\main.rs:58 let tag:&'a str = &*(caps.at(0).unwrap().trim_matches('<').trim_matches('>').to_lowercase());
                                                         ^~~~~~~~~~~~~~~~~~~
src\main.rs:49:90: 80:2 note: reference must be valid for the lifetime 'a as defined on the block at 49:89...
src\main.rs:49 fn html_parser<'a>(html:&'a str, mut tags:HashMap<&'a str, i32>)-> HashMap<&'a str, i32>{
src\main.rs:58:99: 68:6 note: ...but borrowed value is only valid for the block suffix following statement 0 at 58:98
src\main.rs:58 let tag:&'a str = &*(caps.at(0).unwrap().trim_matches('<').trim_matches('>').to_lowercase());
src\main.rs:63
           ...
error: aborting due to previous error

I read about lifetimes in Rust but still can not understand this situation.

If anyone has a good HTML tag regex, please recommend so that I can use it.

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
Somang Nam
  • 21
  • 2
  • Just FYI: if you want to use POSIX character classes, use them inside bracket expressions: `[[:alpha:]]`, `[[:space:]]`. – Wiktor Stribiżew Feb 26 '16 at 15:50
  • @Wiktor Stribizew thank you! – Somang Nam Feb 26 '16 at 18:28
  • Unless you only want to parse a known set of input files that have a very regular format (rather than arbitrary HTML from the web), you probably shouldn’t write your own HTML parser (and [particularly not with regexs](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)). html5ever an HTML parser in Rust, it has thousands of lines of code to handle all kinds of edge cases. I recommend using [Kuchiki](https://github.com/SimonSapin/kuchiki) which uses html5ever for parsing and provides a tree data structure for parsed documents. – Simon Sapin Feb 26 '16 at 20:25
  • @SimonSapin Thank you for your kind answer. I replied in the below comment, but I am new to Rust, and was just trying to learn things by having a small project of building my own web scraper. and I now understand the warnings of using regex for HTML parser. thank you. – Somang Nam Feb 26 '16 at 22:11

1 Answers1

2

To understand your problem it is useful to look at the function signature:

fn html_parser<'a>(html: &'a str, mut tags: HashMap<&'a str, i32>) -> HashMap<&'a str, i32>

From this signature we can see, roughly, that both accepted and returned hash maps may only be keyed by subslices of html. However, in your code you are attempting to insert a string slice completely unrelated (in lifetime sense) to html:

let tag = &*(caps.at(0).unwrap().trim_matches('<').trim_matches('>').to_lowercase());

The first problem here (your particular error is about exactly this problem) is that you're attempting to take a slice out of a temporary String returned by to_lowercase(). This temporary string is only alive during this statement, so when the statement ends, the string is deallocated, and its references would become dangling if this was not prohibited by the compiler. So, the correct way to write this assignment is as follows:

let tag = caps.at(0).unwrap().trim_matches('<').trim_matches('>').to_lowercase();
let tag = &*tag;

(or you can just use top tag and convert it to a slice when it is used)

However, your code is not going to work even after this change. to_lowercase() method allocates a new String which is unrelated to html in terms of lifetime. Therefore, any slice you take out of it will have a lifetime necessarily shorter than 'a. Hence it is not possible to insert such slice as a key to the map, because the data they point to may be not valid after this function returns (and in this particular case, it will be invalid).

It is hard to tell what is the best way to fix this problem because it may depend on the overall architecture of your program, but the simplest way would be to create a new HashMap<String, i32> inside the function:

fn html_parser(html:&str, tags: HashMap<&str, i32>) -> HashMap<String, i32>{
    let mut result: HashMap<String, i32> = tags.iter().map(|(k, v)| (k.to_owned(), *v)).collect();
    let re = Regex::new("<[:alpha:]+?[\\d]*[:space:]*>+").unwrap();
    for caps in re.captures_iter(html) {
        if let Some(cap) = caps.at(0) {
            let tag = cap
                .trim_matches('<')
                .trim_matches('>')
                .to_lowercase();
            let count = result.get(&tag).unwrap_or(0) + 1;
            result.insert(tag, count);
        }       
    }    
    result
}

I've also changed the code for it to be more idiomatic (if let instead of if something.is_none(), unwrap_or() instead of mutable local variables, etc.). This is a more or less direct translation of your original code.

As for parsing HTML with regexes, I just cannot resist providing a link to this answer. Seriously consider using a proper HTML parser instead of relying on regexes.

Community
  • 1
  • 1
Vladimir Matveev
  • 120,085
  • 34
  • 287
  • 296