In my project, I need to download a html (about 50K-100K long when read into String, yes, quite fat), and fetch some contents using regular expressions.And then insert them into the database. The performance is quite bad, and I want to know why.
The process of the codes is like that (multithreaded):
- using httpcomponents to download the html file into String (String html)
- using Regular expressions to fetch the content,and insert (database is mysql)
Pattern p = Pattern.compile("<h.*</a></h.>",Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(html);
boolean result = m.find();
while (result) {
//insert into database stuff
//update database stuff
}
The string is very long, but if I split it into pieces, some matches may be missed. This is quite disturbing.
I added some print lines and find that after inserting into database, there are some delays, before updating operations, but I can't figure it out as the connection to the database isn't closed.
Test
link then comes `. I.e. it matches until the last occurrence of `` of the page, which might be a much larger string than you intended? – Christoffer Mar 06 '12 at 11:09