I've attempted to build a program to scrape the web for company management teams. It's very accurate at obtaining many things, including:
-names
-job titles
-images
-emails
-Qualifications (MD, PhD, ect) and Suffixes (II, III, JR.)
The issue I'm running into is scraping the person's description. For instance on Facebook's Executive Bios page I would want Mark Zuckerberg's description. However, with all the differences in HTML structure, it is very difficult to scrape this with close to 100% accuracy.
I am using Perl and many, what I believe to be advanced, regular expressions. Is there a better way / tool to approach the problem with?
My latest attempt was to find the last occurrence of the persons full name on the page, then take all text until I hit a co-workers name. While this seems like it would work it gives me less than desirable results.
EDIT: I realized this question came off as just trying to parse this specific page, I need something that is general enough to work on any companies "people-page". I know 100% accuracy is unachievable, looking for something that would get me to 50% plus as currently I'm down around 15-20 percent.