Interesting challenge...I'm looking for ways to extract data from a pdf, selectively. These are a collection of research abstracts which consistently have pieces of text I don't want (e.g. author's names and email address, weblinks, location cities etc).
The body of the text is what I want, and I'd looked at using stopwords as a way to solve the problem, but it quickly becomes counterproductive (many of the stopwords are actually necessary words within the text body I need).
So, is there a way to almost do an opposite approach to using stopwords, based on large areas of text you want, only? For example, where there is a title and block of text (e.g. object, methods, results) could these sections of text be selectively extracted?
To add a bit of a challenge further, there isn't much consistency to the documents (so they don't all have the same headings or length).
If anyone has any experience or tips and recommendations it would be really helpful, as the alternative of manual copy and paste just isn't sustainable.
Many thanks,
Graham.