Extracting text from a pdf...selectively

Question

Interesting challenge...I'm looking for ways to extract data from a pdf, selectively. These are a collection of research abstracts which consistently have pieces of text I don't want (e.g. author's names and email address, weblinks, location cities etc).

The body of the text is what I want, and I'd looked at using stopwords as a way to solve the problem, but it quickly becomes counterproductive (many of the stopwords are actually necessary words within the text body I need).

So, is there a way to almost do an opposite approach to using stopwords, based on large areas of text you want, only? For example, where there is a title and block of text (e.g. object, methods, results) could these sections of text be selectively extracted?

To add a bit of a challenge further, there isn't much consistency to the documents (so they don't all have the same headings or length).

If anyone has any experience or tips and recommendations it would be really helpful, as the alternative of manual copy and paste just isn't sustainable.

Many thanks,

Graham.

Thanks KJ....I've seen this possible solution (below) that addresses the problem by looking for and extracting blocks of paragraphs. Might be a good fist step? https://stackoverflow.com/questions/46211806/extract-subpart-of-pdf-text-in-r — GrBrn, Jun 13 '22 at 10:58
Cheers KJ. All sounds like we're going in the right direction in that case. The nub of it is how to make a reproducible methodology that avoids the early lines we don't want and just extracts the paragraphs/lines of interest. But, it does look like this is a massively useful first step. FYI it's abstracts from an RSS feed that we are dealing with. I wonder if anyone else out there has already faced this problem? — GrBrn, Jun 14 '22 at 12:18

Extracting text from a pdf...selectively

0 Answers0