-2

I have a pdf file and I need to get get small pieces of data from it. It is structured like this :

Page1:

Question 1

......................................

......................................

Question 2

......................................

......................................

Page End

I want to get Question 1 and Question 2 as separate html files, which contain text and image.

I've tried

pdftohtml -c pdffile.pdf output.html

And I got files with png images, but how to do I cut the Image into smaller chunks to fit the size of each Question (I want to separate each question into individual files)?

P.S. I have alot of pdf files, so a command-line tool would be nice.

ahk
  • 59
  • 10
  • check this website - http://smallpdf.com/split-pdf, once you split the pages, convert the same into jpeg images if you need!! – Aru Oct 10 '14 at 04:31
  • @Aru I forgot to specify this in the question, I have alot of pdf files, so a commandline tool would be nice. – ahk Oct 10 '14 at 04:36
  • try this http://www.tiffsoftware.com/Batch-pdf-splitter.html or http://pdf-split.com/download, hope it helps you – Aru Oct 10 '14 at 04:52

1 Answers1

1

I'll try to give you an approach on how I would go about it. You mention, that every page in your PDF document might have multiple questions and you basically want have one HTML file for every question.

It's great if pdftohtml works for you, but I also found another decent command line utility that you might want to try out.

Ok, so assuming you have an HTML file converted from the PDF you initially had, you might want to use csplit or awk to split your file into multiple files based on the delimiter 'Question' in your case. (Side note- csplit and awk are linux specific utilites, but I'm sure there are alternatives if you are on Windows or a MAC. I haven't specifically tried the following code)

From a relevant SO Post :

 csplit input.txt'/^Question$/' '{*}'

 awk '/Question/{filename=NR".txt"}; {print >filename}' input.txt

So, assuming this works, you will have a couple of broken html files. Broken because they'll be unsanitized due to dangling < or > or some other stray HTML elements after the splitting.

So you could start by saving the initial .html as .txt, removing the html, head and body elements specifically and going through the general structure of how the program converts the pdf into html. I'm sure you'll see a pattern around how the string 'Quetion' is wrapped in an element and is something you can take care of. That is why I mention .txt files in the code snippets.

You will basically have a bunch of text files with just the content html and not the usual starting tags for an html file because we removed that initially. Then it's only a matter of reading each file, just taking care of the element that surrounds the string 'Question' and adding the html, head and body elements around the content and saving them as .html files. You could do this in any programming language of your choice that supports file reading and writing (would be a fun exercise)

I hope this gets you started in the right direction.

Community
  • 1
  • 1
Vivek Pradhan
  • 4,777
  • 3
  • 26
  • 46