I'll try to give you an approach on how I would go about it. You mention, that every page in your PDF
document might have multiple questions and you basically want have one HTML
file for every question.
It's great if pdftohtml works for you, but I also found another decent command line utility that you might want to try out.
Ok, so assuming you have an HTML file converted from the PDF you initially had, you might want to use csplit
or awk
to split your file into multiple files based on the delimiter 'Question' in your case. (Side note- csplit
and awk
are linux specific utilites, but I'm sure there are alternatives if you are on Windows or a MAC. I haven't specifically tried the following code)
From a relevant SO Post :
csplit input.txt'/^Question$/' '{*}'
awk '/Question/{filename=NR".txt"}; {print >filename}' input.txt
So, assuming this works, you will have a couple of broken html files. Broken because they'll be unsanitized due to dangling <
or >
or some other stray HTML
elements after the splitting.
So you could start by saving the initial .html
as .txt
, removing the html
, head
and body
elements specifically and going through the general structure of how the program converts the pdf
into html
. I'm sure you'll see a pattern around how the string 'Quetion' is wrapped in an element and is something you can take care of. That is why I mention .txt
files in the code snippets.
You will basically have a bunch of text files with just the content html and not the usual starting tags for an html file because we removed that initially. Then it's only a matter of reading each file, just taking care of the element that surrounds the string 'Question' and adding the html
, head
and body
elements around the content and saving them as .html
files. You could do this in any programming language of your choice that supports file reading and writing (would be a fun exercise)
I hope this gets you started in the right direction.