I have roughly 6000~6500 Microsoft Word
.docx
files with various types of formatted answer scripts inside them, in the sequence:
Python Programming Question in Bold
Answer in form of complete, correctly-indented, single-spaced, self-sufficient code
Unfortunately, there seems to be no fixed pattern delineating the code blocks from normal text. Some examples from the first 50 or so files:
Entire Question in bold, after which code starts abruptly, in bold/italics
Question put in comments, after which code continues
Question completely missing, just code with numbered lists indicating start
Question completely missing, with a C/Python style comments indicating start
etc.
For now, I'm extracting the entire unformatted text through python-docx
like this:
doc = Document(infil)
# For Unicode handling.
new_paragraphs = []
for paragraph in doc.paragraphs:
new_paragraphs.append((paragraph.text).encode("utf-8"))
new_paragraphs = list(map(lambda x: convert(x), new_paragraphs))
with open(outfil, 'w', encoding='utf-8') as f:
print('\n'.join(new_paragraphs), file=f)
Once extracted, I'll run them using the PyPy Sandboxing feature which I understand is safe and then assign points as if in a contest.
What I'm completely stuck on is how to detect the start and end of the code programmatically. Most of the language detection APIs are unneeded since I already know the language. This Question: How to detect source code in a text? suggests using linters and syntax highlighters like the Google Code Prettifier, but they don't solve the issue of detecting separate programs.
A suitable solution, from this programmers.se question, seems to be training markov chains, but I wanted some second opinions before embarking on such a vast project.
This extraction code will also be provided to all students after evaluation.
I apologize if the question is too broad or the answer too obvious.