What is a safe way to extract python code blocks from docx files and run them in a sandbox?

Question

I have roughly 6000~6500 Microsoft Word .docx files with various types of formatted answer scripts inside them, in the sequence:

Python Programming Question in Bold

Answer in form of complete, correctly-indented, single-spaced, self-sufficient code

Unfortunately, there seems to be no fixed pattern delineating the code blocks from normal text. Some examples from the first 50 or so files:

Entire Question in bold, after which code starts abruptly, in bold/italics
Question put in comments, after which code continues
Question completely missing, just code with numbered lists indicating start
Question completely missing, with a C/Python style comments indicating start

etc.

For now, I'm extracting the entire unformatted text through python-docx like this:

doc = Document(infil)

# For Unicode handling.
new_paragraphs = []
for paragraph in doc.paragraphs:
    new_paragraphs.append((paragraph.text).encode("utf-8"))

new_paragraphs = list(map(lambda x: convert(x), new_paragraphs))

with open(outfil, 'w', encoding='utf-8') as f:
    print('\n'.join(new_paragraphs), file=f)

Once extracted, I'll run them using the PyPy Sandboxing feature which I understand is safe and then assign points as if in a contest.

What I'm completely stuck on is how to detect the start and end of the code programmatically. Most of the language detection APIs are unneeded since I already know the language. This Question: How to detect source code in a text? suggests using linters and syntax highlighters like the Google Code Prettifier, but they don't solve the issue of detecting separate programs.

A suitable solution, from this programmers.se question, seems to be training markov chains, but I wanted some second opinions before embarking on such a vast project.

This extraction code will also be provided to all students after evaluation.

I apologize if the question is too broad or the answer too obvious.

score 1 · Accepted Answer · answered Mar 13 '17 at 03:31

Hummm, so you are looking for some kind of formatting pattern? That sounds kind of weird to me. Is there any kind of text or string pattern that you can exploit? I'm not sure if this will help or not, but the VBA script below searches through all Word documents in a folder and puts a 'X' in any field that matches a search criteria that you specify in Row1. It also put a hyperlink in ColA, so you can click the link and open the file, rather than searching around for the file. Here is a screen shot.

Script:

Sub OpenAndReadWordDoc()

    Rows("2:1000000").Select
    Range(Selection, Selection.End(xlDown)).Select
    Selection.ClearContents
    Range("A1").Select

    ' assumes that the previous procedure has been executed
    Dim oWordApp As Word.Application
    Dim oWordDoc As Word.Document
    Dim blnStart As Boolean
    Dim r As Long
    Dim sFolder As String
    Dim strFilePattern As String
    Dim strFileName As String
    Dim sFileName As String
    Dim ws As Worksheet
    Dim c As Long
    Dim n As Long

    '~~> Establish an Word application object
    On Error Resume Next
    Set oWordApp = GetObject(, "Word.Application")
    If Err() Then
        Set oWordApp = CreateObject("Word.Application")
        ' We started Word for this macro
        blnStart = True
    End If
    On Error GoTo ErrHandler

    Set ws = ActiveSheet
    r = 1 ' startrow for the copied text from the Word document
    ' Last column
    n = ws.Range("A1").End(xlToRight).Column

    sFolder = "C:\Users\your_path_here\"

    '~~> This is the extension you want to go in for
    strFilePattern = "*.doc*"
    '~~> Loop through the folder to get the word files
    strFileName = Dir(sFolder & strFilePattern)
    Do Until strFileName = ""
        sFileName = sFolder & strFileName

        '~~> Open the word doc
        Set oWordDoc = oWordApp.Documents.Open(sFileName)
        ' Increase row number
        r = r + 1
        ' Enter file name in column A
        ws.Cells(r, 1).Value = sFileName

        ActiveCell.Offset(1, 0).Select
        ActiveSheet.Hyperlinks.Add Anchor:=Sheets("Sheet1").Range("A" & r), Address:=sFileName,
        SubAddress:="A" & r, TextToDisplay:=sFileName

        ' Loop through the columns
        For c = 2 To n
            If oWordDoc.Content.Find.Execute(FindText:=Trim(ws.Cells(1, c).Value),
                    MatchWholeWord:=True, MatchCase:=False) Then
                ' If text found, enter Yes in column number c
                ws.Cells(r, c).Value = "Yes"
            End If
        Next c
        oWordDoc.Close SaveChanges:=False

        '~~> Find next file
        strFileName = Dir()
    Loop

ExitHandler:
    On Error Resume Next
    ' close the Word application
    Set oWordDoc = Nothing
    If blnStart Then
        ' We started Word, so we close it
        oWordApp.Quit
    End If
    Set oWordApp = Nothing
    Exit Sub

ErrHandler:
    MsgBox Err.Description, vbExclamation
    Resume ExitHandler
End Sub

Function GetDirectory(path)
    GetDirectory = Left(path, InStrRev(path, "\"))
End Function

What is a safe way to extract python code blocks from docx files and run them in a sandbox?

1 Answers1