I have been using the FUNSD dataset to predict sequence labeling in unstructured documents per this paper: LayoutLM: Pre-training of Text and Layout for Document Image Understanding . The data after cleaning and moving from a dict to a dataframe, looks like this:
The dataset is laid out as follows:
- The column
id
is the unique identifier for each word group inside a document, shown in columntext
(like Nodes) - The column
label
identifies whether the word group are classified as a 'question' or an 'answer' - The column
linking
denoting the WordGroups which are 'linked' (like Edges), linking corresponding 'questions' to 'answers' - The column
'box'
denoting the location coordinates (x,y top left, x,ybottom right) of the word group relative to the top left corner (0.0). - The Column
'words'
holds each individual word inside the wordgroup, and its location (box).
I aim to train a classifier to identify words inside the column 'words'
that are linked together by using a Graph Neural Net, and the first step is to be able to transform my current dataset into a Network. My questions are as follows:
Is there a way to break each row in the column
'words'
into a two columns[box_word, text_word]
, each only for one word, while replicating the other columns which remain the same:[id, label, text, box]
, resulting in a final dataframe with these columns:[box,text,label,box_word, text_word]
I can Tokenize the columns
'text'
andtext_word
, one hot encode columnlabel
, split columns with more than one numericbox
andbox_word
into individual columns , but How do I split up/rearrange the colum'linking'
to define the edges of my Network Graph?Am I taking the correct route in Using the dataframe to generate a Network, and use it to train a GNN?
Any and all help/tips is appreciated.