1

I have been using the FUNSD dataset to predict sequence labeling in unstructured documents per this paper: LayoutLM: Pre-training of Text and Layout for Document Image Understanding . The data after cleaning and moving from a dict to a dataframe, looks like this: FUNSD Dataframe The dataset is laid out as follows:

  • The column id is the unique identifier for each word group inside a document, shown in column text (like Nodes)
  • The columnlabel identifies whether the word group are classified as a 'question' or an 'answer'
  • The column linking denoting the WordGroups which are 'linked' (like Edges), linking corresponding 'questions' to 'answers'
  • The column 'box' denoting the location coordinates (x,y top left, x,ybottom right) of the word group relative to the top left corner (0.0).
  • The Column 'words' holds each individual word inside the wordgroup, and its location (box).

I aim to train a classifier to identify words inside the column 'words' that are linked together by using a Graph Neural Net, and the first step is to be able to transform my current dataset into a Network. My questions are as follows:

  1. Is there a way to break each row in the column 'words' into a two columns [box_word, text_word], each only for one word, while replicating the other columns which remain the same: [id, label, text, box], resulting in a final dataframe with these columns: [box,text,label,box_word, text_word]

  2. I can Tokenize the columns 'text' and text_word, one hot encode column label, split columns with more than one numeric box and box_word into individual columns , but How do I split up/rearrange the colum 'linking' to define the edges of my Network Graph?

  3. Am I taking the correct route in Using the dataframe to generate a Network, and use it to train a GNN?

Any and all help/tips is appreciated.

El_1988
  • 339
  • 3
  • 13
  • Which library are you using to construct GNN? What is the main difficulty of putting the `linking` data into the network constructor? – Bill Huang Oct 06 '20 at 01:56
  • @BillHuang I'm trying networkx , however I am not sure at all how to use the linked data in the format it is or the layout, do you have any links I could use as a guide? I've tried https://plotly.com/python/network-graphs/ and https://programminghistorian.org/en/lessons/exploring-and-analyzing-network-data-with-python, but neither adresses how to deal with multi length lists inside a dictionary.. – El_1988 Oct 06 '20 at 01:59
  • Please [provide a reproducible dataset](https://stackoverflow.com/questions/20109391) and don't paste data only as screenshots. Besides, what is the complete contents `words` column of row 4? – Bill Huang Oct 06 '20 at 02:15

1 Answers1

1

Edit: process multiple entries in the column words.

Your questions 1 and 2 are answered in the code. Actually quite simple (assuming the data format is correctly represented by what shown in the screenshot). Digest:

Q1: apply the splitting function on the column and unpack by .tolist() such that separate columns can be created. See this post also.

Q2: Use list comprehension to unpack the extra list layer and retain only non-empty edges.

Q3: Yes and no. Yes because pandas is good at organizing data with heterogeneous types. For example, lists, dict, int and float can be present at different columns. Several I/O functions, such as pd.read_csv() or pd.read_json(), are also very handy.

However, there is overhead in data access, and that is especially costly for iterating over rows (records). Therefore, the transformed data that feeds directly into your model is usually converted into numpy.array or more efficient formats. Such a format conversion task is the data scientist's sole responsibility.

Code and Output

I make up my own sample dataset. Irrelevant columns were ignored (as I am not obliged to and shouldn't do).

import networkx as nx
import pandas as pd

# data
df = pd.DataFrame(
    data={
        "words": [
            [{"box": [1, 2, 3, 4], "text": "TO:"}, {"box": [7, 7, 7, 7], "text": "777"}],
            [{"box": [1, 2, 3, 4], "text": "TO:"}],
            [{"text": "TO:", "box": [1, 2, 3, 4]}, {"box": [4, 4, 4, 4], "text": "444"}],
            [{"text": "TO:", "box": [1, 2, 3, 4]}],
        ],
        "linking": [
            [[0, 4]],
            [],
            [[4, 6]],
            [[6, 0]],
        ]
    }
)


# Q1. split
def split(el):
    ls_box = []
    ls_text = []
    for dic in el:
        ls_box.append(dic["box"])
        ls_text.append(dic["text"])
    return ls_box, ls_text

# straightforward but receives a deprecation warning
df[["box_word", "text_word"]] = df["words"].apply(split).tolist()
# to avoid that,
ls_tup = df["words"].apply(split).tolist()  # len: 4x2
ls_tup_tr = list(map(list, zip(*ls_tup)))  # len: 2x4
df["box_word"] = ls_tup_tr[0]
df["text_word"] = ls_tup_tr[1]

# Q2. construct graph
ls_edges = [item[0] for item in df["linking"].values if len(item) > 0]
print(ls_edges)  # [[0, 4], [4, 6], [6, 0]]

g = nx.Graph()
g.add_edges_from(ls_edges)
list(g.nodes)  # [0, 4, 6]
list(g.edges)  # [(0, 4), (0, 6), (4, 6)]

Q1 output

# trim the first column for printing
df_show = df.__deepcopy__()
df_show["words"] = df_show["words"].apply(lambda s: str(s)[:10])
df_show

Out[51]: 
        words   linking                      box_word   text_word
0  [{'box': [  [[0, 4]]  [[1, 2, 3, 4], [7, 7, 7, 7]]  [TO:, 777]
1  [{'box': [        []                [[1, 2, 3, 4]]       [TO:]
2  [{'text':   [[4, 6]]  [[1, 2, 3, 4], [4, 4, 4, 4]]  [TO:, 444]
3  [{'text':   [[6, 0]]                [[1, 2, 3, 4]]       [TO:]
Bill Huang
  • 4,491
  • 2
  • 13
  • 31
  • Im using this as a roadmap, issue is when it splits anf applies in part 1, it only takes the first element in 'words', it doesnt create the rows for the other elements in 'words' – El_1988 Oct 06 '20 at 23:42
  • If column `words` contains multiple entries, Just write a `for` loop in the `split()` function to return lists of multiple elements. Besides, you STILL DON'T provide the sample dataset, and not exemplifying what is the desired output after such a long time. Please note that it is the asker's sole responsibility to have the data well-described and be easily reproduced. The answer cannot foresee these issues not presented in the sample you provided. – Bill Huang Oct 07 '20 at 01:32
  • I have added the code for this issue. Please also take a look at the output added. – Bill Huang Oct 07 '20 at 02:56