Measuring similarity of documents from images is complicated compared to measuring documents from texts for two reasons.
- The images could have similarity in terms of brightness, textual context, diagrams or symbols.
- It is often harder to find the representation of the document from images it contains compared with its textual information.
solution
My solution is using machine learning to find representations of a document and use this representation to classify the document.
Here I will give Keras implementation of the solution I propose.
Network type
I propose using convolutional layers for feature extraction followed by recurrent layers for sequence classification. I've chosen keras because of my familiarity and it has simple API to define a network with a combination of convolutional layers and recurrent layers. But the code can be easily changed to other libraries such as Pytorch, Tensorflow, etc.
Images pre-processing
There are many ways to pre-process the images of documents for neural networks. I'm making the assumptions.
- Images contain horizontal text rather than vertical texts.
- The document images size is fixed. If the images size is not fixed it can be resized using opencv's resize method.
Split the images vertically so that that the lines are feed as sequences(It is more efficient if the splitting line could be made on empty line). I will show this using Numpy for a single document. In the following implementation, I assume the image shape of a single document is (100, 100, 3).
First, let's define the image_shape the shape of document images as
import numpy as np
image_shape = (100, 100, 3)
split_size = 25 # this should be factor of the image_shape[0]
doc_images = [] #
doc_image = np.zeros(image_shape)
splitted_images = np.split(doc_image,[split_size], axis=0)
doc_images.extend(splitted_images)
doc_images = np.array(doc_images)
The network implementation
Keras have ConvLSTM2D layer to deal with sequential images. The inputs to the network are lists of a sequence of images produced by splitting document images.
from keras.models import Sequential
from keras.layers import ConvLSTM2D, Dense, Flatten
num_of_classes = 10
model = Sequential()
model.add(ConvLSTM2D(32,(3, 3),input_shape=(None, split_size, image_shape[1],image_shape[2]),
padding='same',
return_sequences=True))
model.add(ConvLSTM2D(32,(3, 3),padding='same',return_sequences=True))
model.add(ConvLSTM2D(32,(3, 3),padding='same',return_sequences=False))
model.add(Flatten())
model.add(Dense(1024, activation="relu"))
model.add(Dense(num_classes, activation="softmax"))
Ideally this model will work because the model might learn hierarchical representation(characters, words, sentences, contexts, symbols) of document from it's image.