Finding Similar Document

Question

I am working on a project where I have processes and stored documents of Single-page Medical Reports with Labelled Categories. The user will input one document and I have to classify which category it belongs to.

I have converted all documents to grayscaled image format and stored for comparison purposes.

I have a dataset of images having following data,

image_path: This column has a path to the image
histogram_value: This column has a histogram of the image, calculated using cv2.calcHist function
np_avg: This column has an average value of all pixel of the image. Calculated using np.average
- category: This column is a category of the image.

I am planning to use these two methods,

Calculate histogram_value of the input image, find nearest 10 matching images
- Calculate np_avg of the input image, find nearest 10 matching images
- Take intersect of both result set
- If more than one image found, do template matching to find the best fit.

I have very little knowledge in the Image Processing domain. Will the above mechanism is reliable for my purpose?

I check SO, found few questions on same but they have a very different problem and desired outcome. This question looks similar to my situation but it's very generic and I am not sure it will work in my scenario.

Link to sample reports

Since these are documents containing text, if you can do a reasonable OCR, the words thus obtained might serve as better features than pixel values. — dhanushka, Dec 26 '18 at 11:32
Is the comparison solely text-based? Do the reports contain any images as well? If yes, are they consistently appearing for every report? — amanb, Dec 26 '18 at 17:55
@dhanushka, reasonable OCR is difficult thing to create by my self, can you suggest me some opensource solution which i can use? I tried tesseract, not working well in my case. — Gaurav Gandhi, Dec 27 '18 at 06:00
@amanb, Nice idea. But i just checked and found images are not on all reports, also some images are common for different categories of report. For example reports from same laboratory have same logo across all different category of reports. — Gaurav Gandhi, Dec 27 '18 at 06:02
Is it possible to share a dummy report? The important parts can be hidden. Another report for comparison could also be useful. If not the whole, just a part of it. — amanb, Dec 27 '18 at 06:07
Please share a subset of sample reports. The solution to this problem is highly dependent upon the type of input images — ZdaR, Dec 27 '18 at 11:08
@amanb, Have updated the question with link to sample reports. — Gaurav Gandhi, Dec 28 '18 at 09:20
@ZdaR, Have updated the question with link to sample reports. — Gaurav Gandhi, Dec 28 '18 at 09:20
*I have to classify which category it belongs to*, What are the categories for the given samples ? — ZdaR, Dec 28 '18 at 12:24
@ZdaR, Each one of different category. And let's consider each different type of report will belong to it's own unique category. — Gaurav Gandhi, Dec 28 '18 at 17:45

score 4 · Accepted Answer · answered Dec 29 '18 at 03:12

I'd recommend a few things:

Text Based Comparison:

OCR the Documents and extract Text Features using Google's Tesseract which is one of the best open source OCR packages out there. There is also a Python Wrapper for it called PyTesseract. You'll likely need to play with the resolution of your images for the OCR to work to your satisfaction - this will require some trial and error.

Once you have extracted the words one of the commonly accepted approaches is to calculate a TF-IDF (Term Frequency - Inverse Document Frequency) and then any distance based approaches (cosine similarity is one of the common ones) to compare which documents are "similar" (closer) to each other.

Image Based Comparison

If you already have the images as a vector then apply a distance based measure to figure out similarity. Generally L1 or L2 norm would work. This paper suggests that Manhattan (L1 Norm) might work better for natural images. You could start with that and try other distance based measures

Ensemble Text and Image Based Comparisons

Run both approaches and then take some averaging between the two approaches to arrive at documents that are similar to each other.

For e.g.

The Text Based Approach might rank DocB and DocC as the closest 2 documents to DocA by Distance 10 and 20 units respectively.

Image Based Approach might rank DocC and DocB as the closest two by Distance 5 and Distance 20 respectively.

Then you can average the two distances. DocB would be (10+20)/2 = 15 and DocC would be (20+5)/2 = 12.5 units apart from DocA. So you'll treat DocC to be closer to A than B in an ensembled approach.

I tried but didn't get much success with Text or Image based comparisons, will try Ensemble Text and Image Based Comparisons, thanks. — Gaurav Gandhi, Dec 31 '18 at 09:48
Check how good your OCR is. OCR tools struggle with tabular data and in your case you might have a lot of that. So focus on pre-processing whatever the OCR tool extracts for you into a reasonable set of features before running any models on it. — HakunaMaData, Jan 01 '19 at 07:51

Mitiku · Answer 2 · 2019-04-19T11:32:12.897

Measuring similarity of documents from images is complicated compared to measuring documents from texts for two reasons.

The images could have similarity in terms of brightness, textual context, diagrams or symbols.
It is often harder to find the representation of the document from images it contains compared with its textual information.

solution

My solution is using machine learning to find representations of a document and use this representation to classify the document. Here I will give Keras implementation of the solution I propose.

Network type

I propose using convolutional layers for feature extraction followed by recurrent layers for sequence classification. I've chosen keras because of my familiarity and it has simple API to define a network with a combination of convolutional layers and recurrent layers. But the code can be easily changed to other libraries such as Pytorch, Tensorflow, etc.

Images pre-processing

There are many ways to pre-process the images of documents for neural networks. I'm making the assumptions.

Images contain horizontal text rather than vertical texts.
The document images size is fixed. If the images size is not fixed it can be resized using opencv's resize method.

Split the images vertically so that that the lines are feed as sequences(It is more efficient if the splitting line could be made on empty line). I will show this using Numpy for a single document. In the following implementation, I assume the image shape of a single document is (100, 100, 3). First, let's define the image_shape the shape of document images as

import numpy as np
image_shape = (100, 100, 3)
split_size = 25 # this should be factor of the image_shape[0]
doc_images = [] #
doc_image = np.zeros(image_shape)

splitted_images = np.split(doc_image,[split_size], axis=0)
doc_images.extend(splitted_images)
doc_images = np.array(doc_images)

The network implementation

Keras have ConvLSTM2D layer to deal with sequential images. The inputs to the network are lists of a sequence of images produced by splitting document images.

from keras.models import Sequential
from keras.layers import ConvLSTM2D, Dense, Flatten
num_of_classes = 10
model = Sequential()

model.add(ConvLSTM2D(32,(3, 3),input_shape=(None, split_size, image_shape[1],image_shape[2]),
        padding='same',
        return_sequences=True))
model.add(ConvLSTM2D(32,(3, 3),padding='same',return_sequences=True))
model.add(ConvLSTM2D(32,(3, 3),padding='same',return_sequences=False))
model.add(Flatten())
model.add(Dense(1024, activation="relu"))
model.add(Dense(num_classes, activation="softmax"))

Ideally this model will work because the model might learn hierarchical representation(characters, words, sentences, contexts, symbols) of document from it's image.

I have not tried RNN approach, but Just CNN didn't work out well. I will give this a shot, thanks. — Gaurav Gandhi, Dec 31 '18 at 09:47
Thanks for detailed answer with code sample. It's working better than not NN approach, still lot improvements are needed. But for now the shop has sailed. Thanks. — Gaurav Gandhi, Jan 01 '19 at 19:49

score 1 · Answer 3 · answered Jan 01 '19 at 17:28

Sample documents vary greatly, impossible to compare them on image level (histogram, np_avg).

Contents of reports are multiple numeric (min,max,recommended) or category results (Negative/Positive).

For each type of report, you'd have to do preprocessing.

If documents source is digital (not scanned) you do extraction and comparison of fields, rows. Each row separately.

extraction to image part of field or row and comparing it with NN
extraction to text and comparing values (OCR)

If documents are scanned, you have to deal with rotation of image, quality and artifacts - before extraction.

Each type of report is problem on its own. Pick one type of report with multiple samples for start.

Since you are dealing with numbers, only with extraction to text and numbers you'll have good results. If report says value is 0.2 and tolerated range is between 0.1 and 0.3, NN is not tool for that. You have to compare numbers.

NNs are not best tool for this, at least not for comparing values. Maybe for part of extraction process.

Steps to solution

automate categorization of reports
for each type of report mark fields with data
for each type of report automate extraction of values
for each type of report interpret values according to business rules

Finding Similar Document

3 Answers3

solution

Network type

Images pre-processing

The network implementation