Network model of document similarity

Question

Thanks in advance for your help. Briefly, I have been asked to help my organization in an accreditation process that repeats every 5 years. The document we need to compile is roughly 50 pages long (150 or so questions, total), so we would like to reuse as much of the content we produced in our last round as possible.

The problem: The order and wording of the questions changed in this last round, but not completely (e.g., "Please describe your organization's commitment to diversity" vs. "What policies are in place to ensure organizational diversity?"). Thus, we need a way to find out which questions from the old round map onto the new round, or at least mostly (they don't need to be a perfect match, just similar).

My thought was to establish a bipartite network, with old questions and new questions as the vertex sets of the network. Edges would be weighted by some measure of word overlap in their questions or answers.

Does anyone know how to start to tackle this problem?

Again, thank you, any help you offer will likely save hours of time.

PS - I am totally open to alternative solutions too. In case it helps, a picture of how I initially thought about modelling the problem is below.

an example solution

score 1 · Answer 1 · answered Aug 25 '16 at 21:54

First thought on my mind: For 50 pages of work, you might save more time by just doing it with a human.

But, if you have a good data scientist in your team, you can try gensim. The most recent technology of comparing two different phrases is word embedding. You can think of it as converting words to high-dimensional vectors (from 200 to 1000 dimensions) by training on millions of documents.

For example, if your string is "Human computer interaction", you would be looking for something like this.

[(2, 0.99844527), # The EPS user interface management system
(0, 0.99809301), # Human machine interface for lab abc computer applications
(3, 0.9865886), # System and human system engineering testing of EPS
(1, 0.93748635), # A survey of user opinion of computer system response time
(4, 0.90755945), # Relation of user perceived response time to error measurement
(8, 0.050041795), # Graph minors A survey
(7, -0.098794639), # Graph minors IV Widths of trees and well quasi ordering
(6, -0.1063926), # The intersection graph of paths in trees
(5, -0.12416792)] # The generation of random binary unordered trees

from: https://radimrehurek.com/gensim/tut3.html

score 0 · Answer 2 · edited May 23 '17 at 12:33

0

Bit of an outline, but the overall steps for a quick solution are: 1. Convert your words to a format more suitable for machine processing with a tool like http://www.nltk.org/api/nltk.stem.html 2. Follow the steps outlined here to calculate the tf-idf similarity: Similarity between two text documents 3. Use np.argsort() to extract the most similar items.

edited May 23 '17 at 12:33

Community

1
1

answered Aug 25 '16 at 00:45

Sohier Dane

152
1
8

Network model of document similarity

2 Answers2