Thanks in advance for your help. Briefly, I have been asked to help my organization in an accreditation process that repeats every 5 years. The document we need to compile is roughly 50 pages long (150 or so questions, total), so we would like to reuse as much of the content we produced in our last round as possible.
The problem: The order and wording of the questions changed in this last round, but not completely (e.g., "Please describe your organization's commitment to diversity" vs. "What policies are in place to ensure organizational diversity?"). Thus, we need a way to find out which questions from the old round map onto the new round, or at least mostly (they don't need to be a perfect match, just similar).
My thought was to establish a bipartite network, with old questions and new questions as the vertex sets of the network. Edges would be weighted by some measure of word overlap in their questions or answers.
Does anyone know how to start to tackle this problem?
Again, thank you, any help you offer will likely save hours of time.
PS - I am totally open to alternative solutions too. In case it helps, a picture of how I initially thought about modelling the problem is below.