-1

I have several tables that have different column names which are mapped through ETL. There are a total of around 200 tables and 500 attributes, so the set is not massive.

Some column mappings are as follows:

startDate EFT_DATE
startDate START_DATE
startDate entryDate 

As you can see the same column name can be mapped to different names across different tables.

I'm trying to solve the following problem :

Given two schemas I want to find matches between attribute names.

I was wondering if there is a way to leverage gensim to solve this problem similar to source-words from Google example. The challenge I'm facing is which dataset to use to train the model. Also I am wondering if there is another approach to solve the problem.

petezurich
  • 9,280
  • 9
  • 43
  • 57
lakssundar
  • 61
  • 6

1 Answers1

2

You can apply basic text analyzers to this by pre-processing each term.

  • Split by any non-alphabetic character.
    • e.g. EFT_DATE becomes [eft,date]
  • Split by camelCase.
    • e.g. startDate becomes [start,date]
  • Lowercase each term
  • Apply a fuzzy dictionary lookup to each token
    • e.g. startt -> start (typo detection..)
  • Apply stemming
    • e.g. starting -> start
  • Maybe apply a synonym conversion .
    • e.g. begin -> start

Optionally Sort the terms:

  • dateStarted -> [date,start]
  • startingDate -> [start,date] -> [date,start]

You can apply set distance operations now - which is O(^2). Given your moderate cardinality that is fine. If you had larger set of terms than scalable set-comparison approaches like the following can help reduce the complexity.

WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560
  • thanks, this helps. I'm planning to have a function match(attr1, attr2) which returns a percentage match between 2 attributes after applying the above logic. Given two schemas, I will do a cartesian product of attributes and return only the attributes which have a cut-off beyond certain level. This will help in identifying possible attribute matches based on training already provided and new input could be added to training set to increase accuracy going forward. – lakssundar Jul 23 '18 at 00:31