We have quite a lot of text (mostly written in English) which was incorrectly imported (from a source we have no control over). For example
configuredincorrectly
- into the 2 wordsconfigured
&incorrectly
RegardsJohn Doe
- into a wordRegards
and a named entityJohn Doe
To: person1@example.comCC:addr2@example.co.ukBCC:person3@example.sg
- into 3 tuples(To,person1@example.com)
,(CC,addr2@example.co.uk)
,(BCC,person3@example.sg)
problem.Possible
- into the 2 wordsproblem
&possible
I acknowledge that we are trying to address multiple problems here. It is tempting to write non-scalable code such as
- regular expressions each time we try to solve a particular dirty text scenario,
- string.replace(keyword,keywordwithSpace)
Could anyone please point me towards a (partial) solution for problems 1 & 2?
A solution which made use of natural language understanding would be most ideal.
We have ~ 1000 words in our vocabulary, such as [communication, database, hardware, network, problem, rectify, solution, etc.]. Is there a way we can "train" a model to recognize that words like hardwarefailure
really mean 2 separate words hardware
& failure
.
Many thanks in advance!