0

What are some popular methods to find and replace concatenated words such as:

brokenleg -> (broken,leg)

The method should run on thousands of lines without knowing if there are concatenated words there in advance.

I'm using SpaCy library for most of my string handling so best method would be one that works along well with SpaCy.

David Batista
  • 3,029
  • 2
  • 23
  • 42
Latent
  • 556
  • 1
  • 9
  • 23
  • 2
    What's your data, i.e., what kind of words do you have, only in English? And do all the words have sub-words/sub-concepts or just a few of them, have you thought about using a dictionary and looking for each word if it matches with words in a dictionary? – David Batista Mar 07 '19 at 13:11
  • All in English. Its a short free texts that written by people "in hurry" (mostly in mobile..). Most of the sentences are not "concatenated" words, but i need to handle the ones with it in a good way . i thought about using dictionary but i'm not sure how to stop the program from segment legit words just because there are sub-words available , like "bookkeeper" into (book,keeper) . Also , since it should run on >300k sentences i'm looking for optimal running time implementations if there is one available – Latent Mar 07 '19 at 13:17
  • 2
    Have you tried something? Try the dictionary approach using a Trie as data-structure. – David Batista Mar 07 '19 at 14:24
  • There is a solution on github... – amirouche Mar 09 '19 at 02:09
  • 1
    Also check out the answers here: https://stackoverflow.com/questions/8870261/how-to-split-text-without-spaces-into-list-of-words – Sofie VL Mar 11 '19 at 09:25

0 Answers0