0

Folks,

I am using python library of wordsegment by Grant Jenks for the past couple of hours. The library works fine for any incomplete words or separating combined words such as e nd ==> end and thisisacat ==> this is a cat.

I am working on the textual data which involves numbers as well and using this library on this textual data is having a reverse effect. The perfectly fine text of increased $55 million or 23.8% for converts to something very weird increased 55millionor238 for (after performing join operation on the retuned list). Note that this happens randomly (may or may not happen) for any part of the text which involves numbers.

  • Have anybody worked with this library before?
  • If yes, have you faced similar situation and found a workaround?
  • If not, do you know of any other python library that does this trick for us?

Thank you.

Saurabh Gokhale
  • 53,625
  • 36
  • 139
  • 164

2 Answers2

0

Looking at the code, the segment function first runs clean which removes all non-alphanumeric character, it then searches for known unigrams and bigrams within the text clump and scores the words it finds based on the their frequency of occurrence in English.

'increased $55 million or 23.8% for'

becomes

'increased55millionor238for'

When searching for sub-terms, it finds 'increased' and 'for', but the score for the unknown phrase '55millionor238' is better than the score for breaking it up for some reason.

It seems to do better with unknown text, especially smaller unknown text elements. You could substitute out non-alphabetic character sequences, run it through segment and then substitute back in.

import re
from random import choices

CONS = 'bdghjklmpqvwxz'

def sub_map(s, mapping):
    out = s
    for k,v in mapping.items():
        out = out.replace(k,v)
    return out

mapping = {m.group():''.join(choices(cons, k=3)) for m 
           in re.finditer(r'[0-9\.,$%]+', s)}
revmap = {v:k for k,v in mapping.items()}
word_list = wordsegment.segment(sub_map(s, mapping))
word_list = [revmap.get(w,w) for w in word_list]
word_list
# returns:
['increased', '$55', 'million', 'or', '23.8%', 'for']    
James
  • 32,991
  • 4
  • 47
  • 70
  • Thanks for your answer. I will try this. Anyway, I've also opened an issue on the `wordsegment`'s github issue page: https://github.com/grantjenks/python-wordsegment/issues/20 – Saurabh Gokhale Nov 30 '18 at 04:43
0

There are implementations in Ruby and Python at Need help understanding this Python Viterbi algorithm.

The algorithm (and those implementations) are pretty straightforward, and copy & paste may be better than using a library because (in my experience) this problem almost always needs some customisation to fit the data at hand (i. e. language/specific topics/custom entities/date or currency formats).

Matthias Winkelmann
  • 15,870
  • 7
  • 64
  • 76