1

What is the correct way to use gensim's Phrases and preprocess_string together ?, i am doing this way but it a little contrived.

from gensim.models.phrases import Phrases
from gensim.parsing.preprocessing import preprocess_string
from gensim.parsing.preprocessing import strip_tags
from gensim.parsing.preprocessing import strip_short
from gensim.parsing.preprocessing import strip_multiple_whitespaces
from gensim.parsing.preprocessing import stem_text
from gensim.parsing.preprocessing import remove_stopwords
from gensim.parsing.preprocessing import strip_numeric
import re
from gensim import utils

# removed "_" from regular expression
punctuation = r"""!"#$%&'()*+,-./:;<=>?@[\]^`{|}~"""

RE_PUNCT = re.compile(r'([%s])+' % re.escape(punctuation), re.UNICODE)


def strip_punctuation(s):
    """Replace punctuation characters with spaces in `s` using :const:`~gensim.parsing.preprocessing.RE_PUNCT`.

    Parameters
    ----------
    s : str

    Returns
    -------
    str
        Unicode string without punctuation characters.

    Examples
    --------
    >>> from gensim.parsing.preprocessing import strip_punctuation
    >>> strip_punctuation("A semicolon is a stronger break than a comma, but not as much as a full stop!")
    u'A semicolon is a stronger break than a comma  but not as much as a full stop '

    """
    s = utils.to_unicode(s)
    return RE_PUNCT.sub(" ", s)



my_filter = [
    lambda x: x.lower(), strip_tags, strip_punctuation,
    strip_multiple_whitespaces, strip_numeric,
    remove_stopwords, strip_short, stem_text
]


documents = ["the mayor of new york was there", "machine learning can be useful sometimes","new york mayor was present"]

sentence_stream = [doc.split(" ") for doc in documents]
bigram = Phrases(sentence_stream, min_count=1, threshold=2)
sent = [u'the', u'mayor', u'of', u'new', u'york', u'was', u'there']
test  = " ".join(bigram[sent])


print(preprocess_string(test))
print(preprocess_string(test, filters=my_filter))

The result is:

['mayor', 'new', 'york']
['mayor', 'new_york'] #correct

part of the code was taken from: How to extract phrases from corpus using gensim

sophros
  • 14,672
  • 11
  • 46
  • 75
carlos
  • 45
  • 3
  • 7
  • `test = " ".join(bigram[sent])` is fine. It has something to do with `preprocess_string(test)`. Try removing it, or use some string methods. – explorer Aug 24 '18 at 12:37

1 Answers1

3

I would recommend using gensim.utils.tokenize() instead of gensim.parsing.preprocessing.preprocess_string() for your example.

In many cases tokenize() does a very good job as it will only return sequences of alphabetic characters (no digits). This saves you the extra cleaning steps for punctuation etc.

However, tokenize() does not include removal of stopwords, short tokens nor stemming. This has to be cutomized anyway if you are dealing with other languages than English.

Here is some code for your (already clean) example documents which gives you the desired bigrams.

documents = ["the mayor of new york was there",
             "machine learning can be useful sometimes",
             "new york mayor was present"]

import gensim, pprint

# tokenize documents with gensim's tokenize() function
tokens = [list(gensim.utils.tokenize(doc, lower=True)) for doc in documents]

# build bigram model
bigram_mdl = gensim.models.phrases.Phrases(tokens, min_count=1, threshold=2)

# do more pre-processing on tokens (remove stopwords, stemming etc.)
# NOTE: this can be done better
from gensim.parsing.preprocessing import preprocess_string, remove_stopwords, stem_text
CUSTOM_FILTERS = [remove_stopwords, stem_text]
tokens = [preprocess_string(" ".join(doc), CUSTOM_FILTERS) for doc in tokens]

# apply bigram model on tokens
bigrams = bigram_mdl[tokens]

pprint.pprint(list(bigrams))

Output:

[['mayor', 'new_york'],
 ['machin', 'learn', 'us'],
 ['new_york', 'mayor', 'present']]
goerlitz
  • 505
  • 4
  • 15