1

I am vectorizing a text blob with tokens that have the following style:

hi__(how are you), 908__(number code), the__(POS)

As you can see the tokens have attached some information with __(info), I am extracting key words using tfidf, as follows:

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(doc)
indices = np.argsort(vectorizer.idf_)[::-1]
features = vectorizer.get_feature_names()

The problem is that when I do the above procedure for extracting keywords, I am suspecting that the vectorizer object is removing the parenthesis from my textblob. Thus, which parameter from the tfidf vectorizer object can I use in order to preserve such information in the parenthesis?

UPDATE

I also tried to:

from sklearn.feature_extraction.text import TfidfVectorizer

def dummy_fun(doc):
    return doc

tfidf = TfidfVectorizer(
    analyzer='word',
    tokenizer=dummy_fun,
    preprocessor=dummy_fun,
    token_pattern=None)  

and

from sklearn.feature_extraction.text import TfidfVectorizer

def dummy_fun(doc):
    return doc

tfidf = TfidfVectorizer(
    tokenizer=dummy_fun,
    preprocessor=dummy_fun,
    token_pattern=None) 

However, this returns me a sequence of characters instead of tokens that I already tokenize:

['e', 's', '_', 'a', 't', 'o', 'c', 'r', 'i', 'n']
anon
  • 836
  • 2
  • 9
  • 25
  • 1
    `tokenizer=dummy_fun` results in a list of characters because the tokenize needs to take in a string and return an iterable of tokens. Because `dummy_fun` returns a string, it's interpreted as an iterable of characters. Try `return doc.split()` instead. – acattle Jul 27 '18 at 11:28
  • Could you update your answer both with a regex and with a dummy method just for future reference to the community? @acattle – anon Jul 27 '18 at 11:36
  • 1
    Is `doc` a single document or a list of documents? `TfidfVectorizer.fit_transform()` expects [a list of documents](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.fit_transform), not a single document. Maybe try `vectorizer.fit_transform([doc])`? – acattle Jul 27 '18 at 11:49

1 Answers1

2

The problem is that default tokenization used by TfidfVectorizer explicitly ignores all punctuation:

token_pattern : string

Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'. The default regexp selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).

Your problem is related to this previous question but instead of treating punctuation as separate tokens, you want prevent token__(info) from splitting the token. In both cases, the solution is to write a custom token_pattern, although exact patterns are different.

Assuming every token already has __(info) attached:

vectorizer = TfidfVectorizer(token_pattern=r'(?u)\b\w\w+__\([\w\s]*\)')
X = vectorizer.fit_transform(doc)

I simply modified the default token_pattern so it now matches any 2 or more alphanumeric characters followed by __(, 0 or more alphanumeric or whitespace characters, and ending with a ). If you want more information on how to write your own token_pattern, see the Python doc for regular expressions.

acattle
  • 3,073
  • 1
  • 16
  • 21
  • Thanks for the help!... for some reason I am getting: `ValueError: empty vocabulary; perhaps the documents only contain stop words` – anon Jul 27 '18 at 09:38
  • I updated with more info, thanks for the help again! – anon Jul 27 '18 at 09:52
  • I changed the regular expression for `\S+__\([^()]+\)` and still having issues! – anon Jul 27 '18 at 11:27
  • 1
    I fixed a small issue with my regex. I've verified the updated version tokenizes your strings properly however I can't reproduce your issue without knowing more about what `doc` contains. – acattle Jul 27 '18 at 11:52