3

I have some JS functions that have helped me to tokenize my strings using Wink Tokenizer.

I'm moving some services to Python and now I would like to get an equivalent tokenizer function. I have researched a lot and it seems Wink tokenizer is just available for JS. I'm also not that aware of the subtle differences between Wink and other Python Tokenizers like spaCY for example.

Basically I would like to be able to get the same results as:

var tokenizer = require( 'wink-tokenizer' );
// Create it's instance.
var myTokenizer = tokenizer();
 
// Tokenize a tweet.
var s = '@superman: hit me up on my email r2d2@gmail.com, 2 of us plan party tom at 3pm:) #fun';
myTokenizer.tokenize( s );

On Python

Can anyone help me out by pointing me in the right direction of how I could go on replicating the tokenization functions Wink offers on Python? What parameters, configs, regexes do I have to check to get an equivalent behaviout?

Pablo Estrada
  • 3,182
  • 4
  • 30
  • 74

1 Answers1

1

There are many ways. Python has a rich data science community. There are many NLP packages. Here is a reasonable list of easy to implement ways to tokenize text:

https://towardsdatascience.com/5-simple-ways-to-tokenize-text-in-python-92c6804edfc4

I personally use https://github.com/stanfordnlp/stanza

All of these resources were on the first page in google for "python" "tokenization"

jorf.brunning
  • 452
  • 4
  • 9
  • Thanks for the resources. The thing is I'm specifically intersted in the Wink Tokenizer behaviour and the differences betweend Wink and the Python implementations – Pablo Estrada Jan 21 '22 at 20:04
  • I did a cursory look at the requires/imports in the wink tokenizer, and it seems to just be two files, one of which is a list of regex patterns. You could rewrite the project in python, or update your code to use different tokenization -- hard to say which is easier. – jorf.brunning Jan 26 '22 at 20:21