2

I used tokenizer = RegexpTokenizer(r'\w+') which retains alphanumeric characters But how do I combine a regular expression to remove every other element retaining just characters greater than length 2

Below is one row in the dataframe which contains random text

0 [ANOTHER 2'' F/P SAMPLE 01:52 ...A13232 / AS OUTPUT MSG...

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Hackerds
  • 1,195
  • 2
  • 16
  • 34

1 Answers1

5

I think you need for find words with len>2:

RegexpTokenizer(r'\w{3,}')

Or if need only letters:

RegexpTokenizer(r'[a-zA-Z]{3,}')
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252