6

I want to tokenize input file in python please suggest me i am new user of python .

I read the some thng about the regular expression but still some confusion so please suggest any link or code overview for the same.

alvas
  • 115,346
  • 109
  • 446
  • 738
Target
  • 89
  • 1
  • 1
  • 7
  • What do you want to tokenize? Do you need to create a generic tokenizer? Or do you need a tokenizer/parser for a specific (programming) language? – Hans Then Oct 03 '12 at 08:02

2 Answers2

13

Try something like this:

import nltk
file_content = open("myfile.txt").read()
tokens = nltk.word_tokenize(file_content)
print tokens

The NLTK tutorial is also full of easy to follow examples: https://www.nltk.org/book/ch03.html

Yanek
  • 323
  • 2
  • 9
del
  • 6,341
  • 10
  • 42
  • 45
  • 1
    He OP may not want to tokenize a natural text, but source code for a formal language. nltk is for parsing natural languages. For formal languages, you can use ply. I use it extensively for building custom compilers. http://www.dabeaz.com/ply/ With ply you can also parse and compile into an abstract syntax tree. – nagylzs Oct 03 '12 at 07:44
  • 2
    @nagylzs - The question is tagged "nltk". – del Oct 03 '12 at 07:50
3

Using NLTK

If your file is small:

  • Open the file with the context manager with open(...) as x,
  • then do a .read() and tokenize it with word_tokenize()

[code]:

from nltk.tokenize import word_tokenize
with open ('myfile.txt') as fin:
    tokens = word_tokenize(fin.read())

If your file is larger:

  • Open the file with the context manager with open(...) as x,
  • read the file line by line with a for-loop
  • tokenize the line with word_tokenize()
  • output to your desired format (with the write flag set)

[code]:

from __future__ import print_function
from nltk.tokenize import word_tokenize
with open ('myfile.txt') as fin, open('tokens.txt','w') as fout:
    for line in fin:
        tokens = word_tokenize(line)
        print(' '.join(tokens), end='\n', file=fout)

Using SpaCy

from __future__ import print_function
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English

nlp = English()
tokenizer = Tokenizer(nlp.vocab)

with open ('myfile.txt') as fin, open('tokens.txt') as fout:
    for line in fin:
        tokens = tokenizer.tokenize(line)
        print(' '.join(tokens), end='\n', file=fout)
Community
  • 1
  • 1
alvas
  • 115,346
  • 109
  • 446
  • 738