How to tokenize natural English text in an input file in python?

Question

I want to tokenize input file in python please suggest me i am new user of python .

I read the some thng about the regular expression but still some confusion so please suggest any link or code overview for the same.

What do you want to tokenize? Do you need to create a generic tokenizer? Or do you need a tokenizer/parser for a specific (programming) language? — Hans Then, Oct 03 '12 at 08:02

score 13 · Answer 1 · edited Mar 27 '21 at 20:04

13

Try something like this:

import nltk
file_content = open("myfile.txt").read()
tokens = nltk.word_tokenize(file_content)
print tokens

The NLTK tutorial is also full of easy to follow examples: https://www.nltk.org/book/ch03.html

edited Mar 27 '21 at 20:04

Yanek

323
2
9

answered Oct 03 '12 at 07:37

del

6,341
10
42
45

1

He OP may not want to tokenize a natural text, but source code for a formal language. nltk is for parsing natural languages. For formal languages, you can use ply. I use it extensively for building custom compilers. http://www.dabeaz.com/ply/ With ply you can also parse and compile into an abstract syntax tree. – nagylzs Oct 03 '12 at 07:44
2

@nagylzs - The question is tagged "nltk". – del Oct 03 '12 at 07:50

score 3 · Answer 2 · edited Feb 06 '20 at 09:55

Using `NLTK`

If your file is small:

Open the file with the context manager with open(...) as x,
then do a .read() and tokenize it with word_tokenize()

[code]:

from nltk.tokenize import word_tokenize
with open ('myfile.txt') as fin:
    tokens = word_tokenize(fin.read())

If your file is larger:

Open the file with the context manager with open(...) as x,
read the file line by line with a for-loop
tokenize the line with word_tokenize()
output to your desired format (with the write flag set)

[code]:

from __future__ import print_function
from nltk.tokenize import word_tokenize
with open ('myfile.txt') as fin, open('tokens.txt','w') as fout:
    for line in fin:
        tokens = word_tokenize(line)
        print(' '.join(tokens), end='\n', file=fout)

Using SpaCy

from __future__ import print_function
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English

nlp = English()
tokenizer = Tokenizer(nlp.vocab)

with open ('myfile.txt') as fin, open('tokens.txt') as fout:
    for line in fin:
        tokens = tokenizer.tokenize(line)
        print(' '.join(tokens), end='\n', file=fout)

How to tokenize natural English text in an input file in python?

2 Answers2

Using `NLTK`

Using SpaCy

Linked

Related

How to tokenize natural English text in an input file in python?

2 Answers2

Using NLTK

Using SpaCy

Linked

Related

Using `NLTK`