How to output NLTK chunks to file?

Question

I have this python script where I am using nltk library to parse,tokenize,tag and chunk some lets say random text from the web.

I need to format and write in a file the output of chunked1,chunked2,chunked3. These have type class 'nltk.tree.Tree'

More specifically I need to write only the lines that match the regular expressions chunkGram1, chunkGram2, chunkGram3.

How can i do that?

#! /usr/bin/python2.7

import nltk
import re
import codecs

xstring = ["An electronic library (also referred to as digital library or digital repository) is a focused collection of digital objects that can include text, visual material, audio material, video material, stored as electronic media formats (as opposed to print, micro form, or other media), along with means for organizing, storing, and retrieving the files and media contained in the library collection. Digital libraries can vary immensely in size and scope, and can be maintained by individuals, organizations, or affiliated with established physical library buildings or institutions, or with academic institutions.[1] The electronic content may be stored locally, or accessed remotely via computer networks. An electronic library is a type of information retrieval system."]


def processLanguage():
    for item in xstring:
        tokenized = nltk.word_tokenize(item)
        tagged = nltk.pos_tag(tokenized)
        #print tokenized
        #print tagged

        chunkGram1 = r"""Chunk: {<JJ\w?>*<NN>}"""
        chunkGram2 = r"""Chunk: {<JJ\w?>*<NNS>}"""
        chunkGram3 = r"""Chunk: {<NNP\w?>*<NNS>}"""

        chunkParser1 = nltk.RegexpParser(chunkGram1)
        chunked1 = chunkParser1.parse(tagged)

        chunkParser2 = nltk.RegexpParser(chunkGram2)
        chunked2 = chunkParser2.parse(tagged)

        chunkParser3 = nltk.RegexpParser(chunkGram3)
        chunked3 = chunkParser2.parse(tagged)

        #print chunked1
        #print chunked2
        #print chunked3

        # with codecs.open('path\to\file\output.txt', 'w', encoding='utf8') as outfile:

            # for i,line in enumerate(chunked1):
                # if "JJ" in line:
                    # outfile.write(line)
                # elif "NNP" in line:
                    # outfile.write(line)



processLanguage()

For the time being when I am trying to run it I get error:

`Traceback (most recent call last):
  File "sentdex.py", line 47, in <module>
    processLanguage()
  File "sentdex.py", line 40, in processLanguage
    outfile.write(line)
  File "C:\Python27\lib\codecs.py", line 688, in write
    return self.writer.write(data)
  File "C:\Python27\lib\codecs.py", line 351, in write
    data, consumed = self.encode(object, self.errors)
TypeError: coercing to Unicode: need string or buffer, tuple found`

edit: After @Alvas answer I managed to do what I wanted. However now, I would like to know how I could strip all non-ascii characters from a text corpus. example:

#store cleaned file into variable
with open('path\to\file.txt', 'r') as infile:
    xstring = infile.readlines()
infile.close

    def remove_non_ascii(line):
        return ''.join([i if ord(i) < 128 else ' ' for i in line])

    for i, line in enumerate(xstring):
        line = remove_non_ascii(line)

#tokenize and tag text
def processLanguage():
    for item in xstring:
        tokenized = nltk.word_tokenize(item)
        tagged = nltk.pos_tag(tokenized)
        print tokenized
        print tagged
processLanguage()

This above is taken from another answer here in S/O. However it doesn't seem to work. What might be wrong? The error I am getting is:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
not in range(128)

An error trace with a line number would help identify what in the code is causing the `TypeError`. — Morten Jensen, Feb 06 '15 at 12:22
Your `line` contains a `Tree`, not a `string`. Try iterating it for strings contained. — Selcuk, Feb 06 '15 at 12:26
`nltk.RegexpParser().parse()` will return an iterator of `Tree`s. That's why you need to reiterate the contents of `line` by another `for` loop. I cannot test it because I don't have nltk installed at the moment. — Selcuk, Feb 06 '15 at 12:42
how do you want the output to look like? can you show an example? — alvas, Feb 06 '15 at 17:36
@alvas I 've managed to do that by now. I am now struggling with something else.. — , Feb 06 '15 at 17:37
the immediate solution that will work 100% is to use python3. — alvas, Feb 06 '15 at 17:38

score 7 · Answer 1 · answered Feb 07 '15 at 10:30

Firstly, see this video: https://www.youtube.com/watch?v=0Ef9GudbxXY

enter image description here

Now for the proper answer:

import re
import io

from nltk import pos_tag, word_tokenize, sent_tokenize, RegexpParser


xstring = u"An electronic library (also referred to as digital library or digital repository) is a focused collection of digital objects that can include text, visual material, audio material, video material, stored as electronic media formats (as opposed to print, micro form, or other media), along with means for organizing, storing, and retrieving the files and media contained in the library collection. Digital libraries can vary immensely in size and scope, and can be maintained by individuals, organizations, or affiliated with established physical library buildings or institutions, or with academic institutions.[1] The electronic content may be stored locally, or accessed remotely via computer networks. An electronic library is a type of information retrieval system."


chunkGram1 = r"""Chunk: {<JJ\w?>*<NN>}"""
chunkParser1 = RegexpParser(chunkGram1)

chunked = [chunkParser1.parse(pos_tag(word_tokenize(sent))) 
            for sent in sent_tokenize(xstring)]

with io.open('outfile', 'w', encoding='utf8') as fout:
    for chunk in chunked:
        fout.write(str(chunk)+'\n\n')

[out]:

alvas@ubi:~$ python test2.py
Traceback (most recent call last):
  File "test2.py", line 18, in <module>
    fout.write(str(chunk)+'\n\n')
TypeError: must be unicode, not str
alvas@ubi:~$ python3 test2.py
alvas@ubi:~$ head outfile
(S
  An/DT
  (Chunk electronic/JJ library/NN)
  (/:
  also/RB
  referred/VBD
  to/TO
  as/IN
  (Chunk digital/JJ library/NN)
  or/CC

If you have to stick to python2.7:

with io.open('outfile', 'w', encoding='utf8') as fout:
    for chunk in chunked:
        fout.write(unicode(chunk)+'\n\n')

[out]:

alvas@ubi:~$ python test2.py
alvas@ubi:~$ head outfile
(S
  An/DT
  (Chunk electronic/JJ library/NN)
  (/:
  also/RB
  referred/VBD
  to/TO
  as/IN
  (Chunk digital/JJ library/NN)
  or/CC
alvas@ubi:~$ python3 test2.py
Traceback (most recent call last):
  File "test2.py", line 18, in <module>
    fout.write(unicode(chunk)+'\n\n')
NameError: name 'unicode' is not defined

And strongly recommended if you must stick with py2.7:

from six import text_type
with io.open('outfile', 'w', encoding='utf8') as fout:
    for chunk in chunked:
        fout.write(text_type(chunk)+'\n\n')

[out]:

alvas@ubi:~$ python test2.py
alvas@ubi:~$ head outfile 
(S
  An/DT
  (Chunk electronic/JJ library/NN)
  (/:
  also/RB
  referred/VBD
  to/TO
  as/IN
  (Chunk digital/JJ library/NN)
  or/CC
alvas@ubi:~$ python3 test2.py
alvas@ubi:~$ head outfile 
(S
  An/DT
  (Chunk electronic/JJ library/NN)
  (/:
  also/RB
  referred/VBD
  to/TO
  as/IN
  (Chunk digital/JJ library/NN)
  or/CC

I will accept your answer because I value the feedback you provided. Maybe you can help me out in another small thing. Look at the `edit` section of the question. — , Feb 08 '15 at 12:02
i would answer your edit but i think that's another question all by itself. it's better to just ask another question before the SO moderators comes and delete your question for some reason or another. hahahaaa =) — alvas, Feb 08 '15 at 12:56
can you upload your file somewhere and then ask another question about the data cleaning? I can't help much if i don't know how the file looks like or what the file is. There can be 101 ways to clean data depending on the file and the content. — alvas, Feb 08 '15 at 12:59
the script must respond to all random text taken from wikipedia per say. it is not a good idea to be text dependent, therefore I am looking for generic solutions like the one implemented in the question. — , Feb 08 '15 at 13:07
... then you would have to hire someone professional for that . If all you need is wikipedia, see https://github.com/alvations/Wikicorpus. BTW wikipedia is encoded in unicode there is no need to change it down to ascii and working with ascii is a bad practice in NLP or text processing. It's totally discouraged unless you're working with older texts. If not, there is no point in getting backwards. Maybe someone else can help you with a generic script, but sorry i won't be able to do so =( BTW, IBM hires a team of engineers to do such thing, a SO answer would never suffice. — alvas, Feb 08 '15 at 13:16

score 5 · Accepted Answer · answered Feb 13 '15 at 08:59

Your code has several problems, though the main culprit is that your for loop does not modify the contents of the xstring:

I will address all the issues in your code here:

you cannot write paths like this with single \, as \t will be interpreted as a tabulator, and \f as a linefeed character. You must double them. I know it was an example here, but such confusions often arise:

with open('path\\to\\file.txt', 'r') as infile:
    xstring = infile.readlines()

The following infile.close line is wrong. It does not call the close method, it does not actually do anything. Furthermore, your file was closed already by the with clause if you see this line in any answer anywhere, please just downvote the answer outright with the comment saying that file.close is wrong, should be file.close().

The following should work, but you need to be aware that it replacing every non-ascii character with ' ' it will break words such as naïve and café

def remove_non_ascii(line):
    return ''.join([i if ord(i) < 128 else ' ' for i in line])

But here is the reason why your code fails with an unicode exception: you are not modifying the elements of xstring at all, that is, you are calculating the line with ascii characters removed, yes, but that is a new value, that is never stored into the list:

for i, line in enumerate(xstring):
   line = remove_non_ascii(line)

Instead it should be:

for i, line in enumerate(xstring):
    xstring[i] = remove_non_ascii(line)

or my preferred very pythonic:

xstring = [ remove_non_ascii(line) for line in xstring ]

Though these Unicode Errors occur mainly just because you are using Python 2.7 for handling pure Unicode text, something for which recent Python 3 versions are way ahead, thus I'd recommend you that if you are in very beginning with task that you'd upgrade to Python 3.4+ soon.

Thanks for your answer I'll take a close look at it once I have some time. — , Feb 14 '15 at 14:31

How to output NLTK chunks to file?

2 Answers2

Linked