0

I'm trying to load a .json file from an output of an application so I can feed it into different machine learning algorithms so I can classify the text, problem is I can't seem to figure out why NLTK is not loading my .json file, even if I try it with their own .json file, it doesn't seem to work. From what I gather based on the book, I should only need to import 'nltk' and I can use the function 'load' from 'nltk.data'. Can somebody help me realise what I am doing wrong?

Below is the code I used to try loading my the file from nltk.

import nltk
nltk.data.load('corpora/twitter_samples/negative_tweets.json')

After trying that out I got an error from it.

C:\Python34\python.exe "C:/Users/JarvinLi/PycharmProjects/ThesisTrial1/Trial Loading.py"
Traceback (most recent call last):
   File "C:/Users/JarvinLi/PycharmProjects/ThesisTrial1/Trial Loading.py", line 7, in <module>
     nltk.data.load('corpora/twitter_samples/negative_tweets.json')
  File "C:\Python34\lib\site-packages\nltk\data.py", line 810, in load
    resource_val = json.load(opened_resource)
  File "C:\Python34\lib\json\__init__.py", line 268, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
  File "C:\Python34\lib\json\__init__.py", line 312, in loads
    s.__class__.__name__))
TypeError: the JSON object must be str, not 'bytes'

Process finished with exit code 1

EDIT #1 : I'm using Python 3.4.1 and NLTK 3.

EDIT #2 : Below is another try I did but now using json.load()

  import json
  json.load('corpora/twitter_samples/negative_tweets.json')

But I encountered a similar error

C:\Python34\python.exe "C:/Users/JarvinLi/PycharmProjects/ThesisTrial1/Trial Loading.py"
Traceback (most recent call last):
  File "C:/Users/JarvinLi/PycharmProjects/ThesisTrial1/Trial Loading.py", line 5, in <module>
    json.load('corpora/twitter_samples/quotefileNeg.json')
  File "C:\Python34\lib\json\__init__.py", line 265, in load
    return loads(fp.read(),
AttributeError: 'str' object has no attribute 'read'

Process finished with exit code 1
Jarvin Li
  • 1
  • 1
  • 4
  • Looks like a Python 3 vs Python 2 issue. Are you using an older version of NLTK? – tripleee Jul 04 '16 at 08:58
  • I'm using Python 3.4.1, and NLTK 3. @tripleee – Jarvin Li Jul 04 '16 at 10:55
  • Seems to be a weird issue. Can you use double-quotes and additionally escape the `/` and check? – Ic3fr0g Jul 04 '16 at 12:19
  • Probably a bug in NLTK then. http://stackoverflow.com/questions/6862770/python-3-let-json-object-accept-bytes-or-let-urlopen-output-strings discusses the underlying problem. – tripleee Jul 04 '16 at 12:27
  • @MayurH Even Windows accepts forward slashes as directory separators completely transparently. Escaping slashes makes no sense because they have no special meaning (unlike backslash). – tripleee Jul 04 '16 at 12:29
  • @MayurH I tried what you suggested and it did not work, I tried out different combinations of it but none worked. – Jarvin Li Jul 05 '16 at 01:09
  • @tripleee Thanks for the link, I'll try reading it and hopefully understand it to fix my problems. Thanks :) – Jarvin Li Jul 05 '16 at 01:09

1 Answers1

0

If you want to access a new corpus with a specific format, you can extend the NLTK CorpusReader class as follow

from nltk.corpus.reader.api import CorpusReader
from nltk.corpus.reader.util import StreamBackedCorpusView, concat, ZipFilePathPointer

class StoryCorpusReader(CorpusReader):
    corpus_view = StreamBackedCorpusView

    def __init__(self, word_tokenizer=StoryTokenizer(), encoding="utf8"):
        CorpusReader.__init__(
        self, <folder_path>, <file_name>, encoding
    )

        for path in self.abspaths(self._fileids):
            if isinstance(path, ZipFilePathPointer):
                pass
            elif os.path.getsize(path) == 0:
                raise ValueError(f"File {path} is empty")

        self._word_tokenizer = word_tokenizer

    def docs(self, fileids=None):
        return concat(
            [
                self.corpus_view(path, self._read_stories, encoding=enc)
                for (path, enc, fileid) in self.abspaths(fileids, True, True)
            ]
        )

    def titles(self):
        titles = self.docs()
        standards_list = []
        for jsono in titles:
            text = jsono["title"]
            if isinstance(text, bytes):
                text = text.decode(self.encoding)

            standards_list.append(text)
        return standards_list


    def _read_stories(self, stream):
        stories = []
        for i in range(10):
            line = stream.readline()
            if not line:
                return stories
            story = json.loads(line)
            stories.append(story)
        return stories

with a specific Tokenizer

from nltk.tokenize.api import TokenizerI
from nltk.tokenize.casual import _replace_html_entities
import typing
import re

REGEXPS = (
    # HTML tags:
    r"""<[^<>]+>""",
    # email addresses
    r"""[\w.+-]+@[\w-]+\.(?:[\w-]\.?)+[\w-]""")

class StoryTokenizer(TokenizerI):
    _WORD_RE = None

    def tokenize(self, text: str) -> typing.List[str]:
        # Fix HTML character entities:
        safe_text = _replace_html_entities(text)

        # Tokenize
        words = self.WORD_RE.findall(safe_text)

        # Remove punctuation
        words = [
            word
            for word in words
            if re.match(f"[{re.escape(string.punctuation)}——–’‘“”×]", word.casefold())
        == None
        ]

        return words

    @property
    def WORD_RE(self) -> "re.Pattern":
        # Compiles the regex for this and all future instantiations of TweetTokenizer.
        if not type(self)._WORD_RE:
            type(self)._WORD_RE = re.compile(
                f"({'|'.join(REGEXPS)})",
                re.VERBOSE | re.I | re.UNICODE,
            )
        return type(self)._WORD_RE
Larsen
  • 21
  • 1