-1

i am trying to find most frequently used words in tweets. i tokinized txt file and passed tokens into a json file but when i do json.loads it gives me an error:no JSON object could be decoded.

  s_tweets.head()
  print(s_tweets.iloc[:,2])
  tweets = s_tweets.iloc[:,2]
  #step 2: remove the special characters and punctuation
  tlist = []
  for t in tweets:
      t_new=re.sub('[^A-Za-z0-9]+', ' ', t)
      tlist.append(t_new)
      #print(t_new)
      #print(t_list)


 test=word_tokenize(tlist[1])
 print(test)


  fname = 'tokensALL.json'
  ff = open(fname, 'a')
  for i in range(0,1751):
          ff.write(str(word_tokenize(tlist[i])) + "\n")

   ff.close()



  ###### find most frequent words


  fname2 = 'tokensALL.json'
  with open(fname2, 'r') as f:
      count_all = Counter()
      for line in f:
          tweet = json.loads(line)
          # Create a list with all the terms
          terms_stop = [term for term in preprocess(tweet['text']) if 
  term not in stop]
          # Update the counter
          # terms_single = set(terms_all)
          # Count hashtags only
          terms_hash = [term for term in preprocess(tweet['text'])
                        if term.startswith('#')]
          # Count terms only (no hashtags, no mentions)
          terms_only = [term for term in preprocess(tweet['text'])
                        if term not in stop and
                        not term.startswith(('#', '@'))]
          # mind the ((double brackets))
          # startswith() takes a tuple (not a list) if
          # we pass a list of inputs
          terms_single = set(terms_stop)
          terms_bigram = bigrams(terms_stop)
         count_all.update(terms_stop)

      # Print the first 5 most frequent words
      print(count_all.most_common(5))

That's my code and json file content example( ['cries', 'for', 'help', 'like', 'tears', 'in', 'rain'] ['rain', 'rain', 'go', 'away']...etc)

Could anybody help to solve the problem? Thank you!

AlexS
  • 11
  • 4
  • 1
    Do you really need more than two lines (`line="['cries', 'for', 'help', 'like', 'tears', 'in', 'rain']"; json.loads(line)`) to reproduce the bug? See the [mcve] definition; in keeping with the **minimal** aspect of same, a reproducer should be the *simplest possible code* that generates the same bug. – Charles Duffy Aug 14 '18 at 21:45
  • BTW, is your input really JSON, or is it JSONL? Trying to call `json.loads(line)` for each line in a file will only work for a JSONL file (that is to say, a file for which *each line is a separate valid JSON document*), not a JSON file. – Charles Duffy Aug 14 '18 at 21:46
  • 1
    Given what you're writing to `tokensALL.json`, it doesn't sound like you have any idea what JSON is or what it looks like. You should probably go read about JSON first. – user2357112 Aug 14 '18 at 21:46
  • It can't decode your JSON because the JSON you show is invalid. – dmulter Aug 14 '18 at 21:46
  • Could you explain why is it invalid? Thank you – AlexS Aug 15 '18 at 15:26

1 Answers1

0

You must check if your json format is valid json. Here is a post that talked about all the possibilities and how to check if it is valid json. previous post

Ruxi Zhang
  • 323
  • 6
  • 12