3

I want to create a visualization of frequently used words between 'my' and 'my gf' on Facebook. I downloaded all messages directly from FB in a JSON file and I got the counter working

BUT:

  • Counter also counts element names from JSON like "sender_name" or timestamps which are 13 digit numbers
  • The JSON file is lacking UTF encoding - I have strings like \u00c5, \u0082a, \u00c5, \u0082a hardcoded into the words

How do I exclude short meaningless words like 'you, I, a, but' etc?

For the first problem I tried creating a dictionary of words to exclude but I have no idea how to even approach excluding them. Also, the problem is with deleting the timestamp numbers because they are not constant.

For the second problem I tried just opening the file in a word editor and replacing the symbol codes but it crashes every time because of the size of the file (more than 1,5 million lines).

Here's the code that I used to print most frequent words:

import re
import collections
import json

file = open('message.json', encoding="utf8")
a = file.read()

words = re.findall(r'\w+', a)

most_common = collections.Counter(map(str.lower, words)).most_common(50)
print(most_common)

And JSON file structure looks like this:

{
      "sender_name": "xxxxxx",
      "timestamp_ms": 1540327935616,
      "content": "Podobaj\u00c4\u0085 ci si\u00c4\u0099",
      "type": "Generic"
    },
Rarblack
  • 4,559
  • 4
  • 22
  • 33
Marsin Ka
  • 41
  • 2

2 Answers2

2

The problem is that you are using findall over the whole file, do something like this:

import re
import collections
import json


def words(s):
    return re.findall('\w+', s, re.UNICODE | re.IGNORECASE)

file = open('message.json', encoding="utf8")
data = json.load(file)

counts = collections.Counter((w.lower() for e in data for w in words(e.get('content', ''))))
most_common = counts.most_common(50)
print(most_common)

Output

[('siä', 1), ('ci', 1), ('podobajä', 1)]

The output is for a file with the following content (a list of JSON objects):

[{
      "sender_name": "xxxxxx",
      "timestamp_ms": 1540327935616,
      "content": "Podobaj\u00c4\u0085 ci si\u00c4\u0099",
      "type": "Generic"
}]

Explanation

With json.load load the content of the file as a list of dictionaries data, then iterate over the elements of the dictionary and count the words of the 'content' field using the function words and Counter

Further

  1. For removing words such as I, a and but see this

UPDATE

Given the format of the file you need to alter the line: data = json.load(file) to data = json.load(file)["messages"], for the following content:

{
  "participants":[],
  "messages": [
    {
      "sender_name": "xxxxxx",
      "timestamp_ms": 1540327935616,
      "content": "Podobaj\u00c4\u0085 ci si\u00c4\u0099",
      "type": "Generic"
    },
    {
      "sender_name": "aaa",
      "timestamp_ms": 1540329382942,
      "content": "aaa",
      "type": "Generic"
    },
    {
      "sender_name": "aaa",
      "timestamp_ms": 1540329262248,
      "content": "aaa",
      "type": "Generic"
    }
  ]
}

The output is:

[('aaa', 2), ('siä', 1), ('podobajä', 1), ('ci', 1)]
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
  • I'm getting an error: counts = collections.Counter((w.lower() for e in data for w in words(e.get('content', '')))) AttributeError: 'str' object has no attribute 'get' – Marsin Ka Oct 24 '18 at 14:37
  • Is the JSON a list of only JSON objects or does the list include strings also? – Dani Mesejo Oct 24 '18 at 14:38
  • Sadly I can't answer that question with my current knowledge yet. It starts like this and just goes on till the last message: { "participants": [ { "name": "aaa" }, { "name": "aaa" } ], "messages": [ { "sender_name": "aaa", "timestamp_ms": 1540329382942, "content": "aaa", "type": "Generic" }, { "sender_name": "aaa", "timestamp_ms": 1540329262248, "content": "aaa", "type": "Generic" }, – Marsin Ka Oct 24 '18 at 14:40
  • @MarsinKa Updated the answer! – Dani Mesejo Oct 24 '18 at 14:46
  • Damn, works like a charm! So if I want to read a specific element from JSON I can just name it in square brackets when loading the file? – Marsin Ka Oct 24 '18 at 14:51
  • @MarsinKa json.load transforms the input into a dictionary, so you can access any key using square brackets. – Dani Mesejo Oct 24 '18 at 14:52
0

Have you tried reading the json as a dictionary and inspecting types? You can also look for unwanted words after the fact and remove them.

import json
from collections import Counter

def get_words(string):
    return [word.lower() for word in string.split() if word.lower()]

def count_words(json_item):
    if isinstance(json_item, dict):
        for key, value in json_item.items():
            return count_words(key) + count_words(value)
    elif isinstance(value, str):
        return get_words(value)
    elif isinstance(value, list):
        return [word for string in value for word in count_words(string)]
    else:
        return []

with open('message.json', encoding="utf-8") as f:
    json_input = json.load(f)
counter = Counter(count_words(json_input))
result = { key: value for key, value in counter.items() if key not in UNWANTED_WORDS}
sihrc
  • 2,728
  • 2
  • 22
  • 43