I want to create a visualization of frequently used words between 'my
' and 'my gf
' on Facebook. I downloaded all messages directly from FB in a JSON file and I got the counter working
BUT:
- Counter also counts element names from JSON like "
sender_name
" or timestamps which are13
digit numbers - The JSON file is lacking UTF encoding - I have strings like
\u00c5
,\u0082a
,\u00c5
,\u0082a
hardcoded into the words
How do I exclude short meaningless words like 'you, I, a, but
' etc?
For the first problem I tried creating a dictionary of words to exclude but I have no idea how to even approach excluding them. Also, the problem is with deleting the timestamp numbers because they are not constant.
For the second problem I tried just opening the file in a word editor and replacing the symbol codes but it crashes every time because of the size of the file (more than 1,5 million lines).
Here's the code that I used to print most frequent words:
import re
import collections
import json
file = open('message.json', encoding="utf8")
a = file.read()
words = re.findall(r'\w+', a)
most_common = collections.Counter(map(str.lower, words)).most_common(50)
print(most_common)
And JSON file structure looks like this:
{
"sender_name": "xxxxxx",
"timestamp_ms": 1540327935616,
"content": "Podobaj\u00c4\u0085 ci si\u00c4\u0099",
"type": "Generic"
},