4

I have around a 50GB folder full of files. Each file consists of line after line of JSON data and in this JSON structure is a field for user_id.

I need to count the number of unique User IDs across all of the files (and only need the total count). What is the most memory efficient and relatively quick way of counting these?

Of course, loading everything into a huge list maybe isn't the best option. I tried pandas but it took quite a while. I then tried to simple write the IDs to text files but I thought I'd find out if I was maybe missing something far simpler.

Pika Supports Ukraine
  • 3,612
  • 10
  • 26
  • 42
apgsov
  • 794
  • 1
  • 8
  • 30
  • 2
    Have you tried to use a `set` instead of a `list`? That would probably be the most efficient way... – Ralf Apr 12 '19 at 20:23
  • Possible duplicate of [Process data, much larger than physical memory, in chunks](https://stackoverflow.com/questions/17710748/process-data-much-larger-than-physical-memory-in-chunks) – ivan_pozdeev Apr 12 '19 at 20:23
  • A set may not fit in memory any better than a list would, although of course lookup is faster. This is not a duplicate of the above because of the uniqueness restriction--it's not as simple as just reading line by line. – ggorlen Apr 12 '19 at 20:24
  • You do not know would it fit or not – Serge Apr 12 '19 at 20:25
  • Yet lists take less memory (though they are slow) – Serge Apr 12 '19 at 20:29
  • You can't escape having to iterate through the files once to fetch the IDs. While you're at it, throw them into a set. This will remove any duplicates. This is the simpler approach, an alternative would be the two-stack queue. – RoyM Apr 12 '19 at 20:30
  • So basically everithing depends on number of users. see https://stackoverflow.com/questions/495161/fast-disk-based-hashtables if you have billions – Serge Apr 12 '19 at 20:32
  • if you have enough memory `len({user_id: True for user_id in user_ids})` will be faster than set on a list I guess. If the dictionary doesn't fit, and user_ids are incrementals, maybe you could bin ids and repeat this on all the chunks – Lante Dellarovere Apr 12 '19 at 20:44
  • "in this JSON structure is a field for user_id." So there are other fields? If you have a file with 1000 lines, what percentage will be user_id? What characters can be in the user_id? – Steven Rumbalski Apr 12 '19 at 20:57
  • Does the context in which `user_id` shows up matter? Or can we just `grep` for it? – Jan Christoph Terasa Apr 12 '19 at 21:00
  • @StevenRumbalski https://firestream-portal.stocktwits.com/documentation/stream is the full JSON structure. Also, Jan, the only thing that matters is the unique user count, context is irrelevant. – apgsov Apr 12 '19 at 21:04

4 Answers4

1

Since you only need the user_ids, load a .json (as a data stucture), extract any ids, then destroy all references to that structure and any its parts so that it's garbage collected.

To speed up the process, you can do this in a few processes in parallel, take a look at multiprocessing.Pool.map.

ivan_pozdeev
  • 33,874
  • 19
  • 107
  • 152
1

Since it was stated that the JSON context of user_id does not matter, we just treat the JSON files as the pure text files they are.

GNU tools solution

I'd not use Python at all for this, but rather rely on the tools provided by GNU, and pipes:

cat *.json | sed -nE 's/\s*\"user_id\"\s*\:\s*\"([0-9]+)\"\s*/\1/p' | sort -un --parallel=4 | wc -l
  • cat *.json: Output contents of all files to stdout
  • sed -nE 's/\s*\"user_id\"\s*\:\s*\"([0-9]+)\"\s*/\1/p': Look for lines containting "user_id": "{number}" and only print the number to stdout
  • sort -un --parallel=4: Sort the output numerically, ignoring duplicates (i.e. output only unique values), using multiple (4) jobs, and output to stdout
  • wc -l: Count number of lines, and output to stdout

To determine whether the values are unique, we just sort them. You can speed up the sorting by specifying a higher number of parallel jobs, depending on your core count.

Python solution

If you want to use Python nonetheless, I'd recommend using a set and re (regular expressions)

import fileinput
import re

r = re.compile(r'\s*\"user_id\"\s*\:\s*\"([0-9]+)\"\s*')

s = set()
for line in fileinput.input():
    m = r.match(line)
    if m:
        s.add(m.groups()[0])

print(len(s))

Run this using python3 <scriptname>.py *.json.

Jan Christoph Terasa
  • 5,781
  • 24
  • 34
  • I doubt the OP is as memory constrained as he thinks, but if he is the Python solution will last a little longer by changing `s.add(m.groups()[0])` to `s.add(int(m.groups()[0]))` as integer objects are smaller that strings, and strings grow with each successive character, whereas an `int` won't know until it's internal representation is larger than what the platform can handle in whatever internal C integer type it's using (I'm fairly certain, but not positive of that last statement). – Steven Rumbalski Apr 12 '19 at 21:45
  • `sed -nE s/.*\"user_id\"\: \"([0-9]+)\".*/\1/p'` is not reliable since is doesn't parse JSON properly. – ivan_pozdeev Apr 12 '19 at 23:33
  • I think the edit should cover the use case. This was not meant to parse JSON in a general fashion. – Jan Christoph Terasa Apr 13 '19 at 00:17
1

Try the simplest approach first.

Write a function get_user_ids(filepath) that returns a list of user_id in a JSON file.

Then do:

from pathlib import Path
the_folder = Path("path/to/the/folder")
user_ids = set()
for jsonpath in the_folder.glob('*.json'):
    user_ids.update(get_user_ids(jsonpath))
print(len(user_ids))
Nathan Vērzemnieks
  • 5,495
  • 1
  • 11
  • 23
Steven Rumbalski
  • 44,786
  • 9
  • 89
  • 119
0

If the list of user IDs is so large that it can't reasonably fit into a set in memory, an easy and memory-efficient way to de-duplicate is to simply create files named after user IDs in an empty directory, and then count the number of files in the directory. This works because most filesystems are efficient at indexing file names in a directory.

import os
os.chdir('/')
os.mkdir('/count_unique')
os.chdir('/count_unique')
# change the following demo tuple to a generator that reads your JSON files and yields user IDs
for user_id in 'b', 'c', 'b', 'a', 'c':
    open(user_id, 'w').close()
print(sum(1 for _ in os.scandir('/count_unique')))

This outputs: 3

blhsing
  • 91,368
  • 6
  • 71
  • 106