Since it was stated that the JSON context of user_id
does not matter, we just treat the JSON files as the pure text files they are.
GNU tools solution
I'd not use Python at all for this, but rather rely on the tools provided by GNU, and pipes:
cat *.json | sed -nE 's/\s*\"user_id\"\s*\:\s*\"([0-9]+)\"\s*/\1/p' | sort -un --parallel=4 | wc -l
cat *.json
: Output contents of all files to stdout
sed -nE 's/\s*\"user_id\"\s*\:\s*\"([0-9]+)\"\s*/\1/p'
: Look for lines containting "user_id": "{number}"
and only print the number to stdout
sort -un --parallel=4
: Sort the output numerically, ignoring duplicates (i.e. output only unique values), using multiple (4) jobs, and output to stdout
wc -l
: Count number of lines, and output to stdout
To determine whether the values are unique, we just sort them. You can speed up the sorting by specifying a higher number of parallel jobs, depending on your core count.
Python solution
If you want to use Python nonetheless, I'd recommend using a set
and re
(regular expressions)
import fileinput
import re
r = re.compile(r'\s*\"user_id\"\s*\:\s*\"([0-9]+)\"\s*')
s = set()
for line in fileinput.input():
m = r.match(line)
if m:
s.add(m.groups()[0])
print(len(s))
Run this using python3 <scriptname>.py *.json
.