First, I am relatively new at programming; Python is the only language I have any familiarity with using. Secondly, I put DB in the question because that's what seems right to me after searching around, but I am open to not using a DB at all if that's easier or more efficient.
What I Have to Work With
I have a folder with ~75,000 JSON files. They all have the same structure; here is an example of what they look like (more on that below):
{
"id": 93480
"author": "",
"joined by": [],
"date_created": "2010-04-28T16:07:21Z"
"date_modified": "2020-02-21T21:42:45.655644Z"
"type": "010combined",
"page_count": null,
"plain_text": "",
"html": "",
"extracted_by_ocr": false,
"cited": [
]
}
One way that the real files differ from the above is that either the "plain_text" or the "html" key will have an actual value, namely text (whether plaintext or HTML). The length of that text can vary from a couple of sentences to over 200 pages worth of text. Thus, the JSON files range in size from 907 bytes at the smallest to 2.1 MB.
What I'm Trying to Do
I want to be able, essentially, to search through all the files for a word or phrase contained in either the plain_text or HTML fields and, at a minimum, return a list of files containing that word or phrase. [Ideally, I'd do other things with them, as well, but I can figure that stuff out later. What I'm stumped on is where to begin.]
What I Can't Figure Out
Whether to even bother with a document-store db like MongoDB (or PostgreSQL). If that's the appropriate way to handle this, I'm open to working my way through it. But I can't even tell if that's how I should attack the problem, or if I should instead just use a Python script to iterate over the files in the folder directly. Can you use populate a DB with all the files in a folder, then search for a substring in each row? The fact that some of these files have a ton of text in one of the values makes it seem weird to me to use a DB at all, but again: I don't know what I'm doing.
I think I know to iterate over the files directly with Python. I know how to open files, and I know how to get a list of keys from JSON files. But how do you search for a matching substring in two JSON values? And then, if the substring is found in one of them, how do you return the "id" field to a list, close the file, and move to the next one? (I mean, obviously, the basic structure is a conditional. Here's the logical structure of what I'm thinking here:
- Variable = "substring I want to match"
- List = [] # Will hold ids of files containing variable
- Open file
- Read file to the end
- Search file [or just the two JSON keys?] for variable
- If variable found append "id" to list
- Close file
- Move to the next one in the directory
It's the actual code part that I'm stumbling over.