-1

First, I am relatively new at programming; Python is the only language I have any familiarity with using. Secondly, I put DB in the question because that's what seems right to me after searching around, but I am open to not using a DB at all if that's easier or more efficient.

What I Have to Work With

I have a folder with ~75,000 JSON files. They all have the same structure; here is an example of what they look like (more on that below):

{
"id": 93480
"author": "",
"joined by": [],
"date_created": "2010-04-28T16:07:21Z"
"date_modified": "2020-02-21T21:42:45.655644Z"
"type": "010combined",
"page_count": null,
"plain_text": "",
"html": "",
"extracted_by_ocr": false,
"cited": [ 
 ]
}

One way that the real files differ from the above is that either the "plain_text" or the "html" key will have an actual value, namely text (whether plaintext or HTML). The length of that text can vary from a couple of sentences to over 200 pages worth of text. Thus, the JSON files range in size from 907 bytes at the smallest to 2.1 MB.

What I'm Trying to Do

I want to be able, essentially, to search through all the files for a word or phrase contained in either the plain_text or HTML fields and, at a minimum, return a list of files containing that word or phrase. [Ideally, I'd do other things with them, as well, but I can figure that stuff out later. What I'm stumped on is where to begin.]

What I Can't Figure Out

  1. Whether to even bother with a document-store db like MongoDB (or PostgreSQL). If that's the appropriate way to handle this, I'm open to working my way through it. But I can't even tell if that's how I should attack the problem, or if I should instead just use a Python script to iterate over the files in the folder directly. Can you use populate a DB with all the files in a folder, then search for a substring in each row? The fact that some of these files have a ton of text in one of the values makes it seem weird to me to use a DB at all, but again: I don't know what I'm doing.

  2. I think I know to iterate over the files directly with Python. I know how to open files, and I know how to get a list of keys from JSON files. But how do you search for a matching substring in two JSON values? And then, if the substring is found in one of them, how do you return the "id" field to a list, close the file, and move to the next one? (I mean, obviously, the basic structure is a conditional. Here's the logical structure of what I'm thinking here:

  • Variable = "substring I want to match"
  • List = [] # Will hold ids of files containing variable
  • Open file
  • Read file to the end
  • Search file [or just the two JSON keys?] for variable
  • If variable found append "id" to list
  • Close file
  • Move to the next one in the directory

It's the actual code part that I'm stumbling over.

dmvjjvmd
  • 3
  • 1
  • Using some tool is preferred to iterating over files since it'll take a ton of time to process files naively. Take a look at Elasticsearch. – Alexandr Tatarinov Apr 04 '21 at 18:39
  • If you want to actually search arbitrary text for words and phrases, use a search engine (Elasticsearch, Solr, Graylog, Splunk) not a document store which requires indicies for efficient searches – OneCricketeer Apr 04 '21 at 18:39
  • Thanks. I had a hunch that either or both of my naive approaches were wrong. It didn't take long to come to that conclusion when all the examples and tutorials for various dbs, etc. use examples with very simple/short values in key:value pairs. But this is the steepest part, I think, of the learning curve with programming. The zillion different tools for doing jobs and not knowing even how to approach figuring out which one to use. – dmvjjvmd Apr 04 '21 at 19:43

1 Answers1

0

Idea using pandas since I don't know about search engines, some copied from: How to read multiple json files into pandas dataframe?

dfs = [] # an empty list to store the data frames
for file in file_list:
    data = pd.read_json(file, lines=True) # read data frame from json file
    dfs.append(data) # append the data frame to the list

temp = pd.concat(dfs, ignore_index=True) # concatenate all the data frames in the list.

Creating it will take forever but once that's done you can search and do operations quickly. E.g. if you want to find all id where author is not empty:

id_list = temp.loc[temp['author'] != '']['id'].tolist()

If the combined size of all your files is gigantic, you may want to consult the docs to store things more efficiently https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html or use another method.

Sam Szotkowski
  • 344
  • 1
  • 6