I am relatively new to Python and struggling with the following:
I have a list of about 52,000 dictionaries containing metadata on PDFs (that are stored separately). Now, I want to match 5,000 of these PDFs to their corresponding metadata dictionaries, but I'm not sure how to do this.
Metadata:
[{'Title': 'This is the title', 'Author': 'John A.', 'Code': '8372', ...}, {'Title': 'This is another title', 'Author': 'Peter B.', 'Code': '5837_c', ...}, ...]
The PDF file names correspond to the 'Code' values (i.e. the file names are 5346, 8372, 3475_c, 0294, 5837_c, etc., always either three, four or five numbers or three, four or five numbers complemented by _c). Is there a way in which I can match the PDFs to the right dictionaries in the list of metadata dictionaries, using the file names of the PDFs to match?
Other solutions are also very welcome!
Edit: My aim is to create a Textacy Corpus, in which every entry is a Textacy Doc (i.e. the content of one PDF) and its corresponding Textacy Metadata (i.e. the PDFs metadata).
textacy_corpus = textacy.Corpus(u'en', texts=pdfs_list, metadatas=metadata_list)
From Textacy's documentation:
"[Metadata] stream must align exactly with texts
or docs
, or else metadata will be mis-assigned. More concretely, the first item in metadatas
will be assigned to the first item in texts
or docs
, and so on from there."
This is why I want to match the PDFs to the right metadata.