0

I am relatively new to Python and struggling with the following:

I have a list of about 52,000 dictionaries containing metadata on PDFs (that are stored separately). Now, I want to match 5,000 of these PDFs to their corresponding metadata dictionaries, but I'm not sure how to do this.

Metadata:

[{'Title': 'This is the title', 'Author': 'John A.', 'Code': '8372', ...}, {'Title': 'This is another title', 'Author': 'Peter B.', 'Code': '5837_c', ...}, ...]

The PDF file names correspond to the 'Code' values (i.e. the file names are 5346, 8372, 3475_c, 0294, 5837_c, etc., always either three, four or five numbers or three, four or five numbers complemented by _c). Is there a way in which I can match the PDFs to the right dictionaries in the list of metadata dictionaries, using the file names of the PDFs to match?

Other solutions are also very welcome!

Edit: My aim is to create a Textacy Corpus, in which every entry is a Textacy Doc (i.e. the content of one PDF) and its corresponding Textacy Metadata (i.e. the PDFs metadata).

textacy_corpus = textacy.Corpus(u'en', texts=pdfs_list, metadatas=metadata_list)

From Textacy's documentation: "[Metadata] stream must align exactly with texts or docs, or else metadata will be mis-assigned. More concretely, the first item in metadatas will be assigned to the first item in texts or docs, and so on from there." This is why I want to match the PDFs to the right metadata.

NynkeLys
  • 95
  • 2
  • 11
  • 1
    Possible duplicate of [Python list of dictionaries search](https://stackoverflow.com/questions/8653516/python-list-of-dictionaries-search) – Priyank Jul 17 '17 at 09:44
  • How has you stored the names of your files? Are they contained in a list? – asongtoruin Jul 17 '17 at 09:44
  • No, I have only read in the PDFs as I want to perform analyses on the texts later on. The thing is I want to create a Textacy corpus containing the PDFs and their metadata, only to perform analyses after. – NynkeLys Jul 17 '17 at 09:48
  • 1
    Not sure what it is we can do for you.. What do you mean by *"match the PDF to the dictionary"*? Like the filepath? What should the outcome of the code be? – Ma0 Jul 17 '17 at 09:52
  • What you have in your first example is a list of dictionaries. From your explanation I understand that each dictionary's 'Code' key contains a value or a list of values. Each of those values can be found somewhere as an actual filename. Your question is how can you search for each of those filenames on that specific location? – BoboDarph Jul 17 '17 at 09:56
  • Yes, the 'Code' key always contains only one value, corresponding to an actual filename. Please see my edit, I hope it clarifies my problem a bit. – NynkeLys Jul 17 '17 at 09:59
  • So you need to parse the PDF file at that location and then extract it's metadata, then build two lists: one of the files and one of the metadata for each. Both lists contain dicts. Could you please post the structure for the metadata dict? – BoboDarph Jul 17 '17 at 10:05
  • I have already! I have retrieved the metadata dict list from an external source, not directly from the PDFs themselves (the metadata is stored in separate software). – NynkeLys Jul 17 '17 at 10:12
  • Do you need to make sure that the data in the metadata list corresponds and is in the same order as the data in the pdfs_list? If so, I suggest ordering both list based on a common key (Code if present in both). – BoboDarph Jul 17 '17 at 10:24
  • It worked, thanks a million! Funny I couldn't get to this solution myself - sometimes you just need someone to point out the obvious :-) – NynkeLys Jul 17 '17 at 11:21

1 Answers1

0
dict((x['Code'],x) for x in <YOUR_LIST>)
Johnny Kuo
  • 91
  • 1
  • 7
  • Welcome to Stack Overflow! Thank you for this code snippet, which may provide some immediate help. A proper explanation [would greatly improve](//meta.stackexchange.com/q/114762) its educational value by showing *why* this is a good solution to the problem, and would make it more useful to future readers with similar, but not identical, questions. Please [edit] your answer to add explanation, and give an indication of what limitations and assumptions apply. – Toby Speight Jul 17 '17 at 13:44