-1

I have created the following dictionary from the Cranfield Collection:

{
    'd1'   : ['experiment', 'studi', ..., 'configur', 'experi', '.'], 
    'd2'   : ['studi', 'high-spe', ..., 'steadi', 'flow', '.'],
    ..., 
    'd1400': ['report', 'extens', ..., 'graphic', 'form', '.']
}

Each key, value pair represents a single document as the key and the value as a list of tokenized, stemmed words with stopwords removed. I need to create an inverted index from this dictionary with the following format:

{
    'experiment': {'d1': [1, [0]], ..., 'd30': [2, [12, 40]], ..., 'd123': [3, [11, 45, 67]], ...}, 

    'studi': {'d1': [1, [1]], 'd2': [2, [0, 36]], ..., 'd207': [3, [19, 44, 59]], ...}

    ...
}

Here the key becomes the term while the value is a dictionary that contains the document that term shows up in, the number of times, and the indices of the document where the term is found. I am not sure how to approach this conversion so I am just looking for some starter pointers as to how to think about this problem. Thank you.

Hefe
  • 421
  • 3
  • 23
  • 2
    where is the code you've tried to write? – DevLounge Sep 01 '22 at 00:42
  • You tagged the question with `python` and `lucene`. Does that mean you want to use [PyLucene](https://lucene.apache.org/pylucene/) for this? I'm not sure how that aligns with the format you expect for your output. Why the Lucene tag? – andrewJames Sep 01 '22 at 00:50
  • Hey, sorry. I tagged Lucene because what I’m trying to do is loosely related to Information Retrieval in general. Regardless, I can remove that tag because I don’t believe I need to use PyLucene – Hefe Sep 01 '22 at 01:06
  • @andrewJames good point. I would have tried something but honestly had no idea where to start. I am not very well-versed in dictionaries – Hefe Sep 01 '22 at 01:10
  • No problem - I understand. If you want to use PyLucene take a look at [PyLucene Indexer and retriever sample](https://stackoverflow.com/q/47668000/12567365). That will create an inverted index, of course - but may be far more than you need. Or you may end up, piece-by-piece, writing your own Lucene-Lite, before you are done. – andrewJames Sep 01 '22 at 01:19

1 Answers1

1

I hope I've understood your question well:

dct = {
    "d1": ["experiment", "studi", "configur", "experi", "."],
    "d2": ["studi", "high-spe", "steadi", "flow", "flow", "."],
    "d1400": ["report", "extens", "graphic", "form", "."],
}

out = {}
for k, v in dct.items():
    for idx, word in enumerate(v):
        out.setdefault(word, {}).setdefault(k, []).append(idx)

for v in out.values():
    for l in v.values():
        l[:] = [len(l), list(l)]

print(out)

Prints:

{
    "experiment": {"d1": [1, [0]]},
    "studi": {"d1": [1, [1]], "d2": [1, [0]]},
    "configur": {"d1": [1, [2]]},
    "experi": {"d1": [1, [3]]},
    ".": {"d1": [1, [4]], "d2": [1, [5]], "d1400": [1, [4]]},
    "high-spe": {"d2": [1, [1]]},
    "steadi": {"d2": [1, [2]]},
    "flow": {"d2": [2, [3, 4]]},
    "report": {"d1400": [1, [0]]},
    "extens": {"d1400": [1, [1]]},
    "graphic": {"d1400": [1, [2]]},
    "form": {"d1400": [1, [3]]},
}
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • 1
    nice code, but d2 should be 2, you have both 1 and 2. And d1400 shouldn't be 1 – blackraven Sep 01 '22 at 00:43
  • 1
    Thank you Andrej! I will try that out and get back to you. – Hefe Sep 01 '22 at 01:08
  • Hey Andrej, realized I left something out of this question so wrote up a new one if you get the chance to see it: https://stackoverflow.com/q/73563630/15975987 – Hefe Sep 01 '22 at 02:17