0

I'm trying to use the include my own data to train with ChatGPT by storing the data in a vector database (Pinecone). I'm using the ChatGPT retrieval plugin to vectorise the data and store it in Python. The plugin can be found here: https://github.com/openai/chatgpt-retrieval-plugin

Following the guide from https://betterprogramming.pub/enhancing-chatgpt-with-infinite-external-memory-using-vector-database-and-chatgpt-retrieval-plugin-b6f4ea16ab8, everything is good so far. However I'm having the issue of accessing the metadata, ie the source of the information so that can be author or url etc.

I believe this needs to be done in the function below, note that I added the metadata part myself:

def upsert_file(directory: str):
    """
    Upload all files under a directory to the vector database.
    """
    url = "http://0.0.0.0:8000/upsert-file"
    headers = {"Authorization": "Bearer " + DATABASE_INTERFACE_BEARER_TOKEN}
    files = []
    for filename in os.listdir(directory):
        if os.path.isfile(os.path.join(directory, filename)):
            file_path = os.path.join(directory, filename)
            with open(file_path, "rb") as f:
                file_content = f.read()
                # files.append(("file", (filename, file_content, "text/plain")))
                metadata = {
                    "source": filename,  # Add your metadata values
                    "author": "Tim Cook",
                    "url": "Some fake url"
                    # Add more metadata fields as needed
                }
                print(metadata)
                files = {
                    "file": (filename, file_content, "text/plain"),
                    "metadata": (None, json.dumps(metadata), "application/json"),
                }
                # response = requests.post(url, headers=headers, files=files, timeout=600)

            response = requests.post(url,
                                     headers=headers,
                                     files=files,
                                    #  data={"metadata": json.dumps(metadata)},
                                     timeout=600)
            if response.status_code == 200:
                print(filename + " uploaded successfully.")
            else:
                print(
                    f"Error: {response.status_code} {response.content} for uploading "
                    + filename)

My issue here is that the files still get vectorised/stored in pinecone but the metadata is still returning as None as shown below:

metadata': {'source': 'file', 'source_id': None, 'url': None, 'created_at': None, 'author': None, 'document_id': 'Some_Doc_Id_here_that_is_not_None'}

My question is how do I get the metadata? Why is it returning None for so many fields? I should also mention for the line:

"metadata": (None, json.dumps(metadata), "application/json")

If I am to change None to anything else, say testing, I end up with the error below when I try to upsert the files:

Error: 422 b'{"detail":[{"loc":["body","metadata"],"msg":"str type expected","type":"type_error.str"}]}'
picsoung
  • 6,314
  • 1
  • 18
  • 35
Mark
  • 730
  • 1
  • 8
  • 22

1 Answers1

0

For those who come across the same problem, the reason why metadata is showing None is because the metadata must contain the correct attributes. Digging deeper into the code for GPT plugin, the correct attributes are:

class DocumentMetadata(BaseModel):
    source: Optional[Source] = None
    source_id: Optional[str] = None
    url: Optional[str] = None
    created_at: Optional[str] = None
    author: Optional[str] = None

class Source(str, Enum):
    email = "email"
    file = "file"
    chat = "chat"

I was giving the source a filename, which was not a valid attribute for the Source object.

I ended up updating the metadata to something like below which I am now able to retrieve the metadata:

metadata = {"source": "file", "source_id": filename, "url": "https://example.com", "created_at":  str(date.today()), "author": "Tim Cook"}
Mark
  • 730
  • 1
  • 8
  • 22