I'm trying to use the include my own data to train with ChatGPT by storing the data in a vector database (Pinecone). I'm using the ChatGPT retrieval plugin to vectorise the data and store it in Python. The plugin can be found here: https://github.com/openai/chatgpt-retrieval-plugin
Following the guide from https://betterprogramming.pub/enhancing-chatgpt-with-infinite-external-memory-using-vector-database-and-chatgpt-retrieval-plugin-b6f4ea16ab8, everything is good so far. However I'm having the issue of accessing the metadata, ie the source of the information so that can be author or url etc.
I believe this needs to be done in the function below, note that I added the metadata part myself:
def upsert_file(directory: str):
"""
Upload all files under a directory to the vector database.
"""
url = "http://0.0.0.0:8000/upsert-file"
headers = {"Authorization": "Bearer " + DATABASE_INTERFACE_BEARER_TOKEN}
files = []
for filename in os.listdir(directory):
if os.path.isfile(os.path.join(directory, filename)):
file_path = os.path.join(directory, filename)
with open(file_path, "rb") as f:
file_content = f.read()
# files.append(("file", (filename, file_content, "text/plain")))
metadata = {
"source": filename, # Add your metadata values
"author": "Tim Cook",
"url": "Some fake url"
# Add more metadata fields as needed
}
print(metadata)
files = {
"file": (filename, file_content, "text/plain"),
"metadata": (None, json.dumps(metadata), "application/json"),
}
# response = requests.post(url, headers=headers, files=files, timeout=600)
response = requests.post(url,
headers=headers,
files=files,
# data={"metadata": json.dumps(metadata)},
timeout=600)
if response.status_code == 200:
print(filename + " uploaded successfully.")
else:
print(
f"Error: {response.status_code} {response.content} for uploading "
+ filename)
My issue here is that the files still get vectorised/stored in pinecone but the metadata is still returning as None
as shown below:
metadata': {'source': 'file', 'source_id': None, 'url': None, 'created_at': None, 'author': None, 'document_id': 'Some_Doc_Id_here_that_is_not_None'}
My question is how do I get the metadata? Why is it returning None for so many fields? I should also mention for the line:
"metadata": (None, json.dumps(metadata), "application/json")
If I am to change None
to anything else, say testing
, I end up with the error below when I try to upsert the files:
Error: 422 b'{"detail":[{"loc":["body","metadata"],"msg":"str type expected","type":"type_error.str"}]}'