I have a function which accepts a file path. It's as below:
def document_loader(doc_path: str) -> Optional[Document]:
""" This function takes in a document in a particular format and
converts it into a Langchain Document Object
Args:
doc_path (str): A string representing the path to the PDF document.
Returns:
Optional[DocumentLoader]: An instance of the DocumentLoader class or None if the file is not found.
"""
# try:
loader = PyPDFLoader(doc_path)
docs = loader.load()
print("Document loader done")
PyPDfLoader is a wrapper around PyPDF2 to read in a pdf file path
Now,when I call the function with hardcoding the file path string as below:
document_loader('/Users/Documents/hack/data/abc.pdf')
The function works fine and is able to read the pdf file path.
But now if I want a user to upload their pdf file via Streamlit file_uploader() as below:
uploaded_file = st.sidebar.file_uploader("Upload a file", key= "uploaded_file")
print(st.session_state.uploaded_file)
if uploaded_file is not None:
filename = st.session_state.uploaded_file.name
print(os.path.abspath(st.session_state.uploaded_file.name))
document_loader(f'"{os.path.abspath(filename)}"')
I get the error:
ValueError: File path "/Users/Documents/hack/data/abc.pdf" is not a valid file or url
This statement print(os.path.abspath(st.session_state.uploaded_file.name))
prints out the same path as the hardcoded one.
Note: Streamlit is currently on localhost on my laptop and I am the "user" who is trying to upload a pdf via locally runnin streamlit app.
Edit1:
So as per @MAtchCatAnd I added tempfile and it WORKS. But with an issue:
My function where tempfile_path is passed, it is re-running everytime there is any interaction by a user. This is because tempfile path is changing automatically thereby making the function re-run even if I had decorated it with @st.cache_data.
The pdf file uploaded remains the same, so I don't want the same function to re run as it consumes some cost everytime it is run.
How to fix this as I see Streamlit has deprecated allow_mutation=True parameter in st.cache.
Here's the code:
@st.cache_data
def document_loader(doc_path: str) -> Optional[Document]:
""" This function takes in a document in a particular format and
converts it into a Langchain Document Object
Args:
doc_path (str): A string representing the path to the PDF document.
Returns:
Optional[DocumentLoader]: An instance of the DocumentLoader class or None if the file is not found.
"""
# try:
loader = PyPDFLoader(doc_path)
docs = loader.load()
print("Document loader done")
uploaded_file = st.sidebar.file_uploader("Upload a file", key= "uploaded_file")
if uploaded_file is not None:
with tempfile.NamedTemporaryFile(delete=False) as temp_file:
temp_file.write(uploaded_file.getvalue())
temp_file_path = temp_file.name
print(temp_file_path)
custom_qa = document_loader(temp_file_path)