1

I have a function which accepts a file path. It's as below:

def document_loader(doc_path: str) -> Optional[Document]:
        """ This function takes in a document in a particular format and 
        converts it into a Langchain Document Object 
        
        Args:
            doc_path (str): A string representing the path to the PDF document.

        Returns:
            Optional[DocumentLoader]: An instance of the DocumentLoader class or None if the file is not found.
        """
        
        # try:
        loader = PyPDFLoader(doc_path)
        docs = loader.load()
        print("Document loader done")

PyPDfLoader is a wrapper around PyPDF2 to read in a pdf file path

Now,when I call the function with hardcoding the file path string as below:

document_loader('/Users/Documents/hack/data/abc.pdf')

The function works fine and is able to read the pdf file path.

But now if I want a user to upload their pdf file via Streamlit file_uploader() as below:

uploaded_file = st.sidebar.file_uploader("Upload a file", key= "uploaded_file")
print(st.session_state.uploaded_file)

if uploaded_file is not None:
    filename = st.session_state.uploaded_file.name
    print(os.path.abspath(st.session_state.uploaded_file.name))
    document_loader(f'"{os.path.abspath(filename)}"')

I get the error:

ValueError: File path "/Users/Documents/hack/data/abc.pdf" is not a valid file or url

This statement print(os.path.abspath(st.session_state.uploaded_file.name)) prints out the same path as the hardcoded one.

Note: Streamlit is currently on localhost on my laptop and I am the "user" who is trying to upload a pdf via locally runnin streamlit app.

Edit1:

So as per @MAtchCatAnd I added tempfile and it WORKS. But with an issue:

My function where tempfile_path is passed, it is re-running everytime there is any interaction by a user. This is because tempfile path is changing automatically thereby making the function re-run even if I had decorated it with @st.cache_data.

The pdf file uploaded remains the same, so I don't want the same function to re run as it consumes some cost everytime it is run.

How to fix this as I see Streamlit has deprecated allow_mutation=True parameter in st.cache.

Here's the code:

@st.cache_data
def document_loader(doc_path: str) -> Optional[Document]:
        """ This function takes in a document in a particular format and 
        converts it into a Langchain Document Object 

        Args:
            doc_path (str): A string representing the path to the PDF document.

        Returns:
            Optional[DocumentLoader]: An instance of the DocumentLoader class or None if the file is not found.
        """

        # try:
        loader = PyPDFLoader(doc_path)
        docs = loader.load()
        print("Document loader done")

uploaded_file = st.sidebar.file_uploader("Upload a file", key= "uploaded_file")

if uploaded_file is not None:
    with tempfile.NamedTemporaryFile(delete=False) as temp_file:
            temp_file.write(uploaded_file.getvalue())
            temp_file_path = temp_file.name
            print(temp_file_path)

    custom_qa = document_loader(temp_file_path)
Baktaawar
  • 7,086
  • 24
  • 81
  • 149
  • 4
    You have an extra double quote inside your f-string - just do f'{os.path.abspath(filename)}' or heck remove the quotes because the f-string is completely superfluous if all it has is to produce the contained value, i.e. `document_loader(os.path.abspath(filename))` will do – metatoaster May 29 '23 at 01:47
  • 1
    You are creating a path that contains double-quote marks at the start and end - which doesn't match any actual file on your disk. If you wanted to use a f-string, it would simply be `f'{os.path.abspath(filename)}'` - but it's absurd to use a f-string at all when the contents is a single expression: just use `os.path.abspath(filename)`. – jasonharper May 29 '23 at 01:47
  • but the hardcoded string has double quotes in the file path name. I am trying to emulate that – Baktaawar May 29 '23 at 01:53
  • ok, I tried without f string and I get the same error: ValueError: File path /Users/Documents/hack/data/abc.pdf is not a valid file or url – Baktaawar May 29 '23 at 01:56
  • Well ... that is a >different< problem. Notice that the error message is different!! Perhaps the `filename` value is incorrect? Or the file is not being uploaded to the current directory. `abspath` doesn't check that it exists. – Stephen C May 29 '23 at 02:57

1 Answers1

3

The object returned by st.file_uploader is a "file-like" object inheriting from BytesIO.

From the docs:

The UploadedFile class is a subclass of BytesIO, and therefore it is "file-like". This means you can pass them anywhere where a file is expected.

While the returned object does have a name attribute, it has no path. It exists in memory and is not associated to a real, saved file. Though Streamlit may be run locally, it does in actuality have a server-client structure where the Python backend is usually on a different computer than the user's computer. As such, the file_uploader widget is not designed to provide any real access or pointer to the user's file system.

You should either

  1. use a method that allows you to pass a file buffer instead of a path,
  2. save the file to a new, known path,
  3. use tempfiles

A brief example working with temp files and another question about them that may be helpful.

import streamlit as st
import tempfile
import pandas as pd

file = st.file_uploader('Upload a file', type='csv')
tempdir = tempfile.gettempdir()

if file is not None:
    with tempfile.NamedTemporaryFile(delete=False) as tf:
        tf.write(file.read())
        tf_path = tf.name
    st.write(tf_path)
    df = pd.read_csv(tf_path)
    st.write(df)

Response to Edit 1

I would remove the caching and instead rely on st.session_state to store your results.

Create a spot in session state for the object you want at the beginning of your script

if 'qa' not in st.session_state:
    st.session_state.qa = None

Have your function return the object you want

def document_loader(doc_path: str) -> Optional[Document]:
    loader = PyPDFLoader(doc_path)
    return loader # or return loader.load(), whichever is more suitable

Check for results in session state before running the document loader

uploaded_file = st.sidebar.file_uploader("Upload a file", key= "uploaded_file")

if uploaded_file is not None and st.session_state.qa is None:
    with tempfile.NamedTemporaryFile(delete=False) as temp_file:
        temp_file.write(uploaded_file.getvalue())
        temp_file_path = temp_file.name
        print(temp_file_path)

    st.session_state.qa = document_loader(temp_file_path)

custom_qa = st.session_state.qa

# put a check on custom_qa before continuing, either "is None" with  
# stop or "is not None" with the rest of your code nested inside
if custom_qa is None:
    st.stop()

Add in a way to reset, by adding on_change=clear_qa to the file uploader

def clear_qa():
    st.session_state.qa = None
MathCatsAnd
  • 601
  • 2
  • 8
  • Very good answer. Thanks a ton. However, I tried Tempfiles earlier, but it didn't work. pdf_bytes = uploaded_file.read() with tempfile.NamedTemporaryFile(delete=False) as tmp_file: tmp_filename = tmp_file.name tmp_file.write(pdf_bytes) – Baktaawar May 29 '23 at 02:39
  • It's a little hard to read your example with the formatting, but I added a simplified example of working with tempfiles. Maybe you've got the same structure already, but I thought it would be easier to write it out in code format. Let me know if the relevant modifications to your file type aren't working with that. – MathCatsAnd May 29 '23 at 03:04
  • I updated the questions with Edit on what code I wrote using tempfile and while that part works, it is rerunning the same function everytime even if the pdf file doesn't change. Pls check the code I added there – Baktaawar May 29 '23 at 03:17
  • I don't see anything with tempfiles in the question. All I see is the incorrect usage trying to get the path from a file in memory. `uploaded_file.name` is just a string with the filename, including the file extension. It makes no sense to call `os.path.abspath(uploaded_file.name)`. This would only coincidentally work if the file was already in your working directory. Did you forget to save your edit? Streamlit's basic structure is to rerun with every interaction, so if you don't want something to rerun, store a result in session state and check in line for that result ahead of that process. – MathCatsAnd May 29 '23 at 03:32
  • I have no idea why the Edit didn't get saved. I re-wrote the Edit. Pls check – Baktaawar May 29 '23 at 03:41
  • I've added a response to your edit. – MathCatsAnd May 29 '23 at 04:41
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/253871/discussion-between-mathcatsand-and-baktaawar). – MathCatsAnd May 29 '23 at 04:57