1

I'm building a Django app that enables users to upload a CSV via a form using a FormField. Once the CSV is imported I use the Pandas read_csv(filename) command to read in the CSV so I can do some processing on the CSV using Pandas.

I've recently started learning the really useful Dask library because the size of the uploaded files can be larger than memory. Everything works fine when using Pandas pd.read_csv(filename) but when I try and use Dask dd.read_csv(filename) I get the error "'InMemoryUploadedFile' object has no attribute 'startswith'".

I'm pretty new to Django, Pandas and Dask. I've searched high and low and can't seem to find this error when associated with Dask anywhere on Google.

Here is the code I'm trying to use below (just the relevant bits... I hope):

Inside forms.py I have:

class ImportFileForm(forms.Form):
    file_name = forms.FileField(label='Select a csv',validators=[validate_file_extension, file_size])

Inside views.py

import pandas as pd
import codecs
import dask.array as da
import dask.dataframe as dd

from dask.distributed import Client
client = Client()

def import_csv(request):

    if request.method == 'POST':
        form = ImportFileForm(request.POST, request.FILES)
        if form.is_valid():

             utf8_file = codecs.EncodedFile(request.FILES['file_name'].open(),"utf-8")

             # IF I USE THIS PANDAS LINE IT WORKS AND I CAN THEN USE PANDAS TO PROCESS THE FILE
             #df_in = pd.read_csv(utf8_file)

             # IF I USE THIS DASK LINE IT DOES NOT WORK AND PRODUCES THE ERROR
             df_in = dd.read_csv(utf8_file)

And here is the error output I'm getting:

AttributeError at /import_data/import_csv/
'InMemoryUploadedFile' object has no attribute 'startswith'

/home/username/projects/myproject/import_data/services.py in save_imported_doc
    df_in = dd.read_csv(utf8_file) …
▶ Local vars
/home/username/anaconda3/lib/python3.7/site-packages/dask/dataframe/io/csv.py in read
            **kwargs …
▶ Local vars
/home/username/anaconda3/lib/python3.7/site-packages/dask/dataframe/io/csv.py in read_pandas
        **(storage_options or {}) …
▶ Local vars
/home/username/anaconda3/lib/python3.7/site-packages/dask/bytes/core.py in read_bytes
    fs, fs_token, paths = get_fs_token_paths(urlpath, mode="rb", storage_options=kwargs) …
▶ Local vars
/home/username/anaconda3/lib/python3.7/site-packages/fsspec/core.py in get_fs_token_paths
        path = cls._strip_protocol(urlpath) …
▶ Local vars
/home/username/anaconda3/lib/python3.7/site-packages/fsspec/implementations/local.py in _strip_protocol
        if path.startswith("file://"): …
▶ Local vars
/home/username/anaconda3/lib/python3.7/codecs.py in __getattr__
        return getattr(self.stream, name) 
halfer
  • 19,824
  • 17
  • 99
  • 186
data101
  • 145
  • 1
  • 5
  • 13

2 Answers2

4

I finally got it working. Here's a Django specific solution building on the answer from @mdurant who thankfully pointed me in the right direction.

By default Django stores files under 2.5MB in memory and so Dask isn't able to access it in the way Pandas does as Dask asks for a location in actual storage. However, when the file is over 2.5MB Django stores the file in a temp folder which can then be located with the Django command temporary_file_path(). This temp file path can then be used directly by Dask. I found some really useful information about how Django actually handles files in the background in their docs: https://docs.djangoproject.com/en/3.0/ref/files/uploads/#custom-upload-handlers.

In case you can't predict in advance your user uploaded file sizes (as is in my case) and you happen to have a file less than 2.5MB you can change FILE_UPLOAD_HANDLERS in your Django settings file so that it writes all files to a temp storage folder regardless of size so it can always be accessed by Dask.

Here is how I changed my code in case this is helpful for anyone else in the same situation.

In views.py

def import_csv(request):

    if request.method == 'POST':
        form = ImportFileForm(request.POST, request.FILES)
        if form.is_valid():

             # the temporary_file_path() shows Dask where to find the file
             df_in = dd.read_csv(request.FILES['file_name'].temporary_file_path())

And in settings.py adding in the setting as below makes Django always write an uploaded file to temp storage whether the file is under 2.5MB or not so it can always be accessed by Dask

FILE_UPLOAD_HANDLERS = ['django.core.files.uploadhandler.TemporaryFileUploadHandler',]
data101
  • 145
  • 1
  • 5
  • 13
1

It seems you are not passing a file on disc, but some django-specific buffer object. Since you are expecting large files, you probably want to tell django to stream the uploads directly to disc and give you the filename for dask; i.e., is request.FILES['file_name'] actually somewhere in your storage? The error message seems to suggest not, in which case you need to configure django (sorry, I don't know how).

Note that Dask can deal with in-memory file-like objects such as io.BytesIO, using the MemoryFileSystem, but this isn't very typical, and won't help with your memory issues.

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • Thanks for getting back to me so quickly! I think you're right that request.FILES['file_name'] is temporarily held somewhere so you can access it before it's written to storage. I'll see if I can get Django to do as you've suggested as that makes perfect sense now you've mentioned it. Do you know why I'm able to access it using Pandas though? – data101 Dec 20 '19 at 16:13
  • Pandas accepts an arbitrary file-like object, but dask wants to glob a filename and find its size to be able to partition amongst workers. – mdurant Dec 20 '19 at 16:20
  • Ah okay I see. I guess as the Django app is pulling the file into somewhere within itself temporarily anyway it shouldn't be a problem to temporarily write the file to storage within the app so Dask can read it and then push it to S3 (I think!). I'll see if I can figure out how to do this and let you know if it worked. Thanks so much for your help I think you've pointed me in the right direction! – data101 Dec 20 '19 at 16:27
  • You were exactly right! I was able to find a way to get Django to write all uploaded files to storage as you suggested (turns out it actually does this by default for all files over 2.5MB anyway). There is also a Django specific command that gives the temporary saved location/filepath in storage which can then be put straight into dd.read_csv(). Thank you so much for your answer as I would never have worked out what was going on behind the scenes by myself! I've put a Django specific answer too with examples of the code I used to get it working in case it helps anyone else stuck on this too. – data101 Dec 21 '19 at 17:21