3

I'm attempting to upload a csv file to this site. However, I've encountered a few issues, and I think it stems from the incorrect mimetype (maybe).

I'm attempting to manually post the file via urllib2, so my code looks as follows:

import urllib
import urllib2
import mimetools, mimetypes
import os, stat
from cStringIO import StringIO

#============================
# Note: I found this recipe online. I can't remember where exactly though.. 
#=============================

class Callable:
    def __init__(self, anycallable):
        self.__call__ = anycallable

# Controls how sequences are uncoded. If true, elements may be given multiple values by
#  assigning a sequence.
doseq = 1

class MultipartPostHandler(urllib2.BaseHandler):
    handler_order = urllib2.HTTPHandler.handler_order - 10 # needs to run first

    def http_request(self, request):
        data = request.get_data()
        if data is not None and type(data) != str:
            v_files = []
            v_vars = []
            try:
                 for(key, value) in data.items():
                     if type(value) == file:
                         v_files.append((key, value))
                     else:
                         v_vars.append((key, value))
            except TypeError:
                systype, value, traceback = sys.exc_info()
                raise TypeError, "not a valid non-string sequence or mapping object", traceback

            if len(v_files) == 0:
                data = urllib.urlencode(v_vars, doseq)
            else:
                boundary, data = self.multipart_encode(v_vars, v_files)

                contenttype = 'multipart/form-data; boundary=%s' % boundary
                if(request.has_header('Content-Type')
                   and request.get_header('Content-Type').find('multipart/form-data') != 0):
                    print "Replacing %s with %s" % (request.get_header('content-type'), 'multipart/form-data')
                request.add_unredirected_header('Content-Type', contenttype)

            request.add_data(data)

        return request

    def multipart_encode(vars, files, boundary = None, buf = None):
        if boundary is None:
            boundary = mimetools.choose_boundary()
        if buf is None:
            buf = StringIO()
        for(key, value) in vars:
            buf.write('--%s\r\n' % boundary)
            buf.write('Content-Disposition: form-data; name="%s"' % key)
            buf.write('\r\n\r\n' + value + '\r\n')
        for(key, fd) in files:
            file_size = os.fstat(fd.fileno())[stat.ST_SIZE]
            filename = fd.name.split('/')[-1]
            contenttype = mimetypes.guess_type(filename)[0] or 'application/octet-stream'
            buf.write('--%s\r\n' % boundary)
            buf.write('Content-Disposition: form-data; name="%s"; filename="%s"\r\n' % (key, filename))
            buf.write('Content-Type: %s\r\n' % contenttype)
            # buffer += 'Content-Length: %s\r\n' % file_size
            fd.seek(0)
            buf.write('\r\n' + fd.read() + '\r\n')
        buf.write('--' + boundary + '--\r\n\r\n')
        buf = buf.getvalue()
        return boundary, buf
    multipart_encode = Callable(multipart_encode)

    https_request = http_request

    import cookielib
    cookies = cookielib.CookieJar()
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookies),
            MultipartPostHandler)

    opener.addheaders = [(
            'User-agent', 
            'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6'
        )]


    params = {"FILENAME" : open("weather_scrape.csv", 'rb'),
            'CGIREF' : '/calludt.cgi/DDFILE1',
            'USE':'MODEL',
            'MODEL':'CM',
            'CROP':'APPLES',
            'METHOD': 'SS',
            'UNITS' : 'E',
            'LOWTHRESHOLD': '50',
            'UPTHRESHOLD': '88',
            'CUTOFF':'H',
            'COUNTY':'AL',
            'ACTIVE':'Y',
            'FROMMONTH':'3',
            'FROMDAY':'15',
            'FROMYEAR': '2013',
            'THRUMONTH':'5',
            'THRUDAY':'13',
            'THRUYEAR':'2013',
            'DATASOURCE' : 'FILE'
            }

    response = opener.open("http://www.ipm.ucdavis.edu/WEATHER/textupload.cgi", params)

Now, when I post this, all seems to be fine, until I click the submit button on the subsequent webpage that the first POST returns. I then get this error message:

ERROR (bad data) in file 'weather.csv' at line 135.

Data record = [--192.168.117.2.1.4404.1368589639.796.1--]

Too few values found. Check delimiter specification.

Now, upon investigating the post request that gets made when I do the actions in browser, I notice that the content-type is very specific, namely:

------WebKitFormBoundaryfBp6Jfhv7LlPZLKd
Content-Disposition: form-data; name="FILENAME"; filename="weather.csv"
Content-Type: application/vnd.ms-excel

I'm not entirely sure it the content-type is what's causing the error, but.. it's what I'm currently ruling out (as I don't know what is actually going wrong.) I don't see any way to set the content type via urllib2, so after some googling, I stumbled upon urllib3.

Urllib3 has a build in file posting capability, but I'm not entirely sure how to use it.

Filepost

urllib3.filepost.encode_multipart_formdata(fields, boundary=None)
Encode a dictionary of fields using the multipart/form-data MIME format.

Parameters: 
fields –
Dictionary of fields or list of (key, value) or (key, value, MIME type) field tuples. The key is treated as the field name, and the value as the body of the form-data bytes. If the value is a tuple of two elements, then the first element is treated as the filename of the form-data section and a suitable MIME type is guessed based on the filename. If the value is a tuple of three elements, then the third element is treated as an explicit MIME type of the form-data section.
Field names and filenames must be unicode.
boundary – If not specified, then a random boundary will be generated using mimetools.choose_boundary().
urllib3.filepost.iter_fields(fields)
Iterate over fields.

Supports list of (k, v) tuples and dicts.

Using this library, I tried encoding the values as a decribes in the doc, but I'm getting errors.

I tried initially, just to test things out, as a dict.

params = {"FILENAME" : open("weather.csv", 'rb'),
            'CGIREF' : '/calludt.cgi/DDFILE1',
            'USE':'MODEL',
            'MODEL':'CM',
            'CROP':'APPLES',
            'METHOD': 'SS',
            'UNITS' : 'E',
            'LOWTHRESHOLD': '50',
            'UPTHRESHOLD': '88',
            'CUTOFF':'H',
            'COUNTY':'AL',
            'ACTIVE':'Y',
            'FROMMONTH':'3',
            'FROMDAY':'15',
            'FROMYEAR': '2013',
            'THRUMONTH':'5',
            'THRUDAY':'13',
            'THRUYEAR':'2013',
            'DATASOURCE' : 'FILE'
            }

    values = urllib3.filepost.encode_multipart_formdata(params)

however, this raises the following error:

    values = urllib3.filepost.encode_multipart_formdata(params)
  File "c:\python27\lib\site-packages\urllib3-dev-py2.7.egg\urllib3\filepost.py", line 90, in encode_multipart_formdata
    body.write(data)
TypeError: 'file' does not have the buffer interface

Not sure what caused it, I tried passing in a list of tuples (key, value, mimetype), but that also throws an error:

params = [
        ("FILENAME" , open("weather_scrape.csv"), 'application/vnd.ms-excel'),
        ('CGIREF' , '/calludt.cgi/DDFILE1'),
        ('USE','MODEL'),
        ('MODEL','CM'),
        ('CROP','APPLES'),
        ('METHOD', 'SS'),
        ('UNITS' , 'E'),
        ('LOWTHRESHOLD', '50'),
        ('UPTHRESHOLD', '88'),
        ('CUTOFF','H'),
        ('COUNTY','AL'),
        ('ACTIVE','Y'),
        ('FROMMONTH','3'),
        ('FROMDAY','15'),
        ('FROMYEAR', '2013'),
        ('THRUMONTH','5'),
        ('THRUDAY','13'),
        ('THRUYEAR','2013'),
        ('DATASOURCE' , 'FILE)')
        ]

    values = urllib3.filepost.encode_multipart_formdata(params)



>>ValueError: too many values to unpack
Zack Yoshyaro
  • 2,056
  • 6
  • 24
  • 46
  • 1
    You may want to use `requests`, it uses `urllib3` but provides you with a far nicer high-level API to work with. – Martijn Pieters May 15 '13 at 19:54
  • The particular urllib3 error you're getting is because of `open("weather_scrape.csv", 'rb')` in your dict example. Try doing `open("weather_scrape.csv").read()` instead. – shazow May 15 '13 at 20:29

1 Answers1

5

If you wanted to use urllib3 for this, it would look something like this:

import urllib3

http = urllib3.PoolManager()

headers = urllib3.make_headers(user_agent='Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6')
url = "http://www.ipm.ucdavis.edu/WEATHER/textupload.cgi"
csv_data = open("weather_scrape.csv").read()

params = {
    "FILENAME": csv_data,
    'CGIREF': '/calludt.cgi/DDFILE1',
    'USE': 'MODEL',
    'MODEL': 'CM',
    'CROP': 'APPLES',
    'METHOD': 'SS',
    'UNITS' : 'E',
    'LOWTHRESHOLD': '50',
    'UPTHRESHOLD': '88',
    'CUTOFF': 'H',
    'COUNTY': 'AL',
    'ACTIVE': 'Y',
    'FROMMONTH': '3',
    'FROMDAY': '15',
    'FROMYEAR': '2013',
    'THRUMONTH': '5',
    'THRUDAY': '13',
    'THRUYEAR': '2013',
    'DATASOURCE' : 'FILE',
}

response = http.request('POST', url, params, headers)

I couldn't test this with your target url and csv data set, so it may have some small bugs in it. But that's the general idea.

shazow
  • 17,147
  • 1
  • 34
  • 35
  • read() reads the entire file. This could be an issue for very large files. Is there a way to pass a file or some such object (may be your own class) and have http.request() read and stream the data on the fly? – Thiagarajan Hariharan Jun 21 '16 at 21:15
  • @ThiagarajanHariharan The response object is a file-like object, so you can use normal buffered io primatives to copy it like io.BufferedReader in the third example here: https://urllib3.readthedocs.io/en/latest/#usage Also take a look at https://stackoverflow.com/questions/27387783/how-to-download-a-file-with-urllib3 – shazow Jun 23 '16 at 00:31
  • I was referring to the request. Doesn't open('file').read() return the entire contents of the file? – Thiagarajan Hariharan Jun 24 '16 at 00:57
  • Ah, yes. The problem is that the HTTP protocol requires knowing things about the file ahead of time (like its total encoded length), so streaming reads is tricky. I have an old old branch where we tried to implement but it got pretty messy and not merged yet. :/ Related issue: https://github.com/shazow/urllib3/issues/51 – shazow Jun 24 '16 at 14:46