Introduction
Background
I'm writing a script to upload stuff including files using the multipart/form-data
content type defined in RFC 2388. In the long run, I'm trying to provide a simple Python script to do uploads of binary packages for github, which involves sending form-like data to Amazon S3.
Related
This question has already asked about how to do this, but it is without an accepted answer so far, and the more useful of the two answers it currently has points to these recipes which in turn build the whole message manually. I am somewhat concerned about this approach, particularly with regard to charsets and binary content.
There is also this question, with its currently highest-scoring answer suggesting the MultipartPostHandler
module. But that is not much different from the recipes I mentioned, and therefore my concerns apply tho that as well.
Concerns
Binary content
RFC 2388 Section 4.3 explicitely states that content is expected to be 7 bit unless declared otherwise, and therefore a Content-Transfer-Encoding
header might be required. Does that mean I'd have to Base64-encode binary file content? Or would Content-Transfer-Encoding: 8bit
be sufficient for arbitrary files? Or should that read Content-Transfer-Encoding: binary
?
Charset for header fields
Header fields in general, and the filename
header field in particular, are ASCII only by default. I'd like my method to be able to pass non-ASCII file names as well. I know that for my current application of uploading stuff for github, I probably won't need that as the file name is given in a separate field. But I'd like my code to be reusable, so I'd rather encode the file name parameter in a conforming way. RFC 2388 Section 4.4 advises the format introduced in RFC 2231, e.g. filename*=utf-8''t%C3%A4st.txt
.
My approach
Using python libraries
As multipart/form-data
is essentially a MIME type, I thought that it should be possible to use the email
package from the standard python libraries to compose my post. The rather complicated handling of non-ASCII header fields in particular is something I'd like to delegate.
Work so far
So I wrote the following code:
#!/usr/bin/python3.2
import email.charset
import email.generator
import email.header
import email.mime.application
import email.mime.multipart
import email.mime.text
import io
import sys
class FormData(email.mime.multipart.MIMEMultipart):
def __init__(self):
email.mime.multipart.MIMEMultipart.__init__(self, 'form-data')
def setText(self, name, value):
part = email.mime.text.MIMEText(value, _charset='utf-8')
part.add_header('Content-Disposition', 'form-data', name=name)
self.attach(part)
return part
def setFile(self, name, value, filename, mimetype=None):
part = email.mime.application.MIMEApplication(value)
part.add_header('Content-Disposition', 'form-data',
name=name, filename=filename)
if mimetype is not None:
part.set_type(mimetype)
self.attach(part)
return part
def http_body(self):
b = io.BytesIO()
gen = email.generator.BytesGenerator(b, False, 0)
gen.flatten(self, False, '\r\n')
b.write(b'\r\n')
b = b.getvalue()
pos = b.find(b'\r\n\r\n')
assert pos >= 0
return b[pos + 4:]
fd = FormData()
fd.setText('foo', 'bar')
fd.setText('täst', 'Täst')
fd.setFile('file', b'abcdef'*50, 'Täst.txt')
sys.stdout.buffer.write(fd.http_body())
The result looks like this:
--===============6469538197104697019==
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
Content-Disposition: form-data; name="foo"
YmFy
--===============6469538197104697019==
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
Content-Disposition: form-data; name*=utf-8''t%C3%A4st
VMOkc3Q=
--===============6469538197104697019==
Content-Type: application/octet-stream
MIME-Version: 1.0
Content-Transfer-Encoding: base64
Content-Disposition: form-data; name="file"; filename*=utf-8''T%C3%A4st.txt
YWJjZGVmYWJjZGVmYWJjZGVmYWJjZGVmYWJjZGVmYWJjZGVmYWJjZGVmYWJjZGVmYWJjZGVmYWJj
ZGVmYWJjZGVmYWJjZGVmYWJjZGVmYWJjZGVmYWJjZGVmYWJjZGVmYWJjZGVmYWJjZGVmYWJjZGVm
YWJjZGVmYWJjZGVmYWJjZGVmYWJjZGVmYWJjZGVmYWJjZGVmYWJjZGVmYWJjZGVmYWJjZGVmYWJj
ZGVmYWJjZGVmYWJjZGVmYWJjZGVmYWJjZGVmYWJjZGVmYWJjZGVmYWJjZGVmYWJjZGVmYWJjZGVm
YWJjZGVmYWJjZGVmYWJjZGVmYWJjZGVmYWJjZGVmYWJjZGVmYWJjZGVmYWJjZGVmYWJjZGVmYWJj
ZGVmYWJjZGVmYWJjZGVm
--===============6469538197104697019==--
It does seem to handle headers reasonably well. Binary file content will get base64-encoded, which might be avoidable but which should work well enough. What worries me are the text fields in between. They are base64-encoded as well. I think that according to the standard, this should work well enough, but I'd rather have plain text in there, just in case some dumb framework has to deal with the data at an intermediate level and does not know about Base64 encoded data.
Questions
- Can I use 8 bit data for my text fields and still conform to the specification?
- Can I get the email package to serialize my text fields as 8 bit data without extra encoding?
- If I have to stick to some 7 bit encoding, can I get the implementation to use quoted printable for those text parts where that encoding is shorter than base64?
- Can I avoid base64 encoding for binary file content as well?
- If I can avoid it, should I write the
Content-Transfer-Encoding
as8bit
or asbinary
? If I had to serialize the body myself, how could I use the(email.header
package on its own to just format header values?email.utils.encode_rfc2231
does this.)- Is there some implementation that already did all I'm trying to do?
These questions are very closely related, and could be summarized as “how would you implement this”. In many cases, answering one question either answers or obsoletes another one. So I hope you agree that a single post for all of them is appropriate.