0

I am making a REST API using FastAPI and Python. For example here is an api that takes an uploaded file, and returns an array that with the length of each line.

router = APIRouter()

@router.post('/api/upload1')
async def post_a_file(file: UploadFile):    
    result = []
    f = io.TextIOWrapper(file.file, encoding='utf-8')
    while True:
        s = f.readline()
        if not s: break
        result.append(len(s))
    return result

However this fails with error...

f = io.TextIOWrapper(file.file, encoding='utf-8')
AttributeError: 'SpooledTemporaryFile' object has no attribute 'readable'

If i change to

    f = file.file
    while True:
        s = f.readline().decode('utf-8')

then it works, but that is stupid, because reading a "line" of bytes doesn't make sense.

What is the right way to do this?

EDIT: As I learned (see comments) it is not wrong to read a "line" of bytes, because the line break characters (either 0x0A or 0x0D0A) are the same in all character sets.

John Henckel
  • 10,274
  • 3
  • 79
  • 79
  • You can iterate over file line by line `result = [len(line) for line in file.file]` but if file opened in binary mode *(which seems to be the case)* you will get each line as `bytes` which you should `.decode()` if you want to work with string. – Olvin Roght May 26 '23 at 21:23
  • 1
    It's not really _stupid_ per se either, since you're explicitly asking "split my binary data by values that matches what we use to express a new line" when calling `readline`; the concept of a "line" can still have meaning without decoding the content of that line. – MatsLindh May 26 '23 at 22:03
  • You might find [this answer](https://stackoverflow.com/a/70657621/17865804) and [this answer](https://stackoverflow.com/a/73443824/17865804) helpful – Chris May 27 '23 at 05:41
  • Also, you might want to have a look at [`bytes.splitlines()`](https://docs.python.org/3/library/stdtypes.html#bytes.splitlines) – Chris May 27 '23 at 05:41
  • @MatsLindh I should have said "naive". The naive approach would be to split on 0x0A. However without decoding it is impossible to tell if the 0x0A is a single byte character (line feed) or part of a larger multibyte character (which is not a line feed). Or is there some rule in UTF-8 that 0x0A can never be ever part of a multibyte character? – John Henckel May 29 '23 at 15:47
  • 1
    [There is such a rule - anything up to 0x7F will never appear in a multibyte UTF-8 codepoint](https://en.wikipedia.org/wiki/UTF-8#Encoding). The most significant bit is _always_ set for each byte in a multi-byte UTF-8 codepoint. You could argue that there might be encodings where this is not true, but actually most have some sort of variant of the same thing (including SHIFT-JIS where [anything below 0x20 won't appear in multi-byte encodings](https://stackoverflow.com/questions/724247/newline-control-characters-in-multi-byte-character-sets)). Newlines are simply ingrained in low level charsets – MatsLindh May 29 '23 at 19:59

1 Answers1

2

The problem you faced that UploadFile object returned from fastapi is file like object but doesn't have the same attributes of file object that's why you see that error.

so the right approach is to use io.TextIOWrapper as you did to provide the right encoding and cause you use TextIOWrapper you should use async for loop instead of while True

This is the updated code with the right approach

from fastapi import APIRouter, UploadFile
import io

router = APIRouter()

@router.post('/api/upload1')
async def post_a_file(file: UploadFile):
    result = []
    async with file.file as f:
        async for line in io.TextIOWrapper(f, encoding='utf-8'):
            result.append(len(line))
    return result
  • async with used for handling opening and closing the file.

  • async for used for iteating the file line by line and appending them in result list.

I hope this helps.