0

User can upload both UTF-8 and UTF-16 CSV file which is the attendance file of Microsoft teams that they download from there. To handle the case I have written the following code but there is some strange issue that I am not able to fix.

excel_file = request.FILES['excel-file']
try:
    print('try case')
    df = pd.read_csv(excel_file)
    csv_type = 'a'
    print(df)
except UnicodeDecodeError:
    print('except case')
    from io import BytesIO
    df = pd.read_csv(BytesIO(excel_file.read().decode('UTF-16').encode('UTF-8')), sep='\\')
    csv_type = 'b'
    print(df)
except Exception as e:
    print("Incorrect CSV file format", e)

Here first 'try case' handle the UTF-8 and 'except case' handle the UTF-16 CSV file. Both case work fine if I run them separately, But when I put them in try except block then the code fails. Here in the above code UTF-8 case works but UTF-16 gives No columns to parse from file error. Now if I move the except code to try then UTF-16 will work but it will also run for UTF-8 giving wrong data. So how can I handle this case haven't found any way to get file encoding also.

Sujil Devkota
  • 103
  • 1
  • 9
  • can you just provide `encoding="utf-16"` to [`pd.read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)? – Michael Delgado Sep 26 '22 at 06:15
  • That didn't work for me, actually that is not the problem. These code run when I run them separately for each case if I upload utf-8 file and only run try case without try except block it runs. If I upload utf-16 and only run except case code without try except block it runs but in the try except block it fails – Sujil Devkota Sep 26 '22 at 06:27
  • Can you provide an example? The suggestion @MichaelDelgado gave worked for me here, as well as this `with open(excel_file, mode='r', encoding='utf-16') as file: df = pd.read_csv(BytesIO(file.read().encode('UTF-8')), sep=None)`. – Ingwersen_erik Sep 26 '22 at 10:38
  • 1
    For the teams attendance file I have access to, it has multiple sections in addition to the list of attendees. Once cleaned to include only attends, reading it works using the `pd.read_csv` with the encoding parameter as suggested above. You also asked if we could determine the encoding of a file. See this [SO thread](https://stackoverflow.com/a/33819765/2886158), where one could use `chardet.detect(f.read())` for it. – Heelara Sep 26 '22 at 12:23
  • @Ingwersen_erik the code you mention will work if I run directly on the terminal. But this case is of file upload in Django site, here the uploaded file will be passed as inmemory object so the code I write for utf-16 is working fine for this case. The main issue is just read_csv is changing when it executes in try block first which is causing error in except section. Please read the above question again thank you. – Sujil Devkota Sep 26 '22 at 15:08
  • @Heelara thank you for your response in my case the user doesn't clean anything they want to directly upload the file that they download from teams. there are 3 types of attendance files that they can download from there chat, attendance menu, and from teams admin site. Yes determining encoding will solve my case I also found another library magic using that I have fixed it, but also want to know how can I flush or copy the excel_file variable separately because once it runs read_csv it causes the problem and ```try except``` not solving this issue, only ```if else``` can solve – Sujil Devkota Sep 26 '22 at 15:14

0 Answers0