1

Someone helped me with a program so that I can convert PDF files from that format to csv but they didn't specify an encoding type, Here is the code:

import os
import glob
import tabula

path="/Users/username/Downloads/"
for filepath in glob.glob(path+'*.pdf'):
    name=os.path.basename(filepath)
    tabula.convert_into(input_path=filepath, 
                        output_path=path+name+".csv",
                        pages="all")

How can I get the CSV files to be converted with the encoding to be utf-8 or cp1252

Thanks for helping

Error I'm getting

Error

Kenny
  • 43
  • 5
  • PDFs are binary files. You can't expect to be able to decode them with any text encoding, because they're not strictly text. – Brian61354270 Jan 25 '23 at 02:10

1 Answers1

0

You can use chardet library to get the resulted encoding of the file generated by tabula, and then pandas to convert to the encoding you want.

import chardet
import pandas as pd

for filepath in glob.glob(path+'name.csv'):
    with open(filepath, 'rb') as f:
        result = chardet.detect(f.read())
    df = pd.read_csv(filepath,encoding=result['encoding'])
    df.to_csv(filepath,index=False,encoding='utf-8')
Lahcen YAMOUN
  • 657
  • 3
  • 15