How to read Arabic text from PDF using Python script

Question

I have a code written in Python that reads from PDF files and convert it to text file.

The problem occurred when I tried to read Arabic text from PDF files. I know that the error is in the coding and encoding process but I don't know how to fix it.

The system converts Arabic PDF files but the text file is empty. and display this error:

Traceback (most recent call last): File "C:\Users\test\Downloads\pdf-txt\text maker.py", line 68, in f.write(content) UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 50: ordinal not in range(128)

Code:

import os
from os import chdir, getcwd, listdir, path
import codecs
import pyPdf
from time import strftime

def check_path(prompt):
    ''' (str) -> str
    Verifies if the provided absolute path does exist.
    '''
    abs_path = raw_input(prompt)
    while path.exists(abs_path) != True:
        print "\nThe specified path does not exist.\n"
        abs_path = raw_input(prompt)
    return abs_path    

print "\n"

folder = check_path("Provide absolute path for the folder: ")

list=[]
directory=folder
for root,dirs,files in os.walk(directory):
    for filename in files:
        if filename.endswith('.pdf'):
            t=os.path.join(directory,filename)

            list.append(t)

m=len(list)
print (m)
i=0
while i<=m-1:

    path=list[i]
    print(path)
    head,tail=os.path.split(path)
    var="\\"

    tail=tail.replace(".pdf",".txt")
    name=head+var+tail

    content = ""
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
            # Iterate pages
    for j in range(0, pdf.getNumPages()):
        # Extract text from page and add to content
        content += pdf.getPage(j).extractText() + "\n"
    print strftime("%H:%M:%S"), " pdf  -> txt "
    f=open(name,'w')
    content.encode('utf-8')
    f.write(content)
    f.close
    i=i+1

Is there an exception or does the script exit silently? Does it work as expected for PDFs that contain only text written with Latin script? — lenz, Dec 20 '17 at 09:01
@lenz THE SCRIPT work as expected with no error on non ARABIC content but when it comes to ARABIC it convert PDF to empty text file — Rany Fahed, Dec 20 '17 at 09:17
Oh I see. You have to write `content = content.encode('utf-8')` on line 68. String methods never modify strings in-place, you always have to capture the return value. — lenz, Dec 20 '17 at 10:48
Rany, did this work? Because once you fixed your code, I suggest you delete this post, since it's very unlikely to help future readers. Your problem turned out to have nothing to do with encoding, Arabic, or PDF – it's simply a bug that shows up when the content contains non-ASCII characters. — lenz, Dec 20 '17 at 16:35
@lenz the error is gone but still the converted file is empty — Rany Fahed, Dec 21 '17 at 06:43

score 1 · Accepted Answer · answered Dec 21 '17 at 07:41

1

You have a couple of problems:

content.encode('utf-8') doesn't do anything. The return value is the encoded content, but you have to assign it to a variable. Better yet, open the file with an encoding, and write Unicode strings to that file. content appears to be Unicode data.

Example (works for both Python 2 and 3):

 import io
 f = io.open(name,'w',encoding='utf8')
 f.write(content)

If you don't close the file properly, you may see no content because the file is not flushed to disk. You have f.close not f.close(). It's better to use with, which ensures the file is closed when the block exits.

Example:

import io
with io.open(name,'w',encoding='utf8') as f:
    f.write(content)

In Python 3, you don't need to import and use io.open but it still works. open is equivalent. Python 2 needs the io.open form.

answered Dec 21 '17 at 07:41

Mark Tolonen

166,664
26
169
251

I USED your answer to fix my code. now it convert the arabic PDF text INTO TXT FILE BUT WITH **unreadable characters**. – Rany Fahed Dec 21 '17 at 08:11
@RanyFahed What software do you use to inspect the text file? The viewer/editor might be using the wrong encoding. – lenz Dec 21 '17 at 18:30
@RanyFahed Also since it looks like you are on Windows, many Windows programs assume a localized encoding such as Windows-1252 on U.S. Windows. You can use `utf-8-sig` to write a byte order mark (BOM) signature and some programs recognize this to know to use UTF-8. – Mark Tolonen Dec 21 '17 at 21:30
@lenz for PDF files i am using **PDF Complete** for TXT files i am using **NotePad ++** – Rany Fahed Dec 22 '17 at 06:00
@MarkTolonen i did not understand your comment – Rany Fahed Dec 22 '17 at 06:02
@RanyFahed Use `with io.open(name, 'w', encoding='utf-8-sig')` and see if Notepad++ is able to display the text properly. – lenz Dec 22 '17 at 09:23

score 0 · Answer 2 · answered Nov 19 '21 at 15:25

you can use anthor library called pdfplumber instead of using pypdf or PyPDF2

import arabic_reshaper
from bidi.algorithm import get_display
with pdfplumber.open(r'example.pdf') as pdf:
    my_page = pdf.pages[10]
    thepages=my_page.extract_text()
    reshaped_text = arabic_reshaper.reshape(thepages)
    bidi_text = get_display(reshaped_text)
    print(bidi_text)

How to read Arabic text from PDF using Python script

2 Answers2

Linked