0

I want to save my output data into the text file where each new line is shown in a different row. Currently each row is delimited by \n, I want new lines to be saved in different rows.

from PIL import Image 
import pytesseract 
import sys 
from pdf2image import convert_from_path 
import os 



PDF_file = "F:/ABC/Doc_1.pdf"

pages = convert_from_path(PDF_file, 500) 
image_counter = 1

for page in pages: 
    filename = "page_"+str(image_counter)+".jpg"
    page.save(filename, 'JPEG') 
    image_counter = image_counter + 1

filelimit = image_counter-1
outfile = "F:/ABC/intermediate_steps/out_text.txt"


f = open(outfile, "a") 

for i in range(1, 2): 

    filename = "page_"+str(i)+".jpg"
    import pytesseract 
    pytesseract.pytesseract.tesseract_cmd = r"\ABC\opencv-text-detection\Tesseract-OCR\tesseract.exe"
    from pytesseract import pytesseract
    text = str(((pytesseract.image_to_string(Image.open(filename)))))  
    text = text.replace('-\n', '')   
    #text = text.splitlines()
    f.writelines("Data Extracted from next page starts now.")
    f.writelines(str(text.encode('utf-8')))

f.close() 

For eg :-

ABC
DEF
GHI

Current output :-

ABC\nDEF\nGHI\n

Current and Expected Output

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
gaurav2141
  • 130
  • 8
  • I don't get your question. What's the issue? – m02ph3u5 Jul 28 '19 at 15:37
  • @m02ph3u5, i want extracted output to be saved in a text file where each new row is not shown as delimited by **\n** , but each new line is saved in a different row without \n, please see i have included an image in the question. I hope it helps. – gaurav2141 Jul 28 '19 at 15:42
  • What are the exact contents of `text`? Also, why do you use `writelines` instead of `write` if it's just a string? – m02ph3u5 Jul 28 '19 at 15:43
  • Its some data extracted from one pdf document,@m02ph3u5 – gaurav2141 Jul 28 '19 at 15:45
  • @m02ph3u5 writelines and write, none of them are working for me. – gaurav2141 Jul 28 '19 at 16:13

1 Answers1

1

When you do

f.writelines(str(text.encode('utf-8')))

You convert the newline byte \n to its escaped version \\n. You should use just

f.writelines(text)
herculanodavi
  • 228
  • 2
  • 12
  • If i dont encode then it throws an error :UnicodeEncodeError: 'charmap' codec can't encode character '\ufb01' in position 0: character maps to – gaurav2141 Jul 28 '19 at 16:11
  • You could try [this](https://stackoverflow.com/questions/27092833/unicodeencodeerror-charmap-codec-cant-encode-characters) – herculanodavi Jul 29 '19 at 17:07