0

I am dealing with the text and pdf file equal or less than 5KB. If the file is a text file, I get a file from the form and get the required input in a string to summarize:

 file = file.readlines()
 file = ''.join(file)
 result = summarize(file, num_sentences)

It's easily done but for pdf file it turns out it's not that easy. Is there a way to get the sentences of pdf file as a string like I did with my txt file in Python/Django?

pynovice
  • 7,424
  • 25
  • 69
  • 109
  • 1
    This is a possible duplicate of this question: http://stackoverflow.com/questions/2481945/how-to-read-line-by-line-in-pdf-file-using-pypdf – halflings Apr 10 '13 at 10:39
  • Yes, probably. But I already tried the suggest solution in that question. It couldn't return the content of all the file in a string. – pynovice Apr 10 '13 at 10:41
  • 2
    Maybe you could say that in your question and say what exactly went wrong (error message ? wrong content ?) so we can help you ! – halflings Apr 10 '13 at 10:52
  • You can use this app: http://www.unixuser.org/~euske/python/pdfminer/index.html – catherine Apr 10 '13 at 11:45

2 Answers2

3

I dont think its possible to read pdfs just the way you are doing it with txt files, you need to convert the pdfs into txt files(refer Python module for converting PDF to text) and then process it. you can also refer to this to convert pdf to txt easily http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-converter/

Community
  • 1
  • 1
scottydelta
  • 1,786
  • 4
  • 19
  • 39
0

In Django you can do this:

views.py :

def upload_pdf():
     if request.method == 'POST' and request.FILES['myfile']:
        pdfFileObj = request.FILES['myfile'].read() 
        pdfReader = PyPDF2.PdfFileReader(io.BytesIO(pdfFileObj))
        NumPages = pdfReader.numPages
        i = 0
        content = []
        while (i<NumPages):
            text = pdfReader.getPage(i)
            content.append(text.extractText())
            i +=1
       # depends on what you want to do with the pdf parsing results
       return render(request, .....) 

html part:

<form method="post" enctype="multipart/form-data" action="/url">
    {% csrf_token %}
      <input  type="file" name="myfile"> # the name is the same as the one you put in FILES['myfile']
    <button class="butto" type="submit">Upload</button>
</form>

In Python you can simply do this :

fileName = "path/test.pdf"
pdfFileObj = open(fileName,'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
NumPages = pdfReader.numPages

i = 0
content = []
while (i<NumPages):
    text = pdfReader.getPage(i)
    content.append(text.extractText())
    i +=1
Oumab10
  • 696
  • 2
  • 6
  • 14