12

I have a pdf file. It contains of four columns and all the pages don't have grid lines. They are the marks of students.

I would like to run some analysis on this distribution.(histograms, line graphs etc).

I want to parse this pdf file into a Spreadsheet or an HTML file (which i can then parse very easily).

The link to the pdf is:

Pdf

this is a public document and is available on this domain openly to anyone.

note: I know that this can be done by exporting the file to text from adobe reader and then import it into Libre Calc or Excel. But i want to do this using a python script.

Kindly help me with this issue. specs: Windows 7 Python 2.7

Kenly
  • 24,317
  • 7
  • 44
  • 60
IcyFlame
  • 5,059
  • 21
  • 50
  • 74
  • 1
    Does it have to be parsed as a PDF? For example, I was able to create your data as tab delimited just using my favorite text editor by pasting from the PDF and doing a few replaces: http://pastebin.com/ih6tKMpH – Sean Johnson Sep 12 '13 at 04:38
  • 1
    Yeah! I know we can do this by exporting it as text from adobe and then import it into excel. But i wanna do it using a script! – IcyFlame Sep 12 '13 at 04:42
  • Related: http://stackoverflow.com/questions/1848464/advanced-pdf-parsing-using-python-extracting-text-without-tables-etc-whats – Sean Johnson Sep 12 '13 at 04:44
  • You copied the data from the pdf and pasted it? Or did you export the data as text from some pdf reader? @Sean Johnson – IcyFlame Sep 12 '13 at 04:45
  • 1
    I literally just copied and pasted it from the PDF into my text editor, and ran a few replaces to get the fields to be tab delimited for easy parsing. – Sean Johnson Sep 12 '13 at 04:47
  • Did you try using the standard recomendation for this task([PDFMiner](http://www.unixuser.org/~euske/python/pdfminer/index.html))? – elyase Sep 12 '13 at 04:49
  • :) @SeanJohnson - which is to say you did exactly as IcyFlame suggested, you used Excel's export as text to the clipboard feature (select cell range and copy - puts it in the clipboard as both text and table) and pasted it into a text editor (which ignores the table in the clipboard and gets the text). :) – Jesse Chisholm Dec 12 '15 at 19:27
  • Does this answer your question? [Python module for converting PDF to text](https://stackoverflow.com/questions/25665/python-module-for-converting-pdf-to-text) – rdmolony Apr 29 '21 at 15:28

1 Answers1

23

Use PyPDF2:

from PyPDF2 import PdfFileReader

with open('CT1-All.pdf', 'rb') as f:
    reader = PdfFileReader(f)
    contents = reader.getPage(0).extractText().split('\n')
    pass

When you print contents, it will look like this (I have trimmed it here):

[u'Serial NoRoll NoNameCT1 Marks (50)111MA20026KARADI KALYANI212AR10029MUKESH K
MAR5', u'312MI31004DEEPAK KUMAR7', u'413AE10008FADKE PRASAD DIPAK27', u'513AE10
22RAHUL DUHAN37', u'613AE30005HIMANSHU PRABHAT26.5', u'713AE30019VISHAL KUMAR39
, u'813AG10014HEMANT17', u'913AG10028SHRESTH KR KRISHNA37.51013AG30009HITESH ME
RA33.5', u'1113AG30023RACHIT MADHUKAR40.5', u'1213AR10002ACHARY SUDHEER11', u'1
13AR10004AMAN ASHISH20.5', u'1413AR10008ANKUR44', u'1513AR10010CHUKKA SHALEM RA
U11.5', u'1613AR10012DIKKALA VIJAYA RAGHAVA20.5', u'1713AR10014HRISHABH AMRODIA
1', u'1813AR10016JAPNEET SINGH CHAHAL19.5', u'1913AR10018K VIGNESH42.5', u'2013
R10020KAARTIKEY DWIVEDI49.5', u'2113AR10024LAKSHMISRI KEERTI MANNEY49', u'2213A
10026MAJJI DINESH9.5', u'2313AR10028MOUNIKA BHUKYA17.5', u'2413AR10030PARAS PRA
Burhan Khalid
  • 169,990
  • 18
  • 245
  • 284