I want to extract text from pdf file using Python and PYPDF package. This is my pdf fie and this is my code:
import PyPDF2
opened_pdf = PyPDF2.PdfFileReader('test.pdf', 'rb')
p=opened_pdf.getPage(0)
p_text= p.extractText()
# extract data line by line
P_lines=p_text.splitlines()
print P_lines
My problem is P_lines cannot extract data line by line and results in one giant string. I want to extract text line by line to analyze it. Any suggestion on how to improve it? Thanks! This is the string that code returns:
[u'Ingredient information for chemicals subject to 29 CFR 1910.1200(i) and Appendix D are obtained from suppliers Material Safety Data Sheets (MSDS)** Information is based on the maximum potential for concentration and thus the total may be over 100%* Total Water Volume sources may include fresh water, produced water, and/or recycled water0.01271%72.00%7732-18-5Water0.00071%4.00%1310-73-2Sodium Hydroxide0.00424%24.00%533-74-4DazomatBiocidePumpcoPlexcide 24L0.00828%75.00%Organic phosphonic acid salts0.00276%25.00%67-56-1Methyl AlcoholScale InhibitorPumpcoPlexaid 6730.00807%30.00%7732-18-5Water0.00188%7.00%Polyethoxylated alcohol surfactants0.00753%28.00%9003-06-9Ammonium Salts0.00941%35.00%64742-47-8Petroleum DistillateFriction ReducerPumpcoPlexslick 9210.05029%60.00%7732-18-5Water0.03353%40.00%7647-01-0Hydrogen ChlorideHydrochloric AcidPumpcoHCL9.84261%100.00%14808-60-7Crystaline SilicaProppantPumpcoSand90.01799%100.00%7732-18-5WaterCommentsMaximumIngredientConcentrationin HF Fluid(% by mass)**MaximumIngredientConcentrationin Additive(% by mass)**Chemical AbstractService Number(CAS #)IngredientsPurposeSupplierTrade NameHydraulic Fracturing Fluid Composition:2,608,032Total Water Volume (gal)*:7,595True Vertical Depth (TVD):GasProduction Type:NAD27Long/Lat Projection:32.558525Latitude:-97.215242Longitude:Ole Gieser Unit D 6HWell Name and Number:XTO EnergyOperator Name:42-439-35084API Number:TarrantCounty:TexasState:12/10/2010Fracture DateHydraulic Fracturing Fluid Product Component Information Disclosure']