-3

Helloo

I have this pdf file which is basically a form and i want to extract all the field names as a column name and information associated with it and then save it into an excel file.

Please assist me with this how it is possible.

Thank you in advance.

snapshot of pdf

arvin
  • 9
  • 4

1 Answers1

0

You could use the pdfplumber package. There is a crop function that allows you to specify a bounding box from which you can define areas to extract from. There are a whole list of other functions which are useful for extracting text and form fields etc. For example: https://github.com/jsvine/pdfplumber#extracting-form-values

This in combination with some regex selection and matching is often useful. However, I would advise that you give it a go. It is actually not as difficult as you would expect.

Example extracting text from PDF:

with pdfplumber.open(pdf_file) as pdf:
    first_page = pdf.pages[0]
    rows = first_page.extract_text().split('\n')

You can then put the data into a dataframe using the pandas package, and once it is in this format, it is trivial to send it to excel.


EDIT:

Based on new information, it looks like you are dealing with an XFA based PDF. And from my initial attempts was not able to use pdfplumber as described above. My suggestion is to use PyPDF2.

Extract the XML from the PDF document, and then use that to get the information you need. I would say that REGEX would still be an appropriate method here.

Code to extract the XML from the XFA based PDF:

import PyPDF2 as pypdf

def findInDict(needle, haystack):
    for key in haystack.keys():
        try:
            value=haystack[key]
        except:
            continue
        if key==needle:
            return value
        if isinstance(value,dict):            
            x=findInDict(needle,value)            
            if x is not None:
                return x

pdfobject=open('Form CHG-1-010216.pdf','rb')
pdf=pypdf.PdfFileReader(pdfobject)
xfa=findInDict('/XFA',pdf.resolved_objects)
xml=xfa[7].getObject().getData()

print(xml)

(code source)

Here is the XML extracted from the datasets section of the PDF:

b'\n<xfa:datasets xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/"\n><dd:dataDescription xmlns:dd="http://ns.adobe.com/data-description/" dd:name="Form8_Dtls"\n><frm:Form8_Dtls xmlns:frm="http://www.mit.gov.in/eGov/BackOffice/schema/Form"\n><frm:Form8\n><frm:CIN dd:minOccur="0" dd:nullType="exclude"\n/><frm:GLN dd:minOccur="0" dd:nullType="exclude"\n/><frm:EmailID dd:minOccur="0" dd:nullType="exclude"\n/><frm:ChargeType dd:minOccur="0" dd:nullType="exclude"\n/><frm:Applicant dd:minOccur="0" dd:nullType="exclude"\n/><frm:SrnForForm dd:minOccur="0" dd:nullType="exclude"\n/><frm:ChargeID 
dd:minOccur="0" dd:nullType="exclude"\n/><frm:SrnForForm2.28 dd:minOccur="0" dd:nullType="exclude"\n/><frm:Beyond30Within300 dd:minOccur="0" dd:nullType="exclude"\n/><frm:Beyond300 dd:minOccur="0" dd:nullType="exclude"\n/><frm:ReasonDelay dd:minOccur="0" 
dd:nullType="exclude"\n/><frm:WhthrChrgARCorAssignd dd:minOccur="0" dd:nullType="exclude"\n/><frm:WhthrChrghldrAuth dd:minOccur="0" dd:nullType="exclude"\n/><frm:UnCldShrCptl dd:minOccur="0" dd:nullType="exclude"\n/><frm:Improperty dd:minOccur="0" dd:nullType="exclude"\n/><frm:AnyIntrstImproperty dd:minOccur="0" dd:nullType="exclude"\n/><frm:BookDebts dd:minOccur="0" dd:nullType="exclude"\n/><frm:MovProperty dd:minOccur="0" dd:nullType="exclude"\n/><frm:FloatngChrg dd:minOccur="0" dd:nullType="exclude"\n/><frm:CallsMadeNotPaid dd:minOccur="0" dd:nullType="exclude"\n/><frm:Ship dd:minOccur="0" dd:nullType="exclude"\n/><frm:Goodwill dd:minOccur="0" dd:nullType="exclude"\n/><frm:PatentLicence dd:minOccur="0" dd:nullType="exclude"\n/><frm:tradeMark dd:minOccur="0" dd:nullType="exclude"\n/><frm:Copyright dd:minOccur="0" dd:nullType="exclude"\n/><frm:Others dd:minOccur="0" dd:nullType="exclude"\n/><frm:OthersSpec dd:minOccur="0" dd:nullType="exclude"\n/><frm:ConsrtmInvld dd:minOccur="0" dd:nullType="exclude"\n/><frm:JointChrgInvld dd:minOccur="0" dd:nullType="exclude"\n/><frm:NoOfChargeHolders 
dd:minOccur="0" dd:nullType="exclude"\n/><frm:CategoryBank dd:minOccur="0" dd:nullType="exclude"\n/><frm:IfCategoryOthers dd:minOccur="0" dd:nullType="exclude"\n/><frm:ChargeHolderDetails dd:minOccur="0"\n><cdt:Cin xmlns:cdt="http://www.mit.gov.in/eGov/BackOffice/schema/ComplexDataTypes" dd:minOccur="0" dd:nullType="exclude"\n/><cdt:ChrgHldrName xmlns:cdt="http://www.mit.gov.in/eGov/BackOffice/schema/ComplexDataTypes" dd:minOccur="0" dd:nullType="exclude"\n/><cdt:OptionalName xmlns:cdt="http://www.mit.gov.in/eGov/BackOffice/schema/ComplexDataTypes" dd:minOccur="0" dd:nullType="exclude"\n/><cdt:ChargeHldrAddress xmlns:cdt="http://www.mit.gov.in/eGov/BackOffice/schema/ComplexDataTypes"\n><cdt:AddressLn\n><cdt:FirstLine\n/><cdt:SecondLine dd:minOccur="0" dd:nullType="exclude"\n/></cdt:AddressLn\n><cdt:City\n/><cdt:State\n/><cdt:Country dd:minOccur="0" dd:nullType="exclude"\n/><cdt:CountryName dd:minOccur="0" dd:nullType="exclude"\n/><cdt:Pincode dd:minOccur="0" dd:nullType="exclude"\n/><cdt:Telephone dd:minOccur="0" dd:nullType="exclude"\n/><cdt:Fax dd:minOccur="0" dd:nullType="exclude"\n/><cdt:Email dd:minOccur="0" dd:nullType="exclude"\n/><cdt:Mbl dd:minOccur="0" dd:nullType="exclude"\n/></cdt:ChargeHldrAddress\n></frm:ChargeHolderDetails\n><frm:InstrumentDesc dd:minOccur="0" dd:nullType="exclude"\n/><frm:InstrumentCrtModDate dd:minOccur="0" dd:nullType="exclude"\n/><frm:WhthrChrgCrMod dd:minOccur="0" dd:nullType="exclude"\n/><frm:FrgnChargeRcptDate dd:minOccur="0" dd:nullType="exclude"\n/><frm:AmtSecured 
dd:minOccur="0" dd:nullType="exclude"\n/><frm:AmtSecChrgInWords dd:minOccur="0" dd:nullType="exclude"\n/><frm:AmtSecChrgFrgnCurrncyDetails dd:minOccur="0" dd:nullType="exclude"\n/><frm:TermsAndConditions dd:minOccur="0"\n><frm:RateOfInt dd:minOccur="0" dd:nullType="exclude"\n/><frm:TermsOfPaymnt dd:minOccur="0" dd:nullType="exclude"\n/><frm:Margin dd:minOccur="0" dd:nullType="exclude"\n/><frm:ExtntOperatnChrg dd:minOccur="0" dd:nullType="exclude"\n/><frm:Others dd:minOccur="0" dd:nullType="exclude"\n/></frm:TermsAndConditions\n><frm:ExstngChrgAcqDtls dd:minOccur="0"\n><frm:InstrDate dd:minOccur="0" dd:nullType="exclude"\n/><frm:InstrDescr dd:minOccur="0" dd:nullType="exclude"\n/><frm:DateofAcq dd:minOccur="0" dd:nullType="exclude"\n/><frm:ChrgAmnt dd:minOccur="0" dd:nullType="exclude"\n/><frm:PropChrgPartclrs dd:minOccur="0" dd:nullType="exclude"\n/></frm:ExstngChrgAcqDtls\n><frm:PropParticlars dd:minOccur="0" dd:nullType="exclude"\n/><frm:NewPropParticlars dd:maxOccur="10" dd:minOccur="0" dd:nullType="exclude"\n/><frm:PropOwnCmp dd:minOccur="0" dd:nullType="exclude"\n/><frm:PropRegisteredName dd:minOccur="0" dd:nullType="exclude"\n/><frm:DateOfLatestMod dd:minOccur="0" dd:nullType="exclude"\n/><frm:PartclrsPresntMod dd:minOccur="0" dd:nullType="exclude"\n/><frm:BoardResNo dd:minOccur="0" dd:nullType="exclude"\n/><frm:AuthSigReslnDt dd:minOccur="0" dd:nullType="exclude"\n/><frm:DesignationOne dd:minOccur="0" dd:nullType="exclude"\n/><frm:DIN dd:minOccur="0" dd:nullType="exclude"\n/><frm:DesignationTwo dd:minOccur="0" dd:nullType="exclude"\n/><frm:DesignationThree dd:minOccur="0" dd:nullType="exclude"\n/><frm:CharteredOrCostOrCompSec dd:minOccur="0" dd:nullType="exclude"\n/><frm:AssociateorFellow dd:minOccur="0" dd:nullType="exclude"\n/><frm:MembershipnumberorCertificate dd:minOccur="0" dd:nullType="exclude"\n/><frm:CertificateNo dd:minOccur="0" dd:nullType="exclude"\n/><frm:strBankName dd:minOccur="0" dd:nullType="exclude"\n/><frm:CondonationFlag dd:minOccur="0" dd:nullType="exclude"\n/><frm:HTF dd:minOccur="0" dd:nullType="exclude"\n/><frm:IPresNumber dd:minOccur="0" dd:nullType="exclude"\n/><frm:FormId dd:minOccur="0" dd:nullType="exclude"\n/><frm:VersionNo dd:minOccur="0" dd:nullType="exclude"\n/><frm:Form_Language dd:minOccur="0" dd:nullType="exclude"\n/><frm:BoPreFilldataForm dd:minOccur="0"\n><cdt:DateOfFiling xmlns:cdt="http://www.mit.gov.in/eGov/BackOffice/schema/ComplexDataTypes"\n/><cdt:DateOfSigning xmlns:cdt="http://www.mit.gov.in/eGov/BackOffice/schema/ComplexDataTypes"\n/><cdt:eFormSRN xmlns:cdt="http://www.mit.gov.in/eGov/BackOffice/schema/ComplexDataTypes"\n/><cdt:MngmtDispute xmlns:cdt="http://www.mit.gov.in/eGov/BackOffice/schema/ComplexDataTypes" dd:minOccur="0" dd:nullType="exclude"\n/></frm:BoPreFilldataForm\n><frm:MngmtDispute dd:minOccur="0" dd:nullType="exclude"\n/><frm:LSI dd:minOccur="0" dd:nullType="exclude"\n/><frm:HostVersion dd:minOccur="0" dd:nullType="exclude"\n/><frm:HostAppName dd:minOccur="0" dd:nullType="exclude"\n/><frm:TotalPageNo dd:minOccur="0" dd:nullType="exclude"\n/><frm:EfmUniqueID dd:minOccur="0" dd:nullType="exclude"\n/><frm:AttachmentNames dd:minOccur="0" dd:nullType="exclude"\n/></frm:Form8\n></frm:Form8_Dtls\n></dd:dataDescription\n><dd:dataDescription xmlns:dd="http://ns.adobe.com/data-description/" dd:name="CINDataConngetCINLLPINDetailsRequestDD"\n><CINDataConn\n><soap:Body xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"\n><impl:getCINLLPINDetails xmlns:impl="http://prefill.eforms.userinterface.mydca.dca21.com"\n><impl:strCINLLPIN dd:nullType="xsi"\n/></impl:getCINLLPINDetails\n></soap:Body\n></CINDataConn\n></dd:dataDescription\n><dd:dataDescription xmlns:dd="http://ns.adobe.com/data-description/" dd:name="CINSplitDataConn1getChrgHolderAddressWitCondRequestDD"\n><CINSplitDataConn1\n><soap:Body xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"\n><impl:getChrgHolderAddressWitCond xmlns:impl="http://prefill.eforms.userinterface.mydca.dca21.com"\n><impl:strCompanyID dd:nullType="xsi"\n/></impl:getChrgHolderAddressWitCond\n></soap:Body\n></CINSplitDataConn1\n></dd:dataDescription\n><dd:dataDescription xmlns:dd="http://ns.adobe.com/data-description/" dd:name="ForeignCmpnyDataConngetForeignCompanyDetailsNewRequestDD"\n><ForeignCmpnyDataConn\n><soap:Body xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"\n><impl:getForeignCompanyDetailsNew xmlns:impl="http://prefill.eforms.userinterface.mydca.dca21.com"\n><impl:strCompanyID dd:nullType="xsi"\n/></impl:getForeignCompanyDetailsNew\n></soap:Body\n></ForeignCmpnyDataConn\n></dd:dataDescription\n><dd:dataDescription xmlns:dd="http://ns.adobe.com/data-description/" dd:name="PrescrutinyServiceserviceForm8RequestDD"\n><PrescrutinyService\n><soap:Body xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"\n><impl:serviceForm8 xmlns:impl="http://prefill.eforms.userinterface.mydca.dca21.com"\n><impl:objForm8PrescDDto dd:nullType="xsi"\n><tns2:DBoardResDate xmlns:tns2="http://dto.eforms.business.mydca.dca21.com" dd:nullType="xsi"\n/><tns2:DDateOfLatestMod xmlns:tns2="http://dto.eforms.business.mydca.dca21.com" dd:nullType="xsi"\n/><tns2:DTempAcqstnDate xmlns:tns2="http://dto.eforms.business.mydca.dca21.com" dd:nullType="xsi"\n/><tns2:extntOperatnChrg xmlns:tns2="http://dto.eforms.business.mydca.dca21.com" dd:nullType="xsi"\n/><tns2:margin xmlns:tns2="http://dto.eforms.business.mydca.dca21.com" dd:nullType="xsi"\n/><tns2:others xmlns:tns2="http://dto.eforms.business.mydca.dca21.com" dd:nullType="xsi"\n/><tns2:partclrsPresntMod xmlns:tns2="http://dto.eforms.business.mydca.dca21.com" dd:nullType="xsi"\n/><tns2:propParticlars xmlns:tns2="http://dto.eforms.business.mydca.dca21.com" dd:nullType="xsi"\n/><tns2:rateOfInt xmlns:tns2="http://dto.eforms.business.mydca.dca21.com" dd:nullType="xsi"\n/><tns2:termsOfPaymnt xmlns:tns2="http://dto.eforms.business.mydca.dca21.com" dd:nullType="xsi"\n/><tns2:whthrChrgCrMod xmlns:tns2="http://dto.eforms.business.mydca.dca21.com" dd:nullType="xsi"\n/><tns2:DInstrCrtEvdDate xmlns:tns2="http://dto.eforms.business.mydca.dca21.com" dd:nullType="xsi"\n/><tns2:DInstrCrtModDate xmlns:tns2="http://dto.eforms.business.mydca.dca21.com" dd:nullType="xsi"\n/><tns2:DTempFrgnCharge xmlns:tns2="http://dto.eforms.business.mydca.dca21.com" dd:nullType="xsi"\n/><tns2:dupFlag xmlns:tns2="http://dto.eforms.business.mydca.dca21.com" dd:nullType="xsi"\n/><tns2:formID xmlns:tns2="http://dto.eforms.business.mydca.dca21.com" dd:nullType="xsi"\n/><tns2:formVersion xmlns:tns2="http://dto.eforms.business.mydca.dca21.com" dd:nullType="xsi"\n/><tns2:presNumber xmlns:tns2="http://dto.eforms.business.mydca.dca21.com" dd:nullType="xsi"\n/><tns2:strChargeType xmlns:tns2="http://dto.eforms.business.mydca.dca21.com" dd:nullType="xsi"\n/><tns2:strChrgCIN xmlns:tns2="http://dto.eforms.business.mydca.dca21.com" dd:nullType="xsi"\n/><tns2:strChrgHldrName xmlns:tns2="http://dto.eforms.business.mydca.dca21.com" dd:nullType="xsi"\n/><tns2:strCIN xmlns:tns2="http://dto.eforms.business.mydca.dca21.com" dd:nullType="xsi"\n/><tns2:strCountryCode xmlns:tns2="http://dto.eforms.business.mydca.dca21.com" dd:nullType="xsi"\n/><tns2:strDesignation xmlns:tns2="http://dto.eforms.business.mydca.dca21.com" dd:nullType="xsi"\n/><tns2:strDINMembrshpNo xmlns:tns2="http://dto.eforms.business.mydca.dca21.com" dd:nullType="xsi"\n/><tns2:strPinCode xmlns:tns2="http://dto.eforms.business.mydca.dca21.com" dd:nullType="xsi"\n/><tns2:strStateCode xmlns:tns2="http://dto.eforms.business.mydca.dca21.com" dd:nullType="xsi"\n/></impl:objForm8PrescDDto\n></impl:serviceForm8\n></soap:Body\n></PrescrutinyService\n></dd:dataDescription\n><xfa:data\n><frm:Form8_Dtls xmlns:frm="http://www.mit.gov.in/eGov/BackOffice/schema/Form"\n><frm:Form8\n><frm:CIN\n>U37100DL2004PTC128960</frm:CIN\n><frm:EmailID\n>vwcpl@yahoo.com</frm:EmailID\n><frm:ChargeType\n>CRTN</frm:ChargeType\n><frm:Applicant\n>Company</frm:Applicant\n><frm:Beyond30Within300\n>Yes</frm:Beyond30Within300\n><frm:ReasonDelay\n>Due to some DSC problems</frm:ReasonDelay\n><frm:UnCldShrCptl\n>2ONE</frm:UnCldShrCptl\n><frm:Improperty\n>2MMP</frm:Improperty\n><frm:AnyIntrstImproperty\n>2ONE</frm:AnyIntrstImproperty\n><frm:BookDebts\n>2ONE</frm:BookDebts\n><frm:MovProperty\n>2OVP</frm:MovProperty\n><frm:FloatngChrg\n>2ONE</frm:FloatngChrg\n><frm:CallsMadeNotPaid\n>2ONE</frm:CallsMadeNotPaid\n><frm:Ship\n>2ONE</frm:Ship\n><frm:Goodwill\n>2ONE</frm:Goodwill\n><frm:PatentLicence\n>2ONE</frm:PatentLicence\n><frm:tradeMark\n>2ONE</frm:tradeMark\n><frm:Copyright\n>2ONE</frm:Copyright\n><frm:Others\n>2THS</frm:Others\n><frm:OthersSpec\n>movable / immovable fixed assets</frm:OthersSpec\n><frm:ConsrtmInvld\n>NO</frm:ConsrtmInvld\n><frm:JointChrgInvld\n>NO</frm:JointChrgInvld\n><frm:NoOfChargeHolders\n>1</frm:NoOfChargeHolders\n><frm:CategoryBank\n>NATB</frm:CategoryBank\n><frm:ChargeHolderDetails\n><cdt:ChrgHldrName xmlns:cdt="http://www.mit.gov.in/eGov/BackOffice/schema/ComplexDataTypes"\n>Others</cdt:ChrgHldrName\n><cdt:OptionalName xmlns:cdt="http://www.mit.gov.in/eGov/BackOffice/schema/ComplexDataTypes"\n>State Bank of Patiala</cdt:OptionalName\n><cdt:ChargeHldrAddress xmlns:cdt="http://www.mit.gov.in/eGov/BackOffice/schema/ComplexDataTypes"\n><cdt:AddressLn\n><cdt:FirstLine\n>MCG Pitampura</cdt:FirstLine\n><cdt:SecondLine\n>A-102, D-Mall, Netaji Subash Place</cdt:SecondLine\n></cdt:AddressLn\n><cdt:City\n>Delhi</cdt:City\n><cdt:State\n>DL</cdt:State\n><cdt:Country\n>IN</cdt:Country\n><cdt:CountryName\n>INDIA</cdt:CountryName\n><cdt:Pincode\n>110034</cdt:Pincode\n><cdt:Email\n>nimishdel@gmail.com</cdt:Email\n></cdt:ChargeHldrAddress\n></frm:ChargeHolderDetails\n><frm:InstrumentDesc\n>Agreement of Loan Cum Hypothecation&#xD;Letter of Arrangement&#xD;Agreement of Mortgage</frm:InstrumentDesc\n><frm:InstrumentCrtModDate\n>2015-12-30</frm:InstrumentCrtModDate\n><frm:WhthrChrgCrMod\n>NO</frm:WhthrChrgCrMod\n><frm:AmtSecured\n>6000000.00</frm:AmtSecured\n><frm:AmtSecChrgInWords\n>Rupees Sixty Lacs  only</frm:AmtSecChrgInWords\n><frm:TermsAndConditions\n><frm:RateOfInt\n>Term Loan of Rs.60.00 Lakh - @ 3.10 % above base rate, present 
effective rate 12.75 % p. a. with monthly rests.</frm:RateOfInt\n><frm:TermsOfPaymnt\n>The Term Loan of Rs.60.00 Lakh shall be repayable in 26 quarterly installments of Rs.2,25,000/- each, commencing from June, 2016 (last installment of Rs.4,00,000/- )</frm:TermsOfPaymnt\n><frm:Margin\n>41.19 %</frm:Margin\n><frm:ExtntOperatnChrg\n>100 percent.</frm:ExtntOperatnChrg\n><frm:Others\n>THE ABOVE IS TO SECURE THE FOLLOWING CREDIT FACILITIES GRANTED TO THE COMPANY :-&#xD;1. Term Loan             - Rs.60.00 Lakh</frm:Others\n></frm:TermsAndConditions\n><frm:Form_Language\n>ENGL</frm:Form_Language\n><frm:BoPreFilldataForm\n><cdt:DateOfFiling xmlns:cdt="http://www.mit.gov.in/eGov/BackOffice/schema/ComplexDataTypes"\n/><cdt:DateOfSigning xmlns:cdt="http://www.mit.gov.in/eGov/BackOffice/schema/ComplexDataTypes"\n/><cdt:eFormSRN xmlns:cdt="http://www.mit.gov.in/eGov/BackOffice/schema/ComplexDataTypes"\n/></frm:BoPreFilldataForm\n><frm:HostVersion\n>22.00220191</frm:HostVersion\n><frm:HostAppName\n>Reader</frm:HostAppName\n><frm:TotalPageNo\n>6</frm:TotalPageNo\n><frm:EfmUniqueID\n>Form89JOK7KCC3A8WGXDDT7VVA3ZKNRG</frm:EfmUniqueID\n><frm:LSI\n>1</frm:LSI\n><frm:ExstngChrgAcqDtls\n/><frm:NewPropParticlars\n>Hypothecation of all movable / immovable fixed assets of the company created out of the Term Loan.</frm:NewPropParticlars\n><frm:NewPropParticlars\n>Equitable mortgage of Shop No. CSC - 39, DDA Market, A Block, Saraswati Vihar, Delhi -110 034.</frm:NewPropParticlars\n><frm:PropOwnCmp\n>NO</frm:PropOwnCmp\n><frm:BoardResNo\n>03</frm:BoardResNo\n><frm:AuthSigReslnDt\n>2015-12-14</frm:AuthSigReslnDt\n><frm:DesignationOne\n>DIRT</frm:DesignationOne\n><frm:DIN\n>05153044</frm:DIN\n><frm:DesignationTwo\n>AACCS0143D</frm:DesignationTwo\n><frm:CharteredOrCostOrCompSec\n>CA</frm:CharteredOrCostOrCompSec\n><frm:AssociateorFellow\n>FW</frm:AssociateorFellow\n><frm:MembershipnumberorCertificate\n>508508</frm:MembershipnumberorCertificate\n><frm:CertificateNo\n>508508</frm:CertificateNo\n><frm:AttachmentNames\n>2014.pdf,2900.pdf</frm:AttachmentNames\n><frm:HTF\n>NO</frm:HTF\n><frm:IPresNumber\n>0</frm:IPresNumber\n><frm:FormId\n>Form8</frm:FormId\n><frm:VersionNo\n>30</frm:VersionNo\n></frm:Form8\n><CompanyName_C\n>VEERA WASIR CONSULTANTS PRIVATE LIMITED</CompanyName_C\n><ExtractedVersion\n>30</ExtractedVersion\n><Hidden_FormLanguage\n/><CompanyAdd_C\n>9 CSC DDA MARKETA-BLOCK\nSARASWATI VIHAR\nDELHI\nDelhi\nINDIA\n110085</CompanyAdd_C\n><hiddenEmailID\n/><Hidden_L\n>Agreement of Hyp.pdf:2014:Agreement of Mortgage.pdf:2900</Hidden_L\n><Err_C\n/><isDupFlag\n>NO</isDupFlag\n><PrescruitnyErr_N\n>-1</PrescruitnyErr_N\n><BOFiling_errMsg\n/><BOFilingFlag\n>NO</BOFilingFlag\n><CheckForm_C\n>NO</CheckForm_C\n></frm:Form8_Dtls\n></xfa:data\n><dd:dataDescription xmlns:dd="http://ns.adobe.com/data-description/" dd:name="ChargeHoldersDataConngetBankDetailsRequestDD"\n><ChargeHoldersDataConn\n><soap:Body xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"\n><impl:getBankDetails xmlns:impl="http://prefill.eforms.userinterface.mydca.dca21.com"\n><impl:strFormId dd:nullType="xsi"\n/></impl:getBankDetails\n></soap:Body\n></ChargeHoldersDataConn\n></dd:dataDescription\n><dd:dataDescription xmlns:dd="http://ns.adobe.com/data-description/" dd:name="GetBOFilingDtlsgetFilingDtlsRequestDD"\n><GetBOFilingDtls\n><soap:Body xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"\n><impl:getFilingDtls xmlns:impl="http://common.userinterface.backoffice.dca21.com"\n><eFilingInDDto dd:nullType="xsi"\n><formId dd:nullType="xsi"\n/><formUniqueId dd:nullType="xsi"\n/></eFilingInDDto\n></impl:getFilingDtls\n></soap:Body\n></GetBOFilingDtls\n></dd:dataDescription\n><dd:dataDescription xmlns:dd="http://ns.adobe.com/data-description/" dd:name="NewCINDataConngetCINLLPINDetails_SCRequestDD"\n><NewCINDataConn\n><soap:Body xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"\n><impl:getCINLLPINDetails_SC xmlns:impl="http://prefill.eforms.userinterface.mydca.dca21.com"\n><impl:strCINLLPIN\n/></impl:getCINLLPINDetails_SC\n></soap:Body\n></NewCINDataConn\n></dd:dataDescription\n></xfa:datasets\n>'

You can also extract from the form section:

xml2=xfa[13].getObject().getData()

print(xml2)
ScottC
  • 3,941
  • 1
  • 6
  • 20
  • I've tried using this, but for some reason, the PDF won't load. Can you help me with this? I'm new to this. How can I send the PDF file to you? – arvin Oct 15 '22 at 04:55
  • Maybe upload here: https://send.internxt.com/ and then post the link. – ScottC Oct 15 '22 at 05:12
  • https://send.internxt.com/download/8c31a3e5-6a26-40d3-967a-5471a91157ca?code=59b3a8e45df62163a874d3dead3ee83e4e40a7720f597a949ec6f19832121c20 – arvin Oct 15 '22 at 06:10
  • now how to convert this into pandas dataframe? – arvin Oct 15 '22 at 08:16
  • I think first we have to convert it into pandas dataframe then only we can save it into excel – arvin Oct 15 '22 at 08:17
  • Is your plan just to dump it into excel ? Or do you plan to process it first ? – ScottC Oct 15 '22 at 09:07
  • dump it into excel with all field and values associated with it – arvin Oct 15 '22 at 09:18
  • That will take some effort. But I will leave that to you :) You have an output in bytes - you can convert that to string with `xml.decode("utf-8")`. Then you have the fun of extracting the info you need. If you split by a specific character, you will end up with a list. And then it is simple to convert a list to a dataframe. – ScottC Oct 15 '22 at 09:26
  • ook sure il try :) – arvin Oct 15 '22 at 09:42
  • With the above code...all fields are not coming – arvin Oct 15 '22 at 09:47
  • If it is not getting all fields , then I am out of options. Sorry. – ScottC Oct 15 '22 at 09:51
  • Hey hiee...can you tell me how to convert JSON to CSV..? – arvin Oct 17 '22 at 08:07