I have extracted some invoice PDF in text format using pyPDF2. I want to convert this text file into a json file that contains only the important keywords and tokens.
the output should be something like:
#PurchaseOrder
{
"doctype":"PO",
"orderingcompany":"Demo Company",
"suppliercompany":"Demo Company",
"shipto":"Test Customer",
"ponum":"PO1234",
"podate":"01-01-2019",
"totalamount":"$1234.50",
"currency":"SGD"
}
A sample text that I have obtained from a pdf is:
PACE MEMBERSHIP WARE HOUSE
4115 Whispering Pines Circle
Grand Prairie, TX 75051
972
336
7141
56929268
PURCHASE ORDER
TO:
Elmer A. Hua
A+ Investments
1223 Cerullo Road
Lexington, KY 40507
[Phone Number]
SHIP TO:
Laurel Yan
Pace Membership Warehouse
4115 Whispering Pines Circle
Grand Prairie, TX 75051
972
336
7141
P.O. NUMBER:
PO/18
19081
[The P.O. number must appear on all related correspondence, shipping papers, and invoices]
P.O DATE
REQUISITIONER
SHIPPED VIA
F.O.B. POINT
TERMS
7/15/2006
QTY
UNIT
DESCRIPTION
UNIT PRICE
TOTAL (SGD)
100.00
1
Interlock Drifit Round Neck, ILRN
13.50
1,350.00
SUBTOTAL
1,350.00
SALES TAX
200.00
1.
Please send two copies of your invoice.
2.
Enter this order in accordance with the prices, terms, delivery method, and specifications listed above.
3.
Please notify us immediately if you are unable to ship as specified.
4.
Send all correspondence to:
Laurel Yan
4115 Whispering Pines Circle
Gra nd Prairie, TX 75051
972
336
7141
56929268
SHIPPING AND HANDLIN G
OTHER
TOTAL
1,550.00
Authorized by Laurel Yan
7/15/2006