I am attempting to parse a text file using python and regex to construct a specific pandas data frame. Below is a sample from the text file I am parsing and the ideal pandas DataFrame I am seeking.
Sample Text
Washington, DC November 27, 2019
USDA Truck Rate Report
WA_FV190
FIRST PRICE RANGE FOR WEEK OF NOVEMBER 20-26 2019
SECOND PRICE MOSTLY FOR TUESDAY NOVEMBER 26 2019
PERCENTAGE OF CHANGE FROM TUESDAY NOVEMBER 19 2019 SHOWN IN ().
In areas where rates are based on package rates, per-load rates were
derived by multiplying the package rate by the number of packages in
the most usual load in a 48-53 foot trailer.
CENTRAL AND WESTERN ARIZONA
-- LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LEAF LETTUCE SLIGHT SHORTAGE
--
ATLANTA 5100 5500
BALTIMORE 6300 6600
BOSTON 7000 7300
CHICAGO 4500 4900
DALLAS 3400 3800
MIAMI 6400 6700
NEW YORK 6600 6900
PHILADELPHIA 6400 6700
2019 2018
NOV 17-23 NOV 18-24
U.S. 25,701 22,956
IMPORTS 13,653 15,699
------------ --------------
sum 39,354 38,655
The ideal output should look something like:
Region CommodityGroup InboundCity Low High
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC ATLANTA 5100 5500
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC BALTIMORE 6300 6600
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC BOSTON 7000 7300
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC CHICAGO 4500 4900
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC DALLAS 3400 3800
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC MIAMI 6400 6700
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC NEW YORK 6600 6900
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC PHILADELPHIA 6400 6700
With my limited understanding of creating regex statements, this is the closest I have come to successfully isolating the desired text: regex tester for USDA data
I have been trying to replicate the solution from How to parse complex text files using Python?1 where applicable but my regex experience is severely lacking. Any help you can provide will greatly appreciated!