I'm working on a project where we want to extract a company name, city, state, and dollar amount from a block of text in a paragraph. Usually, this information will be at the beginning of the paragraph, and I've been using a regex to find the first dollar sign (which would be the amount we are extracting), and finding the text between each comma since we know which order the text comes in. For example:
company name, city, state, amount $123,456,653
We've run into cases where there could be Xnumer of companies, followed by their city and state before the dollar amount.
Example: company name 1, city, state, company name 2, city, state, amount $123,456,653
There could be the case where the company name is given, but the next piece of info may not be the city, rather the company's name operating as xxx.
Example: company name 1, company name 1 longer, city, state, amount $123,456,653
And finally, we have seen some cases where there may be a statement saying how many companies are being given a dollar amount, followed by all of the company names.
Example (snippet): Twenty-five companies have been awarded a firm-fixed-price contract under the following Global Heavyweight Service, indefinite-delivery/indefinite-quantity, fixed-price contracts with an estimated value of $284,932,621: ABX Air Inc., Wilmington, Ohio (HTC71119DC002); Air Transport International Inc., Wilmington, Ohio (HTC71119DC003); Alaska Airlines Inc., Seattle, Washington (HTC71119DC004); Allegiant Air LLC, Las Vegas, Nevada (HTC71119DC005); American Airlines, Fort Worth, Texas (HTC71119DC006); Amerijet International Inc., Fort Lauderdale, Florida (HTC71119DC007); Atlas Air Inc., Purchase, New York (HTC71119DC008;) Delta Air Lines Inc., Atlanta, Georgia (HTC71119DC009); Federal Express Corp., Washington, District of Columbia (HTC71119DC010);xxxxxxxxxxxxxx
Usually, the paragraph will look like this (70-80% of the time):
L-3 Chesapeake Sciences Corp., Millersville, Maryland, is being awarded a $43,094,331 fixed-price-incentive,xxxxxxxxxx
Just wondering if anyone has some suggestions on libraries for python or a better way of extracting the specific text. I thought about implementing some type of API that would take the extracted value (after separating by comma) and run it by checking if it is a city or state, and then we could potentially have an idea as to which position in the list the data is and what might be next up (state).
This is the current regex I am using: r'([^$]*),.*?\$([0-9,]+)