extracting text from paragraphs using python

Question

I'm working on a project where we want to extract a company name, city, state, and dollar amount from a block of text in a paragraph. Usually, this information will be at the beginning of the paragraph, and I've been using a regex to find the first dollar sign (which would be the amount we are extracting), and finding the text between each comma since we know which order the text comes in. For example:

company name, city, state, amount $123,456,653

We've run into cases where there could be Xnumer of companies, followed by their city and state before the dollar amount.

Example: company name 1, city, state, company name 2, city, state, amount $123,456,653

There could be the case where the company name is given, but the next piece of info may not be the city, rather the company's name operating as xxx.

Example: company name 1, company name 1 longer, city, state, amount $123,456,653

And finally, we have seen some cases where there may be a statement saying how many companies are being given a dollar amount, followed by all of the company names.

Example (snippet): Twenty-five companies have been awarded a firm-fixed-price contract under the following Global Heavyweight Service, indefinite-delivery/indefinite-quantity, fixed-price contracts with an estimated value of $284,932,621: ABX Air Inc., Wilmington, Ohio (HTC71119DC002); Air Transport International Inc., Wilmington, Ohio (HTC71119DC003); Alaska Airlines Inc., Seattle, Washington (HTC71119DC004); Allegiant Air LLC, Las Vegas, Nevada (HTC71119DC005); American Airlines, Fort Worth, Texas (HTC71119DC006); Amerijet International Inc., Fort Lauderdale, Florida (HTC71119DC007); Atlas Air Inc., Purchase, New York (HTC71119DC008;) Delta Air Lines Inc., Atlanta, Georgia (HTC71119DC009); Federal Express Corp., Washington, District of Columbia (HTC71119DC010);xxxxxxxxxxxxxx

Usually, the paragraph will look like this (70-80% of the time):

L-3 Chesapeake Sciences Corp., Millersville, Maryland, is being awarded a $43,094,331 fixed-price-incentive,xxxxxxxxxx

Just wondering if anyone has some suggestions on libraries for python or a better way of extracting the specific text. I thought about implementing some type of API that would take the extracted value (after separating by comma) and run it by checking if it is a city or state, and then we could potentially have an idea as to which position in the list the data is and what might be next up (state).

This is the current regex I am using: r'([^$]*),.*?\$([0-9,]+)

Wow. This is ambitious. I personally have doubts that regex will work well here, because regex requires some kind of standardization. If there are varying orders, especially regarding city names, this will be difficult. First off though, you should post more samples. Secondly, it might be nice if you posted what your desired output would be... — FailSafe, Mar 24 '19 at 00:37
Secondly, in your 70-80% example, is `L-3` typical? In a paragraph blob, you'd need to have something that primes regex to know what will be captured in a group represents a company name as distinct from just other words — FailSafe, Mar 24 '19 at 00:48
The last example (L-3) was a snippet as well, but within a paragraph, this type of information is usually within the first sentence about 70-80% of the time. Company name first, city, state, and the dollar amount. - What about a text analytics library ? — dataviews, Mar 24 '19 at 00:50
I would recommend you use a database of city/company names(anything that constantly) changes, alongside a regex. As @FailSafe mentioned, there may be varying orders of data, I also recommend splitting strings into [ngrams](https://en.wikipedia.org/wiki/N-gram) and using a regex pattern against them. You would need a database of ngrams to start with, I have one but it may not be suited exactly to your task, however, there are certainly be big company names on it. Either way you'll need to use more than just regex. — GKE, Mar 24 '19 at 01:49
A text analysis library may work,but honestly that's beyond my capability :( — FailSafe, Mar 24 '19 at 01:56
@FailSafe A text analysis library might be an overkill, I still think the best way is to use an ngrams database in combination with regex. — GKE, Mar 24 '19 at 02:29
Maybe. But I've gotta tell you based on the criteria, you'll need to right several regexes, or hopefully find a way to know of something consistent which will precede the Company name. I'm shuttering imagining trying to solve this. Nightmares — FailSafe, Mar 24 '19 at 02:32
lol, you and me both. I think for the 70-80% of the time I can capture what I need, but there may need to be some manual entry from the user. — dataviews, Mar 24 '19 at 02:38
lol. I thought @GKE was you replying to me for a second, dataviews. Anyway, ugh, good luck. Ah man, above I wrote "right" instead of "write" — FailSafe, Mar 24 '19 at 02:43
@dataviews Here's the [database](https://github.com/gloriankosi/ngram-sql-database) that may be of help, good luck. — GKE, Mar 24 '19 at 04:07
You write: _we want to extract a company name, city, state, and dollar amount_. What is your desired output format when there are a number of companies > 1? — Armali, Mar 26 '19 at 07:31
@GKE Hey I might have thought of a solution. What are your thoughts? So we know that the order can be: COMPANY NAME, CITY, STATE or COMAPNY NAME, Doing Business as Name, City, State. Do you think we could measure how far apart the city position is from one another ? For instance the first city is captured at position 1, where as city in the second example I show here the is at position 2. So essentially we could say, hey in the second string captured, city is 3 positions away from the last city captured, and therefore we know how many pieces of info go to which company extracted from the text — dataviews, Apr 16 '19 at 02:08
@FailSafe Hey I might have thought of a solution. What are your thoughts? So we know that the order can be: COMPANY NAME, CITY, STATE or COMAPNY NAME, Doing Business as Name, City, State. Do you think we could measure how far apart the city position is from one another ? For instance the first city is captured at position 1, where as city in the second example I show here the is at position 2. So essentially we could say, hey in the second string captured, city is 3 positions away from the last city captured, and therefore we know how many pieces of info go to which company extracted from txt — dataviews, Apr 16 '19 at 02:08
@dataviews Might be better to base it on the "state". Because there are only 50 states, you can create a list of states then then create something that essentially guesses `when a state is found search for *dollar sign* or *floating number* within 25 characters of the state and capture the at least 40 characters before the state`. but really, without consistent formatting you've got quite a problem on your hands. I'd say to assist us, you should post a few samples of the documents spanning at least 400 characters long so everyone can see what you're building against and how consistent it is. — FailSafe, Apr 16 '19 at 09:36
@dataviews I agree with Failsafe on basing search on the states, that's not to say your solution wont work entirely, it would just take adapting and this is where you start leaning towards a Machine Learning and Ai solution as opposed to regex only. If we know that in one large blob of text that City 1 is located at position 1 and City 2 in position 2, and so on, then you have a very delicate solution that depends solely on the structure of the document. If you decide to go with that kind of implementation, you should have components that handle text not in the expected format. — GKE, Apr 16 '19 at 09:51

score 0 · Answer 1 · answered Sep 09 '19 at 03:13

You can likely design some expression to capture those listed companies in the paragraph such as with:

(?i)([a-z0-9\s.-]*),([^\r\n,]*),\s*(Ohio|Washington|Georgia|Nevada|Florida|Texas|New York|District of Columbia)\s+\(\s*([a-z0-9]{13};?)\s*\)

and add or remove the boundaries as you wish, and you'd similarly for the other ones.

Test

import re

string = """
Twenty-five companies have been awarded a firm-fixed-price contract under the following Global Heavyweight Service, indefinite-delivery/indefinite-quantity, fixed-price contracts with an estimated value of $284,932,621: ABX Air Inc., Wilmington, Ohio (HTC71119DC002); Air Transport International Inc., Wilmington, Ohio (HTC71119DC003); Alaska Airlines Inc., Seattle, Washington (HTC71119DC004); Allegiant Air LLC, Las Vegas, Nevada (HTC71119DC005); American Airlines, Fort Worth, Texas (HTC71119DC006); Amerijet International Inc., Fort Lauderdale, Florida (HTC71119DC007); Atlas Air Inc., Purchase, New York (HTC71119DC008;) Delta Air Lines Inc., Atlanta, Georgia (HTC71119DC009); Federal Express Corp., Washington, District of Columbia (HTC71119DC010);

"""

expression = r'(?i)([a-z0-9\s.-]*),([^\r\n,]*),\s*(Ohio|Washington|Georgia|Nevada|Florida|Texas|New York|District of Columbia)\s+\(\s*([a-z0-9]{13};?)\s*\)'
matches = re.findall(expression, string)

print(matches)

Output

[(' ABX Air Inc.', ' Wilmington', 'Ohio', 'HTC71119DC002'), (' Air Transport International Inc.', ' Wilmington', 'Ohio', 'HTC71119DC003'), (' Alaska Airlines Inc.', ' Seattle', 'Washington', 'HTC71119DC004'), (' Allegiant Air LLC', ' Las Vegas', 'Nevada', 'HTC71119DC005'), (' American Airlines', ' Fort Worth', 'Texas', 'HTC71119DC006'), (' Amerijet International Inc.', ' Fort Lauderdale', 'Florida', 'HTC71119DC007'), (' Atlas Air Inc.', ' Purchase', 'New York', 'HTC71119DC008;'), (' Delta Air Lines Inc.', ' Atlanta', 'Georgia', 'HTC71119DC009'), (' Federal Express Corp.', ' Washington', 'District of Columbia', 'HTC71119DC010')]

If you wish to explore/simplify/modify the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

extracting text from paragraphs using python

1 Answers1

Test

Output

Linked