1

I have a bunch of strings like this in a file:

M.S., Arizona University, Tucson, Az., 1957
B.A., American International College, Springfield, Mass., 1978
B.A., American University, Washington, D.C., 1985

and I'd like to extract Tufts University, American International College, American University, University of Massachusetts, etc, but not the high schools (it's probably safe to assume that if it contains "Academy" or "High School" that it's a high school). Any ideas?

Myer
  • 3,670
  • 2
  • 39
  • 51

1 Answers1

2

Tested with preg_match_all in PHP, will work for the sample text you provided:

 /(?<=,)[\w\s]*(College|University|Institute)[^,\d]*(?=,|\d)/

Will need to be modified somewhat if your regex engine does not support lookaheads/lookbehinds.


Update: I looked at your linked sample text & updated the regex accordingly

 /([A-Z][^\s,.]+[.]?\s[(]?)*(College|University|Institute|Law School|School of|Academy)[^,\d]*(?=,|\d)/

The first part will match a string starting with a capital letter, optionally followed by an .. Then a space, then optionally an (. This pattern is matched zero or more times.

This should get all relevant words preceding the keywords.

jisaacstone
  • 4,234
  • 2
  • 25
  • 39