7

I am trying to find mention of age in a large dataset of messages posted by users on the internet (stored in a .csv)

I am currently using regular expressions in python to extract age and save it in a list

For example, "I am 20 years old" would return 20 to the list "He is 30 now" would return 30 "She is in her fifties" would return 50

But the problem is, using RE is very slow for a huge dataset and if text is in a pattern not satisfied by my RE, then I cannot get the age... So, my question is: Is there a better way of doing this? Perhaps some NLP packages/tools in python? I tried researching if nltk has something for this, but it doesnt.

ps:Sorry if the question is unclear, english is not my first language.. I have included some of the RE i used below..

m = re.search(r'.*(I|He|She) (is|am) ([0-9]{2}).*',s,re.IGNORECASE)
n = re.search(r'.*(I|He|She) (is|am) in (my|his|her) (late|mid|early)? ?(tens|twenties|thirties|forties|fifties|sixties|seventies|eighties|nineties|hundreds).*',s,re.IGNORECASE)
o = re.search(r'.*(I|He|She) (is|am) (twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety|one|two|three|four|five|six|seven|eight|nine|ten|eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen|eighteen|nineteen) ?(one|two|three|four|five|six|seven|eight|nine)?.*',s,re.IGNORECASE)
p = re.search(r'.*(age|is|@|was) ([0-9]{2}).*',s,re.IGNORECASE)
q = re.search(r'.*(age|is|@|was) (twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety|one|two|three|four|five|six|seven|eight|nine|ten|eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen|eighteen|nineteen) ?(one|two|three|four|five|six|seven|eight|nine)?.*',s,re.IGNORECASE)
r = re.search(r'.*([0-9]{2}) (yrs|years).*',s,re.IGNORECASE)
s = re.search(r'.*(twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety|one|two|three|four|five|six|seven|eight|nine|ten|eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen|eighteen|nineteen) ?(one|two|three|four|five|six|seven|eight|nine)? (yrs|years).*',s,re.IGNORECASE)
krzna
  • 185
  • 2
  • 9
  • 1
    Can you optimize your algorithm, perhaps precompiling regexp and compiling python script into an executable? Can you provide example of how you use it, the logic; – AlexanderB Apr 10 '15 at 23:55
  • 1
    I will try compiling my script into an executable and see if it's faster, but it seems precompiling regexp would not be of much use [(see here)](http://stackoverflow.com/questions/452104/is-it-worth-using-pythons-re-compile). I use if..elif ladder with re.search.. if I find a match, I dont scan through the rest of the re.. i just save the age to an array and continue to next message read from csv, but if i dont, i keep scanning the text with the next re pattern and so on.. Is there a better way to fasten things up? – krzna Apr 11 '15 at 00:12
  • 3
    If code can't be optimized more than it is, maybe the data can be. Before searching for age, you can lowercase all letters - thus no IGNORECASE flag, and convert words that represent numbers to a digits - thus no more lengthy regexp, and store the result in temporary csv. Then run your code on optimized data. – AlexanderB Apr 11 '15 at 00:38
  • 1
    Thank you for the suggestions. Preprocessing did help speed things up. But is there a tool/package that would do this for me or a way I could avoid regex to achieve this? Because most of my data is from public forums on the internet, there is a lot of variation in the way people state their age, and a lot of data passes through uncaught when I use regex :( – krzna Apr 13 '15 at 16:45
  • 1
    @krzna that's mainly related to how much data you want to catch -- this sounds like an NLP problem, so I don't think there's a pre-rolled solution out there, and it's not usually so easy to put together your own. The regex approach seems like a reasonable alternative to pulling out the NLP machinery, but it will probably miss more stuff. With respect to speeding the process up, have you considered [parallelizing](http://bit.ly/1IYLpJV) your program? This is a case where it would almost certainly help out. [p.s.](http://bit.ly/1FP7Dxn) – Dan Apr 14 '15 at 02:09
  • It might be a usecase for https://github.com/WojciechMula/pyahocorasick – amirouche Mar 21 '19 at 22:26

2 Answers2

1

See Extracting a person's age from unstructured text in Python, particularly the answer to do with using Allen NLP, which appears to be just what you're asking for.

chaos
  • 122,029
  • 33
  • 303
  • 309
1

I'd like to recommend you train a neural network with three multiclass classifiers/heads to predict three digits corresponding to the ones, tens and hundreds.

Lerner Zhang
  • 6,184
  • 2
  • 49
  • 66