0

I am having issues capturing integers and dates correctly with regular expressions.

Integers

int_test: "Today is 6/28/2017 with 17.5 percent chance of rain"

int_pattern = re.findall(r'\d[0-9].*', int_test)

The problem I am having with this regular expression, it is capturing the the "6, 28, 2017, 17, and 5" from the int_test. I am not able to find a way to capture integers surrounded only by whitespace.

Dates

date_test = "Today is 6/28/2017 or June/28/2017 or 28/June/2017 or Jun/28/2017 or 28-Jun-2017"

date_pattern = re.findall(r'\d.*[- /]\d+', date_test)

For this one, I have already wrote code to support either "/" or "-" between dates. I have successfully been able to capture and digits before or after the "/" or "-", but I need a way to capture and amount of characters before or after the "/" or "-" in the sentence.

Any help would be greatly appreciated!

rmahesh
  • 739
  • 2
  • 14
  • 30
  • "ONLY if there is no letter, digit, or character either left or right of the captured integer.", what does this mean? Surely if there is no character either left or right of the captured integer the string is only one character long? – Tom Wyllie Jun 28 '17 at 15:40
  • 2
    Add expected output - what exactly you want to capture in both cases. – streetturtle Jun 28 '17 at 15:45
  • @TomWyllie So for example, I only want to capture the integer when there is no single letter [A-Z] or [a-z], and no other symbols (in this example, it would be "/" for dates, and "." for floats. I am having a problem specifically EXCLUDING those other single characters and symbols. – rmahesh Jun 28 '17 at 15:51
  • @streetturtle What I expect to capture would be only of data type integer when there is no single character or digit either before or after the integer. For the int_test string, nothing should be captured, because there are "/" or "." before or after every single integer. If the string was "There are 60 minutes in an hour", only the 60 should be captured. – rmahesh Jun 28 '17 at 15:53
  • Have you tried using any of the online regex testers to play around with your pattern? like regex101.com? – wwii Jun 28 '17 at 15:55
  • Regular Expressions work with strings, there is no notion of data types or integers, there is only digits. This is why I am asking. So you define your rules for the "integers". From you descritpion it seems that integer should be `\s\d+\s`. – streetturtle Jun 28 '17 at 15:56
  • @wwii Yes, I have actually predominately used regex101.com to play around with it, I attribute much of my progress to solely that. Just stuck with these few parts left to complete that I am lost on. – rmahesh Jun 28 '17 at 15:57
  • Why not just write three different patterns? – wwii Jun 28 '17 at 15:58
  • @streetturtle I didn't want to blow up the initial post with the description of my project, but I will quickly summarize. I need to do ETL operations on Pandas data frame. One of the main things I am going to do is to first convert all data frames into type strings, iterate through a column at a time, run regex to capture data types (so if it has that pattern I described above, it would be a int). Then I will count the amount of items in the patterns. If the items equals the total amount of rows in the column, I convert the entire column to that patterns data type. If not, to string. – rmahesh Jun 28 '17 at 16:00
  • @wwii I thought about doing that, but I am still having issues with matching the words (March, Mar etc) before or after any of the "/" or "-". – rmahesh Jun 28 '17 at 16:02
  • `[int(s) for s in Dates.split("+") if s.lstrip("-").isdigit()]` this extract all integers in your dates – khelili miliana Jun 28 '17 at 16:02
  • Can ```"capture integers ONLY if there is no letter, digit, or character either left or right of the captured integer"``` be re-phrased as ```"capture integers surrounded by whitespace"```? – wwii Jun 28 '17 at 16:05
  • @wwii Just edited and changed it, thank you. – rmahesh Jun 28 '17 at 16:07

3 Answers3

1

Here is the regex for integers: \s(\d+)\s - it uses capturing groups, to which you can refer.
Demo: https://regex101.com/r/eefnS1/1

And here is the regex for dates:

(\d{1,2}|[a-zA-Z]{2,8}) # day or month
(?:[\/-]{1})            # separator
(\d{1,2}|[a-zA-Z]{2,8}) # day or month
(?:[\/-]{1})            # separator
(\d{4})                 # year

Demo: https://regex101.com/r/fo11qf/1/

streetturtle
  • 5,472
  • 2
  • 25
  • 43
  • When loading the code into PyCharm, the regex for integers is picking up '2' but is not picking up '345' in the example you demonstrated on regex101. But the regex for dates works perfectly! – rmahesh Jun 28 '17 at 16:16
  • @rmahesh Note that the *global* flag is set on the regex101 example. – SamWhan Jun 28 '17 at 16:38
  • @ClasG I come from more of a Stats background, not the most proficient in coding could you possibly explain what that means, and why that would possibly cause the regex to not run in PyCharm? – rmahesh Jun 28 '17 at 16:40
  • I don't *speak* python, but search and you'll find. [Check this for example](https://stackoverflow.com/questions/4697882/how-can-i-find-all-matches-to-a-regular-expression-in-python) – SamWhan Jun 28 '17 at 16:47
  • @GlasG Thank you, will do. – rmahesh Jun 28 '17 at 17:26
1
\b\w+[/-]\w+[/-]\d{2,4}\b

Will capture all of your dates and is a bit more efficient but it will also capture other stuff like foo/bar/1111.

wwii
  • 23,232
  • 7
  • 37
  • 77
0

I believe a regex like this is what you're looking for: \s(\d+)\s

Tom Wyllie
  • 2,020
  • 13
  • 16
  • I need to capture integers surrounded by whitespace. In the example I gave in my post description, none of that should be returned to me but all of the integers besides the "/" or "." is being returned. I need to figure out a way to capture integers with whitespace before and after the integers. – rmahesh Jun 28 '17 at 16:28
  • I understand now the first part of your question, I am still confused about the second but I suspect streetturtle has already answered what you want for that part. – Tom Wyllie Jun 28 '17 at 16:32
  • Yes Tom, streetturtle has answered that portion already. His answer for the integer is almost perfect, it just isn't picking up the last integer in the example that he linked once I run that same thing on PyCharm. – rmahesh Jun 28 '17 at 16:34