2

I have lines I am iterating through that look like this:

random text and A08524SDD here (00-04) more random text
lame text (junk data) more text (08-12) more text 4000 5553
random text and numbers 44553349 (2008) 
random text (2005) junk text (junk)
nothing important (13-15) not important (not important)

I am trying to figure out how to pull ONLY the dates (range or single year) from the parenthesis without pulling the other random junk from the parenthesis.

Currently using this, but it is returning the random text as well:

date = re.findall('\(([^)]+)', line)

Edit: Each line in the string I am iterating over 1 line at a time. It is not one single string. I have a for loop that is searching each line and trying to extract the date range from each line. Also, there is random numbers included in the random text, so I cannot just search for ##-## or #### in the entire string. It will have to be encased in ()'s.

Edit2: @CarySwoveland has answered my origional question. As a bonus, I do have a few lines that look like this, that if they could also be included would be nice.

random text and numbers 44553349 (2008 important text) 
random text (2005 important text) junk text (junk) 55555555 (08-09 important text)
nothing important (13-15) not important (not important)(2008 important text)

In the lines with more than 1 () that both start with a ##-## or a #### I need to grab both of them WITH the text. Out of about 35,000 lines of text, only about ~50 or so have these random issues, and I do not mind doing them by hand. But if a solution exists, it would be nice to implement.

THANK YOU TO ALL WHO HAVE POSTED! THIS HAS HELPED ME OUT GREATLY!!!!

Lzypenguin
  • 945
  • 1
  • 7
  • 18
  • will it always be a 4 digit number or 2 digits separated by - ? – Derek Eden May 16 '20 at 05:44
  • @DerekEden Yes, for the most part. It will always be (####) or (##-##). There are a few situations where it is (##-## text), and pulling the entire (##-## text) would be fine, as well as just ##-##. But those are so few I can do those manually If i need to go fix them. – Lzypenguin May 16 '20 at 05:46

3 Answers3

2

As per both your question and your added comments I would suggest the following pattern:

(?<=\()\d\d-?\d\d.*?(?=\))

This would cater for all patterns of interest, like: (####), (##-##) and (##-## text) and possibly (#### text). Here is an online demo

Regular expression visualization

From left to right:

  • (?<=\() - Positive lookbehind for an opening paranthesis
  • \d\d-?\d\d - Two digits followed by an optional hyphen and again two more digits
  • .*? - Match any character except newlines but non-greedy
  • (?=\)) - A positive lookahead to check for a closing paranthesis.

If you want to be very explicit about a closing paranthesis behind the 4th digit and the possibility for text you could extend. For example (?<=\()\d\d-?\d\d(?:\s\w+)?(?=\)), where we have a non-capturing group (?:...) where we check for a space \s followed by one-or-more word-characters \w+. The non-capturing group is optional ...)? and then the same positive lookahead as above.

Don't forget, using these patterns in Python, you'd need to make sure to use them as raw strings.

Note: I escaped both the opening and closing paranthesis in the lookarounds with a backslash, e.g: \( and \) to use them as literals. Not doing so will prematurely open/close another (non-)capturing group!


A Python example:

import re

lines = ['random text and A08524SDD here (00-04) more random text',
         'lame text (junk data) more text (08-12) more text 4000 5553',
         'random text and numbers 44553349 (2008)',
         'random text (2005) junk text (junk)',
         'nothing important (13-15) not important (not important)',
         'random text and numbers 44553349 (2008 important text)',
         'random text (2005 important text) junk text (junk) 55555555 (08-09 important text)',
         'nothing important (13-15) not important (not important)(2008 important text)']

for line in lines:
    print(re.findall(r'(?<=\()\d\d-?\d\d.*?(?=\))', line))

Returns:

['00-04']
['08-12']
['2008']
['2005']
['13-15']
['2008 important text']
['2005 important text', '08-09 important text']
['13-15', '2008 important text']
JvdV
  • 70,606
  • 8
  • 39
  • 70
  • I don't quite get regex, and trying to add this to test, and I am not getting any results. I added these lines and get nothing returned: date = r'(?<=\()\d\d-?\d\d(?=\D*\))' print(date) – Lzypenguin May 16 '20 at 06:36
  • 1
    penguin, you need to escape `(` and `)` in the lookarounds. – Cary Swoveland May 16 '20 at 06:38
  • When i added yours, i am getting nothing returned for any of my lines. – Lzypenguin May 16 '20 at 06:51
  • @JvdV I now have this working, and it is returning the years perfectly, and in the situations where there are more than 1 year, it is returning both, which is great. But It is not capturing any of the text. When i go add text in the demo, it doesnt grab it either. – Lzypenguin May 16 '20 at 07:06
  • @Lzypenguin, so you actually ***do*** want the text to be included? Please try again with updated answer. – JvdV May 16 '20 at 07:19
  • 1
    @JvdV You are AWESOME!!! Thank you so much! This works perfectly and does everything I have asked! Thank you so much for taking the time to help me with this!!! – Lzypenguin May 17 '20 at 04:52
1

You can use the following regular expression.

(?m)(?<=\()(?:\d{4}|\d{2}-\d{2})(?=\))

Regex Demo <¯\_(ツ)_/¯> Python demo

Python's regex engine performs the following operations.

(?m)           multiline mode
(?<=\()        match is preceded by '(' (positive lookbehind)
(?:            begin non-capture group
  \d{4}        match 4 digits          
  |            or
  \d{2}-\d{2}  match 2 digits, a hyphen, 2 digits
)              end non-capture group
(?=\))         match is followed by ')' (positive lookahead)
Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100
0

does something like this work for you?

this is assuming strings is a list of your lines

def getter(string):
    return re.search(r'(\(\d{4}\)|\(\d{2}-\d{2}\))', string).group()

list(map(getter, strings))

output:

['(00-04)', '(08-12)', '(2008)', '(2005)', '(13-15)']

as per your edit...if you are looping, just apply the function in the loop on each line

Derek Eden
  • 4,403
  • 3
  • 18
  • 31