0

How can I extract the STOP_DATE value from this long string in Python?

GROUP                  = TEMPORALINFORMATION

OBJECT                 = PRODUCTIONDATETIME
  NUM_VAL              = 1
  VALUE                = "2015-07-19T18:29:43Z"
END_OBJECT             = PRODUCTIONDATETIME

OBJECT                 = START_DATE
  NUM_VAL              = 1
  VALUE                = "2015-07-11T20:17:22Z"
END_OBJECT             = START_DATE

OBJECT                 = STOP_DATE
  NUM_VAL              = 1
  VALUE                = "2015-07-11T21:03:52Z"
END_OBJECT             = STOP_DATE

END_GROUP              = TEMPORALINFORMATION
JRodDynamite
  • 12,325
  • 5
  • 43
  • 63
mikitk
  • 100
  • 1
  • 9
  • try this worst one `re.findall('[\s\S]*STOP_DATE[\s\S]*VALUE[\s\S]*=([\s\S]*)END_OBJECT[\s\S]*STOP_DATE[\s\S]*', string)[0].strip()` – itzMEonTV Feb 24 '17 at 11:01

3 Answers3

1

As others have shown, you can do this as a single line regular expression but this is clearer:

import re
input_data="""  GROUP                  = TEMPORALINFORMATION\n\n    OBJECT                 = PRODUCTIONDATETIME\n      NUM_VAL              = 1\n      VALUE                = "2015-07-19T18:29:43Z"\n    END_OBJECT             = PRODUCTIONDATETIME\n\n    OBJECT                 = START_DATE\n      NUM_VAL              = 1\n      VALUE                = "2015-07-11T20:17:22Z"\n    END_OBJECT             = START_DATE\n\n    OBJECT                 = STOP_DATE\n      NUM_VAL              = 1\n      VALUE                = "2015-07-11T21:03:52Z"\n    END_OBJECT             = STOP_DATE\n\n  END_GROUP              = TEMPORALINFORMATION
"""

def find_stop_date(s):
    in_stop_date=False
    result=None
    for line in s.split("\n"):
        line = line.strip()
        if re.search(r"^OBJECT.*=.*STOP_DATE", line):
            in_stop_date=True
        if re.search(r"^END_OBJECT.*=.*STOP_DATE", line):
            in_stop_date=False
        if in_stop_date:
            re_result = re.search("VALUE\s*=\s*(.*)", line)
            if (re_result):
                result = re_result.group(1)

    return result

result = find_stop_date(input_data)
if result:
    print("Found: {}".format(result))
else:
    print("not found")
heroworkshop
  • 365
  • 2
  • 6
1

You can use this regex:

STOP_DATE.+?VALUE\s*=\s*\"(.+?)\"

The Python commands:

import re

regex = r"STOP_DATE.+?VALUE\s*=\s*\"(.+?)\""

match = re.search(regex, test_str, re.DOTALL)
print(match.group(1))

where test_str is the name of your string.

The result:

2015-07-11T21:03:52Z

Try it online.

Sven Hohenstein
  • 80,497
  • 17
  • 145
  • 168
0

Sven's answer is not as refined as it could be, my pattern will run 5x faster and the DOTALL flag can be omitted: STOP_DATE[^"]+"([^"]+)

import re

test_str = '''GROUP                  = TEMPORALINFORMATION

    OBJECT                 = PRODUCTIONDATETIME
      NUM_VAL              = 1
      VALUE                = "2015-07-19T18:29:43Z"
    END_OBJECT             = PRODUCTIONDATETIME

    OBJECT                 = START_DATE
      NUM_VAL              = 1
      VALUE                = "2015-07-11T20:17:22Z"
    END_OBJECT             = START_DATE

    OBJECT                 = STOP_DATE
      NUM_VAL              = 1
      VALUE                = "2015-07-11T21:03:52Z"
    END_OBJECT             = STOP_DATE

    END_GROUP              = TEMPORALINFORMATION'''

print re.search( r'STOP_DATE[^"]+"([^"]+)', test_str).group(1)
# 2015-07-11T21:03:52Z

The performance boost comes from using two greedy "negated capture classes" instead of dots.

Since the desired substring is the only double-quoted value to follow STOP_DATE, the double quotes are the only characters that need to be identified.

If your actual data has other values that are double-quoted, and you need to specify VALUE, then you can use: STOP_DATE[^"]+VALUE[^"]+"([^"]+) but the required steps balloon to 2.5 times my previous pattern (but still 2x faster than Sven's).

mickmackusa
  • 43,625
  • 12
  • 83
  • 136