Extract data between pound signs

Question

Hi I am parsing through XML files grabbing SQL text and paraments. I need to pull the strings that lie between two # signs. For example if this is my text:

CASE WHEN TRIM (NVL (a.SPLR_RMRK, ' ')) = '' OR TRIM (NVL (a.SPLR_RMRK, ' ')) IS NULL THEN '~' ELSE a.SPLR_RMRK END AS TXT_DESCR_J, 'PO' AS TXT_TYP_CD_J FROM #ps_RDW_Conn.jp_RDW_SCHEMA_NAME#.P_PO_RCPT_DTL a, (SELECT PO_RCPT_DTL_KEY, ETL_CRT_DTM FROM #ps_RDW_Conn.jp_RDW_SCHEMA_NAME#.#jp_PoRcptDtl_Src# WHERE ETL_UPDT_DTM > TO_DATE ('#jp_EtlPrcsDt#', 'YYYY-MM-DD:HH24:MI:SS'))

I want to have ps_RDW_Conn.jp_RDW_SCHEMA_NAME, ps_RDW_Conn.jp_RDW_SCHEMA_NAME jp_PoRcptDtl_Src and jp_EtlPrcsDt print out.

Some code that I have so far is

for eachLine in testFile:
    print re.findall('#(*?)#', eachLine)

This gives me the following error:

nothing to repeat.

Any help or suggestions is greatly appreciated!

Possible duplicate: http://stackoverflow.com/questions/5869650/python-regex-strange-behavior — Dunno, Jun 17 '14 at 20:15
Your original sample text had some new line characters. Everything is on one line now? — HeyWatchThis, Jun 17 '14 at 20:34

HeyWatchThis · Accepted Answer · 2014-06-17T20:54:00.297

Unlike in bash regular expressions, the * is not a wild-card character, but instead it says repeat 0 or more times the thing before me.

In your regular expression, your * had no symbol to modify and so you saw the complaint nothing to repeat.

On the other hand, if you supply a . symbol for * to modify, testing with one line as an example,

eachLine = '#ps_RDW_Conn.jp_RDW_SCHEMA_NAME#.P_PO_RCPT_DTL a, (SELECT PO_RCPT_DTL_KEY, '

re.findall('#(.*?)#', eachLine)

We get,

['ps_RDW_Conn.jp_RDW_SCHEMA_NAME']

Some more detail. I'm not sure if this is what you intended, but your *? is actually well placed. *? is interpreted as a single qualifier which says repeat 0 or more times the thing before me, but take as little as possible.

So this ends up having the similar effect of what @tobias_k suggests in the comments, in preventing multiple groups from being absorbed into one.

>>> line = 'And here is # some interesting code #, where later on there are #fruit flies# ?' 
>>> re.findall('#(.*)#', line)
[' some interesting code #, where later on there are #fruit flies']

>>> 
>>> re.findall('#(.*?)#', line)
[' some interesting code ', 'fruit flies']
>>>

For reference, browse Repeating Things in docs.python.org

+1 Don't know why the downvote... however, I'd suggest using `"#([^#]+)#`, so it does not accidentally select more than one group. — tobias_k, Jun 17 '14 at 20:26

score 0 · Answer 2 · answered Jun 17 '14 at 20:16

Your regex is not working as intended because you are using both * (0 or more) and ? (0 or 1) to modify the thing before it, but a) there is nothing before it, and b) you should use either * or ?, not both.

If you mean to capture ## or #anything#, then use the regex #(.*)#.

score -1 · Answer 3 · answered Jun 17 '14 at 20:17

-1

Try to escape ( and ). r'\(.*?\)' should work.

for eachLine in testFile: print re.findall(r'\(.*?\)', eachLine)

answered Jun 17 '14 at 20:17

Christian Berendt

3,416
2
13
22

Extract data between pound signs

3 Answers3