2

I am trying to match a multiline pattern using a shell command through python.

I am able to match using the shell commands but I am not able to pass this command through the Python subprocess.call or the os.system modules.

My file looks something like this:

(CELL
  (CELLTYPE "NAND_2X1")
  (INSTANCE U2)
  (DELAY
    (ABSOLUTE
    (IOPATH A1 ZN (0.02700::0.02700) (0.01012::0.01012))
    (IOPATH A2 ZN (0.02944::0.02944) (0.00930::0.00930))
    )
  )
)

No, I am trying to extract this:

  (INSTANCE U2)
  (DELAY
    (ABSOLUTE
    (IOPATH A1 ZN (0.02700::0.02700) (0.01012::0.01012))
    (IOPATH A2 ZN (0.02944::0.02944) (0.00930::0.00930))
    )
  )

using this regex:

pcregrep -M -n 'INSTANCE U2((?!^\)).*\n)+' sdf/c1_syn_buf2.sdf

wherein U2 is the search string and sdf/c1_syn_buf2.sdf is the file name

In Python, I have defined a function to which I will pass the search string and the file name as I have to do this operation multiple times.

I am unable to successfully execute this as a shell command using something like:

>>>b = subprocess.call(['pcregrep','-M','-n','INSTANCE '+arg, '\)((?!^\).*\n)+ '+file ])
pcregrep: Failed to open \)((?!^\).*
)+ /home/sanjay/thesis/code/sdf/c7552_syn_buf0.sdf: No such file or directory

When I actually put in the argument (U2 in this case) name and the file name, I am able to get the desired output.

EDIT If pcregrep is not friendly enough, here is the awk command:

awk '/INSTANCE U2/,/^)\n?/' sdf/c1_syn_buf2.sdf

Returns the same.

Can someone please help me with this?

sanjay
  • 647
  • 1
  • 6
  • 14
  • I think you just have a typo around: .*\n)+ '+file It perhaps should be .*\n)+ ', file with a comma instead of a plus – rkh Sep 19 '14 at 20:52
  • @rkh Its not a typo, the stuff in parentheses is basically saying `(.*\n)+`, I have added the `(?!\))` to indicate that matching needs to be done only until a line that starts with `)` is encountered. – sanjay Sep 19 '14 at 20:55
  • The comma separating arg from '\) should also be a +, I think, so that the element in the list is a single argument to the subprocess call – rkh Sep 19 '14 at 20:57

3 Answers3

1

Just looking at your original command line, and formatting the call to one arg per line, should it not be this?

b = subprocess.call(
['pcregrep',
    '-M',
    '-n',
    'INSTANCE {}\)((?!^\)).*\n)+ '.format(arg),
    file ])

I am not so sure about the parenthesis and the backslashes. Those are always a bit tricky in regexes. You might have to fiddle with them a bit to get exactly what you want (look in the python documentation for the r'' regex string type)

rkh
  • 1,761
  • 1
  • 20
  • 30
  • I did not get this one, I would like to have the `arg` in the search string – sanjay Sep 19 '14 at 21:03
  • the format line will put arg in the search string. If arg is set to "Hello", then 'INSTANCE {}\)((?!^\).*\n)+ '.format(arg) is equal to 'INSTANCE hello\)((?!^\).*\n)+ ' (plus or minus some backslashes) – rkh Sep 19 '14 at 21:05
  • hmm...I actually just tried it and now pcregrep reports the error: `>>> b = subprocess.call(['pcregrep','-M','-n','INSTANCE {}\)((?!^\).*\n) '.format(arg),file ]) pcregrep: Error in command-line regex at offset 25: missing )` – sanjay Sep 19 '14 at 21:06
  • I wish there was a simple backtick or exec method for Python like perl! – sanjay Sep 19 '14 at 21:08
  • looking at your original regex, I think I forgot an extra ')' in the middle. I have just edited. Hopefully the backslashes are correct... – rkh Sep 19 '14 at 21:14
  • note: `len('\n') == 1` and `len(r'\n') == 2`. Use `r''` literals if you want to pass `\n` to `pcregrep`. – jfs Sep 20 '14 at 12:37
1

Looks like I need to use format specifiers %s

It works when I use:

b = subprocess.check_output("pcregrep -M -n 'INSTANCE '%s'((?!^\)).*\n)+' {} ".format(file) %arg,shell=True)

With this, I get the exact match into the variable b

I am passing the argument using %s and the file name using the {} .format method

sanjay
  • 647
  • 1
  • 6
  • 14
  • you should use either `%` or `.format` string formatting. Using both at the same time looks ugly. – jfs Sep 20 '14 at 12:28
1

To run the shell command:

$ pcregrep -M -n 'INSTANCE U2((?!^\)).*\n)+' sdf/c1_syn_buf2.sdf

in Python:

from subprocess import check_output as qx

output = qx(['pcregrep', '-M', '-n', r'INSTANCE {}((?!^\)).*\n)+'.format(arg),
             path_to_sdf])
  • use r'' literal or double all backslashes
  • pass each shell argument as a separate list item

Also, you don't need pcregrep, you could search the file in Python:

import re
from mmap import ACCESS_READ, mmap

with open(path_to_sdf) as f, mmap(f.fileno(), 0, access=ACCESS_READ) as s:
    # arg = re.escape(arg) # call it if you want to match arg verbatim
    output = re.findall(r'INSTANCE {}((?!^\)).*\n)+'.format(arg).encode(), s,
                        flags=re.DOTALL | re.MULTILINE)

mmap is used to accommodate files that do not fit in memory. It also might run faster on Windows.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • @Sebastian I'm fairly new to python, hence was using Unix commands..thanks for the mmap approach, it works perfectly! Thanks for the perl like approach to qx – sanjay Sep 21 '14 at 13:19