0

TLDR: Is there a clean way to make a list of entries for subprocess.check_output('pcregrep', '-M', '-e', pattern, file)?

I'm using python's subprocess.check_output() to call pcregrep -M. Normally I would separate results by calling splitlines() but since I'm looking for a multiline pattern, that won't work. I'm having trouble finding a clean way to create a list of the matching patterns, where each entry of the list is an individual matching pattern.

Here's a simple example file I'm pcgrep'ing

module test_module(
    input wire in0,
    input wire in1,
    input wire in2,
    input wire cond,
    input wire cond2,
    output wire out0,
    output wire out1
);

assign out0 = (in0 & in1 & in2);
assign out1 = cond1 ? in1 & in2 :
              cond2 ? in1 || in2 :
              in0;

Here's (some of) my python code

#!/usr/bin/env python
import subprocess, re

output_str = subprocess.check_output(['pcregrep', '-M', '-e',"^\s*assign\\s+\\bout0\\b[^;]+;", 
                                     "/home/<username>/pcregrep_file.sv"]).split(';')

# Print out the matches
for idx, line in enumerate(output_str):
    print "output_str[%d] = %s" % (idx, line)

# Clear out the whitespace list entries                           
output_str = [line for line in output_str if re.match(\S+, line)]

Here is the output

output_str[0] = 
assign out0 = in0 & in1 & in2
output_str[1] = 
assign out1 = cond1 ? in1 & in2 :
              cond2 ? in1 || in2 :
              in0
output_str[2] = 

It would be nice if I could do something like

output_list = subprocess.check_output('pcregrep', -M, -e, <pattern>, <file>).split(<multiline_delimiter>)

without creating garbage to clean up (whitespace list entries) or even to have a delimiter to split() on that is independent on the pattern.

Is there a clean way to create a list of the matching multiline patterns?

mgoblue92
  • 57
  • 6
  • I don't see any reason to use an external tool, why you don't use the re module? – Casimir et Hippolyte Mar 21 '17 at 22:05
  • fair point, I just have more experience using grep, pcregrep etc. than using re to grep files. I also thought pcregrep might be more optimized for this and performance will (eventually) be a factor. – mgoblue92 Mar 21 '17 at 22:07
  • Stop dreaming about performances and try with the language regex engine to see if it does the job and if the time it takes to do that is acceptable for your needs. After, and only after (when you have made all your possible to refine your pattern or to find an other way with the language), try to use external tools. – Casimir et Hippolyte Mar 21 '17 at 22:24
  • gotcha. I generally like to take performance into account as I'm starting out so I have potentially less work to do down the road, but you're right that the regex machine might be more than sufficient. – mgoblue92 Mar 21 '17 at 22:27

2 Answers2

1

Per Casimir et Hippolyte's comment, and the very helpful post, How do I re.search or re.match on a whole file without reading it all into memory?, I read in the file using re instead of an external call to pcregrep and used re.findall(pattern, file, re.MULTILINE)

Full solution (which only slightly modifies the referenced post)

#!/usr/bin/env python
import re, mmap

filename = "/home/<username>/pcregrep_file.sv"
with open(filename, 'r+') as f:
    data = mmap.mmap(f.fileno(), 0)
    output_str = re.findall(r'^\s*assign\s+\bct_ela\b[^;]+;', data, re.MULTILINE)
    for i, l in enumerate(output_str):
    print "output_str[%d] = '%s'" % (i,l)

which creates the desired list.

Community
  • 1
  • 1
mgoblue92
  • 57
  • 6
0

Don't do that. If you can't use the Python regular expression module for some reason, just use the Python bindings for pcre.

zmbq
  • 38,013
  • 14
  • 101
  • 171
  • Note that most of the time, when you feel constrained by the re module, you can use the regex module: https://pypi.python.org/pypi/regex that has all features of your dreams (including the most of the pcre features). – Casimir et Hippolyte Mar 21 '17 at 22:33