0

I need to extract some data from a string like this (VHDL code):

entBody = """entity pci_bfm is                                                            
  generic(                                                                   
    G_INST_NAME            : string          := "PCI_BFM";                   
    G_HANDLE_NO            : rpciBfmHandleNo := 0;                           
    G_IDSEL_POS_EXT_TARGET : idsel_pos       := 30;                          
    G_IDSEL_POS_INT_TARGET : idsel_pos       := 29                           
    );                                                                       
  port(                                                                      
    i_tb_stop  : in    boolean;                       -- Testbench global sto
    o_clk      : out   std_logic;                     -- PCI clock.          
    o_rstn     : out   std_logic;                     -- PCI reset.          
    o_idsel    : out   std_logic;                     -- Initialization devic
    i_reqn     : in    std_logic;                     -- Request. The reqn in
    o_gntn     : out   std_logic;                     -- Grant. The gntn onpu
    io_ad      : inout std_logic_vector(31 downto 0); -- Address/data bus. Th
    io_cben    : inout std_logic_vector(3 downto 0);  -- Command/byte enable.
    io_par     : inout std_logic;                     -- Parity. The par sign
    io_framen  : inout std_logic;                     -- Frame. The framen si
    io_irdyn   : inout std_logic;                     -- Initiator ready. The
    io_devseln : inout std_logic;                     -- Device select. Targe
    io_trdyn   : inout std_logic;                     -- Target ready. The tr
    io_stopn   : inout std_logic;                     -- Stop. The stopn sign
    io_perrn   : inout std_logic;                     -- Parity error. The pe
    i_serrn    : in    std_logic;                     -- System error. The se
    i_intan    : in    std_logic;                     -- Interrupt A. The int
    o_lockn    : out   std_logic                      -- Locked operations. R
    );                                                                       
end entity pci_bfm;"""

The VHDL comments do not have all the same size, I truncated them to be easier to read.

I am interested to get everything between 'port(' and last ');' (the one that closes port declarations). Of course the VHDL declarations may not be well indented and formatted as here.

I have a Python 2.7.x regex for this:

pattern = re.compile("port\s*\((.*?)\s+\)\s*;")
match3 = pattern.search(entBody)
ports = match3.group(1)

It works well if the closing ); is not immediately after the last declaration. The following will not work:

entBody2 = """entity QSPI_FLASH_SPANSION_S25FL_BFM is
  generic
    (
      G_INST_NAME : string  := "QSPI_FLASH_SPANSION_S25FL_BFM";
      G_HANDLE_NO : integer := 2
      );
  port (
    tb_stop : in    boolean;                       -- Testbench global stop.
    sclk    : in    std_logic;
    csn     : in    std_logic;
    sdat    : inout std_logic_vector(3 downto 0));
end;"""

If I change my regex a little bit like this:

pattern = re.compile("port\s*\((.*?)\s*\)\s*;") # \s* instead of \s+

then the search will end at 'io_ad : inout std_logic_vector(31 downto 0' which is not good at all.

I was wondering if I can use regex to to a search like this, i.e. to count opening parenthesis and only stop when all parenthesis are closed.

If there is no simple way, I will do a simple string search using string functions to solve it.

Thank you.

Mihai Hangiu
  • 588
  • 4
  • 13
  • *I truncated them to be easier to read* - are you sure you preserved the same format? How can you define the leading/trailing boundaries and the content inside? – Wiktor Stribiżew May 07 '16 at 08:46

2 Answers2

1

Here you want to match the characters including newline. So use pattern \s\S with in a character class.

\s match the any whitespace character.

\S match the any non whitespace character

match3 =re.search(r"port\(([\s\S]+?)\);\s+\n",entBody)

Or S flag. Helps to match any character including newlines.

match3 =re.search(r"port\((.+?)\);\s+\n",entBody,re.S)
mkHun
  • 5,891
  • 8
  • 38
  • 85
  • A word of warning: This will capture all text between `port(` and _the last closing parens `)` in the entire text_. While it may work in this specific case, I doubt this is a good solution for real world scenarios. All you need to do to break this solution is to move the `generic` section after the `port` section. – Aran-Fey May 07 '16 at 09:12
  • @Rawing Thank you for your comment post edited – mkHun May 07 '16 at 09:21
1

You can use following regex:

/port\s*\((.+)\)\s*;/s

Breaking it down:

port            # matches the characters port literally (case sensitive)
\s*             # match any white space character [\r\n\t\f ] Between zero and unlimited times
\(              # matches the character ( literally
(.+)            # capturing group start - matching any character - Between one and unlimited times
\)              # matches the character ) literally
\s*             # match any white space character [\r\n\t\f ] Between zero and unlimited times
;               # matches the character ; literally

s               # modifier: single line. Dot matches newline characters

REGEX DEMO

IDEONE DEMO


UPDATE: If there is a case when there is something else after the port(...) you can check following regex:

/port\s*\((.*?)(?:\)\s*;\s*\w)/s

AKS
  • 18,983
  • 3
  • 43
  • 54
  • This only works if `port` is the last section in the code. If, for example, the `generic` section was below the `port` section, this wouldn't work. Also, you might want to allow whitespace between the closing parens and the semicolon. – Aran-Fey May 07 '16 at 09:17
  • @Rawing Thank you. I will work on the modifications you mentioned. – AKS May 07 '16 at 09:23