What regular expression should I use for this case?

Question

I want to create a regular expression to get only the lines that start with a date(ignore the other ones) and the ones that have the word "Prefix" on it. How should the regular expression looks like?

I have the following structure in my txt file:

                                                        Prefix : 0051601

    Data     Material                                       No. OS  Hist. Nr/Controle        Quant.       Vlr.Unit.            Vlr.Total 
 ----------------------------------------------------------------------------------------------------------------------------------------
 13/01/2008  00101050 Lampada farol H5 24V                          003   4863                2,000        9,870556              19,7411 
                                                                                        ====== Total dia 13/01/2008 ======
                                                                     Entradas :                                                         
                                                                     Saídas   :               2,000                              19,7411
                                                                     -------------------------------------------------------------------

And the primary code is:

import glob, os
import re

os.chdir("./txtfiles/")

for file in glob.glob("*.txt"):

    with open(file) as f:
        content = f.readlines()
        # not working, just for test purpose
        result = re.match(r'Prefix', content, re.M|re.I)
        if result:
            print(content)
        else:
            print "no match found!"

See [this question](https://stackoverflow.com/questions/180986/what-is-the-difference-between-re-search-and-re-match). Also you do not need a regex for just checking the presence of a substring in a string. — Paolo, Sep 03 '18 at 20:41
Not yet @ThomasAyoub, I trying it right now! But I know the logic, I just don't know the syntax for python regex — Eduardo Humberto, Sep 03 '18 at 20:41
Why do you think you need a regular expression? The data appears to be formatted in strict columns. Why not just look at the first 10 bytes, and if they are blank then ignore the line? — Bryan Oakley, Sep 03 '18 at 21:10
You are welcome. @ThomasAyoub How is your comment any useful? OP has shared his code, which includes his attempt to solve his own problem, which is more than what can be said for most question on this website. What's the point in asking that? — Paolo, Sep 03 '18 at 21:11
@ed_deut: if the problem is simply that you "just don't know the syntax for python regex", then you can get your answer by reading the docs, it's very complete: https://docs.python.org/3/library/re.html — Bryan Oakley, Sep 03 '18 at 21:11

score 1 · Answer 1 · answered Sep 03 '18 at 21:08

What about the following without re, assumed that the lines with date at beginning are the only ones with / at pos 2 and 5...:

   with open(file) as f:
        for line in f:
            if line[2]==line[5]=='/' or 'Prefix' in line:
                print(line)

score 1 · Accepted Answer · answered Sep 04 '18 at 02:05

1

You could use this regex to identify those lines.
Use findall to get all the lines.

r"(?im)(?:^[^\S\r\n]*\d+/\d+/\d+|.*\bprefix).*"

https://regex101.com/r/rAl3r6/1

answered Sep 04 '18 at 02:05

What regular expression should I use for this case?

2 Answers2