Python,how to extract text between two markers multiple times throughout text file?

Question

I am having trouble extracting portions of text from txt file. Using python 3, I have the format below throughout the whole text file:

    integer stringOfFilePathandName.cpp string integer
    ...not needed text...
    ...not needed text...
    singleInteger( zero or one)
    ---------------------------------
    integer stringOfFilePathandName2.cpp string integer
    ...not needed text...
    ...not needed text...
    singleInteger( zero or one)
    ---------------------------------

The number of unwanted text lines is not stable for each pattern occurence. I need to save the stringOfFilePathandName.cpp and the singleInteger value, if possible to a dictionary, like {stringOfFilePathandName:(0 or 1)}.

The text contains other file extensions (like the .cpp) which I do not need. Also, I do not know the file's encoding so I read it as binary.

My issue shares features with the problems addressed at the links below:

Python read through file until match, read until next pattern

https://sopython.com/canon/92/extract-text-from-a-file-between-two-markers/ - which I don't quite comprehend

python - Read file from and to specific lines of text- this I have tried to copy, but worked for only one instance. I need to iterate this process throughout the file.

Currently I have tried this which works for a single occurence:

fileRegex = re.compile(r".*\.cpp")

with open('txfile',"rb") as fin:
   filename = None
   for line in input_data:
       if re.search(fileRegex,str(line)):
           filename = ((re.search(fileRegex,str(line))).group()).lstrip("b'") 
           break
   for line in input_data:
       if (str(line).lstrip("b'").rstrip("\\n'"))=="0" or (str(line).lstrip("b'").rstrip("\\n'"))=="1":
        dictOfFiles[filename] = (str(line).lstrip("b'").rstrip("\\n'"))

   del filename

My thinking is that a similar process which iterates through the file is needed. Up till now, the approach I followed was line-by-line. Possibly, it would be better to just save the whole text to a variable and then extract. Any thoughts, are welcome, this has been bugging me for quite a while...

per request here's the text file: https://raw.githubusercontent.com/CGCL-codes/VulDeePecker/master/CWE-119/CGD/cwe119_cgd.txt

Are you OK with `{b'stringOfFilePathandName.cpp': b'0', b'stringOfFilePathandName2.cpp': b'1'}` output? Or do you want to have UTF8 strings in the result, like `{'stringOfFilePathandName.cpp': '0', 'stringOfFilePathandName2.cpp': '1'}`? — Wiktor Stribiżew, Jun 07 '19 at 11:17
@WiktorStribiżew yes, it's fine I can strip it later, thank you — Nikos H., Jun 07 '19 at 11:52
@Nikos You cannot strip `b` prefix, you need to re-encode the values. See my answer below how to do that. — Wiktor Stribiżew, Jun 07 '19 at 11:53

Wiktor Stribiżew · Accepted Answer · 2019-06-07T14:13:06.220

3

You may use

fileRegex = re.compile(rb"^\d+\s+(\S+\.cpp)\s.*(?:\r?\n(?![01]\r?$).*)*\r?\n([10]+)\r?$", re.M)
dictOfFiles = []
with open(r'txfile','rb') as fin:
    dictOfFiles = [(k.decode('utf-8'), (int)(v.decode('utf-8'))) for k, v in fileRegex.findall(fin.read())]

Then, print(dictOfFiles) returns

[('stringOfFilePathandName.cpp': 0), ('stringOfFilePathandName2.cpp': 1)....]

See the regex demo.

NOTES

You need to read all the file into a memory for this multiline regex to work, hence I am using fin.read()
When you are reading in a file with a binary mode, CR are not removed, hence I added \r? (optional CR) before each \n
To convert byte strings to Unicode strings, we need to use .decode('utf-8') on the results.

Regex details (in case you need to adjust it later):

^ - start of a line (due to re.M, ^ matches line start positions)
\d+ - 1+ digits
\s+ - 1+ whitespaces
(\S+\.cpp) - Group 1: 1+ non-whitespace chars and then .cpp
\s - a whitespace
.* - 0+ chars other than line break chars as many as possible
(?:\r?\n(?![01]\r?$).*)*
\r?\n - a CRLF or LF linebreak
([10]) - Group 2: a 1 or 0
\r? - an optional CR
$ - end of line.

edited Jun 07 '19 at 14:13

answered Jun 07 '19 at 11:21

Wiktor Stribiżew

607,720
39
448
563

I get different total occurrences when running the _filename_ regex (essentially checking how many times does .cpp appear) and your solution(9765 vs 3361). Maybe I omitted or misinterpreted the text format. For replication I added the text file link on the initial post. – Nikos H. Jun 07 '19 at 13:38
1

@Nikos The pattern is correct, you just have repeating filenames. If they are not unique, do not use a dictionary, use a list of tuples. I updated the code. – Wiktor Stribiżew Jun 07 '19 at 14:12
Oh of course, such ignorance on my part...thank you very much. If I may ask something more, could you please provide a guide/extra resources regarding regex? – Nikos H. Jun 07 '19 at 14:16
1

@Nikos I do not know your level of regex knowledge, so I can only suggest doing all lessons at [regexone.com](http://regexone.com/), reading through [regular-expressions.info](http://www.regular-expressions.info), [regex SO tag description](http://stackoverflow.com/tags/regex/info) (with many other links to great online resources), and the community SO post called [What does the regex mean](http://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean). Also, [rexegg.com](http://rexegg.com) is worth having a look at. – Wiktor Stribiżew Jun 07 '19 at 14:17
1

@NikosH. Hi, I have started uploading some [regex videos on Youtube](https://www.youtube.com/channel/UCFeq5T-LNtqpVrn_rcJ9hFw), feel free to check them out if you want to learn more about regex. As I am a novice in Youtubing, I'd be grateful for any suggestions. – Wiktor Stribiżew Jul 20 '21 at 12:37

Tim Biegeleisen · Answer 2 · 2019-06-07T11:01:37.453

2

One possibility would be to use re.findall with a regex pattern which can cope spanning more than one line:

input = """1 file1.cpp blah 3
           not needed
           not needed
           2
           ---------------------------------
           9 file1.cpp blah 5
           not needed
           not needed
           3
           ---------------------------------"""
matches = re.findall(r'(\w+\.cpp).*?(\d+)(?=\s+--------)', input, re.DOTALL)
print(matches)

This prints:

[('file1.cpp', '2'), ('file1.cpp', '3')]

This answer assumes that you can tolerate reading the entire file into memory, and then making one pass with re.findall. If you can't do that, then you will need to continue with your current parsing approach.

edited Jun 07 '19 at 11:01

answered Jun 07 '19 at 10:59

Tim Biegeleisen

502,043
27
286
360

It won't work with the current code because OP is reading the file line by line. `input` must hold the whole file contents. Also, it won't return a dictionary (though it is easy to adjust). – Wiktor Stribiżew Jun 07 '19 at 11:01
@WiktorStribiżew Just noticed that...I added a caveat to my answer. – Tim Biegeleisen Jun 07 '19 at 11:02
I use decode from Wiktor's answer and it seems to work. As you mention I first read the whole file to a variable and the run .findall. – Nikos H. Jun 07 '19 at 12:42

Python,how to extract text between two markers multiple times throughout text file?

2 Answers2