I am having trouble extracting portions of text from txt file. Using python 3, I have the format below throughout the whole text file:
integer stringOfFilePathandName.cpp string integer
...not needed text...
...not needed text...
singleInteger( zero or one)
---------------------------------
integer stringOfFilePathandName2.cpp string integer
...not needed text...
...not needed text...
singleInteger( zero or one)
---------------------------------
The number of unwanted text lines is not stable for each pattern occurence. I need to save the stringOfFilePathandName.cpp and the singleInteger value, if possible to a dictionary, like {stringOfFilePathandName:(0 or 1)}.
The text contains other file extensions (like the .cpp) which I do not need. Also, I do not know the file's encoding so I read it as binary.
My issue shares features with the problems addressed at the links below:
Python read through file until match, read until next pattern
https://sopython.com/canon/92/extract-text-from-a-file-between-two-markers/ - which I don't quite comprehend
python - Read file from and to specific lines of text- this I have tried to copy, but worked for only one instance. I need to iterate this process throughout the file.
Currently I have tried this which works for a single occurence:
fileRegex = re.compile(r".*\.cpp")
with open('txfile',"rb") as fin:
filename = None
for line in input_data:
if re.search(fileRegex,str(line)):
filename = ((re.search(fileRegex,str(line))).group()).lstrip("b'")
break
for line in input_data:
if (str(line).lstrip("b'").rstrip("\\n'"))=="0" or (str(line).lstrip("b'").rstrip("\\n'"))=="1":
dictOfFiles[filename] = (str(line).lstrip("b'").rstrip("\\n'"))
del filename
My thinking is that a similar process which iterates through the file is needed. Up till now, the approach I followed was line-by-line. Possibly, it would be better to just save the whole text to a variable and then extract. Any thoughts, are welcome, this has been bugging me for quite a while...
per request here's the text file: https://raw.githubusercontent.com/CGCL-codes/VulDeePecker/master/CWE-119/CGD/cwe119_cgd.txt