How to find a string from a ZIP file using Python

Question

I am trying to read a string from a ZIP file which contains n number of files. If the string is present in the file, that file has to be moved to a specific location.

import zipfile,os,shutil

f = []
files = 'Contains given substring'
os.chdir(r'C:\Users\Vishali\Desktop\PY\POC')

archive = zipfile.ZipFile('PY.zip')
print(archive.namelist())

for n in archive.namelist():
    print(n)

    f1 = archive.open(n,'r')
    re = f1.readlines()
    print(files)
    print(re)
    if files in re:
        shutil.copy(n,r'C:\Users\Vishali\Desktop\PY\s')
        f.append(f1)

print(f)

However, if the string is present in a file, it is not getting detected. f remains an empty list.

What do you want to check for? If `re` contains a string that contains the given substring or if one of the strings in `re` _exactly equals_ the `files` string? Or maybe something else? — ForceBru, Aug 19 '19 at 18:11
if 'Zap.zip' is my zip file name and it contains 3 files named 'first.txt','second.txt' and 'third.txt'. i want to check which file contains the string i am searching for . For eg , if i am searching for string 'hello', which is present in the file 'second.txt' , i want to print the file name that contains the string and also move the same file to a specific location — Lucazade, Aug 19 '19 at 18:15
Currently `files in re` will check if the exact string is contained within the `re` list, it is not a substring match — C.Nivs, Aug 19 '19 at 18:16
@forcebru : the error i faced when i used read() . 'if files in f1.read():TypeError: a bytes-like object is required, not 'str' ' . r = f1.readlines() returns a list and i am not able to find a way to find a string in that list — Lucazade, Aug 19 '19 at 18:22
@nameless13, then `files` should be `bytes`, like: `files = b"the actual thing"` — ForceBru, Aug 19 '19 at 18:24
Rename your variables. The names you've given them do not appear to reflect what they represent. This makes understanding the intentions of your code much more difficult. — jpmc26, Aug 19 '19 at 18:25
I have used this question as an example in [a discussion](https://meta.stackoverflow.com/q/388663/1394393) about a common, larger issue facing this community. — jpmc26, Aug 19 '19 at 22:41
It is not clear how you want to handle line endings. Are newline characters forbidden in your search string? Or can a search string include them and match across multiple lines? If they can include them, do they have to match exactly, or do you need to normalize them somehow? — jpmc26, Aug 20 '19 at 01:05
Please in code questions give a [mre]--cut & paste & runnable code; example input with desired & actual output (including verbatim error messages); tags & versions; clear specification & explanation. — philipxy, Aug 20 '19 at 08:01

score -1 · Accepted Answer · edited Aug 20 '19 at 11:43

-1

"re" is a list. I am incorporating feedback from @jpmc26 to my original answer.

Change this:

if files in re:
    shutil.copy(n,r'C:\Users\Vishali\Desktop\PY\s')
    f.append(f1)

to this:

decode = ''
for lines in re:
    decode = decode + lines.decode('utf-8')
if files in decode:
    shutil.copy(n,r'C:\Users\Vishali\Desktop\PY\s')
    f.append(f1)

This properly decodes the lines retrieved by zipfile (if the file has UTF-8 encoding) and will eliminate escape characters from your search that otherwise could have caused false positives.

edited Aug 20 '19 at 11:43

Peter Mortensen

30,738
21
105
131

answered Aug 19 '19 at 18:23

brazosFX

342
1
11

2

`str` is not the proper mechanism to combine a list of strings. – jpmc26 Aug 19 '19 at 18:26
Sure is sufficient to evaluate the truth of an if statement though. The answer is correct. – brazosFX Aug 19 '19 at 18:27
2

It introduces extra characters that are not part of the original content, such as quotes and commas. This can result in a false positive. It is not correct. – jpmc26 Aug 19 '19 at 18:28
@jpmc26 - I am assuming you are right. I see the extra output in the string. I will update the answer or comment with a more ideal way once I find it. Thnks. Or add to comment and save me the search? ...and give my vote back! I am a new contributor and you are supposed to be nice! Ha ha – brazosFX Aug 19 '19 at 18:37
1

You don't need to take my word for it. Just observe the output of `str` on a list in the REPL. – jpmc26 Aug 19 '19 at 18:40
2

The results are even worse if the file contains non-ASCII, control characters, or backslashes. It generates escape sequences in the string you're checking against. For example, `str(['\\'])` doubles the slashes in the repr. – jpmc26 Aug 19 '19 at 20:21
@jpmc26, updated answer after some research. Upvote if you approve, otherwise please comment. Thx for the help. – brazosFX Aug 19 '19 at 22:24
2

Dropping the newlines can also create false positives. Consider searching for `'abc'` when the file contains `'ab\nc'`. Also, concatenation is [generally not a good way of combining many strings](https://stackoverflow.com/a/52561012/1394393). You may also want to see some things I noted in the Meta discussion I linked above. – jpmc26 Aug 19 '19 at 23:12

How to find a string from a ZIP file using Python

1 Answers1