Using RegEx and Reading files inside of a .EGG File?

Question

newemail = 'test@gmail.com'

import zipfile
import re

egg = zipfile.ZipFile('C:\\Users\\myname\\Desktop\\TEST\\Tool\\scraper-1.11-py3.6.egg')
file = egg.open('scraping_tool/settings.py')

text = file.read().decode('utf8')
emailregex = re.compile(r'[A-Za-z0-9-.]+@[A-Za-z0-9-.]+')
newtext = emailregex.sub(newemail,text)
newtext = newtext.encode('utf8')

file.close()
egg.close()

egg = zipfile.ZipFile('C:\\Users\\myname\\Desktop\\TEST\\Tool\\scraper-1.11-py3.6.egg', 'w')
file = egg.open('scaping_tool/settings.py', 'w')
file.write(newtext)

file.close()
egg.close()

I'm a week into programming so let me know if anything I say doesn't make sense. The objective I'm trying to achieve right now is getting the email out of a .py file in a egg file.

In the interactive shell I was able to successfully to retrieve the txt = file.read() but once I start getting the match objects and regEX involved I start getting errors like "can not get strings from byte objects"

Tried reading stackoverflow questions about the errors but still too new to decipher what they are talking about and might need it dumbed down a bit more. I understand the zip file is messing up how regex works with strings but not sure how to fix it.

EDIT: Bonus Question about encoding

score 0 · Accepted Answer · answered Jan 25 '20 at 08:33

0

When you .read() an entry from a zipfile, you will get bytes. There is no automatism that detects whether a zip entry is a binary file or a text file, you have to make that decision yourself.

In order to convert bytes into string, you must decode them, which requires knowledge of the text encoding these files have been saved in. If you don't know for sure, UTF-8 (utf8) and Windows-1252 (cp1252) are common encodings to try. You'll know that you've picked the right encoding when special/accented characters look right in the result:

txt = file.read().decode('utf8')
print(txt)

Once you're working with strings, the "can not get strings from byte objects" error won't occur anymore.

answered Jan 25 '20 at 08:33

Tomalak

332,285
67
532
628

1

thanks, actually tried this before when googling solutions and it didn't work, not sure what I changed this time but works, great thankyou! – Jan 25 '20 at 08:40
I assume I just do .encode('utf8') if I want to write it? – Jan 25 '20 at 08:42
Exactly. Encoding turns a string back into bytes. (BTW, if you have tried things that didn't work, add them to your question (next time). People will usually also explain was wrong in your attempt.) – Tomalak Jan 25 '20 at 08:44
thank you! follow up question tying to wrap this code up, I edited the code to what I have now. When i try to write the file currently it changes the entire egg file into my newly edited settings.py. I assume this is because I'm not zipping it back up correctly which I haven't figured out the syntax to. Any way you could help me out with that? – Jan 25 '20 at 09:07
General tip: It's better to leave your original question alone instead of overwriting it with your progress. This way the answers don't get out of sync with what you have asked, so the next person reading this can make sense of the thread. – Tomalak Jan 25 '20 at 09:21
Other than that, have a look at https://stackoverflow.com/questions/4653768/overwriting-file-in-ziparchive – Tomalak Jan 25 '20 at 09:25

Using RegEx and Reading files inside of a .EGG File?

1 Answers1