Replace \n in mht file

Question

I'm trying to open and process through an mht files and scrape off the dealer location data. Whenever I run into a website with 'tricky' format of the html I always keep running into same problem. It turns:

a href="http://www.google.com/maps?s=123 main st"......

into

a href="http://www.=
google.com/maps?=12=
 3 main st"

Anything I have tried so far hasn't worked to take the line back to it original self. I still can't pull the address off.

a = a.replace(r'=\n', '')

or

a = a.replace(r'\n', '')

or even tried,

a = a.replace(r'[0D]', '')

and just tried,

a = a.sub(r'\n', '')

and all I got was the error 'str object has no attribute 'sub', and it does the same thing with or without the 'r' in the code.

Nothing has worked thus far. How do I replace the =\n that always pops up whenever I go to look at an mht file.

I am using

a = open('Filename.mht', 'r')
b = a.read()
a.close()

Can you show us the code you're using to obtain the mht file, and how you open it? — Bill Bell, Dec 28 '16 at 17:57

score 0 · Answer 1 · edited May 23 '17 at 10:30

0

Doing str = str.replace("\n","") Works for me. So if you do

string = '''a href="http://www.=
google.com/maps?=12=
3 main st''' 
string = string.replace("\n", "")

print(string)
'a href="http://www.=google.com/maps?=12=3 main st'

That should work This post might help, and explain why.

EDIT: Just tested that, it does work.

edited May 23 '17 at 10:30

Community

1
1

answered Dec 28 '16 at 18:02

Pike D.

671
2
11
30

score 0 · Answer 2 · answered Dec 28 '16 at 19:32

0

I think I found the work around. The .read() was causing issue, not sure why though. I changed it to readlines() and then recomposed the string back together and it works fine now with one small exception, gotta hate the '.' when your trying to re.findall...at least I think that is what is causing the program to hang up right now.

answered Dec 28 '16 at 19:32

confused

1,283
6
21
37

Do you still need help? – Pike D. Dec 28 '16 at 20:41

Replace \n in mht file

2 Answers2