1

I'm trying to open and process through an mht files and scrape off the dealer location data. Whenever I run into a website with 'tricky' format of the html I always keep running into same problem. It turns:

a href="http://www.google.com/maps?s=123 main st"......

into

a href="http://www.=
google.com/maps?=12=
 3 main st"

Anything I have tried so far hasn't worked to take the line back to it original self. I still can't pull the address off.

a = a.replace(r'=\n', '')

or

a = a.replace(r'\n', '')

or even tried,

a = a.replace(r'[0D]', '')

and just tried,

a = a.sub(r'\n', '')

and all I got was the error 'str object has no attribute 'sub', and it does the same thing with or without the 'r' in the code.

Nothing has worked thus far. How do I replace the =\n that always pops up whenever I go to look at an mht file.

I am using

a = open('Filename.mht', 'r')
b = a.read()
a.close()
confused
  • 1,283
  • 6
  • 21
  • 37

2 Answers2

0

Doing str = str.replace("\n","") Works for me. So if you do

string = '''a href="http://www.=
google.com/maps?=12=
3 main st''' 
string = string.replace("\n", "")

print(string)
'a href="http://www.=google.com/maps?=12=3 main st'

That should work This post might help, and explain why.

EDIT: Just tested that, it does work.

Community
  • 1
  • 1
Pike D.
  • 671
  • 2
  • 11
  • 30
0

I think I found the work around. The .read() was causing issue, not sure why though. I changed it to readlines() and then recomposed the string back together and it works fine now with one small exception, gotta hate the '.' when your trying to re.findall...at least I think that is what is causing the program to hang up right now.

confused
  • 1,283
  • 6
  • 21
  • 37