1

I am wondering if there is a way to save the html in my file with a specific structured format. Right now the output of this script is just bunched up letters and numbers. Is there a way it could be structured ? for example: 111.111.111.11:111 222.222.222.22:22 (IP Format)

Any help is appreciated!

import urllib.request
import re

ans = True

while ans:
    print("""
      - Menu Selection -
      1. Automatic 
      2. Automatic w/Checker
      3. Manual
      4. Add to list
      5. Exit
      """)
ans = input('Select Option : ')

if ans =="1":
    try :
       with urllib.request.urlopen('http://www.mywebsite.net') as response: 
         html = response.read()
         html = str(html)
         html = re.sub(r'([a-z][A-Z])', '', html)
         f = open('text.txt','a')
         f.write(html)
         f.close()
         print('Data(1) saved.')
         ans = True
    except :
        print('Error on first fetch.')    
dexray
  • 33
  • 8
  • Use an HTML parser such as `BeautifulSoup`. – Alex Hall May 22 '16 at 21:04
  • How would you like the order of dots and colons with the numbers.. Do you have a way in mind in which that can be intact ? – minocha May 22 '16 at 21:04
  • In the format of an IP. – dexray May 22 '16 at 21:07
  • @dexray can you please give a sample input and output. Elaborate example. The intended result in unclear – minocha May 22 '16 at 21:12
  • Input = a text file that contains the following : fdsfdsfdsf123.123.123.123:123fdds125.125.125.125:125fdsfdfdsfdsfsdf I want my output to be = 123.123.123.123:123 (newline)125.125.125.125:125 – dexray May 22 '16 at 21:19

1 Answers1

1

According to the question -

if the sample input is -

Input - fdsfdsfdsf123.123.123.123:123fdds125.125.125.125:125fdsfdfdsfdsfsdf

Output - 123.123.123.123:123 (newline) 125.125.125.125:125

if html is the input string -

filtered_alpha = re.sub('[^0-9\.:]','\n', html)
multiple_ips = filter(None, filtered_alpha.split("\n"))
print "\n".join(multiple_ips)

this will give you the intended output.

If you are specifically looking for just ip_addresses you can refer to the post by @MarkByers here where he mentions -

ip = re.findall( r'[0-9]+(?:\.[0-9]+){3}', html)

Community
  • 1
  • 1
minocha
  • 1,043
  • 1
  • 12
  • 26
  • Thank you very much friend! – dexray May 22 '16 at 21:35
  • @dexray - no problem :) – minocha May 22 '16 at 21:36
  • Still have a small issue. Is there a way i can separate each IP with spaces or line breaks ? – dexray May 22 '16 at 21:45
  • if all those ip's are in lists you can do " ".join(list_name) or "\n".join(list_name) to have spaces or new lines between the ip's like i've done in my example `print "\n".join(multiple_ips)` – minocha May 22 '16 at 21:47
  • This ended up not fixing my issue. I used a website where the html was just IPS and ports. With other websites everything i tried does not return in the format of IP:PORT. – dexray May 23 '16 at 00:58