0

I have a text file which have some website list links like

test.txt:

http://www.site1.com/
http://site232546ee.com/
https://www.site3eiue213.org/
http://site4.biz/

I want to make a simple python script which can extract only site names with length of 8 characters... no name more than 8 characters.... the output should be like:

output.txt:

site1
site2325
site3eiu
site4

i have written some code:

txt1 = open("test.txt").read()
txt2 = txt1.split("http://www.")
f = open('output.txt', 'w')
for us in txt2:
    f.write(us)
print './done'

but i don't know how to split() more than one command in one line ... i also tried it with import re module but don't able to know that how to write code for it.

can some one help me please to make this script. :(

Ali1331
  • 117
  • 2
  • 9

3 Answers3

0

you can achieve this using regular expression as below.

import re

no = 8
regesx = "\\bhttp://www.|\\bhttp://|\\bhttps://www."    
text = "http://site232546ee.com/"
match = re.search(regesx, text)
start = match.end(0)
end = start+no
string1 = text[start:end]
end = string1.find('.')
if end > 0:
    final = string1[0:end]
else:
    final = string1
print(final)
Abhijit
  • 673
  • 2
  • 17
  • 35
0

You said you want to extract site names with 8 characters, but the output.txt example shows bits of domain names. If you want to filter out domain names which have eight or less characters, here is a solution.

Step 1: Get all the domain names.

import tldextract
import pandas as pd
text_s=''

list_u=('http://www.site1.com/','http://site232546ee.com/','https://www.site3eiue213.org/','http://site4.biz/')
#http:\//www.(\w+).*\/?
for l in list_u:
    extracted = tldextract.extract(l)
    text_s+= extracted.domain + ' '

print (text_s) #gives a string of domain names delimited by whitespace

Step 2: filter domain names with 8 or less characters.

word= text_s.split()
lent= [len(x) for x in text_s.split()]

word_len_list = pd.DataFrame(
    {'words': word,
     'char_length': lent,
     })
word_len_list[(word_len_list.char_length <= 8)]

Output looks like this:

words char_length 0 site1 5 3 site4 5

Disclaimer: I am new to Python. Please ignore any unnecessary and/or stupid steps I may have written

SKD
  • 58
  • 8
-1

Have you tried printing txt2 before doing anything with it? You will see that it did not do what (I expect) you wanted it to do, since there's only one "http://www." available in the text. Try to split at a newline \n. That way you get a list of all the urls.

Then, for each url you'll want to strip the front and back, which you can do with regular expression but which can be quite hard, depending on what you want to be able to strip off. See here.

When you have found a regular expression that works for you, simply check the domain for its length and write those domains to a file that satisfy your conditions using an if statement (if len(domain) <= 8: f.write(domain))

Community
  • 1
  • 1
fhdrsdg
  • 10,297
  • 2
  • 41
  • 62
  • I don't have find the solution in your link which you have provided – Ali1331 Sep 16 '14 at 16:13
  • The thing is that there are so many top level domains that it's very hard to have a regular expression that takes out all of them, especially since there are tld's like `.co.uk` with a period in them. To make it easier, you could try using a package such as [tldextract](https://pypi.python.org/pypi/tldextract) or [tld](https://pypi.python.org/pypi/tld). See the first two answers of [this question](http://stackoverflow.com/questions/14406300/python-urlparse-extract-domain-name-without-subdomain). – fhdrsdg Sep 16 '14 at 16:45