I am trying to write a function as follows:
def get_urls(*urls,restrictions=None):
#here there should be some code that
#iterates through the urls and create
#a dictionary where the keys are the
#respective urls and their values are
#a list of the possible extentions. The
#function should return that dictionary.
First, to explain. If I have a site: www.example.com, and it has only the following pages: www.example.com/faq, www.example.com/history, and www.example.com/page/2. This would be the application:
In[1]: site = 'http://example.com'
In[2]: get_urls(site)
Out[2]: {'http://example.com':['/faq','/history','/page/2']}
I have spent hours researching, and so far this seems impossible! So am I missing some module that can do this? Is there one that exists but not in python? If so, what language?
Now you are probably wondering why there is restrictions=None
, well here is why:
I want to be able to add restrictions to what is an acceptable url. For example restrictions='first'
could make it only do pages that exist with one '/'
. Here is an example:
In[3]: get_urls(site,restrictions='first')
Out[3]: {'http://example.com':['/faq','/history']}
I don't need to keep explaining the ideas for restrictions, but you understand the need for it! Some sites, especially social networks, have some crazy add ons for ever picture and weeding those out is important while keeping the original page consisting of all the photos.
So yes, I have absolutely no code for this, but that is because I have no clue what to do! But I think I made myself clear about what I need to be able to do, so, is this possible? If yes, how? if no, why not?
EDIT:
So after some answers and comments, here is some more info. I want to be given a url, not necessarily a domain, and return a dictionary with the original url as the key and a list of all the the extensions of that url as the items. Here is an example with my previous 'example.com'
:
In[4]: site = 'http://example.com/page'
In[5]: get_urls(site)
Out[5]: {'http://example.com/page':['/2']}
The crawling examples and beautiful soup is great, but if there is some url that is not directly linked on any of the pages, then I can't find it. Yes, that generally is not a concern, but I would like to be able to!