I am quite new to regex, so I tried to solve this myself for sometime but couldn't come up with a solution. (I am trying to do this with Python 2.7)
I have a list of tumblr links from posts and notes. They look like
"http://TumblrUsername.tumblr.com/post/hello/notes/somemoresutff/464654"
What I want to do is select only the "http://TumblrUsername.tumblr.com/" part and leave the rest so that I can compile a list of tumblr users.
My code looks like this but my question is how do I select the what I want...
import urllib
import requests
import lxml
from bs4 import BeautifulSoup
def find_notes():
file = open('output.txt', 'w')
f = requests.get('http://fullthrottleauto.tumblr.com/post/132323884114/treunenthibault-ferrari-599xx-evo-as-i-love')
soup = BeautifulSoup(f.text, "lxml")
for post_note in soup.find_all('a', href=True):
print post_note['href']
returnline = str(post_note['href'])
if '.tumblr.com/' in returnline:
## I need to do some thing here to extract "only the http://username.tumblr.com/"
file.write(returnline + '\n')
find_notes()