How do I match Tumblr urls from a text file with Regex and Python

Question

I am quite new to regex, so I tried to solve this myself for sometime but couldn't come up with a solution. (I am trying to do this with Python 2.7)

I have a list of tumblr links from posts and notes. They look like

"http://TumblrUsername.tumblr.com/post/hello/notes/somemoresutff/464654"

What I want to do is select only the "http://TumblrUsername.tumblr.com/" part and leave the rest so that I can compile a list of tumblr users.

My code looks like this but my question is how do I select the what I want...

import urllib
import requests
import lxml
from bs4 import BeautifulSoup


def find_notes():

    file = open('output.txt', 'w')

    f = requests.get('http://fullthrottleauto.tumblr.com/post/132323884114/treunenthibault-ferrari-599xx-evo-as-i-love')

    soup = BeautifulSoup(f.text, "lxml")

    for post_note in soup.find_all('a', href=True):

        print post_note['href']
        returnline = str(post_note['href'])

        if '.tumblr.com/' in returnline:
           ## I need to do some thing here to extract "only the http://username.tumblr.com/"
            file.write(returnline + '\n')


find_notes()

Thanks for the reply. There is no specific code at the moment. Let me post what I have so far. — MaxE, Nov 02 '15 at 09:45
So [here](https://docs.python.org/2/library/re.html) is the document, take a look at the `.*` part, `.+?` part and the `re.findall()` part, then try something before you ask a question here. — Remi Guan, Nov 02 '15 at 09:47
`result = re.findall("http://TumblrUsername.tumblr.com", subject, re.IGNORECASE)` — Learner, Nov 02 '15 at 09:47
@SIslam I think that `TumblrUsername` is not fixed here, it's a username. So maybe `re.findall(r'http://.+?\.tumblr\.com', string)`. Or just extract the username part: `re.findall(r'http://(.+?)\.tumblr\.com', string)`. — Remi Guan, Nov 02 '15 at 09:50
Check this [Example on Regex101](https://regex101.com/r/uE1mI4/2) — benjamin, Nov 02 '15 at 09:54
BTW, about `file = open('output.txt', 'w')`, what about [close it](http://stackoverflow.com/questions/7395542/is-explicitly-closing-files-important) after `file.write`?. — Remi Guan, Nov 02 '15 at 10:04
Thank you so much for your help guys! Specially @SIslam I didn't really know about startswith() and endswith() functions. — MaxE, Nov 02 '15 at 13:09

Learner · Accepted Answer · 2015-11-02T10:32:00.990

Below code works why regex? It prints links and writes them into a file specified by the path!

import urllib
import requests
import lxml
from bs4 import BeautifulSoup


def find_notes():

    data_file = open(r"C:\Users\USER_NAME\Desktop\output.txt", 'ab')

    f = requests.get('http://fullthrottleauto.tumblr.com/post/132323884114/treunenthibault-ferrari-599xx-evo-as-i-love')

    soup = BeautifulSoup(f.text, "lxml")

    for post_note in soup.find_all('a', {'rel':'nofollow'}):
        if post_note['href'].endswith('.tumblr.com/') and post_note['href'].startswith('http') :
            print post_note['href']
            data_file.write(post_note['href']+'\n')
    data_file.close()


find_notes()

It prints-

http://jambo077.tumblr.com/
http://jambo077.tumblr.com/
http://thelordlux.tumblr.com/
http://thelordlux.tumblr.com/
http://dp0d.tumblr.com/
http://dp0d.tumblr.com/
http://fullthrottleauto.tumblr.com/
http://dp0d.tumblr.com/
http://dp0d.tumblr.com/
http://fraggreen.tumblr.com/
http://fraggreen.tumblr.com/
http://amazingcars.tumblr.com/
http://kennylayy.tumblr.com/
http://kennylayy.tumblr.com/
http://fullthrottleauto.tumblr.com/
http://coco2280.tumblr.com/
http://coco2280.tumblr.com/
http://fullthrottleauto.tumblr.com/
http://devrimdeniz3.tumblr.com/
http://devrimdeniz3.tumblr.com/
http://nicholasembly.tumblr.com/
http://nicholasembly.tumblr.com/
http://fullthrottleauto.tumblr.com/
http://nicholasembly.tumblr.com/
http://nicholasembly.tumblr.com/
http://geee22.tumblr.com/
http://geee22.tumblr.com/
http://donymadero.tumblr.com/
http://donymadero.tumblr.com/
http://avromen.tumblr.com/
http://avromen.tumblr.com/
http://carbonmotors.tumblr.com/
http://carbonmotors.tumblr.com/
http://blackdragonheartrider.tumblr.com/
http://blackdragonheartrider.tumblr.com/
http://travelerintheworldofdreams.tumblr.com/
http://travelerintheworldofdreams.tumblr.com/
http://evo-dreaming.tumblr.com/
http://evo-dreaming.tumblr.com/
http://fullthrottleauto.tumblr.com/
http://kareem121.tumblr.com/
http://kareem121.tumblr.com/
http://hotmenandhotcars.tumblr.com/
http://hotmenandhotcars.tumblr.com/
http://fullthrottleauto.tumblr.com/
http://schnixon.tumblr.com/
http://schnixon.tumblr.com/
http://fullthrottleauto.tumblr.com/
http://schnixon.tumblr.com/
http://schnixon.tumblr.com/
http://mikeawwr.tumblr.com/
http://mikeawwr.tumblr.com/
http://joshke1.tumblr.com/
http://joshke1.tumblr.com/
http://banginscrew.tumblr.com/
http://banginscrew.tumblr.com/
http://smiley-sj.tumblr.com/
http://smiley-sj.tumblr.com/
http://char1ie1000.tumblr.com/
http://char1ie1000.tumblr.com/
http://fullthrottleauto.tumblr.com/
http://char1ie1000.tumblr.com/
http://char1ie1000.tumblr.com/
http://relentless-haedons.tumblr.com/
http://relentless-haedons.tumblr.com/
http://metinpurde.tumblr.com/
http://metinpurde.tumblr.com/
http://superkingchris.tumblr.com/
http://superkingchris.tumblr.com/
http://16frango16.tumblr.com/
http://16frango16.tumblr.com/
http://franck-brevet.tumblr.com/
http://franck-brevet.tumblr.com/
http://car1ba.tumblr.com/
http://car1ba.tumblr.com/
http://trezio.tumblr.com/
http://trezio.tumblr.com/
http://molounhuevofrito.tumblr.com/
http://molounhuevofrito.tumblr.com/
http://fullthrottleauto.tumblr.com/
http://rebeccabum.tumblr.com/
http://rebeccabum.tumblr.com/
http://madv8.tumblr.com/
http://madv8.tumblr.com/
http://jrcs87lol.tumblr.com/
http://jrcs87lol.tumblr.com/
http://fullthrottleauto.tumblr.com/
http://jrcs87lol.tumblr.com/
http://jrcs87lol.tumblr.com/
http://alegasta.tumblr.com/
http://alegasta.tumblr.com/
http://ericj3love.tumblr.com/
http://ericj3love.tumblr.com/
http://frostfiree.tumblr.com/
http://frostfiree.tumblr.com/
http://bull58.tumblr.com/
http://bull58.tumblr.com/
http://fullthrottleauto.tumblr.com/
http://fumihirokoyama.tumblr.com/
http://fumihirokoyama.tumblr.com/
http://thethatnoelguysstuff.tumblr.com/
http://thethatnoelguysstuff.tumblr.com/
http://fullthrottleauto.tumblr.com/
http://thethatnoelguysstuff.tumblr.com/
http://thethatnoelguysstuff.tumblr.com/
http://peachedme.tumblr.com/
http://peachedme.tumblr.com/
http://il-salice-errante.tumblr.com/
http://il-salice-errante.tumblr.com/
http://fajhr.tumblr.com/
http://fajhr.tumblr.com/
http://jah-eras.tumblr.com/
http://jah-eras.tumblr.com/
http://fullthrottleauto.tumblr.com/

How do I match Tumblr urls from a text file with Regex and Python

1 Answers1