Find command in python catches only first line

Question

Trying to grab the magnet link from the following code

rawdata = ''' <div class="iaconbox center floatright">
            <a rel="12624681,0" class="icommentjs kaButton smallButton rightButton" href="https://kat.cr/zootopia-2016-1080p-hdrip-x264-ac3-jyk-t12624681.html#comment">209 <i class="ka ka-comment"></i></a>               <a class="icon16" href="https://kat.cr/zootopia-2016-1080p-hdrip-x264-ac3-jyk-t12624681.html" title="Verified Torrent"><i class="ka ka16 ka-verify ka-green"></i></a>                                <div data-sc-replace="" data-sc-slot="_ae58c272c09a10c792c6b17d55c20208" class="none" data-sc-params="{ &#39;name&#39;: &#39;Zootopia%202016%201080p%20HDRip%20x264%20AC3-JYK&#39;, &#39;extension&#39;: &#39;mkv&#39;, &#39;magnet&#39;: &#39;magnet:?xt=urn:btih:CE8357DED670F06329F6028D2F2CEA6F514646E0&amp;dn=zootopia+2016+1080p+hdrip+x264+ac3+jyk&amp;tr=udp%3A%2F%2Ftracker.publicbt.com%2Fannounce&amp;tr=udp%3A%2F%2Fglotorrents.pw%3A6969%2Fannounce&amp;tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&amp;tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&#39; }"></div>
            <a data-nop="" title="Torrent magnet link" href="magnet:?xt=urn:btih:CE8357DED670F06329F6028D2F2CEA6F514646E0&amp;dn=zootopia+2016+1080p+hdrip+x264+ac3+jyk&amp;tr=udp%3A%2F%2Ftracker.publicbt.com%2Fannounce&amp;tr=udp%3A%2F%2Fglotorrents.pw%3A6969%2Fannounce&amp;tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&amp;tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce" class="icon16 askFeedbackjs" data-id="CE8357DED670F06329F6028D2F2CEA6F514646E0"><i class="ka ka16 ka-magnet"></i></a>
            <a data-download="" title="Download torrent file" href="https://kat.cr/torrents/zootopia-2016-1080p-hdrip-x264-ac3-jyk-t12624681/" class="icon16 askFeedbackjs"><i class="ka ka16 ka-arrow-down"></i></a>
        </div> '''

Using this command

rawdata[rawdata.find("<")+1:rawdata.find(">")]

Gives me

div class="iaconbox center floatright"

But when I try to find Magnet link

rawdata[rawdata.find("href="magnet:?")+1:rawdata.find(""")]

It gives me

' '

What I actually want it to give me

magnet:?xt=urn:btih:CE8357DED670F06329F6028D2F2CEA6F514646E0&dn=zootopia+2016+1080p+hdrip+x264+ac3+jyk&tr=udp%3A%2F%2Ftracker.publicbt.com%2Fannounce&tr=udp%3A%2F%2Fglotorrents.pw%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce

It's so easy with Shell, but it has to be done with Python itself.

score 1 · Answer 1 · answered Jun 12 '16 at 15:10

1

try rawdata[rawdata.find('href="magnet:?')+1:rawdata.find('"')]

answered Jun 12 '16 at 15:10

HenryM

5,557
7
49
105

score 1 · Answer 2 · answered Jun 12 '16 at 15:26

1

It's better to use regular expression.

import re

rawdata = '''your rawdata......'''
regex = re.compile('href="(.+)" class="icon16')
magnet_href = regex.search(rawdata).group(1)

answered Jun 12 '16 at 15:26

diracccc.lu

31
4

score 1 · Answer 3 · answered Jun 12 '16 at 15:27

First of all, as pointed out by HenryM, you need to use single quotes or escape the " to make the strings valid.

Second, find() always returns the first index of the character found. So you will find the first " and not the one ending the link. To fix this use the beg parameter to define the beginning of your search.

Additionally, you need to add the length of your query to the start index, as find gives you the starting index of the match, not the end you are looking for. The code would look something like this (completely untested):

start = rawdata.find('href="magnet:?') + 14
end = rawdata.find('"', beg=start)
link = rawdata[start:end]

score 1 · Answer 4 · edited May 23 '17 at 11:58

The input data is an HTML fragment. You should not be using regular expressions to parse it.

Use a parser instead. Here is a working sample using BeautifulSoup HTML parser:

from bs4 import BeautifulSoup


rawdata = ''' <div class="iaconbox center floatright">
    <a rel="12624681,0" class="icommentjs kaButton smallButton rightButton" href="https://kat.cr/zootopia-2016-1080p-hdrip-x264-ac3-jyk-t12624681.html#comment">209 <i class="ka ka-comment"></i></a>               <a class="icon16" href="https://kat.cr/zootopia-2016-1080p-hdrip-x264-ac3-jyk-t12624681.html" title="Verified Torrent"><i class="ka ka16 ka-verify ka-green"></i></a>                                <div data-sc-replace="" data-sc-slot="_ae58c272c09a10c792c6b17d55c20208" class="none" data-sc-params="{ &#39;name&#39;: &#39;Zootopia%202016%201080p%20HDRip%20x264%20AC3-JYK&#39;, &#39;extension&#39;: &#39;mkv&#39;, &#39;magnet&#39;: &#39;magnet:?xt=urn:btih:CE8357DED670F06329F6028D2F2CEA6F514646E0&amp;dn=zootopia+2016+1080p+hdrip+x264+ac3+jyk&amp;tr=udp%3A%2F%2Ftracker.publicbt.com%2Fannounce&amp;tr=udp%3A%2F%2Fglotorrents.pw%3A6969%2Fannounce&amp;tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&amp;tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&#39; }"></div>
    <a data-nop="" title="Torrent magnet link" href="magnet:?xt=urn:btih:CE8357DED670F06329F6028D2F2CEA6F514646E0&amp;dn=zootopia+2016+1080p+hdrip+x264+ac3+jyk&amp;tr=udp%3A%2F%2Ftracker.publicbt.com%2Fannounce&amp;tr=udp%3A%2F%2Fglotorrents.pw%3A6969%2Fannounce&amp;tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&amp;tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce" class="icon16 askFeedbackjs" data-id="CE8357DED670F06329F6028D2F2CEA6F514646E0"><i class="ka ka16 ka-magnet"></i></a>
    <a data-download="" title="Download torrent file" href="https://kat.cr/torrents/zootopia-2016-1080p-hdrip-x264-ac3-jyk-t12624681/" class="icon16 askFeedbackjs"><i class="ka ka16 ka-arrow-down"></i></a>
</div> '''

soup = BeautifulSoup(rawdata, "html.parser")
print(soup.find("a", title="Torrent magnet link")["href"])

Prints:

magnet:?xt=urn:btih:CE8357DED670F06329F6028D2F2CEA6F514646E0&dn=zootopia+2016+1080p+hdrip+x264+ac3+jyk&tr=udp%3A%2F%2Ftracker.publicbt.com%2Fannounce&tr=udp%3A%2F%2Fglotorrents.pw%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce

If I were to find multiple instances of the Magnet link, will findall() do the job here? — SamFlynn, Jun 12 '16 at 17:18
@SamFlynn yeas, sure, use the `find_all()` method and get the `href` attribute for every element found in the loop. — alecxe, Jun 12 '16 at 17:27

Find command in python catches only first line

4 Answers4