0

I'm trying to filter out a link from some java script. The java script part isin't relevant anymore because I transfromed it into a string (text).

Here is the script part:

<script>                
                     
     setTimeout("location.href = 'https://airdownload.adobe.com/air/win/download/30.0/AdobeAIRInstaller.exe';", 2000);
                
    
                $(function() {
                    $("#whats_new_panels").bxSlider({
                        controls: false,
                        auto: true,
                        pause: 15000
                    });
                });
                setTimeout(function(){
                    $("#download_messaging").hide();
                    $("#next_button").show();
                }, 10000);
            </script>

Here is what I do:

import re

def get_link_from_text(text):
   text = text.replace('\n', '')
   text = text.replace('\t', '')
   text = re.sub(' +', ' ', text)

   search_for = re.compile("href[ ]*=[ ]*'[^;]*")
   debug = re.search(search_for, text)

   return debug

What I want is the href link and I kind of get it, but for some reason only like this

<_sre.SRE_Match object; span=(30, 112), match="href = 'https://airdownload.adobe.com/air/win/dow>

and not like I want it to be

<_sre.SRE_Match object; span=(30, 112), match="href = 'https://airdownload.adobe.com/air/win/download/30.0/AdobeAIRInstaller.exe'">

So my question is how to get the full link and not only a part of it.

Might the problem be that re.search isin't returning longer strings? Because I tried altering the RegEx, I even tried matching the link 1 by 1, but it still returns only the part I called out earlier.

Kuchen
  • 23
  • 1
  • 7
  • You just needed to access the value, `re.search(search_for, text).group()`. See [a related thread](https://stackoverflow.com/questions/48675282/pythons-match-line-in-sre-sre-match-output-can-it-show-the-full-match). – Wiktor Stribiżew Aug 02 '18 at 07:21

1 Answers1

1

I've modified it slightly, but for me it returns the complete string you desire now.

import re

text = """
<script>                

setTimeout("location.href = 'https://airdownload.adobe.com/air/win/download/30.0/AdobeAIRInstaller.exe';", 2000);


    $(function() {
        $("#whats_new_panels").bxSlider({
            controls: false,
            auto: true,
            pause: 15000
        });
    });

    setTimeout(function(){
        $("#download_messaging").hide();
         $("#next_button").show();
    }, 10000);
</script>
"""

def get_link_from_text(text):
   text = text.replace('\n', '')
   text = text.replace('\t', '')
   text = re.sub(' +', ' ', text)

   search_for = re.compile("href[ ]*=[ ]*'[^;]*")
   debug = search_for.findall(text)

   print(debug)

get_link_from_text(text)

Output:

["href = 'https://airdownload.adobe.com/air/win/download/30.0/AdobeAIRInstaller.exe'"]
ttreis
  • 131
  • 7