-1

My problem is that I have a non-structured .txt file like the one below that contains different links because there is a signature in each link:

Sample of the text file

What I want is to extract all the links that begins with http:// web.alphorm.com

I used the regex shown below:

matchObj = re.findall(r'(http:// web.alphorm.com/.*&Key-Pair-Id=APKAJF2PMCJPGKXG2GEA)"}',
                      string)

But it doesn't really gives me what I want. It shrinks the text file and gives me the searched links, but along with other undesirable links and text!

What is wrong with it?

martineau
  • 119,623
  • 25
  • 170
  • 301
A.oussama
  • 15
  • 1
  • 6

1 Answers1

2

The .* in your regex is greedy, meaning the parsing engine will match the http://web.alphorm.com/ of the first match, and the &Key-Pair-Id=APKAJF2PMCJPGKXG2GEA of the last match, and everything in between.

Try this:

matchObj = re.findall(r'(http://web.alphorm.com/.*?&Key-Pair-Id=APKAJF2PMCJPGKXG2GEA)"}',string)

The addition of the ? will make the matching lazy, matching as little as possible.

Note: I also removed the space between http:// and web.alphorm.com, as I presume that's a typo.

jschnurr
  • 1,181
  • 6
  • 8