Different links extraction from a text file?

Question

My problem is that I have a non-structured .txt file like the one below that contains different links because there is a signature in each link:

Sample of the text file

What I want is to extract all the links that begins with http:// web.alphorm.com

I used the regex shown below:

matchObj = re.findall(r'(http:// web.alphorm.com/.*&Key-Pair-Id=APKAJF2PMCJPGKXG2GEA)"}',
                      string)

But it doesn't really gives me what I want. It shrinks the text file and gives me the searched links, but along with other undesirable links and text!

What is wrong with it?

Do you really have a space between `http://` and `web.alphorm.com`? — Casimir et Hippolyte, Jun 18 '17 at 01:28
Please [edit] your question and put some actual sample data from the text file in it. See [**_Discourage screenshots of code and/or errors_**](https://meta.stackoverflow.com/questions/303812/discourage-screenshots-of-code-and-or-errors). — martineau, Jun 18 '17 at 01:44

score 2 · Accepted Answer · answered Jun 18 '17 at 03:12

The .* in your regex is greedy, meaning the parsing engine will match the http://web.alphorm.com/ of the first match, and the &Key-Pair-Id=APKAJF2PMCJPGKXG2GEA of the last match, and everything in between.

Try this:

matchObj = re.findall(r'(http://web.alphorm.com/.*?&Key-Pair-Id=APKAJF2PMCJPGKXG2GEA)"}',string)

The addition of the ? will make the matching lazy, matching as little as possible.

Note: I also removed the space between http:// and web.alphorm.com, as I presume that's a typo.

Different links extraction from a text file?

1 Answers1