-3

I have a text file of links after scrapping, I need to make a regular expression for these links so i can extract them from a file, but different links have same structure but different in length, like

https://www.cnbc.com/2016/10/12/billionaire-richard-branson-learned-a-key-business-lesson-playing-tennis.html

and this:

https://www.cnbc.com/2016/10/12/hedge-fund-bonus-makeover.html

I can successfully make RE for the base domain, but after that title give me a tough time, mine is

[h][t][t][p][s]:\/\/[w][w][w].[c][n][b][c].[c][o][m]\/[2][0][1][5-8] 

for https://www.cnbc.com/2016/10/11/ but dont know how to make for further with diiferent words for different links ahead,

jackson
  • 101
  • 1
  • 8
  • I have tried something of my own [here](https://regex101.com/r/8iIuYL/2). Nevertheless, you can also refer [this](https://stackoverflow.com/questions/161738/what-is-the-best-regular-expression-to-check-if-a-string-is-a-valid-url) – nice_dev May 27 '18 at 07:34
  • You should really read up on the basics of regular expression syntax. Most of those square brackets are totally unnecessary, but then you've left unescaped `.`s that match any character. – jonrsharpe May 27 '18 at 08:04

2 Answers2

1

You are overcomplicating things,

https?://\S+?cnbc\.com\S+

will probably do, see https://regex101.com/r/ci3O1I/1/ for a demo.

Jan
  • 42,290
  • 8
  • 54
  • 79
  • https://www.cnbc.com/ali-montag/?page=31 this one is also selected in links file which is not required – jackson May 27 '18 at 07:54
1

You can simplify your regex to something like this:

preg_match("/http.*:\/\/www\.cnbc\.com\/201[5-8].*/", $string, $match);

This matches the address with http or https.
Then any link that is between 2015 and 2018.

See here how it works:
https://www.phpliveregex.com/p/o7p

Andreas
  • 23,610
  • 6
  • 30
  • 62
  • <_sre.SRE_Match object; span=(1596553, 1596664), match='https://www.cnbc.com/2018/03/02/heres-the-most-al> after printing , how to increase the match length , because it is not print full link – jackson May 27 '18 at 08:03
  • What is that? I don't understand what that is. – Andreas May 27 '18 at 08:04
  • the links after matching them from a file it print like this in python – jackson May 27 '18 at 08:06
  • <_sre.SRE_Match object; span=(1596553, 1596664), match='https://www.cnbc.com/2018/03/02/heres-the-most-al> <_sre.SRE_Match object; span=(1596665, 1596747), match='https://www.cnbc.com/2018/03/02/how-gary-cohn-got> <_sre.SRE_Match object; span=(1596748, 1596840), match='https://www.cnbc.com/2018/03/02/how-richard-brans> <_sre.SRE_Match object; span=(1596841, 1596917), match='https://www.cnbc.com/2018/03/02/how-the-nra-might> – jackson May 27 '18 at 08:07
  • but match length is not enough to hold all the url, if you know about it it will be convinient for me – jackson May 27 '18 at 08:08
  • It doesn't help if you paste 50 of those strings here. I do not understand what it is! Your question stated two simple links now you add "stuff" and talk about lenght. I don't get it – Andreas May 27 '18 at 08:11