-5

Im trying to scrap some web page which contain a proxy list and have manage to scrap the proxies and ports but im stuck on replacing the table border between proxy and port which is to replace with ":" here's my regex code

(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})(?:\s+|\s*<\/td><td>\s*)(\d{2,5})

and here is the scraped page in html

<tr><td>35.199.100.7</td><td>8080</td><td>US</td><td class='hm'>United States</td><td>elite proxy</td><td class='hm'>no</td><td class='hx'>yes</td><td class='hm'>1 second ago</td></tr><tr><td>163.172.181.29</td><td>80</td><td>FR</td><td class='hm'>France</td><td>elite proxy</td><td class='hm'>no</td><td class='hx'>no</td><td class='hm'>1 second ago</td></tr><tr><td>178.213.144.238</td><td>41258</td><td>RU</td><td class='hm'>Russian Federation</td><td>elite proxy</td><td class='hm'>no</td><td class='hx'>yes</td><td class='hm'>1 second ago</td></tr><tr><td>142.93.79.212</td><td>3128</td><td>CA</td><td class='hm'>Canada</td><td>anonymous</td><td class='hm'>no</td><td class='hx'>no</td><td class='hm'>1 second ago</td></tr><tr>

here my test page https://www.phpliveregex.com/p/oPW Can somebody help me thank you

Pat
  • 21
  • 5
  • 1
    This may help you: (?<=\d)<\/td>(?=\d). However it can fail, it will match any as long long as it is surounded by numbers. The reason is you should not parse html with regex, ever. See https://stackoverflow.com/questions/1732348 – Jorge.V Aug 07 '18 at 05:20

1 Answers1

0

Try with this: \d+(?:\.\d+){3}\K<\/td><td>(?=\d+) and replace by :

Demo

Your data seems to be a subset of the web page, or a "pre-filtered" webpage. In that case, I don't feel wrong to use a regular expression as the input data is simple.

However, the question is, how did you get to that data? Probably with several other regexes. That's were the thing can get wrong, as Jorge said in the comments.

Unless this is a throw-away script, you really should consider rewrite the whole thing using some html parser.

Julio
  • 5,208
  • 1
  • 13
  • 42