2

I'm mining some data from Yahoo RSS, but it seems to be making a memory leak? It's quite bazaar. I data mine multiple sources with the same code, but Yahoo RSS feed is the only one that overloads the memory. This is a dumbed down version of the code but basically if you run this in multiple instances, it'll crash the server eventually because it'll run out of memory:

while(1) {
   $get_rss = file_get_contents("https://feeds.finance.yahoo.com/rss/2.0/headline?s=AAPL&region=US&lang=en-US");
}

However, if you run this same code with a different source, it runs absolutely fine and stable, such as:

while(1) {
   $get_rss = file_get_contents("http://www.marketwatch.com/news/headline/getheadlines?ticker=AAPL&countryCode=US&dateTime=&docId=&docType=2007&sequence=bb90b87a-9f6f-4d70-9a1d-b052088523f5&messageNumber=0&count=10&channelName=%2Fnews%2Fpressrelease%2Fcompany%2Fus%2Faapl&topic=&_=1460832767208");
}

Can anyone explain this behavior to me? I find it quite bazaar/weird. I usually use a curl method for pulling the URL contents, but switched to this file_get_contents to see if it acted the same, and it does. I've tried simple xml, it also has the same behavior. I don't understand?

Do RSS files get cached into memory? I don't see how/why that would happen. Any help / knowledge about this issue would be appreciated

Sven Kahn
  • 487
  • 1
  • 3
  • 16
  • what is the error that you get? i.e when you write "it'll crash the server eventually because it'll run out of memory" what crash is it? is it really the server that crashes or is it the php script? – carmel Jun 13 '16 at 12:01

3 Answers3

3

Like @NoPro suggested your way of fetching the RSS feed is quite non-standard as it may be treated as attack especially if you run it in a loop. Second file_get_contents manual states that it may fail if

An E_WARNING level error is generated if filename cannot be found, maxlength is less than zero, or if seeking to the specified offset in the stream fails.

Since, you are facing problem only with one particular yahoo feed i wonder if you are being rate limited or throttled by yahoo servers. They are delaying the packet chunks and file_get_contents wait a little longer still residing inside the memory.

Bash script can be less resource consuming than running it on browser. Unfortunately, when i ran this on my local machine i didn't see memory issue. Why not run it once on other machine or server with different IP pool.

Ashish Nayyar
  • 560
  • 3
  • 10
  • I tried simpleXML as well, same result. The only thing that doesn't cause a memleak when pulling RSS feed it seems is if you do something like $get_feed = exec("sh get.sh $SYMBOL"); with the bash script using wget. Also, they don't rate limit the RSS feeds. But delaying packets makes sense a bit. I will try it on a diff server yes – Sven Kahn Jun 10 '16 at 15:22
  • Did it worked for you? I think you won't see the issue on different server. – Ashish Nayyar Jun 13 '16 at 08:40
2

Yahoo's RSS is 34 KB while your sample from another source is 12 KB. However, that's too low to cause memory errors. Problem lies somewhere else. And - no, file_get_contents only gets specified file as a sequence of bytes - it doesn't care about what this file is and doesn't cache anything.

Jehy
  • 4,729
  • 1
  • 38
  • 55
  • It's really quite weird. I don't get this problem with any other kind of sources. I don't see why it would do this. I run 1000's of instances of this kind of thing with my other sources and resources are stable. It's running on a server with 128GB of ram as well, so specs are not an issue. – Sven Kahn Jun 06 '16 at 12:54
  • @SvenKahn please, post here you full code - there is no problem in this piece of code. – Jehy Jun 06 '16 at 12:57
  • it's not an issue with the code. I've narrowed it down to for some reason when you pull an RSS feed, it causes a mem leak. Even if you run the first code above in 300 instances, it'll crash server eventually. If you run second bit of code, it won't crash. But I added full code. I've removed all the detailed stuff/parsing stuff and narrowed it down to when it pulls RSS feed contents. It's only using these stupid parsing stuff because I switched from simple xml load, because I thought that was the issue. – Sven Kahn Jun 06 '16 at 19:15
  • `f_g_c` just slurps in some bytes. it couldn't care less what those bytes are. that url could be spitting out html, text, 2000-year-long .gif loops of kittens, etc.. it's all just bytes. – Marc B Jun 06 '16 at 19:23
  • I figured it didn't care, but I have no idea why it's doing this. Run the above code in 200-300 instances from ssh, and watch htop, you'll see memory get eaten up nonstop – Sven Kahn Jun 06 '16 at 19:25
  • Your array can be eating memory: `$rss_md5[$symbol] = md5($rss_lastBuild);` - for each $symbol you store large md5 hash in memory. – Jehy Jun 06 '16 at 19:28
  • I use the same method with a different data source and it runs great like that. I'll look at that a bit more but not sure that's it. – Sven Kahn Jun 06 '16 at 19:32
  • @Jehy nope, I removed that again and let it run, still acting the same. I appreciate you trying to help, not sure why this is getting minimal views. – Sven Kahn Jun 06 '16 at 20:07
  • If you run while(1) { $get_rss = file_get_contents("https://feeds.finance.yahoo.com/rss/2.0/headline?s=AAPL&region=US&lang=en-US"); } in a couple 100 instances/threads you'll see over time it just eats up the memory, I really can't explain it – Sven Kahn Jun 06 '16 at 20:29
  • @SvenKahn I tried your loop and... Suddenly, you were right! More memory got consumed during script execution. And I even found a good related answer here: http://stackoverflow.com/a/3136639/1727132 – Jehy Jun 06 '16 at 20:42
  • I would advice running your script with less iterations to avoid too much memory consumption. You can specify limit in SQL query and launch php scripts in external bash loop - those will consume less memory and then free it after themselves. – Jehy Jun 06 '16 at 20:47
  • You can also use less memory if you bind variables in SQL query instead of making one big string. That can make a really huge difference. – Jehy Jun 06 '16 at 20:50
  • That link is interesting, I just don't understand totally why it only does it with this data source. I mine data from a bunch other places with this same kind of setup with no problems at all. I tested using "$get_rss = exec("wget -qO- 'https://feeds.finance.yahoo.com/rss/2.0/headline?s=".$symbol."&region=US&lang=en-US'");" and it doesn't consume the memory, it gets to about 10.5G and stays there, which is totally acceptable for me - except it eats CPU. I will take into consideration what you're saying and see if putting that together will fix my problem, if so I'll accept answer – Sven Kahn Jun 06 '16 at 20:52
  • Your bash script idea gives me a good idea, instead of running the PHP in while loop, run bash script in while loop calling the php file once maybe - if that makes sense - but then actually it will cause more mysql load because it won't have the whole $rss_md5[$symbol] to prevent it from trying to add it to the database. Ugh. – Sven Kahn Jun 06 '16 at 20:54
  • Just don't call php script too often - database connection is a very "expensive" operation. – Jehy Jun 06 '16 at 20:56
  • It needs to be near realtime unfortunately, but perhaps a bash script is the route, will have to investigate – Sven Kahn Jun 06 '16 at 21:03
  • there are memory leaks, variables should be unset after using – PauAI Nov 15 '16 at 23:13
1

Your loop looks a bit caveman-style. If you keep polling an https-url without any delay in one instance, it may be considered as bad behaviour, but if you are continuesly polling the same resource in 100s of instances simultaneously, it is very annoying and may even be considered as a DDoS-atack.

But what may be the reason for your high memory-consumption is the ssl-handshake.

NoPro
  • 21
  • 4
  • 1
    I tried simpleXML as well, same result. The only thing that doesn't cause a memleak when pulling RSS feed it seems is if you do something like $get_feed = exec("sh get.sh $SYMBOL"); with the bash script using wget. Also, I tried both their https/http with same result. – Sven Kahn Jun 10 '16 at 15:24