1

Real world problem: I'm generating a page dinamically. This page is an xml which is retrieved by the user (curl, file_get_contents or whatever can by made server side scripting). Once the user make the request, he start waiting and I start retrieving a large set of data from the db and building an xml with them (using the php dom objects). Once I've done I fire the "print $document->saveXML()". It takes about 8 minutes to create this 40 megabytes document. Then as it is ready I serve the page/document. Now I have a user who has a 60 seconds connection timeout: he said I need to send the first octet each 60 seconds. How can I achieve such a thing?

Since it's useless to post a 23987452 lines code cause nobody is gonna read them, I'll explain the script which serves this page as real-very-pseudo-pseudo-code:

  • grab all the data from the db: an enormous set of rows
  • create a domdocument element
  • loop through each row and add a node element to the domdocument to contain a piece of data
  • call the $dom->saveXML() to get the document as a string
  • print the string so the user retrieve an xml document

1) I can't send real data since it is an xml document and it has to begin with "<?xml..." to not mess up the parser.`

2) The user can't deal with firewall/serverconfig

3) I can't deal with "buy a more powerful server"

4) I tried using an ob_start() at the top of the script and then at the beginning of each loop a "header("Transfer-Encoding: chunked"); ob_flush(); " but nothing: nothing comes before the 8 minutes.

Help me guys!!

Damiano Barbati
  • 3,356
  • 8
  • 39
  • 51
  • You are going to have to generate the XML as you go, outputting it as you go. Either that, or output it to a file in the background, and serve up the file when ready. – Brad Sep 29 '11 at 14:47
  • I alreay serves the file as it is ready. I need to serve it in chunks or something like that. But I can't outputting it as I go: because the domdocument class in php is an object, and once you call the saveXML() it generates the xml with its closing tags! – Damiano Barbati Sep 29 '11 at 14:51

2 Answers2

1

I would

  • Generate a random value

  • Start the XML generating script as a background process (see e.g. here)

  • Make the generating script write the XML into a file with the random value as the name when the script is done

  • Frequently poll for the existence of that empty file, e.g. using Ajax requests every 10 seconds, until it's there. Then fetch the XML from the file.

Community
  • 1
  • 1
Pekka
  • 442,112
  • 142
  • 972
  • 1,088
  • Wait a moment: I's not me having the problem of the timeout. I need to send something on the otherside, to the user who is fetching the page from his own computer. – Damiano Barbati Sep 29 '11 at 14:49
  • 1
    @hysoka yeah, I understand. This should prevent that kind of timeout – Pekka Sep 29 '11 at 14:50
  • I'm sorry Pekka: thanks for helping me but I still can't get it. It's not a web page: you access it through an url, but obviously no browser can open a 40 mb page and nobody is gonna fetch this file by a browser: they're gonna use a file_get_contents($url) or more likely a curl by terminal. So what should I do? – Damiano Barbati Sep 29 '11 at 14:54
  • @hysoka ah, I understand now. Not sure what can be done in that case... Except maybe send meaningless filler comments every 30 seconds or so while the document is generated in the background. I think most services require a callback URL that *they* will call when they're done, but that would require your users to completely change the way their scripts work – Pekka Sep 29 '11 at 14:59
0

You send padding and still have it be valid XML. Trivial examples include whitespace in a lot of places, or comments. Once you've sent the XML declaration, you could start a comment, and keep sending padding:

<?xml version="1.0">
<!-- this comment to prevent timeouts:
     30
     60
     90 
     ⋮

or whatever, the exact data doesn't matter of course.

That's the easy solution. The better solution is to make that generation run in the background, and e.g., use AJAX to poll the server every 10s to check if its done. Or to implement an alternate notification method (e.g., email a URL when the the document is ready).

If this isn't a browser accessing, you may want a trivially simple API: Have one request to start generating the document, and another to fetch it. The one to fetch it may return "not ready yet" as e.g., a HTTP status code 500, 503, or 504. Then the script requesting should retry later. (For example, with curl, the --retry option will do this).

derobert
  • 49,731
  • 15
  • 94
  • 124
  • Padding can't be used: the declaration is print at the end of the script with the saveXML() function. Before the end there's nothing to print and nothing gotta be print. It's not a browser accessing: the user is a portal and they have automatic procedures to fetch the url I send them and get the data so nothing can't be changed user server side. What happens if I keep on giving HTTP status code 500? The request keeps trying or you need to esplictly say it through a parameter like the one you mentioned if using curl? – Damiano Barbati Sep 29 '11 at 16:27
  • @hysoka44: You certainly can send padding. `DOMDocument::saveXML` returns a string. So you can strip off the `` declaration it generates, and instead send your own (much earlier). You could also try the `LIBXML_NOXMLDECL` flag, but that may not work (documentation is sort of conflicting). As far as curl's default behavior, I don't believe --retry is the default. – derobert Sep 29 '11 at 16:32