2

I am building a Dockerised record-playback system to help me record websites, so I can design scrapers aginst a local version rather than the real thing. This means that I do not swamp a website with automated requests, and has the added advantage that I do not need to be connected to the web to work.

I have used the Java-based WireMock internally, which records from a queue of site scrapes using Wget. I am using the WireMock API to read various pieces information from the mappings it records.

However, I have spotted from a mapping response that domain information does not seem to be recorded (except where it is in response headers by accident). See the following response from __admin/mappings:

{

    "result": {
        "ok": true,
        "list": [
            {
                "id": "794d609f-99b9-376d-b6b8-04dab161c023",
                "uuid": "794d609f-99b9-376d-b6b8-04dab161c023",
                "request": {
                    "url": "/robots.txt",
                    "method": "GET"
                },
                "response": {
                    "status": 404,
                    "bodyFileName": "body-robots.txt-j9qqJ.txt",
                    "headers": {
                        "Server": "nginx/1.0.15",
                        "Date": "Wed, 04 Jan 2017 21:04:40 GMT",
                        "Content-Type": "text/html",
                        "Connection": "keep-alive"
                    }
                }
            },
            {
                "id": "e246fac2-f9ad-3799-b7b7-066941408b8b",
                "uuid": "e246fac2-f9ad-3799-b7b7-066941408b8b",
                "request": {
                    "url": "/about/careers/",
                    "method": "GET"
                },
                "response": {
                    "status": 200,
                    "bodyFileName": "body-about-careers-GhVqy.txt",
                    "headers": {
                        "Server": "nginx/1.0.15",
                        "Date": "Wed, 04 Jan 2017 21:04:35 GMT",
                        "Content-Type": "text/html",
                        "Last-Modified": "Wed, 04 Jan 2017 12:52:12 GMT",
                        "Connection": "keep-alive",
                        "X-CACHE-URI": "/about/careers/",
                        "Accept-Ranges": "bytes"
                    }
                }
            },
            {
                "id": "def378f5-a93c-333e-9663-edcd30c936d7",
                "uuid": "def378f5-a93c-333e-9663-edcd30c936d7",
                "request": {
                    "url": "/about/careers/feed/",
                    "method": "GET"
                },
                "response": {
                    "status": 200,
                    "bodyFileName": "body-careers-feed-Fd2fO.xml",
                    "headers": {
                        "Server": "nginx/1.0.15",
                        "Date": "Wed, 04 Jan 2017 21:04:45 GMT",
                        "Content-Type": "application/rss+xml; charset=UTF-8",
                        "Transfer-Encoding": "chunked",
                        "Connection": "keep-alive",
                        "X-Powered-By": "PHP/5.3.3",
                        "Vary": "Cookie",
                        "X-Pingback": "http://www.example.com/xmlrpc.php",
                        "Last-Modified": "Thu, 06 Jun 2013 14:01:52 GMT",
                        "ETag": "\"765fc03186b121a764133349f8b716df\"",
                        "X-Robots-Tag": "noindex, follow",
                        "Link": "<http://www.example.com/?p=2680>; rel=shortlink",
                        "X-CACHE-URI": "null cache"
                    }
                }
            },
            {
                "id": "616ca6d7-6e57-4c10-8b57-f6f3dabc0930",
                "uuid": "616ca6d7-6e57-4c10-8b57-f6f3dabc0930",
                "request": {
                    "method": "ANY"
                },
                "response": {
                    "status": 200,
                    "proxyBaseUrl": "http://www.example.com"
                },
                "priority": 10
            }
        ]
    }

}

The only clear recording of a URL is in the final entry against proxyBaseUrl, and given that I had to specify a URL in the console call I am now worried that if I record against a different domain, the domain that each one is from will be lost.

That would mean that in playback mode, WireMock would only be able to play back from one domain, and I'd have to restart it and point it to another cache in order to play back different sites. This is not workable for my use case, so is there a way around this problem?

(I have done a little work with Mountebank, and would be willing to switch to it, though I find WireMock generally easier to use. My limited understanding of Mountebank is that it suffers from the same single-domain problem, though I am happy to be corrected on that. I'd be happy to swap to any robust open-source API-driven recorder HTTP proxy, if dropping WireMock is the only way forward).

halfer
  • 19,824
  • 17
  • 99
  • 186

1 Answers1

4

It's possible to serve WireMock stubs for multiple domains by adding a Host header criterion in your requests. Assuming your DNS/host file maps all the relevant domains to your WireMock server's IP, then this will cause it to behave like virtual hosting on an ordinary web server.

The main issue is that the recorder won't add the host header to your mappings so you'd need to do this yourself afterwards, or hack the recorder to do it on the fly.

I've been considering adding better support for this, so watch this space.

I'd also suggest checking out Hoverfly, which seems to solve this problem pretty well already.

Tom
  • 3,471
  • 21
  • 14
  • 1
    Ah brilliant, thanks Tom - I will try this. Since [recording is per-domain anyway](http://stackoverflow.com/q/41049289/472495), I imagine I can do something to modify the new requests at the end of the record phase. – halfer Jan 06 '17 at 13:44
  • 1
    Hoverfly does look interesting, but does not offer a delete-by-id feature in [its HTTP API](http://hoverfly.io/reference/#api), which I think I would need. However you're right that it seems to record and serve from multiple domains already, so perhaps that would be a good backup plan. – halfer Jan 06 '17 at 17:37
  • 1
    Aha, I was stuck previously on how to add filters into the request system, believe it to be simple key-value pairs. I've just spotted how to do it [from here](http://wiremock.org/docs/request-matching/), all nicely documented! – halfer Jan 13 '17 at 22:16
  • 1
    I have just used PHP to set up [curl with a proxy](http://stackoverflow.com/a/9247672/472495), and this fetches from WireMock very well indeed! Thanks again. I didn't even need to reset the DNS/host file to point to the proxy - specifying a proxy address seemed to be good enough. – halfer Jan 13 '17 at 23:30