2

I have an API endpoint which I reverse-engineered. I use it to search for a name and it returns no more than 100 entities at one request. But there's about 1.3M+ of these entities that I want to fetch.

Here's a sample of an entity from response:

{
 "name":"COMPANY NAME",
 "regNo":"100-H",
 "newRegNo":"191101000018",
 "type":"Company"
}

I can search by either name or regNo. there's no minimum character limit for searching. I thought of search by alphabetically but since it returns no more than 100 entities at once i cannot fetch the rest. So, I tried to fetch it by regNo. regNo can be from 1 up to 1000000.

here's the script that I wrote to fetch all entities by their regNo:

const test = async () => {
  const data = {};
  try {
    const requests = [];
    // since it returns no more than 100 entities at once it adds 100 
    // to the search query on every loop

    for (let i = 100; i < 10000; i += 100) {
      requests.push(fetchData(i));
    }
    const result = await Promise.all(requests);

    result.forEach(res => {
      res.entityList.forEach(entity => {
        data[entity.regNo] = entity;
      });
    });

    // You can ignore this part
    fs.writeFile("data.json", JSON.stringify(data), err => {
      console.log(err);
    });
    console.log(Object.keys(data).length);
  } catch (err) {
    console.log(err);
  }
};

It took about 15 seconds to fetch 9100 entities ( made 100 loops ) And every regNo has one letter suffix like this 11000-H

If I fetch 100 it would return something like this:

entityList: [
    {
      name: "COMPANY NAME",
      regNo: '100-H',
      newRegNo: '191101000018',
      type: 'Company'
    },
    {
      name: "COMPANY NAME",
      regNo: '1000-V',
      newRegNo: '193901000021',
      type: 'Company'
    },
    {
      name: "COMPANY NAME",
      regNo: '10000-T',
      newRegNo: '197001000604',
      type: 'Company'
    },
    {
      name: "COMPANY NAME",
      regNo: '100000-D',
      newRegNo: '198301004377',
      type: 'Company'
    },
    {
      name: "COMPANY NAME",
      regNo: '1000001-W',
      newRegNo: '201001012078',
      type: 'Company'
    },
    {
      name: "COMPANY NAME",
      regNo: '1000002-K',
      newRegNo: null,
      type: 'Company'
    },
    {
      name: "COMPANY NAME",
      regNo: '1000003-U',
      newRegNo: '201001012079',
      type: 'Company'
    },
    {
      name: "COMPANY NAME",
      regNo: '1000004-V',
      newRegNo: '201001012080',
      type: 'Company'
    },
    {
      name: "COMPANY NAME",
      regNo: '1000005-D',
      newRegNo: '201001012081',
      type: 'Company'
    },
    {
      name: "COMPANY NAME",
      regNo: '1000006-A',
      newRegNo: '201001012082',
      type: 'Company'
    },
 .......

As you can see it does not return entities from 0 to 99. I am assuming that the highest regNo is 1000000-suffixLetter and if I can fetch from 100 to 1000000 in a loop I would fetch about 1M entities. BUT here's the trick regNo has a suffix letter. Let's suppose that if I fetch 100 it returs from 100-A to 199-A. But there's other entities like 100-B, 100-C, etc

how can I fetch 1.3M+ entities efficiently without loss of data ?

Dilshod Turobov
  • 132
  • 2
  • 12
  • If this is a one-time operation, then what is wrong with the approach you are using? – Ben Aston Feb 17 '20 at 17:41
  • Is the efficiency related to speculating on queries that may probe on no entries (and thus be worthless but unavoidable effort), or just as Patrick87 refers, the performance of the number of probes run? – Jared Farrish Feb 17 '20 at 17:42
  • Does `fetchData(100)` return all the entities with a `regNo` from 0 to 99? – Ben Aston Feb 17 '20 at 17:45
  • @Ben Thank you for your reply. I updated the question to make it clearer. Pls take a look ) – Dilshod Turobov Feb 18 '20 at 06:17
  • You need to enumerate all the possible groups of `regNo`s. For each of them you need to make an API call and append the result to a datastructure (or file or database). You need to consider avoiding triggering denial of service protections on the server, so rate-limit the calls to have (say) ten requests in flight at any one time. This is the best I can help with the current information. – Ben Aston Feb 18 '20 at 12:41

1 Answers1

0

OK it looks like you're looking at like 250 MB of data and that request round-trips are taking about 15 ms. Assuming you have 100 Mbps download speed the theoretical max speed you can do this at is 20 seconds.

Since most of the time involved here seems to be waiting for the network round trips, you could try massively parallelizing this. Instead of having one thread looping, you can loop and create many threads. Up to some point, I'd expect you to get nearly proportional speedup from doing this since you have almost no compute time here, almost exclusively I/O. Past that point, you might begin to exhaust available network resources and get bottlenecked that way again. However, your system should be able to handle many independent requests simultaneously.

Given that your single-threaded program takes ~15 ms per request, and there are potentially ~1.3M requests, you'd be looking at about 195,000 seconds running this way. Use a single other thread and I'd be surprised if you didn't see something around 100,000 seconds. Use four threads and you might get around 50,000 seconds. You might need to experiment with a smaller range and toggle the number of threads until you see your best throughput.

Note: it's possible that the site you're accessing already has (or may quickly install if you start doing the above) rate-limiters to throttle high-volume traffic from single sources. Consider whether the API you're calling can handle this much traffic before you start using it this way.

Patrick87
  • 27,682
  • 3
  • 38
  • 73
  • The JavaScript will be evaluated on a single thread, but we can probably expect the different fetch calls to be made by a thread pool, behind the scenes inside the runtime? – Ben Aston Feb 17 '20 at 17:38
  • Also relevant: https://stackoverflow.com/questions/12060869/why-is-node-js-only-processing-six-requests-at-a-time – Ben Aston Feb 17 '20 at 17:48
  • @Ben Ah your point is node will already send multiple overlapping requests by default, and you can configure the amount of parallelism as in the answer. But will it really? The part that is doing the `fetchData` is in a plain `for` loop; surely the code is going to block on that since you could use the response to break out of the loop before going to the next iteration? The OP doesn't do that here but could. Does node.js do enough static analysis to make an educated guess or just assume it's supposed to be synchronous? – Patrick87 Feb 17 '20 at 18:00
  • The Node.js API will be invoked synchronously, and the network calls will be hived-off to a thread pool under the hood, with callbacks being triggered asynchronously. Progress through the script will be delayed by things like `Promise.all`, but the main runtime thread of execution will not be blocked due to the asynchronous nature of the fetch API. – Ben Aston Feb 17 '20 at 18:02
  • @Ben I think I get it, `fetchData` doesn't actually send the request, just creates the task that sends the request... it's the `Promise.all` that actually starts the sending – Patrick87 Feb 17 '20 at 18:03
  • ^^^ (or at any rate the `fetchData` is not synchronous) – Patrick87 Feb 17 '20 at 18:17
  • @Patrick87 Thank you for your reply. I updated the question to make it clearer. Pls take a look ) – Dilshod Turobov Feb 18 '20 at 06:17
  • fetchData is a function written by the author. The HTTP request API exposed to JavaScript programs is fetch. The fetch function will be implemented according to the W3C WebAPI specification, but the precise mechanism of implementation will be left to the author of the environment hosting the JavaScript runtime. For Node.js, a thread pool is used IIUC – Ben Aston Feb 18 '20 at 12:44