2

In case of HTTP requests like HEAD / GET / POST etc, which information of client is received by the server?

I know some of the info includes client IP, which can be used to block a user in case of, lets say, too many requests.

Another information of use would be user-agent, which is different for browsers, scripts, curl, postman etc. (Of course client can change default by setting request headers, but thats alright)

I want to know which other parameters can be used to identify a client (or define some properties)? Does the server get the mac address somehow?

So, is there a possibility that just by the request, it is identifiable that this request is being done by a "bot" (python or java code, eg.) vs a genuine user?

Assume there is no token or any such secret shared between client-server so there is no session...each subsequent request is independent.

vish4071
  • 5,135
  • 4
  • 35
  • 65

4 Answers4

4

The technique you are describing is generally called fingerprinting - the article covers properties and techniques. Depending on the use there are many criticisms of it, as it bypasses a users intention of being anonymous. In all cases it is a statistical technique - like most analytics.

Cadmium
  • 623
  • 3
  • 9
  • 1
    I'm trying to stop crawlers on my site which tend to hit upward of 25k hits per hour and keep at it for a long time. I've found cases where my domain is hit, clearly using multiple proxies in the process. – vish4071 Jan 18 '23 at 13:12
  • fail2ban would be a lot easier, most of those bots that ignore robots.txt will yield a ton of 404s. – chovy Aug 25 '23 at 08:15
1

Putting your domain behind a service like cloudflare might help prevent some of those bots from hitting your server. Other than a service like that, setting up a reCAPTCHA would block bots from accessing any pages behind it.

It would be hard to detect bots using solely HTTP because they can send you whatever headers they want. These services use other techniques to try and detect and filter out the bots, while allowing real users to access the site.

Hmkyriacou
  • 64
  • 1
  • 8
0

I don't think you can rely on any HTTP request header, because a client might not send it to the server, and/or there might be proxies between the client and the server that strip or alter the request headers.

If you just want to associate a unique ID to an HTTP request, you could generate an ID on your backend. For example, the JavaScript framework Hapi.js computes a request ID using this code:

new Date() + '-' + process.pid + '-' + Math.floor(Math.random() * 0x10000)

You might not even need to generate an ID manually. For example, if your app is on AWS and there is an Application Load Balancer in front of your backend, the incoming request will have the custom header X-Amzn-Trace-Id.

As for distinguishing between requests made by human clients and bots, I think you could adopt a "time trap" approach like the one described in this answer about honeypots for spambots.

jackdbd
  • 4,583
  • 3
  • 26
  • 36
  • I'm trying to stop crawlers on my site which tend to hit upward of 25k hits per hour and keep at it for a long time. I've found cases where my domain is hit, clearly using multiple proxies in the process. Its not a form or any dynamic activity happening on my site, so there is no way to "fool" bots for interacting. – vish4071 Jan 18 '23 at 13:11
  • 1
    A few months ago I noticed some strange URL requests in GCP Cloud Trace. They had "zgrab" in the user agent, an application layer network scanner. I did an IP lookup and found out there was already a report on AbuseIPDB. I think services like Cloudflare DNS and Cloud Armor use some database like AbuseIPDB to block requests coming from these abusive IPs. – jackdbd Jan 18 '23 at 21:07
0

HTTP request headers are not a good way to track users that use your site. This is because users can edit these headers and the server has no way to verify their authenticy. Also, in the case of the IP Address, it can change during a session if, for example, a user is on a mobile network.
My suggestion is using a cookie with a unique, random id, given to the user the first time they land on a page of your site. Keep in mind that the user can still edit/remove this cookie, so it isn't a perfect method. If you can force the user to login, then you could track the user with their session token.