Questions tagged [sre]

Site Reliability Engineering (SRE), a reliability focused implementation of DevOps.

Highest level concern is to design, build and support software with "ever-watchful eye on system availability, latency, performance, and capacity".

SRE has started at Google but has now been adopted by several other companies.

49 questions
4
votes
1 answer

PromQL query to calculate service uptime & downtime from a fixed date

I'm trying to build a basic SRE dashboard in order to learn Prometheus/Grafana. I want to calculate the number of hours the service has been running & the number of hours its been down since the 1st January of the current year so that I can reduce…
user9492428
  • 603
  • 1
  • 9
  • 25
3
votes
1 answer

manage dataproc cluster access using service account and IAM roles

I am a beginner in cloud and would like to limit my dataproc cluster access to a given gcs buckets in my project. Lets says I have created a service account named as 'data-proc-service-account@my-cloud-project.iam.gserviceaccount.com' and then I…
2
votes
1 answer

conditions to check if Aerospike cluster is being idle

Assuming aerospike is running, I need some conditions through which check weather aerospike cluster is idle and not being used at all. I tried checking log files but it also logs the heartbeat, so even ifaerospike is not running it will generate…
Sujay_ks
  • 47
  • 7
2
votes
1 answer

how do I measure error budget consumption for rolling windows?

I have a SLO for one application where 95% of service response times must be less than 450ms over a rolling 24 hour window. I sample once every 60 seconds. Typically my "current service level" is around 96-97%. If the service level falls below 95%…
Miked
  • 21
  • 1
1
vote
1 answer

RBAC for Infrastructure Engineer

I feel this is a rather basic question, but somehow I'm unable to find a good answer. Recently auditors are complaining about the Role Based Access Control for our cloud set-up. My team is responsible for the Cloud infrastructure (aka Cloud…
Herman
  • 750
  • 1
  • 10
  • 23
1
vote
1 answer

Can Services in GCP's Monitoring monitor endpoints?

I installed managed Anthos on a GKE cluster. Anthos Service Mesh is working and is displaying my API. Thanks to that Services that are in Monitoring automatically detect my API. This is great as it enables me to easily set SLOs and Error Budget for…
Marcin Kulik
  • 845
  • 1
  • 12
  • 28
1
vote
1 answer

Can TTFB be affected after page load?

In case of server side rendering, we know that TTFB is the time it takes between the start of the request and the start of the response. My question is can the TTFB be affected if the page visually updates due to filters or something but is not a…
user14199036
1
vote
0 answers

What and where is this class 'UniversalScalabilityLawForecast' in Micrometer library?

I'm reading 'SRE with Java Microservices'(O'reilly) "USL forecasting is a form of “derived” Meter in Micrometer and can be enabled as shown in Example 4-39. " Example 4-39. Universal scalability law forecast configuration in…
BY-J
  • 11
  • 2
1
vote
0 answers

What a page and pager mean in SRE context?

I've been reading the Google SRE Book and I've found the word page and pager in multiple lines. In this context what do they mean? see link Thank you.
Iván Casanova
  • 351
  • 1
  • 6
  • 16
1
vote
0 answers

Is the error budget in GCP UI supposed to rise above 100%?

I have just started using SLO's in GCP and my first SLI seems to be working, but, the "error budget" field is way above 100%. All the examples I have seen online sit at 100%, whereas mine seems to float between 700.00% and above in to the thousands.…
1
vote
1 answer

How to avoid "Positive Feedback Cycle Overload Problem"?

Sometimes while designing reliable systems, we try to make the system more reliable by adding retries in event of failure (with feedback mechanisms). And it results to potential for an overload because we may be adding more load to an already…
1
vote
0 answers

SLO compliance report according to google SRE book

I want to create a SLO compliance report like Google SRE handbook indicated here : https://landing.google.com/sre/workbook/chapters/implementing-slos/#slo-compliance-report As shown in the description : the numbers in parentheses indicate the number…
zubug55
  • 729
  • 7
  • 27
1
vote
1 answer

How do we measure the site availability?

To measure the availability of a web site / API, should the dependencies also be considered? For instance, assume the payment service is down; but the shopping site is still available. Here the customer is not able to complete the purchase since the…
programmer
  • 249
  • 4
  • 12
0
votes
1 answer

docker unable to delete default network

When I start the docker-compose file all containers are working fine. Docker File: services: db: container_name: postgresql environment: POSTGRES_DB: sonar POSTGRES_PASSWORD: sonar POSTGRES_USER: sonar hostname:…
Mayur Dagdi
  • 11
  • 1
  • 4
0
votes
0 answers

How to put Grafana into maintenance mode?

Is there any way to put Grafana in maintenance mode? I want to show the details of the planned maintenance window in the Grafana UI for all the users. How can we do it? Where we can show text like below The Observabality plantform will be on…
SHC
  • 487
  • 1
  • 6
  • 19
1
2 3 4