59

I have a lambda function that writes metrics to Cloudwatch. While, it writes metrics, It generates some logs in a log-group.

INFO:: username: simran+test@example.com ClinicID: 7667 nodename: MacBook-Pro-2.local

INFO:: username: simran+test2@example.com ClinicID: 7667 nodename: MacBook-Pro-2.local

INFO:: username: simran+test@example.com ClinicID: 7668 nodename: MacBook-Pro-2.local

INFO:: username: simran+test3@example.com ClinicID: 7667 nodename: MacBook-Pro-2.local

I would like to query AWS logs in past x hours where x could be anywhere between 12 to 24 hours, based on any of the params.

For ex:

  1. Query Cloudwatch logs in last 5 hours where ClinicID=7667

or

  1. Query Cloudwatch logs in last 5 hours where ClinicID=7667 and username='simran+test@example.com'

or

  1. Query Cloudwatch logs in last 5 hours where username='simran+test@example.com'

I am using boto3 in Python.

Stephen Ostermiller
  • 23,933
  • 14
  • 88
  • 109
systemdebt
  • 4,589
  • 10
  • 55
  • 116

4 Answers4

98

You can get what you want using CloudWatch Logs Insights.

You would use start_query and get_query_results APIs: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/logs.html

To start a query you would use (for use case 2 from your question, 1 and 3 are similar):

import boto3
from datetime import datetime, timedelta
import time

client = boto3.client('logs')

query = "fields @timestamp, @message | parse @message \"username: * ClinicID: * nodename: *\" as username, ClinicID, nodename | filter ClinicID = 7667 and username='simran+test@example.com'"

log_group = '/aws/lambda/NAME_OF_YOUR_LAMBDA_FUNCTION'

start_query_response = client.start_query(
    logGroupName=log_group,
    startTime=int((datetime.today() - timedelta(hours=5)).timestamp()),
    endTime=int(datetime.now().timestamp()),
    queryString=query,
)

query_id = start_query_response['queryId']

response = None

while response == None or response['status'] == 'Running':
    print('Waiting for query to complete ...')
    time.sleep(1)
    response = client.get_query_results(
        queryId=query_id
    )

Response will contain your data in this format (plus some metadata):

{
  'results': [
    [
      {
        'field': '@timestamp',
        'value': '2019-12-09 17:07:24.428'
      },
      {
        'field': '@message',
        'value': 'username: simran+test@example.com ClinicID: 7667 nodename: MacBook-Pro-2.local\n'
      },
      {
        'field': 'username',
        'value': 'simran+test@example.com'
      },
      {
        'field': 'ClinicID',
        'value': '7667'
      },
      {
        'field': 'nodename',
        'value': 'MacBook-Pro-2.local\n'
      }
    ]
  ]
}
Stephen Ostermiller
  • 23,933
  • 14
  • 88
  • 109
Dejan Peretin
  • 10,891
  • 1
  • 45
  • 54
  • Perfect. Thanks :) – systemdebt Dec 11 '19 at 20:57
  • Can you look at a related question here, please https://stackoverflow.com/questions/59314132/query-cloudwatch-logs-for-distinct-values-using-boto3-in-python @Unkindness of Datapoints – systemdebt Dec 12 '19 at 23:29
  • 2
    You should pay attention at the `limit` argument for `start_query` function ([link](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/logs.html#CloudWatchLogs.Client.start_query)), which is 1000 by default. If you work with a high-density log, some events can be omitted w/o specifying higher limit. – Konstantin Jul 09 '20 at 10:20
  • querying using log insights is chargeable. Be cautions of the time frame and hence the amount of data the query will analyze. Run the same query in GUI to see the amount of data it processes – tarvinder91 Oct 16 '20 at 19:59
  • 10
    Note that `response['status']` can also be `Scheduled` in addition to `Running` so consider adding `or response['status'] == 'Scheduled'` to the `while` conditions. The full list is `Scheduled`, `Running`, `Complete`, `Failed`, `Cancelled`, `Timeout` and `Unknown` from the linked docs. – danialk May 28 '21 at 11:17
14

You can achieve this with the cloudWatchlogs client and a little bit of coding. You can also customize the conditions or use JSON module for a precise result.

EDIT

You can use describe_log_streams to get the streams. If you want only the latest, just put limit 1, or if you want more than one, use for loop to iterate all streams while filtering as mentioned below.

    import boto3

    client = boto3.client('logs')

    ## For the latest
    stream_response = client.describe_log_streams(
        logGroupName="/aws/lambda/lambdaFnName", # Can be dynamic
        orderBy='LastEventTime',                 # For the latest events
        limit=1                                  # the last latest event, if you just want one
        )

    latestlogStreamName = stream_response["logStreams"]["logStreamName"]

    response = client.get_log_events(
        logGroupName="/aws/lambda/lambdaFnName",
        logStreamName=latestlogStreamName,
        startTime=12345678,
        endTime=12345678,
    )

    for event in response["events"]:
        if event["message"]["ClinicID"] == "7667":
            print(event["message"])
        elif event["message"]["username"] == "simran+test@example.com":
            print(event["message"])
        #.
        #.
        # more if or else conditions

    ## For more than one Streams, e.g. latest 5
    stream_response = client.describe_log_streams(
        logGroupName="/aws/lambda/lambdaFnName", # Can be dynamic
        orderBy='LastEventTime',                 # For the latest events
        limit=5
        )

    for log_stream in stream_response["logStreams"]:
        latestlogStreamName = log_stream["logStreamName"]

        response = client.get_log_events(
             logGroupName="/aws/lambda/lambdaFnName",
             logStreamName=latestlogStreamName,
             startTime=12345678,
             endTime=12345678,
        )
        ## For example, you want to search "ClinicID=7667", can be dynamic

        for event in response["events"]:
           if event["message"]["ClinicID"] == "7667":
             print(event["message"])
           elif event["message"]["username"] == "simran+test@example.com":
             print(event["message"])
           #.
           #.
           # more if or else conditions

Let me know how it goes.

Stephen Ostermiller
  • 23,933
  • 14
  • 88
  • 109
Sanny Patel
  • 411
  • 3
  • 12
  • 1
    Thanks for your response. Log stream names are generated automatically by the lambda function so I do not know them in advance. How should I go about it? – systemdebt Dec 09 '19 at 08:35
  • @SannyPatel, My aws logs are as below, how can i capture the entire JSON. INFO 2020-11-27 10:30:09,510 [[reltiodatagateway-1.0.0-SNAPSHOT].callDnB-Main-Flow.stage1.03] org.mule.api.processor.LoggerMessageProcessor: 80916b10-309b-11eb-ab19-0242ac110002 JSON Output After calling DnB API. Response Details { "jobId": 5754492016394240, "success": "OK", "message": "Scheduled" } – Kranthi Sama Nov 27 '20 at 15:32
  • Create your Lambda and define a `FunctionName` explicitly. This will allow you to create the Log Group dynamically. Another (worse) option is to query CloudFormation for the Lambda name and then build the log group name based on that. https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-lambda-function.html#cfn-lambda-function-functionname – Jmoney38 Dec 06 '21 at 15:23
  • `orderBy='LastEventTime'` by itself doesn't do anything. You also need to specify `descending` eg `descending=True` to get the latest event. – West Jun 30 '22 at 09:19
3

I used awslogs. if you install it, you can do. --watch will tail the new logs.

awslogs get /aws/lambda/log-group-1 --start="5h ago" --watch

You can install it using

pip install awslogs

to filter you can do:

awslogs get /aws/lambda/log-group-1  --filter-pattern '"ClinicID=7667"' --start "5h ago" --timestamp

It supports multiple filter patterns as well.

awslogs get /aws/lambda/log-group-1  --filter-pattern '"ClinicID=7667"' --filter-pattern '" username=simran+test@abc.com"' --start "5h ago" --timestamp

References:

awslogs

awslogs . PyPI

Arun Kamalanathan
  • 8,107
  • 4
  • 23
  • 39
3

The easiest way is to use awswrangler:

import awswrangler as wr

# must define this for wrangler to work
boto3.setup_default_session(region_name=region)

df = wr.cloudwatch.read_logs(
    log_group_names=["loggroup"],
    start_time=from_timestamp,
    end_time=to_timestamp,
    query="fields @timestamp, @message | sort @timestamp desc | limit 5",
)

You can pass a list of the log groups needed, start and end time. The output is a pandas DataFrame containing the results.

FYI, under the hood, awswrangler uses the boto3 commands as in @dejan answer

HagaiA
  • 193
  • 3
  • 15