AWS CloudWatch Tutorial: Complete Guide to Master Cloud Monitoring
AWS CloudWatch serves as the comprehensive monitoring and observability service for AWS resources and applications, making it essential for maintaining reliable, performant cloud infrastructure. This CloudWatch tutorial provides complete guidance from fundamental concepts through advanced implementation patterns, enabling effective monitoring, troubleshooting, and optimization of AWS environments.
Understanding CloudWatch capabilities transforms how teams operate cloud infrastructure—from reactive firefighting to proactive monitoring, from manual troubleshooting to automated remediation, from guesswork to data-driven optimization. CloudWatch collects metrics, logs, and events from AWS services and custom applications, providing unified visibility across your entire cloud environment.
This comprehensive tutorial covers CloudWatch metrics, alarms, logs, dashboards, insights, events, and automation patterns with practical examples and best practices. Whether you’re new to CloudWatch or optimizing existing implementations, this guide delivers the knowledge needed to leverage CloudWatch effectively for robust AWS operations.
What is AWS CloudWatch?
Before diving into specific features, understanding what CloudWatch is and how it fits into AWS architecture provides essential context.
CloudWatch Overview
AWS CloudWatch is a monitoring and observability service providing data and actionable insights for AWS resources, applications, and services running on AWS and on-premises. CloudWatch collects monitoring and operational data in the form of logs, metrics, and events, providing unified view of AWS resources, applications, and services.
Core Capabilities:
Metrics: Collect and track numeric time-series data from AWS services and custom applications. Monitor CPU utilization, network traffic, disk I/O, custom business metrics, and thousands of other data points.
Alarms: Set thresholds on metrics and trigger notifications or automated actions when thresholds breach. Enable proactive response to operational issues before they impact users.
Logs: Collect, monitor, and analyze log data from applications, AWS services, and infrastructure. Search, filter, and create metrics from log data for comprehensive observability.
Dashboards: Create customizable visualizations combining multiple metrics, logs, and alarms into unified operational views. Share dashboards across teams for coordinated monitoring.
Events: Respond to state changes in AWS resources through event-driven automation. Trigger Lambda functions, SNS notifications, or other actions based on events.
Insights: Query and visualize log data at scale using purpose-built query language. Perform sophisticated log analysis for troubleshooting and investigation.
Why Use CloudWatch?
Unified Monitoring: Single service monitoring all AWS resources rather than deploying separate monitoring tools for each service type.
Automatic Integration: AWS services automatically send metrics to CloudWatch without configuration. Start monitoring immediately upon resource creation.
Scalability: Handles monitoring data from single instances to massive fleets without infrastructure management or capacity planning.
Cost Efficiency: Pay only for metrics, logs, and features consumed. Free tier covers basic monitoring for most small-to-medium workloads.
Automation: Trigger automated responses to operational issues through alarms and events, reducing manual intervention and response time.
Troubleshooting: Correlate metrics, logs, and traces for comprehensive troubleshooting. Identify root causes faster with unified visibility.
CloudWatch Metrics
Metrics form the foundation of CloudWatch monitoring, representing time-series data about AWS resources and applications.
Understanding Metrics
Metric Definition: A metric represents a time-ordered set of data points published to CloudWatch. Metrics exist only in the region where they’re created and cannot be deleted (they expire after 15 months).
Metric Components:
Namespace: Container for metrics, organizing them by service (AWS/EC2, AWS/RDS, Custom/MyApp).
Metric Name: Name of the measurement (CPUUtilization, NetworkIn, DiskReadOps).
Dimensions: Name/value pairs that identify specific resource (InstanceId=i-1234567890abcdef0).
Timestamp: Time the measurement occurred.
Value: The measurement value.
Unit: Unit of measurement (Percent, Bytes, Count, Seconds).
AWS Service Metrics
AWS services automatically publish metrics to CloudWatch. Here are common metrics by service:
Amazon EC2:
Namespace: AWS/EC2
Key Metrics:
- CPUUtilization: Percentage of allocated compute units in use
- NetworkIn/NetworkOut: Network bytes received/sent
- DiskReadOps/DiskWriteOps: Completed disk read/write operations
- StatusCheckFailed: Instance or system status check failures
Dimensions:
- InstanceId: i-1234567890abcdef0
- InstanceType: t3.micro
- ImageId: ami-12345678
Default: 5-minute intervals (free)
Detailed: 1-minute intervals (paid)
Amazon RDS:
Namespace: AWS/RDS
Key Metrics:
- CPUUtilization: Database CPU usage percentage
- DatabaseConnections: Number of database connections in use
- FreeableMemory: Available RAM in bytes
- ReadLatency/WriteLatency: I/O operation latency
- DiskQueueDepth: Outstanding I/O requests
Dimensions:
- DBInstanceIdentifier: mydb-instance
- DatabaseClass: db.t3.micro
Amazon S3:
Namespace: AWS/S3
Key Metrics:
- BucketSizeBytes: Total bucket storage
- NumberOfObjects: Total object count
- AllRequests: Total request count
- 4xxErrors/5xxErrors: HTTP error counts
Dimensions:
- BucketName: my-bucket
- StorageType: StandardStorage
AWS Lambda:
Namespace: AWS/Lambda
Key Metrics:
- Invocations: Function invocation count
- Duration: Execution time in milliseconds
- Errors: Failed invocation count
- Throttles: Throttled invocation count
- ConcurrentExecutions: Concurrent execution count
Dimensions:
- FunctionName: my-function
- Resource: function-version or alias
Custom Metrics
Publish custom application and business metrics using AWS CLI, SDKs, or CloudWatch Agent.
Publishing via AWS CLI:
# Single metric data point
aws cloudwatch put-metric-data \
--namespace "CustomApp/Orders" \
--metric-name "OrdersProcessed" \
--value 42 \
--timestamp "2024-12-29T10:00:00Z" \
--dimensions Environment=Production,Region=US-East
# Multiple dimensions
aws cloudwatch put-metric-data \
--namespace "CustomApp/Performance" \
--metric-name "ResponseTime" \
--value 234 \
--unit Milliseconds \
--dimensions API=GetUser,Method=POST,Environment=Prod
Publishing via Python (boto3):
import boto3
from datetime import datetime
cloudwatch = boto3.client('cloudwatch')
# Publish single metric
cloudwatch.put_metric_data(
Namespace='CustomApp/Sales',
MetricData=[
{
'MetricName': 'Revenue',
'Value': 12450.00,
'Unit': 'None',
'Timestamp': datetime.utcnow(),
'Dimensions': [
{'Name': 'Store', 'Value': 'Store-123'},
{'Name': 'Region', 'Value': 'Northeast'}
]
}
]
)
# Publish multiple metrics
cloudwatch.put_metric_data(
Namespace='CustomApp/Performance',
MetricData=[
{
'MetricName': 'PageLoadTime',
'Value': 1.234,
'Unit': 'Seconds',
'Timestamp': datetime.utcnow()
},
{
'MetricName': 'ErrorCount',
'Value': 3,
'Unit': 'Count',
'Timestamp': datetime.utcnow()
}
]
)
Publishing via CloudWatch Agent:
The CloudWatch agent collects system-level metrics and custom metrics from EC2 and on-premises servers.
Agent Configuration (JSON):
{
"metrics": {
"namespace": "CustomApp/System",
"metrics_collected": {
"cpu": {
"measurement": [
{
"name": "cpu_usage_idle",
"rename": "CPU_IDLE",
"unit": "Percent"
}
],
"totalcpu": false,
"metrics_collection_interval": 60
},
"disk": {
"measurement": [
{
"name": "used_percent",
"rename": "DISK_USED",
"unit": "Percent"
}
],
"metrics_collection_interval": 60,
"resources": ["/", "/data"]
},
"mem": {
"measurement": [
{
"name": "mem_used_percent",
"rename": "MEMORY_USED",
"unit": "Percent"
}
],
"metrics_collection_interval": 60
}
}
}
}
Metric Statistics and Math
Statistics: Aggregations of metric data over specified periods.
Available Statistics:
- Average: Mean value
- Sum: Total of all values
- Minimum: Lowest value
- Maximum: Highest value
- SampleCount: Number of data points
Extended Statistics:
- Percentiles: p50 (median), p90, p95, p99
- Custom percentiles: p99.9, p99.99
Metric Math: Perform calculations across multiple metrics.
Examples:
# Calculate error rate percentage
m1 = ErrorCount metric
m2 = RequestCount metric
ErrorRate = (m1 / m2) * 100
# Calculate average response time
m1 = TotalResponseTime metric
m2 = RequestCount metric
AvgResponseTime = m1 / m2
# Calculate network throughput
m1 = NetworkIn metric
m2 = NetworkOut metric
TotalThroughput = (m1 + m2) / 300 # Convert to bytes per second
CloudWatch Alarms
Alarms monitor metrics and trigger actions when thresholds breach, enabling proactive response to operational issues.
Creating Alarms
Alarms consist of several key components defining when and how they trigger.
Alarm States:
- OK: Metric within defined threshold
- ALARM: Metric breached threshold
- INSUFFICIENT_DATA: Not enough data to determine state
Alarm Components:
Threshold: Numeric value defining alarm boundary Evaluation Period: Number of data points to evaluate Datapoints to Alarm: How many periods must breach before triggering Comparison Operator: How to compare metric to threshold (>, <, >=, <=) Missing Data Treatment: How to handle missing data points
Creating Alarms via Console
Example: High CPU Alarm
Configuration:
- Metric: AWS/EC2 CPUUtilization
- Dimension: InstanceId = i-1234567890
- Statistic: Average
- Period: 5 minutes
- Threshold: 80%
- Comparison: Greater than
- Datapoints: 2 out of 3
- Missing Data: Treat as breaching
Behavior:
- Evaluates last 3 periods (15 minutes)
- Alarms if 2 of 3 periods exceed 80% CPU
- Missing data counts as breach
Creating Alarms via AWS CLI
# CPU utilization alarm
aws cloudwatch put-metric-alarm \
--alarm-name "HighCPU-i-1234567890" \
--alarm-description "CPU exceeds 80%" \
--metric-name CPUUtilization \
--namespace AWS/EC2 \
--statistic Average \
--period 300 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 3 \
--datapoints-to-alarm 2 \
--dimensions Name=InstanceId,Value=i-1234567890 \
--alarm-actions arn:aws:sns:us-east-1:123456789012:alerts \
--treat-missing-data breaching
# Disk space alarm
aws cloudwatch put-metric-alarm \
--alarm-name "LowDiskSpace-i-1234567890" \
--metric-name disk_used_percent \
--namespace CWAgent \
--statistic Average \
--period 300 \
--threshold 85 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 2 \
--dimensions Name=InstanceId,Value=i-1234567890 Name=path,Value=/data \
--alarm-actions arn:aws:sns:us-east-1:123456789012:alerts
# Lambda error rate alarm
aws cloudwatch put-metric-alarm \
--alarm-name "HighErrorRate-MyFunction" \
--metric-name Errors \
--namespace AWS/Lambda \
--statistic Sum \
--period 300 \
--threshold 10 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 1 \
--dimensions Name=FunctionName,Value=MyFunction \
--alarm-actions arn:aws:sns:us-east-1:123456789012:lambda-alerts
Creating Alarms via CloudFormation
Resources:
HighCPUAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub '${AWS::StackName}-HighCPU'
AlarmDescription: CPU utilization exceeds 80%
MetricName: CPUUtilization
Namespace: AWS/EC2
Statistic: Average
Period: 300
EvaluationPeriods: 3
DatapointsToAlarm: 2
Threshold: 80
ComparisonOperator: GreaterThanThreshold
Dimensions:
- Name: InstanceId
Value: !Ref MyEC2Instance
AlarmActions:
- !Ref AlertTopic
TreatMissingData: breaching
DatabaseConnectionsAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub '${AWS::StackName}-HighDBConnections'
MetricName: DatabaseConnections
Namespace: AWS/RDS
Statistic: Average
Period: 300
EvaluationPeriods: 2
Threshold: 80
ComparisonOperator: GreaterThanThreshold
Dimensions:
- Name: DBInstanceIdentifier
Value: !Ref MyDBInstance
AlarmActions:
- !Ref AlertTopic
Composite Alarms
Combine multiple alarms using AND/OR logic for sophisticated alerting.
Also Read: AWS Tutorial
Example: Application Health Alarm
# Create individual alarms
aws cloudwatch put-metric-alarm --alarm-name "HighCPU" ...
aws cloudwatch put-metric-alarm --alarm-name "HighMemory" ...
aws cloudwatch put-metric-alarm --alarm-name "HighErrorRate" ...
# Create composite alarm
aws cloudwatch put-composite-alarm \
--alarm-name "ApplicationUnhealthy" \
--alarm-description "Multiple health indicators failing" \
--alarm-rule "ALARM(HighCPU) OR ALARM(HighMemory) OR ALARM(HighErrorRate)" \
--actions-enabled \
--alarm-actions arn:aws:sns:us-east-1:123456789012:critical-alerts
Alarm Actions
Configure actions triggered when alarms change state.
Available Actions:
SNS Notifications: Send emails, SMS, or HTTP/S endpoints
--alarm-actions arn:aws:sns:us-east-1:123456789012:my-topic
Auto Scaling: Trigger scale-up or scale-down actions
--alarm-actions arn:aws:autoscaling:us-east-1:123456789012:scalingPolicy:policy-id
EC2 Actions: Stop, terminate, reboot, or recover instances
--alarm-actions arn:aws:automate:us-east-1:ec2:stop \
arn:aws:automate:us-east-1:ec2:terminate \
arn:aws:automate:us-east-1:ec2:reboot \
arn:aws:automate:us-east-1:ec2:recover
Systems Manager Actions: Execute automation documents
--alarm-actions arn:aws:ssm:us-east-1:123456789012:opsitem:1
CloudWatch Logs
CloudWatch Logs collects, monitors, and analyzes log data from applications and AWS services.
Log Concepts
Log Groups: Containers for log streams sharing retention, permissions, and encryption settings.
Log Streams: Sequences of log events from same source (e.g., single application instance).
Log Events: Individual log entries with timestamp and message.
Retention: How long CloudWatch stores logs (1 day to 10 years, or indefinitely).
Sending Logs to CloudWatch
CloudWatch Logs Agent (Legacy):
# Install agent
sudo yum install -y awslogs
# Configure /etc/awslogs/awslogs.conf
[/var/log/messages]
datetime_format = %b %d %H:%M:%S
file = /var/log/messages
buffer_duration = 5000
log_stream_name = {instance_id}
initial_position = start_of_file
log_group_name = /var/log/messages
# Start agent
sudo service awslogs start
CloudWatch Unified Agent (Current):
{
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/app/application.log",
"log_group_name": "/aws/app/application",
"log_stream_name": "{instance_id}/{hostname}",
"timestamp_format": "%Y-%m-%d %H:%M:%S"
},
{
"file_path": "/var/log/nginx/access.log",
"log_group_name": "/aws/nginx/access",
"log_stream_name": "{instance_id}",
"timestamp_format": "%d/%b/%Y:%H:%M:%S %z"
}
]
}
}
}
}
Application Code (Python boto3):
import boto3
import logging
from datetime import datetime
# Configure CloudWatch Logs handler
cloudwatch_logs = boto3.client('logs')
log_group = '/aws/application/myapp'
log_stream = 'instance-123'
# Ensure log group and stream exist
try:
cloudwatch_logs.create_log_group(logGroupName=log_group)
except cloudwatch_logs.exceptions.ResourceAlreadyExistsException:
pass
try:
cloudwatch_logs.create_log_stream(
logGroupName=log_group,
logStreamName=log_stream
)
except cloudwatch_logs.exceptions.ResourceAlreadyExistsException:
pass
# Send log events
cloudwatch_logs.put_log_events(
logGroupName=log_group,
logStreamName=log_stream,
logEvents=[
{
'timestamp': int(datetime.now().timestamp() * 1000),
'message': 'Application started successfully'
},
{
'timestamp': int(datetime.now().timestamp() * 1000),
'message': 'Processing user request: user_id=12345'
}
]
)
Lambda Functions (Automatic):
Lambda automatically sends logs to CloudWatch Logs:
import json
def lambda_handler(event, context):
# Print statements automatically go to CloudWatch Logs
print(f"Event received: {json.dumps(event)}")
# Process event
result = process_event(event)
print(f"Processing completed: {result}")
return {
'statusCode': 200,
'body': json.dumps(result)
}
Searching and Filtering Logs
Filter Patterns:
# Simple text search
ERROR
# Multiple terms (OR)
ERROR WARN FATAL
# Multiple terms (AND)
ERROR "user not found"
# Field extraction
[timestamp, request_id, level, message]
[*, *, level=ERROR, *]
# JSON logs
{ $.level = "ERROR" }
{ $.responseTime > 1000 }
{ $.userId EXISTS }
CLI Examples:
# Search for errors in last hour
aws logs filter-log-events \
--log-group-name "/aws/lambda/MyFunction" \
--start-time $(date -u -d '1 hour ago' +%s)000 \
--filter-pattern "ERROR"
# Search with field extraction
aws logs filter-log-events \
--log-group-name "/aws/app/application" \
--filter-pattern '[timestamp, request_id, level=ERROR, message]'
# Tail logs (real-time)
aws logs tail "/aws/lambda/MyFunction" --follow
Log Insights
CloudWatch Logs Insights provides purpose-built query language for analyzing log data.
Query Examples:
# Find all errors in last hour
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100
# Count errors by type
fields @timestamp, @message
| filter @message like /ERROR/
| parse @message /ERROR: (?<errorType>.*?) -/
| stats count() by errorType
| sort count desc
# Calculate API response times
fields @timestamp, requestId, duration
| filter @type = "REPORT"
| stats avg(duration), max(duration), pct(duration, 95) by bin(5m)
# Analyze Lambda cold starts
fields @timestamp, @message
| filter @type = "REPORT"
| filter @message like /Init Duration/
| parse @message /Init Duration: (?<initDuration>.*?) ms/
| stats count(), avg(initDuration), max(initDuration)
# Top slow requests
fields @timestamp, request.method, request.path, response.time
| filter response.time > 1000
| sort response.time desc
| limit 20
Saved Queries:
# Save frequently used query
aws logs put-query-definition \
--name "ErrorAnalysis" \
--query-string 'fields @timestamp, @message | filter @message like /ERROR/ | stats count() by bin(5m)' \
--log-group-names "/aws/app/*"
Metric Filters
Extract metrics from log data for monitoring and alerting.
Create Metric Filter:
# Count error occurrences
aws logs put-metric-filter \
--log-group-name "/aws/app/application" \
--filter-name "ErrorCount" \
--filter-pattern "ERROR" \
--metric-transformations \
metricName=ApplicationErrors,metricNamespace=CustomApp,metricValue=1,defaultValue=0
# Extract response times
aws logs put-metric-filter \
--log-group-name "/aws/nginx/access" \
--filter-name "ResponseTime" \
--filter-pattern '[host, ident, authuser, date, request, status, bytes, referer, agent, response_time]' \
--metric-transformations \
metricName=ResponseTime,metricNamespace=Nginx,metricValue=$response_time,unit=Milliseconds
# Count specific errors
aws logs put-metric-filter \
--log-group-name "/aws/lambda/MyFunction" \
--filter-name "TimeoutErrors" \
--filter-pattern "Task timed out" \
--metric-transformations \
metricName=TimeoutCount,metricNamespace=Lambda/Errors,metricValue=1
Log Retention and Storage
Set Retention:
# Set retention to 30 days
aws logs put-retention-policy \
--log-group-name "/aws/lambda/MyFunction" \
--retention-in-days 30
# Retention options: 1, 3, 5, 7, 14, 30, 60, 90, 120, 150, 180, 365,
# 400, 545, 731, 1827, 3653 days, or never expire
Export to S3:
# Create export task
aws logs create-export-task \
--log-group-name "/aws/app/application" \
--from $(date -u -d '7 days ago' +%s)000 \
--to $(date -u +%s)000 \
--destination my-log-bucket \
--destination-prefix "logs/application/"
CloudWatch Dashboards
Dashboards provide customizable visualizations of metrics, logs, and alarms.
Creating Dashboards
Via Console:
- Navigate to CloudWatch → Dashboards
- Create Dashboard
- Add widgets (Line, Number, Gauge, etc.)
- Configure metrics for each widget
- Arrange and resize widgets
- Save dashboard
Via CLI:
# Create dashboard with multiple widgets
aws cloudwatch put-dashboard \
--dashboard-name "ApplicationMonitoring" \
--dashboard-body file://dashboard.json
dashboard.json:
{
"widgets": [
{
"type": "metric",
"properties": {
"metrics": [
["AWS/EC2", "CPUUtilization", {"stat": "Average"}]
],
"period": 300,
"stat": "Average",
"region": "us-east-1",
"title": "EC2 CPU Utilization",
"yAxis": {
"left": {"min": 0, "max": 100}
}
}
},
{
"type": "metric",
"properties": {
"metrics": [
["AWS/Lambda", "Invocations", {"stat": "Sum"}],
[".", "Errors", {"stat": "Sum"}],
[".", "Throttles", {"stat": "Sum"}]
],
"period": 300,
"stat": "Sum",
"region": "us-east-1",
"title": "Lambda Metrics"
}
},
{
"type": "log",
"properties": {
"query": "SOURCE '/aws/lambda/MyFunction' | fields @timestamp, @message | filter @message like /ERROR/",
"region": "us-east-1",
"title": "Recent Errors"
}
}
]
}
Dashboard Best Practices
Organize by Audience:
- Operations: System health, resource utilization
- Development: Application metrics, error rates
- Business: User activity, transaction volumes
Use Appropriate Visualizations:
- Line charts: Trends over time
- Number widgets: Current values, totals
- Gauges: Percentages, capacity
- Pie charts: Distributions
- Bar charts: Comparisons
Include Context:
- Add alarm widgets showing current alarm states
- Include log insights widgets for error tracking
- Use annotations for deployments or incidents
- Add markdown widgets for documentation
CloudWatch Events / EventBridge
EventBridge (evolved from CloudWatch Events) enables event-driven architecture by responding to AWS service events and custom application events.
Event Patterns
AWS Service Events:
{
"source": ["aws.ec2"],
"detail-type": ["EC2 Instance State-change Notification"],
"detail": {
"state": ["terminated"]
}
}
Scheduled Events:
{
"schedule": "rate(5 minutes)"
}
{
"schedule": "cron(0 20 * * ? *)"
}
Creating Rules
Respond to EC2 State Changes:
aws events put-rule \
--name "EC2StateChange" \
--event-pattern '{
"source": ["aws.ec2"],
"detail-type": ["EC2 Instance State-change Notification"]
}'
aws events put-targets \
--rule "EC2StateChange" \
--targets "Id"="1","Arn"="arn:aws:lambda:us-east-1:123456789012:function:HandleEC2StateChange"
Scheduled Lambda Execution:
aws events put-rule \
--name "DailyCleanup" \
--schedule-expression "cron(0 2 * * ? *)"
aws events put-targets \
--rule "DailyCleanup" \
--targets "Id"="1","Arn"="arn:aws:lambda:us-east-1:123456789012:function:CleanupOldData"
Use Cases
Automated Backup:
Trigger: Daily at 2 AM
Action: Lambda function creating RDS snapshots
Auto-Remediation:
Trigger: EC2 instance status check failure
Action: Lambda function restarting instance
Security Automation:
Trigger: IAM policy change
Action: SNS notification + Lambda logging change
Cost Optimization:
Trigger: Weekday 6 PM
Action: Lambda stopping non-production instances
Best Practices
Monitoring Strategy
Layered Monitoring:
Infrastructure Layer:
- CPU, memory, disk, network
- Instance health checks
- Resource saturation
Application Layer:
- Request rates, latencies
- Error rates, exceptions
- Business transactions
User Experience:
- Page load times
- API response times
- Availability from user perspective
Alarm Design
Avoid Alarm Fatigue:
- Set appropriate thresholds (not too sensitive)
- Use composite alarms for complex conditions
- Implement alarm suppression during maintenance
- Route alarms by severity (critical vs. warning)
Actionable Alarms:
- Each alarm should have clear response procedure
- Include context in alarm descriptions
- Link to runbooks or documentation
- Test alarm actions regularly
Cost Optimization
Free Tier Awareness:
- 10 metrics (detailed monitoring disabled)
- 10 alarms
- 1 million API requests
- 5GB log ingestion and storage
- 3 dashboards (up to 50 metrics each)
Cost Reduction Strategies:
- Use basic (5-minute) monitoring when sufficient
- Set appropriate log retention
- Delete unused log groups
- Use metric filters instead of ingesting all logs
- Archive old logs to S3
Log Management
Structured Logging:
{
"timestamp": "2024-12-29T10:30:00Z",
"level": "ERROR",
"requestId": "abc123",
"userId": "user456",
"message": "Database connection failed",
"error": {
"type": "ConnectionError",
"code": "ECONNREFUSED"
}
}
Log Levels:
- DEBUG: Detailed diagnostic information
- INFO: General informational messages
- WARN: Warning messages for degraded conditions
- ERROR: Error events allowing continued operation
- FATAL: Severe errors requiring immediate attention
Conclusion
AWS CloudWatch provides comprehensive monitoring and observability capabilities essential for reliable, performant AWS operations. From basic metric collection through sophisticated log analysis and automated remediation, CloudWatch enables teams to understand system behavior, detect issues quickly, and respond automatically to operational challenges.
Key Takeaways:
Metrics: Foundation of monitoring—collect from AWS services automatically and publish custom metrics for application-specific insights.
Alarms: Proactive notifications and automated responses when metrics breach thresholds—critical for operational reliability.
Logs: Comprehensive logging from applications, AWS services, and infrastructure—essential for troubleshooting an