Monitoring your customer facing websites and critical API endpoints can make you aware of serious issues that should be resolved urgently. You should add uptime monitoring to check these endpoints at regular intervals and raise alerts if the endpoints are not responding or sending a wrong response.
Disk space can often fill up due to growing log files and is especially dangerous because it can be hard to keep track of.
You can monitor disk space usage using the guides for AWS, Google Cloud Platform or Microsoft Azure below.
Alternately, you can create a script to check disk space and raise alert via webhook if the disk space utilisation
goes above a certain threshold (usually 80-90%). A cron job should execute this script at regular intervals (at least once a day).
Keep track of utilization and load for your infrastructure by monitoring CPU and memory usage.
High CPU utilisation can lead to programs slowing down or freezing altogether.
High memory utilisation can lead to performance bottlenecks and inability to handle more users on your website and apps.
You should also raise alerts when your network I/O usage goes up either due to user load or suspicious network activity.
Keep track of important errors and exceptions in your web and mobile apps which affect your customers.
Configure the error monitoring to raise alerts based on the importance and frequency of the errors.
Cron jobs form the backbone of your system and keep track of important tasks like DB backups, user data management etc.
Cron job failures can often go unnoticed and cause havoc. Keep track of them and raise alerts when necessary.
Poor application performance can lead to a bad user experience and lead to users leaving your website and apps.
Monitor the performance of your apps and alert your engineering and ops teams when performance thresholds are crossed.