There are different ways to monitor a system to ensure its performance, stability, reliability and resilience. Along with the "usual suspects", we also use end to end tests for this purpose.
The usual suspects
We are dealing with a web application, so the usual ways to monitor are
Those issues are generally easy to detect, measure and compare to baselines. They typically focus on the non-user-facing part of the application and on-call technicians are automatically alerted if something goes wrong.
These events are more ambiguous to define since it is not as clear when something is in a state of error. For instance, if less log-ins happen in the current hour compared to the previous one, is it because the application's authentication is broken or just the usual decrease during night time? Also, the recipients of alerts are harder to define since it is not clear at first glance who is responsible to fix it. It can be an network infrastructure error, failing hardware, a bug in the application code, a database outage, even planned downtimes.
Reusing end to end tests
There is a forth kind of monitoring in place that is taking the end to end tests that usually run on staging environments: Synthetic monitoring
Wikipedia describes it as
[...] a monitoring technique that is done by using an emulation or scripted recordings of transactions. Behavioral scripts (or paths) are created to simulate an action or path that a customer or end-user would take on a site, application or other software (or even hardware).
We use a test scenario which is part of our usual automated checks suite. It reflects a typical happy path for a customer ensuring that the vital parts of our application works. This test is covering the complete system including all services, data sources and other dependencies. Since it is starting on the UI level and simulating user actions, it is a good indicator that real users can properly interact with the core functions.
These tests run day and night continuously and notify the QA team if something fails. This makes sense for us since this is the team with the most knowledge of different issue types and can quickly delegate to other teams if necessary.
It is vital that the issue reporting works in a way that is not only accessible within the company network as it can happen outside of office hours. Also, notifications have to include enough information that the issue can be identified and acted upon.
Since using all of these forms of monitoring together, we found that synthetic monitoring is generally the fastest indicator of a failure as it does not need any comparative data to notice it.
Instead, the criterion is very simple: if a user cannot use the application, it is a failure!