So, I have this friend. (No, really, it’s my friend, it’s not me, I set up my own Nagios server.) She’s a DBA with no responsibility for anything outside of a bunch of SQL Servers. Nagios wakes her up in the middle of the night if the web server goes down.
If you page people in the middle of the night over things that aren’t their responsibility, you’re just training them to ignore their pagers. I once worked with someone who was, according to legend, the only person ever to work at [name of company redacted] ever to successfully flush a pager. (And they didn’t even have Nagios at that time!)
I feel the same way about people who receive daily “CRITICAL!!!” emails that their servers’ drives are 98% full. Nagios is supposed to be informing you about things that are unusual. If your SQL Server typically uses 96% of its RAM (mine do), don’t turn off warnings and only receive notifications for critical, and don’t receive daily emails saying that the servers are using too much RAM. Up the thresholds to sane numbers that indicate an unusual condition. What do you think happens if, in the slew of daily emails about “CRITICAL!!!” there’s a disk that usually isn’t 100% full, or a service down, or a memory leak? No, no. You don’t want your slew of “Situation Normal: All Frelled Up” emails, you want to know when something unusual is occurring.
If you’re like me, you resist this. “Dammit, my C: drive should be at least 20% free!” There comes a time when you have to accept that a number is not an attainable number and work from there.