Dell Open Manage Event ID Monitoring

Monitoring Event IDs with Dell Open Manage

All right, it’s time to set up some kickass event ID monitoring. You’ve installed Dell Open Manage / Server Administrator on all of your physical hosts, and want to make sure you are aware if anything breaks or is about to explode! You’ve been searching for hours trying to find the Event IDs that actually matter — you’re in the right place!

The alerting system you use is up to you, Kaseya, Labtech, Nagios, Pandora FMS, Zabbix, whatever — what matters is getting the correct Event IDs and a matching description filter. Just alerting by Event ID may give you a flood of alerts, there are only so many low ID numbers to go around. The description filter matches only events from OpenManage.

These logs are generated across BOTH Application and System Event Logs, so be sure you are capturing both categories of Event IDs.

I’ve done the hard work of looking through all 500+ of Dell’s individually paged Event ID descriptions — visible here.

Below is a massive list of the ones that matter (or at least, the ones I think matter — any warnings, errors, or critical alerts related to hardware health — anything storage (RAID disks, rebuilds, hot spares, SMART, controller battery, etc), memory (bit errors, ECC failures, failed sticks, etc), CPU (failed processors, temperature, etc), power supplies (redundancy, device failure, cord unplugged, etc). You may want more logs to look at, but I tried to pick anything that could lead to degraded performance or failure.

Open Manage 2

Take my list to get you started. All event descriptions should have wildcards (*), so the description does not require an exact match, otherwise one letter off and you don’t get an alert. Enjoy the code — let me know if it helped you out! 🙂

Dell Open Manage Event ID Cheat

Event ID		Description Filter
1004			*Thermal shutdown*	
1053			*Temperature sensor*	
1054			*Temperature sensor*	
1104			*Fan sensor*	
1153			*Voltage sensor*	
1154			*Voltage sensor*	
1203			*Current sensor*	
1204			*Current sensor*	
1305			*Redundancy*	
1306			*Redundancy*	
1353			*Power supply*	
1354			*Power supply*	
1403			*Memory*	
1404			*Memory*	
1405			*Memory*	
1501			*AC power*	
1503			*AC power*	
1504			*AC power*	
1505			*AC power*	
1552			*Log size*	
1554			*Log size*	
1555			*Log size*	
1604			*Processor*	
1703			*Battery*	
1704			*Battery*	
1705			*Battery*	
2048			*Device failed*	
2049			*disk removed*	
2051			*disk degraded*	
2056			*Virtual disk failed*	
2057			*degraded*	
2076			*Consistency failed*	
2081			*reconfiguration failed*	
2082			*rebuild failed*	
2083			*rebuild failed*	
2094			*Predictive*	
2100			*Temperature*	
2102			*Temperature exceeded*	
2106			*SMART*	
2107			*SMART*	
2108			*SMART*	
2109			*SMART*	
2110			*SMART*	
2112			*Enclosure was shut down*	
2122			*Redundancy degraded*	
2123			*Redundancy lost*	
2126			*sector reassign*	
2129			*BGI failed*	
2145			*Controller battery*	
2146			*Bad block*	
2146			*DR0*	
2147			*DR0*	
2147			*Bad block*	
2148			*Bad block*	
2149			*Bad block*	
2150			*Bad block*	
2169			*controller battery*	
2187			*ECC error*	
2201			*hot spare failed*	
2203			*hot spare failed*	
2272			*uncorrectable media*	
2273			*punctured*	
2289			*ECC error*	
2290			*ECC error*	
2310			*permanently degraded*	
2312			*power supply*	
2313			*power supply*	
2318			*battery*	
2319			*ECC error*	
2320			*ECC error*	
2321			*ECC error*	
2324			*AC power supply cable*	
2340			*uncorrectable errors*	
2342			*inconsistent parity*	
2346			*Error on PD*	
2347			*rebuild failed*	
2348			*rebuild failed*	
2349			*bad disk block*	
2350			*unrecoverable disk media*	
2367			*Rebuild is not possible*	
2367			*Rebuild is not possible*	
2384			*hot spare*	
2385			*hot spare*	
2387			*bad block medium*	
2396			*uncorrectable multiple medium*	
2397			*uncorrectable errors*	
2402			*Disk Power status*	
2405			*Command timeout*	
2416			*medium error*	
2417			*medium error*	
2434			*wear-out limit*	
2436			*read-only mode*	
2441			*critical temperature*	
2442			*degraded*	
2443			*Data loss*	
2900			*cache device*	
2901			*inaccessible*	
2911			*cached LUN*	
2930			*caching*	
1				*device*	
20				*Device*IO failed*	
4098			*returning error*
7				*bad block*
11				*controller error*	
52				*predicted that it will fail*

 

Leave a Reply

Your email address will not be published. Required fields are marked *