Monday, March 26, 2007

IA: Logging and Error Messages

In an enterprise you have various groups of users that access a varying number of systems. These users include, but are not limited to:
  1. Admins
  2. Power Users
  3. Management (Execs.)
  4. Middle mgmt.
  5. Clerks
  6. Testers
  7. Developers
  8. External Users
  9. Monkeys (e.g., hackers, or others who shouldn't be accessing the system)
Each group will not have the same level of access, which makes error reporting much more difficult. Some error messages will contain data only pertinent to a certain group of users, while others may only get a message telling them to contact their local system administrator.

In class we came up with a list of what a good error message should consist of:
  • Dialog box
  • Identifier that tells you what caused the error
  • Attention directing icon
  • Possible actions to take
  • Log a copy of the message
  • Action buttons (undo, restore, etc.)
  • Auto report to the developer
  • Description, context (ability to reproduce the error, timestamp)
You don't necessarily need to log all of your error messages, just the important ones (e.g., don't log every time someone submits a form with a null field in it.) The message should be able to tie back to a specific location in the actual program (a line number). There should be a level defined to permit the appropriate groups to access the error messages, and a time stamp should be in place to better troubleshoot the error.

Personal Application
I've dealt a lot with error messages in the last few jobs I've worked. Currently I work in conjunction with the university's monitoring group making sure all the systems on-campus are being monitored correctly, and that the correct notifications are sent to the appropriate people when something goes wrong. I guess you could say I'm part of the error message team at BYU.

I specifically write the scripts that monitor the thousands of servers on campus that report to the Operations Center what is going on. I haven't written all several thousand of them, but I've written a good share of them. Also, part of the process includes making sure that documentation exists so that when an alert goes off for one of the checks I've written, the appropriate people are contacted, and the correct actions are taken.

For instance, just recently we've been asked to write a check that will better monitor the state of the LDAP connections to the campus' databases. My co-worker and I had to interface with the DBA's and try to understand the process they wanted to monitor. We then wrote and tested the script, and confirmed the error messages with the DBA's. Next in line we have to document how the check is to be handled, test it, and move it into production.

No comments: