Monday, January 7, 2008

Glitches...

We were finally able to resolve the intermittent search engine problems with a fully distributed solution. Technically, it's quite simple. Have one master search server, many slave servers, and then spread the load.

With the search problem out of the way we were blindsided with another problem this past Sunday afternoon which caused more than 2 hours of down-time for the entire site. Usually, we are able to resolve most problems within minutes, but due to the weekend timing and general bad luck, we were not able to rectify the situation any quicker.

The underline problem was with a single point of failure with our caching framework. Caching the method of not re-generating frequently requested content. Almost all pages on the site is cached for performance and practical reasons. A load balanced/fail-over system has been put into to place get rid of this single point of failure once and for all.

The second issue is regarding to make sure staff is alerted to critical problems. Before, we have email notification but since Sunday, we have implemented the ever popular text-message solution. If and when any part of our network is down, the correct staff will be both emailed and text-messaged. Once notified, our staff can remotely fix almost all software problems within minutes. However, the critical part is making sure we are alerted as soon as possible to potential problems.