Dismiss Notice

Welcome To CK5!

Registering is free and easy! Hope to see you on the forums soon.

Score a FREE t-shirt and membership sticker when you sign up for a Premium Membership and choose the recurring plan.

Site outage (update)

Discussion in 'The Garage' started by CK5, Mar 29, 2004.

  1. CK5

    CK5 In my underwear Administrator Premium Member GMOTM Winner Author

    Joined:
    May 19, 1999
    Posts:
    21,623
    Likes Received:
    701
    Location:
    CO
    Offical word from our host:

    Hello,

    We're very VERY sorry about this.. we had a network outage this morning between the hours of 2AM and 10AM PST (GMT -8). A failure by our primary and secondary routers caused sporadically slow or unresponsive connections to all sites and services.

    We apologize for it taking so long for the service to be restored.. we were on it almost immediately but it was a long painful process. Even now that the network
    is working again we're still looking into what caused the problem in the first place.

    Then, possibly due to the sudden flood of backed up network requests, once the network did come back up, immediately about 20 of our shared webhosting servers crashed in unison.. keeping some sites down a little bit longer as we rebooted them. At this point, everything should be okay again.

    We actually ordered a brand new router just last week and ironically it hadn't arrived yet. This new router should definitely mark the end of these sorts of network troubles. It's been a bad two weeks, we know.. /forums/images/graemlins/frown.gif

    Also, we're sorry about not being able to get an announcement out immediately..
    we attempted to but had such a hard time reaching our own internal services we couldn't. Only once things were almost better were we able to get in and announce what was going on (a couple of hours ago).

    Also note, any emails sent during the outage should still get through, but may be delayed a bit and still coming in now.

    I apologize again for the downtime, I promise this sort of problem will NOT happen again! We really appreciate you sticking this out with us, and understand the level of inconvenience and damage non-working email and web can cause. Our goal is to put March 2004 behind us as quickly as possible.

    Sincerely,
    Josh
     
  2. denver75k5

    denver75k5 1/2 ton status

    Joined:
    May 15, 2001
    Posts:
    1,017
    Likes Received:
    0
    Location:
    Somewhere In The Woods
    Re: Site outage

    CRUD!, screw it lets go wheelin!!!!
    /forums/images/graemlins/thumb.gif /forums/images/graemlins/thumb.gif /forums/images/graemlins/thumb.gif
     
  3. thezentree

    thezentree 3/4 ton status

    Joined:
    Sep 19, 2003
    Posts:
    7,198
    Likes Received:
    0
    Location:
    NC
    Re: Site outage

    of course the site goes down on a sick day...glad to know it wasn't my crappy computer /forums/images/graemlins/whistling.gif
     
  4. jwduke

    jwduke 1/2 ton status

    Joined:
    Aug 25, 2001
    Posts:
    721
    Likes Received:
    0
    Location:
    Ice Cream Capital of the World, IA.
    Re: Site outage

    Thanks for telling us, so after you get thru this headache, the first round is on me Steve!
     
  5. skratch

    skratch 1/2 ton status

    Joined:
    Oct 28, 2003
    Posts:
    1,646
    Likes Received:
    0
    Location:
    Gorveport, OH
    Re: Site outage

    I take it this means they aren't using Cisco routers!

    There really is no excuse for loosing two primary routers in the first place, so this is kinda funny. Irritating but funny.

    Thanks for the update Steve.
     
  6. 84gmcjimmy

    84gmcjimmy 1 ton status

    Joined:
    Dec 3, 2003
    Posts:
    12,838
    Likes Received:
    0
    Location:
    B.C. CANADA
    Re: Site outage

    Thank goodness I read this, I went to go on earlier and It wouldn't let me on so I thought my computer was messed, my other computer crashed the other day and needs a new hard drive or something, Anything can happen /forums/images/graemlins/mad.gif
    p.s. thanks for the update Steve!
     
  7. Z3PR

    Z3PR Banned

    Joined:
    Mar 30, 2002
    Posts:
    19,217
    Likes Received:
    0
    Location:
    Everywhere
    Re: Site outage

    I was suffering from CK5 withdraws earlier today. Not fun, wouldn't want to have to go through that again. /forums/images/graemlins/bow.gif
     
  8. big83chevy4x4

    big83chevy4x4 3/4 ton status

    Joined:
    Mar 22, 2002
    Posts:
    6,587
    Likes Received:
    0
    Location:
    Sheridan, Michigan
    Re: Site outage

    yea i actally had to work on the garage insted of beeing on here /forums/images/graemlins/doah.gif /forums/images/graemlins/whistling.gif
     
  9. az-k5

    az-k5 1/2 ton status

    Joined:
    Oct 15, 2003
    Posts:
    2,774
    Likes Received:
    0
    Location:
    Phoenix AZ
    Re: Site outage

    Ya I made it to school and work on time, and in the same day, cause I wasn't able to get on. Man I was happy when I got home tonight to see the forums again. /forums/images/graemlins/grin.gif
     
  10. CK5

    CK5 In my underwear Administrator Premium Member GMOTM Winner Author

    Joined:
    May 19, 1999
    Posts:
    21,623
    Likes Received:
    701
    Location:
    CO
    Re: Site outage

    The latest:

    Hello,

    The severe network outage that occurred throughout the day on Monday, March 29,
    2004 was
    due to a failure of several parts of our system as a whole, both technical and
    procedural. We
    have already begun taking steps to improve our service and strengthen our
    network to better
    handle any future outages.

    We now know the network outage was the result of a malicious Denial of Service
    attack aimed at
    a website hosted on our servers. According to our network graphs, the first
    wave of the attack
    started at around 2:30am PST (GMT -8). It then resumed at around 5am PST, and
    continued in
    earnest from there with only a few hours of network availability until it was
    resolved completely
    at around 5pm PST. Due to the nature of the problem itself and the complexity
    of our
    redundant multiple router setup, it was not immediately apparent that we were
    under attack.
    Ordinarilly you can immediately tell because the router itself is overloaded.
    In this case
    however, it was so overwhelmed that it completely malfunctioned and the
    ordinary diagnostic
    tools did not paint a full picture of the situation. We were also unable to
    access our own
    network graphs due to the scale of the attack so it was assumed a hardware
    failure of the router
    was responsible. As a result, many misdirected solutions were attempted and
    failed while the
    mental pressure on our network engineers continued to mount.

    After several solutions failed, the scope was broadened and we finally realized
    the true nature
    of the situation. To resolve the Denial of Service attack, we worked with our
    multiple redundant
    upstream network providers to divert and eliminate the malicious traffic before
    it reached our
    network. Some network providers are more responsive than others so this didn't
    happen on all
    of them simultaneously, but the largest chunk of the attack was diverted very
    quickly.

    Once network access was restored to our system, several of our servers crashed
    under the
    sudden load or other problems stemming from the previous lack of network access.
    We quickly
    dealt with those server problems and got all sites back up within an hour or so.
    That explains
    why some of your sites were still down while our own sites were up and
    available. There were
    also a few outstanding network issues that weren't fully dealt with until later
    in the evening and
    some of you were probably affected by that.

    Several announcements were sent out during the course of the day, but they were
    not
    immediately delivered due to the network problem. We are dealing with that
    Catch-22 situation
    in a way that I will detail below as I outline our steps to prevent this problem
    in the future.

    We were slower to handle the problem than we should have been for a couple of
    reasons. Our
    weekend night staff was not properly prepared to handle the situation. We are
    now in the
    course of preparing a procedural guide for this type of network attack and will
    be thoroughly
    training all of our night staff. Also, our own network monitoring system was
    not properly
    configured to notify us of this particular type of situation and our response
    time was slower as
    a result. We already had one type of monitor in development and it did catch
    the problem early
    on, but since it is still in development it was not configured to send out
    proper notifications.
    Needless to say, development on that monitor will be accelerated!

    We are also accelerating our plans to deploy new, more powerful routers. Our
    existing routers
    were already nearing the end of their service to us and we will now speed them
    off to an early
    retirement. The new routers will be able to handle several times more traffic
    than our existing
    equipment and will not buckle under the load so easily, if at all. We will not
    stand by and allow
    malicious attackers to take advantage of us so easily again. We expect the new
    routers to be
    installed and running in the next couple of weeks. We obviously must avoid as
    much downtime
    as possible during the transition so it will be handled delicately.

    Many of you mentioned that the lack of information from us compounded the
    problem severely
    and we have also already begun taking steps to alleviate that issue in the event
    that a similar
    situation should occur in the future. We are setting up an off-network status
    and updates
    system. We will publish more detailed information about this in the future. It
    will consist of a
    server hosted on a secondary network that will be accessible even when our
    primary network is
    not for some reason. The status system will be updated with the latest news
    during any
    emergency situation. We will also set it up to allow us to notify all of our
    customers at
    secondary email accounts so we can keep you in the loop with all available
    information. We
    hope to never have to use this emergency system of course, but we will perform
    periodic tests
    of it to ensure it is always functioning properly.

    That covers the network outage from yesterday, but several of you also had
    questions about
    our previous central database problem. Our central database is the core of our
    entire automation system and is what provides you with the extensive
    self-service control
    over your accounts you have no doubt come to appreciate. Our central database
    is vital to our
    service and it is set up in a fully data redundant manner. Unfortunately, it
    was not able to
    handle software failures and we experienced some downtime as a result. We have
    already taken
    some initial steps to reduce the potential for problems in the short-term and
    are in the process
    of developing our next generation fully-clustered central database. Once that's
    installed and
    running, overall performance and stability of our central database will be much
    improved. Note
    that our central database is separate from our customer database machines.
    Those machines
    were fully up and running throughout the central database problems.

    As before, we will continue to keep close watch on all of our hosting servers
    and work quickly
    to eliminate individual problems as they arise. Note that the vast majority of
    these problems
    are caused by individual users so anything you can do to reduce the server
    impact of your own
    web applications would benefit all of your fellow users.

    Please let us know if you have any more questions about this situation. We are
    currently a little
    behind on our support queue, but we are working extra hard to clear it out as
    quickly as
    possible. We will get to your questions eventually so bear with us. Thank you
    for taking the
    time to read this message and have a good day!
     
  11. shane74

    shane74 1/2 ton status

    Joined:
    Feb 12, 2002
    Posts:
    4,100
    Likes Received:
    0
    Location:
    Vancouver, WA
    Well..chit happens man...Glad it's back up and running now! /forums/images/graemlins/waytogo.gif
     
  12. big83chevy4x4

    big83chevy4x4 3/4 ton status

    Joined:
    Mar 22, 2002
    Posts:
    6,587
    Likes Received:
    0
    Location:
    Sheridan, Michigan
    Re: Site outage

    it was pavementsucks.com, they had to take our site down so they can get ahead in the votes

    /forums/images/graemlins/histerical.gif /forums/images/graemlins/histerical.gif

    some people /forums/images/graemlins/shame.gif /forums/images/graemlins/shame.gif
     
  13. 84gmcjimmy

    84gmcjimmy 1 ton status

    Joined:
    Dec 3, 2003
    Posts:
    12,838
    Likes Received:
    0
    Location:
    B.C. CANADA
    Re: Site outage

    [ QUOTE ]
    it was pavementsucks.com, they had to take our site down so they can get ahead in the votes

    [/ QUOTE ]

    /forums/images/graemlins/rotfl.gif /forums/images/graemlins/rotfl.gif /forums/images/graemlins/rotfl.gif /forums/images/graemlins/rotfl.gif /forums/images/graemlins/rotfl.gif /forums/images/graemlins/rotfl.gif /forums/images/graemlins/rotfl.gif
     
  14. thezentree

    thezentree 3/4 ton status

    Joined:
    Sep 19, 2003
    Posts:
    7,198
    Likes Received:
    0
    Location:
    NC
    Re: Site outage

    so when do we get torches and pitchforks and go lynch em? /forums/images/graemlins/angryfire.gif /forums/images/graemlins/hack.gif /forums/images/graemlins/thumb.gif
     

Share This Page