CK5 Network
Repairing, restoring or modifying your ride just got a whole lot easier!

Site outage (update)

CK5

In my underwear
Administrator
Premium Member
GMOTM Winner
Author
Joined
May 19, 1999
Posts
23,586
Reaction score
4,560
Location
CO
Offical word from our host:

Hello,

We're very VERY sorry about this.. we had a network outage this morning between the hours of 2AM and 10AM PST (GMT -8). A failure by our primary and secondary routers caused sporadically slow or unresponsive connections to all sites and services.

We apologize for it taking so long for the service to be restored.. we were on it almost immediately but it was a long painful process. Even now that the network
is working again we're still looking into what caused the problem in the first place.

Then, possibly due to the sudden flood of backed up network requests, once the network did come back up, immediately about 20 of our shared webhosting servers crashed in unison.. keeping some sites down a little bit longer as we rebooted them. At this point, everything should be okay again.

We actually ordered a brand new router just last week and ironically it hadn't arrived yet. This new router should definitely mark the end of these sorts of network troubles. It's been a bad two weeks, we know.. /forums/images/graemlins/frown.gif

Also, we're sorry about not being able to get an announcement out immediately..
we attempted to but had such a hard time reaching our own internal services we couldn't. Only once things were almost better were we able to get in and announce what was going on (a couple of hours ago).

Also note, any emails sent during the outage should still get through, but may be delayed a bit and still coming in now.

I apologize again for the downtime, I promise this sort of problem will NOT happen again! We really appreciate you sticking this out with us, and understand the level of inconvenience and damage non-working email and web can cause. Our goal is to put March 2004 behind us as quickly as possible.

Sincerely,
Josh
 

denver75k5

1/2 ton status
Joined
May 15, 2001
Posts
1,019
Reaction score
7
Location
Somewhere In The Woods
Re: Site outage

CRUD!, screw it lets go wheelin!!!!
/forums/images/graemlins/thumb.gif /forums/images/graemlins/thumb.gif /forums/images/graemlins/thumb.gif
 

thezentree

3/4 ton status
Joined
Sep 19, 2003
Posts
7,198
Reaction score
0
Location
NC
Re: Site outage

of course the site goes down on a sick day...glad to know it wasn't my crappy computer /forums/images/graemlins/whistling.gif
 

skratch

1/2 ton status
Joined
Oct 28, 2003
Posts
1,646
Reaction score
1
Location
Gorveport, OH
Re: Site outage

I take it this means they aren't using Cisco routers!

There really is no excuse for loosing two primary routers in the first place, so this is kinda funny. Irritating but funny.

Thanks for the update Steve.
 

84gmcjimmy

1 ton status
Joined
Dec 3, 2003
Posts
12,837
Reaction score
0
Location
B.C. CANADA
Re: Site outage

Thank goodness I read this, I went to go on earlier and It wouldn't let me on so I thought my computer was messed, my other computer crashed the other day and needs a new hard drive or something, Anything can happen /forums/images/graemlins/mad.gif
p.s. thanks for the update Steve!
 

Z3PR

Banned
Joined
Mar 30, 2002
Posts
19,216
Reaction score
0
Location
Everywhere
Re: Site outage

I was suffering from CK5 withdraws earlier today. Not fun, wouldn't want to have to go through that again. /forums/images/graemlins/bow.gif
 

big83chevy4x4

3/4 ton status
Joined
Mar 22, 2002
Posts
6,587
Reaction score
0
Location
Sheridan, Michigan
Re: Site outage

yea i actally had to work on the garage insted of beeing on here /forums/images/graemlins/doah.gif /forums/images/graemlins/whistling.gif
 

az-k5

1/2 ton status
Joined
Oct 15, 2003
Posts
2,774
Reaction score
0
Location
Phoenix AZ
Re: Site outage

Ya I made it to school and work on time, and in the same day, cause I wasn't able to get on. Man I was happy when I got home tonight to see the forums again. /forums/images/graemlins/grin.gif
 

CK5

In my underwear
Administrator
Premium Member
GMOTM Winner
Author
Joined
May 19, 1999
Posts
23,586
Reaction score
4,560
Location
CO
Re: Site outage

The latest:

Hello,

The severe network outage that occurred throughout the day on Monday, March 29,
2004 was
due to a failure of several parts of our system as a whole, both technical and
procedural. We
have already begun taking steps to improve our service and strengthen our
network to better
handle any future outages.

We now know the network outage was the result of a malicious Denial of Service
attack aimed at
a website hosted on our servers. According to our network graphs, the first
wave of the attack
started at around 2:30am PST (GMT -8). It then resumed at around 5am PST, and
continued in
earnest from there with only a few hours of network availability until it was
resolved completely
at around 5pm PST. Due to the nature of the problem itself and the complexity
of our
redundant multiple router setup, it was not immediately apparent that we were
under attack.
Ordinarilly you can immediately tell because the router itself is overloaded.
In this case
however, it was so overwhelmed that it completely malfunctioned and the
ordinary diagnostic
tools did not paint a full picture of the situation. We were also unable to
access our own
network graphs due to the scale of the attack so it was assumed a hardware
failure of the router
was responsible. As a result, many misdirected solutions were attempted and
failed while the
mental pressure on our network engineers continued to mount.

After several solutions failed, the scope was broadened and we finally realized
the true nature
of the situation. To resolve the Denial of Service attack, we worked with our
multiple redundant
upstream network providers to divert and eliminate the malicious traffic before
it reached our
network. Some network providers are more responsive than others so this didn't
happen on all
of them simultaneously, but the largest chunk of the attack was diverted very
quickly.

Once network access was restored to our system, several of our servers crashed
under the
sudden load or other problems stemming from the previous lack of network access.
We quickly
dealt with those server problems and got all sites back up within an hour or so.
That explains
why some of your sites were still down while our own sites were up and
available. There were
also a few outstanding network issues that weren't fully dealt with until later
in the evening and
some of you were probably affected by that.

Several announcements were sent out during the course of the day, but they were
not
immediately delivered due to the network problem. We are dealing with that
Catch-22 situation
in a way that I will detail below as I outline our steps to prevent this problem
in the future.

We were slower to handle the problem than we should have been for a couple of
reasons. Our
weekend night staff was not properly prepared to handle the situation. We are
now in the
course of preparing a procedural guide for this type of network attack and will
be thoroughly
training all of our night staff. Also, our own network monitoring system was
not properly
configured to notify us of this particular type of situation and our response
time was slower as
a result. We already had one type of monitor in development and it did catch
the problem early
on, but since it is still in development it was not configured to send out
proper notifications.
Needless to say, development on that monitor will be accelerated!

We are also accelerating our plans to deploy new, more powerful routers. Our
existing routers
were already nearing the end of their service to us and we will now speed them
off to an early
retirement. The new routers will be able to handle several times more traffic
than our existing
equipment and will not buckle under the load so easily, if at all. We will not
stand by and allow
malicious attackers to take advantage of us so easily again. We expect the new
routers to be
installed and running in the next couple of weeks. We obviously must avoid as
much downtime
as possible during the transition so it will be handled delicately.

Many of you mentioned that the lack of information from us compounded the
problem severely
and we have also already begun taking steps to alleviate that issue in the event
that a similar
situation should occur in the future. We are setting up an off-network status
and updates
system. We will publish more detailed information about this in the future. It
will consist of a
server hosted on a secondary network that will be accessible even when our
primary network is
not for some reason. The status system will be updated with the latest news
during any
emergency situation. We will also set it up to allow us to notify all of our
customers at
secondary email accounts so we can keep you in the loop with all available
information. We
hope to never have to use this emergency system of course, but we will perform
periodic tests
of it to ensure it is always functioning properly.

That covers the network outage from yesterday, but several of you also had
questions about
our previous central database problem. Our central database is the core of our
entire automation system and is what provides you with the extensive
self-service control
over your accounts you have no doubt come to appreciate. Our central database
is vital to our
service and it is set up in a fully data redundant manner. Unfortunately, it
was not able to
handle software failures and we experienced some downtime as a result. We have
already taken
some initial steps to reduce the potential for problems in the short-term and
are in the process
of developing our next generation fully-clustered central database. Once that's
installed and
running, overall performance and stability of our central database will be much
improved. Note
that our central database is separate from our customer database machines.
Those machines
were fully up and running throughout the central database problems.

As before, we will continue to keep close watch on all of our hosting servers
and work quickly
to eliminate individual problems as they arise. Note that the vast majority of
these problems
are caused by individual users so anything you can do to reduce the server
impact of your own
web applications would benefit all of your fellow users.

Please let us know if you have any more questions about this situation. We are
currently a little
behind on our support queue, but we are working extra hard to clear it out as
quickly as
possible. We will get to your questions eventually so bear with us. Thank you
for taking the
time to read this message and have a good day!
 

shane74

1/2 ton status
Joined
Feb 12, 2002
Posts
4,100
Reaction score
0
Location
Vancouver, WA
Well..chit happens man...Glad it's back up and running now! /forums/images/graemlins/waytogo.gif
 

big83chevy4x4

3/4 ton status
Joined
Mar 22, 2002
Posts
6,587
Reaction score
0
Location
Sheridan, Michigan
Re: Site outage

it was pavementsucks.com, they had to take our site down so they can get ahead in the votes

/forums/images/graemlins/histerical.gif /forums/images/graemlins/histerical.gif

some people /forums/images/graemlins/shame.gif /forums/images/graemlins/shame.gif
 

84gmcjimmy

1 ton status
Joined
Dec 3, 2003
Posts
12,837
Reaction score
0
Location
B.C. CANADA
Re: Site outage

[ QUOTE ]
it was pavementsucks.com, they had to take our site down so they can get ahead in the votes

[/ QUOTE ]

/forums/images/graemlins/rotfl.gif /forums/images/graemlins/rotfl.gif /forums/images/graemlins/rotfl.gif /forums/images/graemlins/rotfl.gif /forums/images/graemlins/rotfl.gif /forums/images/graemlins/rotfl.gif /forums/images/graemlins/rotfl.gif
 

thezentree

3/4 ton status
Joined
Sep 19, 2003
Posts
7,198
Reaction score
0
Location
NC
Re: Site outage

so when do we get torches and pitchforks and go lynch em? /forums/images/graemlins/angryfire.gif /forums/images/graemlins/hack.gif /forums/images/graemlins/thumb.gif
 
Top Bottom