Recovering from prolonged outage
tny.im and this website were down for about 42 hours, starting on June 29 at 03:16 UTC.
The problem? BlueVM’s S19-NY server went down, taking with it the server I have/had there (and which I paid for a full year!). Other than this outage, the service had worked fine for three months, – fast network, full resources availability – since I bought it.
S19-NY is still down, without any ETA for it to come back. There’s no information in what conditions it will come back, or *if* it will come back (with the previous contents, at least). BlueVM staff is pretty much unresponsive, other than a guy who sometimes hangs on IRC and says he can’t fix the KVM instance because he doesn’t have access to it.
Of course, I no longer recommend BlueVM and I don’t plan on renewing the server I have with them.
The “solution” to put an end to two days of downtime, was to buy the cheapest SecureDragon OpenVZ server, (OpenVZ! so hard to live without my beloved KVM, I can’t even use davfs2 because there’s no fuse module!) and restore the backups I had (from four hours before the BlueVM instance went down). This has been done, except for the HTTPS certificate of tny.im… that alone is another story:
As I tried to retrieve the existing cert from StartSSL (because, stupid me, automated backups were not copying everything SSL-related, and I didn’t save it locally), I found my authentication certificate had expired. This basically means my StartSSL account is lost, unless I create a new account and ask their staff to link it to the old one. They probably won’t do that without a payment and some ID checks so… out of question. I guess, if the BlueVM server doesn’t come back, that I’ll just create a new StartSSL account and generate a new cert for tny.im. There’s no security issue with this, as the previous certificate has not been compromised (unless BlueVM is collecting certificate private keys from inside their clients’ machines…) and so it doesn’t need to be revoked.
To conclude the HTTPS point: tny.im has the HTTPS service unavailable for now, until I can retrieve the existing certificate from the previous server, or until I get a new one.
Is all the fault in BlueVM’s side? Of course not… I could have lost my love for the money earlier, bought the SecureDragon VPS yesterday already and reduce the downtime by 24 hours. But since I had hope the problems on BlueVM support were just Sunday-related, I thought that by Monday they would have it fixed. They didn’t.
On related news, I’ll take this downtime and new server acquisition as the motivation for setting up a new advanced and redundant system, so that if one server goes down, tny.im (and possibly this blog too) will continue to operate as normal. I have two servers already (assuming the BlueVM one comes up), and I plan on developing a system where firing up new instances of tny.im on any empty server will be really easy. The system will be always prepared to lose any server at any point, without data loss, and restore full service within five minutes. That way, I can add less reliable hosts, perhaps even VPS trials, to the redundant system. This also allows for scaling the service as needed. Sounds ambitious? Of course it won’t happen in a week, but I have the full summer to develop and test the system…
Why don’t I just go with some SaaS that supports scaling? Two reasons: the price is too high, and the tny.im software is not coded in a way that’s compatible with these services. Let me remember you, that while not exactly being a CGI script, tny.im is not coded in one of those fancy modern languages, and even though PHP is not exactly outdated or unmaintained, the quality of the code can make it pretty horrible or pretty good. And the code is… not perfect – it doesn’t use any popular framework, it is based on YOURLS and has many, many hacks feature additions, plus a… very close relationship with the database.
Let me finish by saying that downtime of this kind is something to be expected if I were still hosted by FreeVPS and the like. But believe it or not, on FreeVPS and other sponsorships I’ve never seen a customer service as bad as the one of BlueVM (and it’s hard to remember an outage as big as this one). It is definitely not adequate for a paid service. In addition to S19-NY, they have many other instances down, with similar or worse downtime. The admins don’t appear to be online or reachable in any way, even by other staff members. The latest news/excuse on the S19-NY situation is that IPMI is broken and they are waiting for the provider to fix it… now tell me, does this look like a serious company, or some poor-man’s sponsorship?
EDIT: The BlueVM server is still down. tny.im is now hosted by three servers with round-robin balancing. HTTPS service was restored with a new certificate.
EDIT 7/7/2014: I forgot to update this post in time, but the BlueVM server has been up since three days ago. But I only got to know that the service was restored thanks to a friend of mine, because they didn’t reply to my ticket to inform me about it. Anyway, I don’t plan to renew this server, and BlueVM lost me as a customer (except for some really cheap deal which I’ll use as a personal/development box, and never in production).
On related news, Mirasm – the Tiny Server Redundancy Manager – is mostly finished, only needs some more testing to be put on production servers, managing the new tny.im redundancy system.