Access to our core SQL server for our site CMS has apparently been down from THREE MONTHS after the service done on SQL2K1201 back in March: ============================== 03-10-2013, 01:38 PM tasha DiscountASP.NET Staff >>>sql2k1201 issues We are currently experiencing disk issues with the "sql2k1201" MS SQL server. Our system administrators are working on the problem and will solve it as soon as possible. We will provide updates as they become available. ================================= 03-10-2013, 02:06 PM tasha DiscountASP.NET Staff The server should be operating normally now. ================================== However, the SQL server is NOT allowing us to log in, we cannot perform backups and we cannot access our database! This has caused our entire site to display a reverted page to a FRESH INSTALL. We are not getting timely responses to our tickets (i.e. there is no "severity" setting to cause faster responses) This kind of outage and "resolution" resulting in this kind of result to a customer and it never being caught, is unacceptable and we need to resolve this IMMEDIATELY. Please advise!
More Information: Apparently our database is missing from the server by the name that we knew it by. Also, if we clear the DB name and try to login, there is also no "default" DB, so again, we can't login to see what is on the server if anything.
The best way to resolve this type of issue is to contact our support department. I also see that you currently did contact us and we'll go ahead and address that problem on the ticket you opened.
Martino, It has been over an hour since you responded to my ticket, stating that a "Higher level of support" was being engaged. I have not been contacted, nor is the problem resolved. My last reply to your statement above was "What is the SLA on a complete outage of a customer? What can I expect?" I also have heard nothing on that question. Please let me know what is going on.
Since i have heard nothing from anyone in hours, i am left to my own devices... It appears that my DB is stuck in a "RECOVERY PENDING" state - but isn't recovering. Please see the attached image and get back to me ASAP! This is ridiculous.
It is unfortunate and unusual, but I don't know if I would characterize it as ridiculous. I read your helpdesk tickets and I think Martin explained the situation for what it is: SQL database set to Recovery mode (the problem in your case) is not one of the notifications we get from our system. Your database was the only one affected in that particular way after the SQL server was restarted, and unfortunately we don't monitor every aspect or state of every database - it isn't possible to do so. So if the server is working properly, and your database isn't causing a problem or throwing an error that we do monitor, we rely on you alert us to a problem. Normally that would happen soon after the server restart and we would get everything ironed out. But apparently you didn't notice that your database was down until now, so that's why it's been unavailable for three months. We run hundreds of servers with dozens of different O/Ses and versions, and we have tens of thousands of individual users on those servers. System health is measured in a lot of ways, but there are an almost infinite number of things that can't be monitored but can still go very wrong. That's what happened here. I'm sorry it happened to you, and I would like to be able to say I guarantee that nothing like that will ever happen again! but I can't, because that wouldn't be true. Not for us or any other host. But I really am sorry, it was a messed up situation and it certainly isn't something we like to see. We don't want anyone to have anything less than great service.
Michael, thanks for replying. This is a serious issue for us and yes, we dropped our ball on our end too, however, I disagree with your quote: Monitoring at the level I'm referring to actually is possible, and my clients do it all the time. Before I get into that, however, I want to point out that from a monitoring standpoint I have this opinion... Your level of monitoring should monitor at the level of the service you are providing. In other words, if you sold SQL servers as a service, then I would accept the level of monitoring you outline above. But you sell SQL data bases as a service, so you should monitor at the database level. Just one example (there are no doubt others) of how you do it is provided using the "2012 SQL server Management Pack", which has the following ability: ====================================================== Management Pack: SQL Server MP Version: 6.3.173.0 for SQL Server 2008 Released: 4/2/2012 Publisher: Microsoft Database Status Monitor ID: Microsoft.SQLServer.2008.Database.DBStatusMonitor Description: This monitor checks the status of the database as reported by Microsoft® SQL Server™. Target: SQL Server 2008 DB Enabled: Yes Operational States Name: State Database Available: Success Database Unavailable: Error Database Recovering/Restoring: Warning <--This would have alerted you Using the information from the quote above, a simple query of the MASTER DB would emulate this level of alerting for whatever system you use. There is a nicely formatted table in a Wiki available here that outlines its use more clearly: http://mpwiki.viacode.com/default.aspx?g=posts&t=100300 Also, to answer the obvious question about how to avoid false alarms when users alter their DB in a way that would cause an alert (e.g. take it offline, remove it, etc.), merely requires a "Maintenance mode" switch that users must engage before making changes. DASP simply must enforce it's proper use, otherwise that user will lose the ability to be alerted by DASP, etc... In any case, I understand what you are up against by running a large and nearly fully automated organization like you do. I am very impressed with the level of ability the users have been given by the control panel you have made. I think that since DASP continues to grow, it's likely time to consider leveraging the vast amount of information you have about customers to create better alerts and email notifications about impact of outages and maintenance events, both planned and unplanned. In my mind, the outage of SQL2K2101 should have played out like this: The outage caused an alert in the monitoring system The system then executes a query for all the users effected by the problem The system then prepares a message to them, and presents that message to a human that then can craft an impact statement to the users Then the human sends it. (Via email and/or SMS - which is also an email actually) When resolved, the system drafts another message and allows a person to send it with any additional info desired It's clear you have skilled folks there, otherwise this hosting platform wouldn't be as successful as it is, however I encourage you to consider what I have suggested so that issues like this are avoided. I know I have been wordy here, but the overriding concept I hope to convey is that you should monitor (and alert) at the level of service you provide, not one level higher that avoids monitoring the product sold all together. Beyond that, it's relatively easy to assemble an impact report that you can leverage to communicate only with the effected customers. I welcome your thoughts. -L
I appreciate your researched and comprehensive reply. I'm not a DBA so I have no idea what impact things like 2012 SQL server Management Pack would have on a shared production server. Remember to scale your experience up when you think about using anything on a shared server. Way up. An occasional warning generated by that check on a small scale becomes, for us, thousands of warnings every day that someone has to parse and evaluate and respond to. But that aside, for better or worse an ecosystem has been established in web hosting, and granular monitoring like you're describing isn't done for two reasons, 1) shared web hosting is typically an inexpensive service that isn't expected to - and doesn't claim to - provide an enterprise-level experience, and 2) generally it isn't necessary. The first reason speaks for itself. But the second, the reason it isn't generally necessary to provide granular monitoring, is because we have tens of thousands of active users. Very little escapes their notice. We often get email form a user about a problem within a minute of us receiving an alert internally. That doesn't happen all the time, and we certainly don't rely on customer feedback to take the place of our monitoring, but it is an extremely, extremely rare circumstance when something goes wrong on the network here and no customer notices it. We'll normally hear from someone within three hours, let alone days, weeks or months. So your experience is atypical. Which I guess leads to a third point, and that is, it doesn't make sense for us to use our resources to do something that benefits a very small percentage of our users. That might sound counter-intuitive or like bad customer service from your end, I understand that, but in the real world that kind of approach is essential if you want to survive. The bottom line is I think your expectations for our kind of service are - while completely understandable - a little unrealistic. I would hazard a guess that there are few (or no) large, inexpensive hosts that monitor their customer's sites or databases for problems and alert them when those problems occur. The host's concern is always the overall server health. Now I'm also quite sure there are many hosts who will provide that service - for a considerable fee - but it is not typically something I would expect form any host. And personally, I am a consumer of web hosting services as well as a provider.