What is High Availability? SLA for IaaS: Real Guarantees for Virtual IT Infrastructure High Availability as a Service.

IT Infrastructure as a Service (IaaS) services are becoming increasingly popular with corporate clients, and they are already usedand for mission-critical applications. It's time to figure it outwhat the providers of these services guarantee and what responsibility they bear in cases when the virtual IT infrastructure slows down or becomes completely unavailable.

After interviewing leading enterprise-grade IaaS infrastructure service providers, we analyzed their offerings. At the same time, the "corporate level" means the following: the cloud platform is deployed in a data center that meets the requirements of Tier III (the presence of a certificate from the Uptime Institute is notrequired), and provides a high level of resiliency through High Availability (HA) mechanisms and relocation of virtual machines in the event of a disaster.

AVAILABILITY AND RESPONSE TIME

The main parameters of the IaaS service, which are usually indicated in the SLA, are the level of its availability, the response time to various incidents and the duration of their resolution, as well as the scheme and parameters of compensation in case of downtime.

If you decide to use a virtual IT infrastructure, you can safely count on availability of 99.5% and higher. At least none of the providers we surveyed named a lower figure. Moreover, representatives of many companies emphasized that the value indicated in their answers (see Table 1) is typical and, at the request of the customer, the level of availability can be increased using various technical means.

Typically, enterprise-grade IaaS platforms are hosted in data centers (internal or external) that meet the Tier III fault tolerance level, which is known to have 99.98% availability. The values ​​of the availability of virtual IaaS infrastructures indicated by the providers do not exceed the corresponding characteristic of the physical site, which is quite natural.

The exception is the 99.99% availability provided by Dataline in metro cluster mode. This option is disaster-resistant the cloud covers two data centers of the company - for more information about the metro cluster, see the material "Disaster-resistant cloud at a" non-cloud "price", published in the October issue of the "Journal of Network Solutions / LAN" for 2013 ().

In principle, the supplier can indicate in the SLA arbitrarily high availability, at least 100%, but then he risks losing more than earning, because any sane buyer will require a strict compensation scheme for non-fulfillment of the agreed conditions to be included in the contract. While no standard scheme has yet been developed - each supplier offers something of its own, so the buyer must evaluate the proposed compensation taking into account possible financial losses in the event of downtime of IT services.

Many companies offer a certain refund of a monthly payment (in percentage) for each additional (in excess of the specified in the SLA) hour of service unavailability. For example, with the availability level specified in the SLA of 99.95% (idle time no more than 1 hour per month), Inoventica is ready to reimburse 2% of the monthly payment for each additional hour of disconnection from the service. Cloud4Y in the standard version compensates for 1% for 1 hour of downtime (in calculations, total cost services for full calendar month preceding this one), but not more than 50% of the cost of the service.

A number of providers have provided detailed calculations of how the amount of compensation varies depending on the level of availability (see Table 2). In the event of a significant decrease in this level, a very substantial compensation is offered. For example, if the value is less than 95%, "Onlanta" (GC "Lanit") allows a decrease in the level of payment for the service up to 40%. And the company "IT-Grad", if the level of availability drops below 96.71%, promises compensation of 50%. It is clear that providers consider such a deterioration in the quality of services unlikely.

“We have introduced two independent principles of compensation: for violation of target indicators of service parameters and target indicators for processing requests,” says Vitaly Mzokov, Head of Cloud Services and Infrastructure Solutions from Servionica (I-Teco Group of Companies). - Violation of target indicators of service parameters is compensated for on a progressive scale. Depending on the actual level of availability, the compensation indicator is calculated, expressed as a percentage of the invoice amount for using the service. Compensation for violation of target indicators for processing requests is calculated based on the client's waiting time with an accuracy of the minute. "

According to the practice adopted by Servionica, the types of customer requests, as well as general targets for the maximum response time to requests and the maximum time for solving a problem, are described in the service interaction regulations. And in the SLA itself, these indicators are specified for a specific service.

“According to the contract, the customer can receive several services from us. That is why the regulations describe general indicators with a note: "The targets specified in the SLA for a specific service overlap the indicators specified in the regulations." This is done so that, if necessary, it is possible to specify (expand or decrease) the reaction time and the solution time, - explains Vitaly Mzokov. - We are obliged to respond to requests of any kind within 15 minutes. The maximum resolution time, depending on the type and priority of the request, ranges from 1 hour (for incidents with priority No. 1) to 48 hours (for requests for which a complete study of the customer's information request is required - for example, the provision of information on tariffs and other services, various clarifications and instructions).

The response time to an application usually depends on its priority. For example, the priority levels Linxdatacenter practices are:

  • Critical - the service is completely unavailable, it is necessary to take urgent measures to restore, the reaction time is 15 minutes, the recovery time is not more than 4 hours;
  • High - the service is partially unavailable, the reaction time is up to 1 hour, the increased priority;
  • Normal - clarification on the parameters of the service, current non-urgent questions, reaction time up to 1 hour, 24 hours are given to prepare the answer.

Table 3 shows another example - the categorization of queries used by Cloud4Y; reaction time - no more than 30 minutes.

They try to work promptly at T-Systems. According to Vsevolod Egupov, sales director for the ICT division of T-Systems RUS, the specialists of this company “in 80% of cases respond within 30 seconds” (!). But, like most of our respondents, he noted that the reaction time depends on the criticality of the situation.

MONITORING TOOLS

It is not enough to indicate in the SLA an attractive level of accessibility and rigid compensation schemes, it is also necessary to provide the client with a convenient and effective tool control. And this is where vendor approaches differ significantly.

Referring to the practice of the Servionika company, Vitaly Mzokov notes that clients are more interested in receiving transparent and accurate reporting from the operator than in mastering some special tools for self-monitoring. As a rule, Servionica provides monthly reports on an agreed set of parameters, but, at the request of the client, the contract may provide for more frequent reporting.

Many companies, by default, provide service health reports once a month, but they can also more often - at the request of customers. An example of a report offered by Onlanta is shown in Figure 1. According to Mikhail Lyapin, head of its cloud business, Onlanta is the only company in Russia that provides customers with a cloud availability report with this level of detail. According to him, most service providers get by with statistics on the level of availability of virtual machines.

A number of companies offer customers an online self-service console. According to Ruslan Zaedinov, Deputy General Director, Head of Data Center and Cloud Computing at Krok, every consumer of the IaaS service has access to such a console with a built-in capability for online monitoring of the functioning of certain components. For example, in the case of virtual machines, the customer's IT specialists can monitor how much the processor is loaded, how the I / O is working, how much memory is occupied, etc. This data is available in real time, as well as - on demand - in the form of statistics for any period.

DO I NEED TO GUARANTEE PERFORMANCE

It is obvious that with an increase in the load on the IaaS platform of the provider, the degradation of the performance level of the virtual machine is possible. Service providers are committed to preventing this from happening. All companies agree on this. However, some include performance metrics in the SLA, while others consider such a measure unnecessary.

Here is what Vitaly Slizen, a member of the Board of Directors of Inoventica, says about this: “We do not observe degradation [of productivity] even with an increase in load, since we are timely expanding and modernizing the capacities of data centers. Separately in the SLA, these parameters (VM and storage performance) are not reflected, since their observance is our primary responsibility, regardless of customer requests. " Inoventica specialists constantly monitor all the main parameters of the leased infrastructure facilities, which allows them to quickly receive information about potential problems and predict them in a timely manner.

Igor Drozdov, manager of technical support Sales Linxdatacenter: “Our company provides guaranteed computing resources for use. They are reserved in the cloud and grow as the number of clients increases, so the performance of virtual machines and storage systems remains at a consistently high level. In addition, we provide timely server upgrades and performance monitoring with dedicated VMware products. ”

Orange Business Services is also one of the service providers that do not regulate performance parameters in the standard SLA. At the same time, according to Dmitry Dorodnykh, head of unified communications and IT products development at Orange Business Services in Russia and the CIS, “if a client requires that certain computing resources be guaranteed for his virtual machines, we use standard means modern virtualization platforms that allow virtual machines to be moved to other servers in the event of resource contention. "

Vsevolod Egupov believes that adding performance characteristics to the SLA "does not make sense, since degradation affects the level of service availability regulated by the agreement." At T-Systems, the performance of virtual machines and storage systems is controlled by the capacity management department, its specialists are responsible for preventing its degradation.

There are also quite a few companies that believe that adding performance characteristics to the SLA is advisable. The narrowest point In a virtualized IT environment, storage performance is considered by many experts to be storage performance, which is why most storage vendors pay close attention to storage characteristics such as input / output operations per second (IOPS)and disk access time (latency).

Dataline provides performance metrics for storage and virtual machines in each SLA (see Table 4). At the same time, according to Dmitry Tishin, head of the service development department of this company, "depending on the requirements put forward to the system landscape by the client, the metrics can be changed." IOPS values ​​are measured by NetApp DFM monitoring system, and disk access time is regular means Virtualization software (vCenter). In the event of a problem with a virtual machine, the on-call shift and the engineers of the virtualization team are alerted. In addition, Dataline provides monitoring of various parameters at the level of the operating system and the services running in it. If the client uses the company's OS and services administration services, such monitoring is performed by default.

To prevent degradation of virtual machine performance, Dataline specialists apply a set of measures. So, for the cluster, the Distributed Resource Scheduler (DRS) mechanism is used, which monitors the load of physical servers according to the main parameters - if a certain load on the server is reached, some of the virtual machines are automatically moved to another. The redundancy of servers is maintained in the cluster so that the load on the entire cluster is no more than 70%. Within the framework of the concluded service contracts with equipment suppliers, the resource capacities of the clusters can be increased according to the schedule.

Safedata also regulates performance characteristics such as IOPS and MIPS in the SLA. “We cannot reduce performance below the values ​​specified in the SLA,” says Anton Antonov, head of sales at Safedata. "If service degradation is observed with increasing load on physical servers, additional backup EXSi hosts are put into operation."

The performance characteristics of the storage disk system regulated in the SLA Cloud4Y are shown in Table 5. According to Evgeny Bessonov, Head of the Cloud4Y Marketing Department, in case of violation of the guaranteed performance indicators of CPU, HDD, RAM, compensation is stipulated, which is negotiated separately or paid in accordance with standard conditions: 1% of the monthly cost for 1 hour.

“We guarantee the performance of virtual machines at the lower limit, without limiting it from above,” says Ruslan Zaedinov. “Thus, if the server where the virtual machine is located has free computing resources in excess of the guaranteed ones, they will be available to the customer.” As for storage systems, at present all Croc clients use a common communication channel with storage systems. For a long time, this was not a problem, but now, in order to meet the growing needs of customers, the company is migrating cloud storage from Fiber Channel and SATA drives to flash drives with direct access to them from virtual machines over the Infiniband network. In parallel, software is being implemented to ensure guaranteed throughput of the data storage system in the cloud. The corresponding changes to the SLA will be introduced this fall.

As agreed with the customer, Servionica fixes performance indicators of individual components of the cloud platform in the SLA of each project. In addition, the agreement specifies how to measure these indicators and the frequency of measurements. “Any operator can write“ guaranteed 100,500 OPs per 1 GB of disk space ”, but not everyone is able to prove that this criterion is met. We are for the most transparent relationship between the cloud platform operator and its consumer, ”emphasizes Vitaly Mzokov. The performance of virtual machines and storage systems is determined in the Servionica SLA by IOPS and Latency.

As Maxim Zakharenko told, general manager service provider "Oblakoteka", in the contracts they conclude, peak performance indicators are regulated in such a way that the load of the input-output and network bandwidth does not exceed 80%. Monitoring is carried out using the Microsoft SCOM system. He notes that for different systems various indicators are important: for websites - response time, for placing IT infrastructures - indicators of peak CPU, memory, virtual network, etc. The company also includes guaranteed backup parameters, methods and terms of provision and storage in its SLA user data ("Honest parting").

CROSS-CUT SLA

No matter how high the reliability of the IaaS platform itself, located in a fault-tolerant data center, the access channels to this platform can become a bottleneck for the customer. The good news is that many of the providers we interviewed have end-to-end SLAs that span both the IaaS service itself and the access channels. Moreover, according to them, at correct organization and channel redundancy, the level of communication availability is not lower than that of the SLA platform, and therefore this important characteristic does not decrease in end-to-end SLAs.

However, as Vsevolod Egupov notes, reducing or maintaining the level of availability depends on the way the communication channels are organized - if the channel is reserved, the availability does not deteriorate. Otherwise, the availability level in the end-to-end SLA is reduced to the channel availability level. T-Systems RUS has its own network of data centers located around the world. Serving Russian customers is mainly carried out from data centers located in Germany and Austria. The company has signed an SLA with Rostelecom, Beeline, and cooperates with other telecom operators.

Those IaaS service providers that are also carriers are taking advantage of this advantage. Thus, being an international telecom operator, Orange Business Services practices the conclusion of end-to-end SLA covering IaaS and telecom services. The availability level in such SLA is 99.95%. But, as Dmitry Dorodnykh explains, this characteristic depends on the geographic location of the client - for example, in the Central region this level is higher than beyond the Urals and in Siberia. The last mile may have its own SLA parameters. The schemes and mechanisms for controlling SLA on communication channels have already been worked out for decades, so the issue of monitoring is not a problem for Orange Business Services.

As Vitaly Slizen notes, Inoventica has its own backbone communication channels and a geographically distributed network of data centers, which makes it possible to implement geoclusters. This allows you to maintain data and service performance even in the event of physical destruction of one of the data centers. According to him, Inoventica is “the only company on Russian market providing the full chain of services "Data Center - Channel - Service - Client (AWS)" in accordance with the SLA, which is the minimum forround trip delay less than 10 ms and almost zero packet loss. " Currently, the comprehensive Inoventica solution is available to customers in five federal districts of the Russian Federation.

Non-carrier IaaS service providers actively cooperate with carriers. Thus, Servionika has formed an SLA for working with telecom operators serving its data center (which is more than 10 large telecom providers). The company translates the terms of these SLA in contracts with customers who use communication services. And control over the observance of the SLA is provided by the technical services of the TrustInfo data center. “We indicate in our contracts the same SLA parameters as those of operators, that is, we take responsibility for the quality of their work and the uninterrupted provision of communication channels,” notes Vitaly Mzokov.

To provide customers with communication channels, Dataline uses the services of telecommunications operators under a subcontracting scheme. With this scheme, the company controls the quality within the framework of its contract with the operator, while the client receives a comprehensive service from her and deals with only one contractor. The level of accessibility is comprehensive service does not decrease. Dataline has its own data transmission network in Moscow, where the following characteristics are guaranteed: the share of lost packets is no more than 0.2%, the average network delay is no more than 5 ms.

According to Ruslan Zaedinov, Krok uses wide channels, the bandwidth of which is quite enough for all customers in the cloud. Technically valid guarantees are provided by cross-channel redundancy between different Krok data centers using its own optical ring. For those organizations for which a fixed bandwidth of a communication channel is critical, the company implements an individual connection to the cloud via separate channels with guaranteed throughput or even "dark" optics. Such a connection is most often equipped with individual encryption tools, including certified ones.

So, IaaS services are offered in Russia by a fairly large number of companies, and according to quite understandable and documented (in SLA) rules. The industry has yet to agree on whether or not the performance characteristics of virtual IT infrastructures should be regulated in SLAs, but the guaranteed availability indicators seem to be quite acceptable for even the most demanding enterprise customers. In addition, providers understand customers' need for end-to-end SLAs and are working to improve them.

Alexander Barskov- Leading editor of the "Journal of Network Solutions / LAN". You can contact him at:

, author Stewart Rens(Stuart Rance).

The availability of IT services is of great importance. When the services the customer needs are not available, they will be dissatisfied. Why should a customer pay for a service that doesn’t exist in reality when he needs it? This is why a consistent service availability metric is often included in a KPI.

IT staff go to great lengths to make sure that the stated goal is achieved and to show the figures in the reports to the customers to prove it. Typically IT companies use percentages for this, for example 99.999%. Unfortunately, this often means that they focus only on the percentage and lose sight of their true goal of being of value to the customer.

Percentage availability problem

One of the simplest ways to calculate availability is based on two parts. You agree on the time intervals during which the service should be available in reporting period... This is the agreed service time (AST). You are measuring downtime (DT) during this period. Subtract the downtime from the agreed service availability and convert that into a percentage.

If the AST is 100 hours and the downtime is 2 hours, the availability would be like this:

The problem is that, although this calculation is quite simple, as is the collection of data for it, in fact, it is not completely clear which indicator exactly reflects the figure you received as a result of the calculation. I'll talk about this a little later.

Worse, from the customer's perspective, you can communicate that you have achieved the agreed goals while leaving them completely dissatisfied.

A meaningful availability report should be based on dimensions that describe things of interest to the customer, such as the ability to send and receive. emails or withdraw cash from ATMs, and the overall percentage does not seem to be able to.

Defining accessibility targets

If you want to measure, document, and report availability in a way that is beneficial to your organization and your customers, you need to do two things. First, define the context and reinforce the meaning of “accessibility” for you and your customers. To do this, you need to talk to them.

Second, you need to think carefully about a number of practical questions: what you will measure, how you will collect the data, how you will document it, and how you will report your findings.

Communication with customers

Before taking any action, you need to understand what is important to your customers and what impact the loss of availability has on them. This allows you to set realistic goals that take into account technology, budget, and staffing constraints.

But what exactly should you say to your customers? The impact of downtime can be a great starting point for a conversation. Below are five questions you should ask:

  1. What business functions are critical and top priority in protecting against downtime?
  2. How does downtime affect the business?
  3. How does the frequency of downtime affect the business?
  4. What is the impact of downtime on organizational performance?
  5. How do customers of the organization perceive these forced downtime?

Business Critical Functions

Most IT services support multiple business processes, some of which are critical and others less important. For example, an ATM can support cash dispensing and check printing. The ability to dispense cash is critical, while the inability to print a check is much less of an impact.

You need to talk to customers and determine the importance of the various functions to the business. You can create a spreadsheet that highlights the business implications of the downtime of each of these functions. Example:

Table 1 - The importance of services as a percentage

NB: The numbers should not add up to 100%

From this table, you can see that this service doesn't matter at all if there is no way to send and receive emails, and its value is reduced to half the normal level if the public folders cannot be read. This tells IT to focus on the quality of the postal service.

Duration and frequency of downtime

You need to figure out how the frequency and duration of downtime affects the customer's business.

I already mentioned that percentage availability may not be sufficient. When a service that is supposed to be available for 100 hours has 98% availability, this indicates that there has been two hours of downtime. But this can mean one two-hour incident or several shorter incidents. The relative impact of a single sustained incident or a series of short incidents will vary depending on the nature of the business and business processes.

For example, billing that lasts two days and must be restarted after any outage will be severely affected by each short outage, but one forced outage that lasts a long time may be of much less importance. On the other hand, a one-minute outage may not affect the operation of the online store in any way, but after two hours it can lead to a significant loss of customers. Once you have an understanding of the likely business impact of downtime, you can create much more efficient infrastructure, applications, and processes that truly help your customer.

Here's an example of how availability can be measured and documented to reflect the fact that the impact of downtime varies:

Table 2 - Trip duration and maximum frequency

If you use a spreadsheet like this when you are discussing downtime rates and lengths with your customers, these numbers are likely to be much more useful than percentage availability, and they certainly will. greater value for your customers.

Downtime and productivity

I mentioned that percentage availability is not very useful for communicating with customers about the frequency and duration of downtime. On the other hand, when you are discussing the performance impact of downtime, percentages can come in very handy.

Most incidents do not result in a complete loss of service for all users. Some users may not be affected, while others are completely disabled. Maybe there is just one user with a faulty PC who cannot access any of the services. You can even classify this as a 100% loss of service, but this would be a completely unattainable goal for IT and cannot be a fair measure of availability.

On the other hand, you can say that a service is available while someone can still access it. However, it doesn't take a lot of imagination to figure out how customers will feel if a service is listed as affordable when many people simply cannot use it.

One way to determine impact is to calculate the percentage of lost user minutes. To do this:

  • Calculate PotentialUserMinutes. This total users who work per unit of time. For example, if you have 10 employees working for 8 hours, then PotentialUserMinutes is 10 x 8 x 60 = 4800
  • Calculate UserOutageMinutes. This is the total number of users who were unable to work multiplied by the time they were unable to work. For example, if an incident prevented 5 employees from working for 10 minutes, then UserOutageMinutes is 50.
  • Calculate percentage availability using a very similar formula to the one we saw earlier

In the given example, we got the following availability:

You can use this same technique to calculate the impact of lost IP telephony availability in a call center in terms of PotentialAgentPhoneMinutes and LostAgentPhoneMinutes; for applications that involve transactions or manufacturing, you can use a similar approach to quantify the business impact of an incident. You are comparing the number of transactions that would have been expected without downtime versus the number of actual transactions or the amount of production that was expected versus actual production.

Availability measurement and reporting

Once you've agreed and documented accessibility targets, you need to think about the practical aspects of how you can measure and report accessibility. For instance:

  • What will you measure?
  • How will you collect the data?
  • How will you document and report your findings?

What is measuredI ambe

It is very important to measure and report availability in the same terms that define customer-agreed targets and that are based on a shared understanding of what customer accessibility really is. The goals should make sense to him and ensure that IT efforts are focused on providing support to his business.

Typically, these goals are part of a service level agreement (SLA) between IT and the customer, but you need to be careful that the numbers from the SLA do not become your goal. Your real goal is to provide services that meet the expectations of your customers.

How to collect data

There are many different ways to collect data on the availability of IT services. Some of them are simple, but not very accurate, some are quite expensive. You can use only one approach, or combine several of them to create your own reports.

Data collection in technical support

One way to collect availability data is through support. Typically, service personnel determine the impact and duration of each incident on the business, as it is part of incident management. This data can be used to determine the duration of incidents and the number of users affected.

This approach is usually fairly inexpensive. However, it can lead to disputes about the accuracy of the availability data.

Measuring infrastructure and application availability

This approach includes tooling for all the components needed to provide a service and calculating availability based on an understanding of how each component contributes.

It can be very effective, but it can miss small glitches. For example, minor database corruption may prevent some users from performing certain types of transactions. This method can also miss the impact of shared components, for example one of my clients had regular email not working due to unreliable DHCP servers at their headquarters, but the IT service did not register this as an email downtime.

Fictitious customers

Some companies use dummy customers to send known transactions from specific points on the network to check for availability.

In fact, it is a measure of end-to-end availability. Depending on the size and complexity of the network, this approach can be expensive to implement and only reports availability from specific fictitious customers. This means that small glitches can be missed, for example, if an incident caused a certain web browser to malfunction while the bogus customer is using a different browser.

The tools that support this data collection also frequently report service performance and availability, which can be a useful addition.

Refinement of applications

Some companies add custom code to their applications to monitor end-to-end availability. This will help to actually measure the end-to-end availability of services, provided that this was the goal at the time of the application development. Typically, this revision includes code in both the client application and the server side.

If implemented well, it can not only collect availability data, but can also help pinpoint exactly where the failure occurred, which can help increase availability by reducing incident resolution time.

How to document and report your findings

Once you've collected your availability data, you need to think about how to communicate the results to your customers.

Plan for downtime

One aspect of availability measurement and reporting that is often overlooked is downtime. If you don't consider planned downtime when designing your availability reports, you run the risk of including metrics that aren't true.

There are several ways to ensure that planned downtime does not bloat statistics. One is to have scheduled downtime for a specific amount of time that is not included in the availability calculation. Another is to schedule a scheduled downtime. For example, some organizations may not consider future downtime a month in advance.

Regardless of what you decide to do, it is important that your SLA clearly defines how planned downtime will be accounted for.

Reporting period agreement

Earlier, I talked about the limitations that percentage availability hides. Nevertheless, it is used and continues to be widely used. Therefore, it is important to understand that you need to specify the period of time during which calculations are performed and reports are provided, as this can be critical for the numbers that will be in your reports.

For example, consider an IT company that has agreed to a 24 × 7 service and 99% availability. Let's assume there is an eight hour break:

  • if we report availability on a weekly basis, then AST (Agreed Service Time) is 24 x 7 hours = 168 hours
  • monthly AST (24 x 365) / 12 = 730 hours
  • quarterly AST (24 x 365) / 4 = 2190 hours

Putting these numbers into the accessibility equation gives:

  • Weekly availability = 100% x (168-8) / 168 = 95.2%.
  • Monthly Availability = 100% x (730 - 8) / 730 = 98.9%
  • Quarterly availability = 100% x (2190-8) / 2190 = 99.6%

Each is a valid indicator of service availability, but only one indicates that the goal has been met.

In custody

Almost every IT company I have worked with measures and reports on the availability of their services. Truly efficient IT departments work with their customers to optimize own investments and provide an excellent level of availability. Unfortunately, many IT companies focus on SLA numbers and fail to meet the needs of their customers, even if they end up showing consistent numbers in the reports.

This is a long article, below are the key points that are covered in it:

  • You don't need to tell the customer that you delivered 98% availability unless you have an understanding of the impact of 2% downtime.
  • Talk to your customers and make sure you understand the impact of any downtime on them and end customers
  • Think about ways to protect your customers' critical business processes
  • Find ways to measure the frequency and duration of downtime, and the impact of downtime on performance that meet your customers' needs
  • Agree, document, and report availability metrics in ways that make sense to your customers and help plan
  • Use the appropriate tools to properly assess availability and report back.

What else could you add to my tips? Please write in the comments.

“Accessibility”, “three nines after the decimal point” - these terms are often used when discussing new IT solutions. IT architects propose a project to the customer new system especially considering that it has a very high availability. The contract has been concluded, the system has been built, the certificates of commissioning of the complex have been signed, the operation begins ... It is at the stage of operation that one can check the "quality" of the created system, and it is then that disappointment can come. What is hidden behind the magic "nines"? What are the real promises at the design stage? And who is responsible for availability?

Accessibility: an introduction to the subject

The best way to understand what accessibility is is to figure out why it is needed. Availability is a characteristic of what the business wants from the IT department. Unfortunately, some business representatives, when asked about the desired availability of IT services, answer something like this: "I want everything to always work." In this case, it is up to the IT manager to write the terms of reference for the service, including determining the availability parameters. So, availability is a dimension of an IT service that the business consumes and that the IT service provides. The formula for calculating availability is as follows:

Availability = (AST - DT) / AST × 100 = Servise or Component Availability (%)

where
AST (agreed service time)- the agreed time for the provision of the service;
DT (actual downtime during agreed service time)- the actual time when the service was unavailable during the agreed time of its provision.

The specifics of calculating availability are easier to understand with a specific example. Let's try to determine the availability of the IT service "online store" for the AAA company located in Moscow, which sells books. At the same time, books and their delivery to any city can be paid for, for example, using a credit card. Obviously, shipping orders will only be processed on weekdays from 9 am to 6 pm.

But what will AST be - the agreed service delivery time? To answer this question, you need to consider that people can place orders in non-working hours, and be sure to take into account the fact that there are 11 time zones in Russia. Therefore, the service must be provided 24 hours a day, 7 days a week.

Now you need to deal with DT - the time when the service may be unavailable. Here one cannot do without negotiations with business. It is possible that four hours of service unavailability once a month might be an adequate choice for this example. However, one nuance must be taken into account - the period of time during which the DT parameter is assessed, that is, the actual agreed service provision time (AST). The choice of the AST period is a private matter of the contracting parties: business and IT service. It is better to take a week or several weeks as such a period, since a month or a year are not constant values ​​(they include a different number of days). However, you need to pay attention to psychology: shorter periods of time can be negatively perceived by the business. In our example, the same availability value corresponds to approximately one hour of downtime per week. However, businesses might not like the fact that the online store will be unavailable for an hour each week, although they may agree to four hours of downtime per month. On the other hand, it is sometimes impossible to operate an IT system without stopping it for a few hours for routine maintenance. Such planned downtime should also be taken into account when choosing a DT, which, in turn, may lead to a revision of the AST parameter.

Based on the above, we choose 4 hours of unavailability of the service once every four weeks. That is, AST = 4 weeks, DT = 4 hours. Then the availability is as follows:

Availability = (24 × 7 × 4-4) / (24 × 7 × 4) × 100% = 99.40%

It is possible that the business will disagree. In this case, you need to find out which option he will agree to. In the future, you can calculate two options for hardware and software systems with different availability and negotiate with the business based on a comparison of the cost of both options. In general, negotiations with the business and budgeting of the IT service are a separate topic, which, perhaps, will require more than one book to disclose. Therefore, let's say that in our example, the availability is calculated and agreed upon and we can proceed to creating the system.

Note that we identified the required availability before we started working on the solution that provides it, not the other way around — we first chose the solution and considered its availability. The technical task is primary, and the required availability is one of the parameters fixed in it. When the system is put into service, the availability should meet the required value. Therefore, we advise in the agreement with the business (SLA - Service Level Agreement) to decipher in detail what is meant by the number of availability (in our example, as follows: "4 hours of service unavailability one (1) time within four (4) weeks"), so that all the parties clearly understood what was really hidden behind the numbers.

Three dimensions of accessibility

The very first thing to understand when choosing a solution is what the availability of an IT service consists of. Many operational frustrations stem from the fact that the availability of the service a business wants is directly related to the availability of equipment. However, the availability of an IT service is a combination of three components:
1) Reliability - usually translated as reliability;
2) Maintainability - translated as "maintainability";
3) Serviceability - maintainability.
Let's take a look at each of these points.

Reliability

Reliability is the availability of infrastructure or hardware and software complex as a whole, including communications. For example, for an online store, we need a web server, an application server, a DBMS, disk storage, and Internet access. For simplicity, we will assume that the application server software includes a web server and will be installed on one hardware server, the DBMS on the second, and the disk storage is an external disk array.

We start to create - we build an infrastructure project. Under each component, we will write the parameters of its accessibility. The availability of each component - hereinafter we will use the term "reliability" - must be obtained from the supplier of the component (equipment, software or service). If for some reason this is impossible (for example, for software components, the value of reliability is usually unknown), the required value will have to be independently estimated and assigned. Each component is a single point of failure, so they are connected in series in the working diagram for calculating reliability (Fig. 1). Note that this is not a scheme for connecting infrastructure components, but only a scheme for calculating reliability.

So, we calculate the reliability. Since we have a serial connection of components, the reliability values ​​are multiplied:

Reliability = (0.985 x 0.97 x 0.975 x 0.98 x 0.99 x 0.9999 x 0.99) x 100% = 89.47%

This is clearly insufficient compared to the required value of 99.40%. Then we will change our decision - we will include in the system an alternative provider of Internet access services (Fig. 2) and calculate its reliability. Since we have a parallel connection with respect to Internet access, the overall reliability is determined as follows:

Overall reliability =

Reliability = × 100% = 91.72%

I think that the principle of "working with reliability" of the future system has been demonstrated. It should be noted that the considered example did not include the components of the network infrastructure and the reliability of connections (for example, between the database server and disk storage), as well as the components of the technical infrastructure (power supply, air conditioning, etc.), which are also points of failure and should be included in the calculation. The assessment of the reliability of software components deserves special attention. The main advice here is reasonable conservatism: use software components that have been used in such solutions for a long time and have proven themselves well.

Using the techniques that were briefly discussed above, you can select a solution with the required availability.

Maintainability and Serviceability

Moving on to the other components of accessibility - maintainability and serviceability. Note that the translations "maintainability" and "maintainability" are unsuccessful, since they are not very clear from them what it means. Better to use more understandable translations: maintainability - the activities of the internal IT service of the organization; serviceability - services provided by external providers.

To clarify the situation, consider the extreme options. When is maintainability completely absent? This happens when a company outsource its own IT service. Here, availability is only a combination of reliability and external service providers.

When is serviceability completely absent? This happens, for example, in the FSB, which, for reasons of secrecy, is forced to conduct all activities to maintain the system in working order exclusively by its IT department, even spare parts are bought independently, and are not supplied under a technical support contract. Then availability is only a combination of the reliability of the system and the activities of the internal IT service of the organization.

It is clear that the decision must be made at the same time as developing the maintainability and serviceability schemes. Overall, reliability, maintainability and serviceability are the three dimensions of accessibility. Changes in one of them must be compensated for by changes in the other two - otherwise, the parameter of IT service availability will change, which can harm the business.

Ways to Manipulate Accessibility Components

To understand how all the pieces of accessibility can be manipulated, consider another practical example. The company, which has data processing centers in two Russian cities, Zelenograd (a satellite city of Moscow) and Irkutsk, acquired two identical turnkey systems. Consequently, reliability - reliability - is the same for them. Both IT systems were backed by the same hardware and software support contracts, which means that the services provided by the external vendor — serviceability — were also the same. However, the availability of the systems varied. And the company began to complain to the supplier about the poor availability of the system in Irkutsk, claiming that one of the solutions was “defective,” and demanding an audit.

However, in this case, the audit of the solution will most likely not reveal the root cause of the “failure” of availability, since only one component will be investigated - Reliability, which should be the same for both systems, and just two other components need to be investigated. If you pay attention to them, it turns out that two options are possible.

Option 1: Hardware failures caused the loss of availability. Due to the geographic location of data centers, the same hardware support contracts may actually be different. For instance, service center the external supplier is located in Moscow, and the technical support contract says that it is valid only on weekdays and the engineer arrives at the equipment installation site “by the first available train or flight”. Obviously, for an engineer leaving Moscow, this value will be different for Zelenograd and Irkutsk.

Possible solutions to the availability problem in this case:

  • change the reliability of the IT system in Irkutsk, for example, put an additional node in the cluster;
  • change the serviceability parameter - to create a warehouse in Irkutsk, to get an opportunity for the company's IT specialists to change faulty components on their own, if this does not contradict the manufacturer's rules.

It also makes sense to check the operating conditions. Examples of typical violations of these conditions:

  • carrying out repair work in the premises with the systems turned on, which leads to their dustiness, and the dust is very dangerous for server equipment;
  • the use of household air conditioners in server rooms, although each type of equipment has its own requirements for humidity and household air conditioners are not designed to maintain its specified level, and completely dry air is destructive for technology.

Option 2: Software glitches caused the required level of availability to be reduced. In this case, the problem is most likely in the IT service in Irkutsk. Software technical support services are provided remotely. Therefore, there is no difference in services, except that there are different periods of service in relation to local time for different time zones, but this usually does not have a significant effect. The likely reason for the "failure" of accessibility here is the different level of professionalism of IT departments - in Irkutsk it is probably lower than in Zelenograd. Possible solutions:

  • to tighten the maintainability to the required level - to conduct training for IT personnel in Irkutsk on software and hardware products that are part of the IT system, organize seminars to transfer the experience of the IT team from Zelenograd, copy operation processes, etc .;
  • compensate for maintainability through serviceability - purchase advanced technical support services, outtouching services, etc.

Going back to our online store example, what is the best combination of reliability, maintainability, and serviceability? The answer to this question depends on each specific case. For example, you can recommend hosting instead of fully implementing the entire infrastructure (IT and technical) yourself. In general, we have the following standard ways of managing availability. 1. Change in reliability:

  • change of the IT solution towards high availability (High Availability) - the use of clusters, the use of equipment with support for "hot" replacement, repeated duplication of potential points of failure, etc .;
  • lease of the entire infrastructure or part of it from external suppliers (hosting, collocation).

2. Change in maintainability (changes in the activities of the IT service of the company):

  • dissemination within the organization of its own best practices in IT management;
  • inviting external consultants to organize processes in the IT department;
  • training of IT personnel.

3. Changing serviceability - changing contracts for IT services with external providers towards increasing the level of service, increasing the volume of services, expanding the area of ​​responsibility of external service providers, etc. All techniques for manipulating three sources and three components of accessibility cannot be described within one article. , however, the main approaches to compensating for some components of accessibility with others have been demonstrated. To further improve your proficiency in this area, you should study practical experience design and operation of IT systems.

Changing business views on the provision of IT services leads to the need to implement a process for managing their availability.

In the third version, ITIL processes for managing the availability and continuity of IT services are considered together (hereinafter referred to as the process). The most important key concepts of this collaborative process are:

availability- the ability of an IT service or its components to perform their functions in a certain period of time;

reliability- the ability of an IT service or its components to perform specified functions under certain operating conditions;

recoverability- the ability of the IT service or its components to recover their operational characteristics, partially or completely lost as a result of a failure;

serviceability- the characteristic of IT components, which determines their location and parameters in order to ensure the rationality of personnel actions during installation, transportation, prevention and repair (this concept is applied in relation to external providers of IT services).

The business has its own understanding of the availability and cost of IT services, and therefore the goal of the process is to ensure the required level of availability while maintaining a certain level of costs. To achieve this goal, the process aims to accomplish the following tasks:

    Planning and development of IT services taking into account the business requirements for the level of availability;

    Optimizing the availability of IT services through cost-effective improvements;

    Reducing the number and duration of incidents affecting the availability of IT services.

In the course of solving these problems, the business requirements for the availability of IT services and IT infrastructure components are fixed; required reports are developed; the levels of availability of IT services are periodically reviewed; an availability plan is formed that defines priorities and reflects measures to improve the availability of IT services. In other words, the process boils down to planning the delivery of IT services, measuring the level of availability and taking actions to improve it.

Planning

When planning, the business requirements for the availability of IT services are formulated, criteria for determining the level of availability and acceptable downtime of IT services are developed, and some aspects are considered. information security... The business must establish a boundary that defines the availability and unavailability of an IT service, such as the amount of time that an IT service can be disrupted in the event of an IT infrastructure failure.

When designing the availability of IT services, an analysis of the IT infrastructure is carried out in order to identify the most vulnerable components that do not have a reserve and can, in the event of a failure, have a negative impact on the provision of IT services. In ITIL terminology, these components are called Single Point of Failure (SPOF) and are defined using the Component Failure Impact Analysis (CFIA) method. This method is used to assess and predict the impact of IT component failures on an IT service. The main goals of the CFIA are:

    Identifying points of failure affecting availability;

    Analyzing the impact of component failure on business and users;

    Determination of the relationship of components and personnel;

    Determining the recovery time of components;

    Identifying and documenting recovery options.

For risk analysis, the risk analysis and management method (CCTA Risk Analysis and Management Method, CRAMM) is used, which analyzes possible threats and dependencies of IT components, assesses the likelihood of non-standard situations or emergency events.

To ensure the required level of availability, it is possible to use a masking technique from the negative impact due to planned or unplanned component downtime, duplication of IT components, as well as the use of means to improve the performance of a component in the event of an increase in load, etc. In cases where specific business functions are highly dependent on the availability of IT services, and the loss business reputation downtime is considered unacceptable, the availability of certain IT services is higher, and additional resources are allocated.

The IT service delivery design ensures that the stated availability requirements are met, but this refers to the stable, operational state of the IT service. However, failures are also possible, therefore, planning for the recovery of IT services is also carried out, including the organization of interaction with the incident management process and the Service Desk; planning and implementation of monitoring systems to detect failures and provide timely notification of them; development of requirements for backup and recovery of hardware, software and data; developing a backup and recovery strategy; defining recovery metrics, etc.

Another aspect of planning is determining downtime. All IT components must be subject to a service strategy. Depending on the IT application, the criticality and importance of the business functions supported by a particular IT component, the frequency and level of service may vary. If you need to provide a service in 24x7 mode, you need to find an optimal balance between the requirements for servicing IT components and the business losses from service downtime. Approved service schedules must be documented in Service Level Agreements (SLAs).

Improving the availability of IT services

Why improve accessibility? There can be many reasons: mismatch of the quality of IT services with SLA requirements; instability in the provision of IT services; downward trends in the availability of IT services; unacceptably long recovery times; business requests for increased availability.

Improving accessibility requires reasonable additional financial costs and certain techniques and technologies are used to identify opportunities for improvement in IT services, including Fault Tree Analysis (FTA) and Systems Outage Analysis (SOA).

Fault tree analysis identifies the chain of events leading to the failure of an IT component or IT service. Graphically, a fault tree (see Fig.) Is a sequence of events that begins with an initiating event followed by one or more functional events and ends with a final state. Depending on the events, the sequences can branch out logically.

System Downtime Analysis is a structured approach to identifying the root causes of interruptions in the provision of IT services and uses multiple data sources to determine the location and cause of interruptions. The objectives of this analysis:

    Determination of the root causes of disruptions in the provision of IT services;

    Determining the effectiveness of IT service support;

    Preparation of reports;

    Initiation of the program for the implementation of the accepted recommendations;

    Analyze the improvements in availability based on the analysis of system downtime.

Using the analysis of system downtime will increase the level of availability without increasing costs, improve the staff's own skills and abilities to avoid the cost of consulting on improving accessibility, and identify a specific improvement program.

The result of the service availability improvement activities is a long-term plan to proactively improve the availability of IT services, taking into account financial constraints. An accessibility plan describes the current and planned levels of accessibility, as well as the actions that need to be taken to improve it. The preparation of the plan requires the participation of business representatives, managers of the implemented ITSM processes, representatives of external IT service providers, technical support specialists responsible for testing and maintenance. The plan is drawn up for up to two years, and for the next six months it should contain detailed description activities. The plan is reviewed every quarter with minimal adjustments and every six months with the possibility of major changes.

Measuring IT Service Availability

An IT service, from a consumer perspective, can be considered affordable when the vital business functions that use it are performing well. In this case, the main quantitative indicators are availability - the ratio of the time of real availability of an IT component to the time of availability specified in service level agreements, and unavailability (in%) - the inverse of availability. These parameters are used by IT services and, from a business point of view, are not very indicative, since they do not reflect the values ​​of availability for business or users - they can demonstrate a high level of availability of IT components, while the current level of availability of IT services will be low ...

The business can understand such indicators as: the frequency of IT services outages, the total duration of the outages, the area of ​​influence from the interruption of IT services.

Roles and responsibilities

The process defines the role of the process manager, who is responsible for guiding the process and taking the necessary actions. The process manager is responsible for the operation and development of the process in accordance with regulations and plans. It is recommended to employ an employee with practical experience in process management, knowledge of ITSM, statistical and analytical methods used in IT, principles of cost management, experienced in working with personnel, knowledge of negotiation methods, etc. for the role of a process manager.

Process implementation

The implementation of any ITSM process is a long and complex project with specific goals and deadlines. Implementation in-house is difficult: implementation of the process in parallel with daily operational activities does not allow you to fully focus on the project; the constant "pulling" of resources for tasks outside the project in the end result leads to an increase in financial costs, a shift in the timing of the project for an indefinite period, a gradual loss of attention or even a possible stop of the project. In addition, in-house implementation requires knowledge in a given subject area, which entails the need for costly training.

Like any project, process implementation starts with building project teams, developing project management documents, drawing up a project plan, and more. At the stage of "pre-design" work, marketing activities are carried out to familiarize business representatives with ITIL technologies and recommendations and to justify the need for a business to implement a process for managing the availability of IT services.

After agreeing and receiving a positive response about the implementation of the process, the goals and boundaries of the subject area of ​​the process are determined.

Effect and problems

The main effect of process implementation is that IT services are designed with availability in mind and are operated and managed at an agreed level of availability and cost. Positive factors are also: one person responsible for the availability of IT services; optimal use of the performance of the IT infrastructure to ensure the required level of availability of IT services; reducing the frequency and duration of IT service outages over time; a qualitative transition in the activities of IT service providers from eliminating errors in the provision of services to increasing the level of their availability.

Potential problems that can negatively influence decision-making on the implementation and operation of the process are usually organizational in nature:

    The existence of a situation where each IT manager is responsible for the availability of IT systems or components that are in his area of ​​responsibility, while the overall availability of IT services is not monitored and may be unsatisfactory;

    Refusal to implement the process because the current availability of IT services is considered acceptable;

    Assumptions that if there are other TSM processes in place, the availability management process will be done automatically;

    Resistance to centralization in IT infrastructure management by IT managers;

    Insufficient authority of the process manager, leading to the inability to perform duties properly.

Evgeny Bulychev (Bulychev@i-teco.ru) - Consultant of the I-Teco Business Consulting department (Moscow).

The idea to write this article came after a conversation with one of the large customers - a colleague told the story of choosing an IaaS cloud provider for his company.

The first set of criteria for evaluating a service provider looked something like this: a well-known name (brand), a positive business history in the field of cloud services, adequate value. Based on the results of the analysis, potential applicants were chosen between several companies, which, according to the above criteria, were almost the same, and each tried to prove their advantages, referring to the different characteristics of their cloud services.

Vladimir Kurilov, Onlanta company.

So the conversation reached the reliability indicators. And it revolved around comparing the availability levels of data centers in which the clouds were located. It quickly became clear that only two candidates have data centers with 99.98% availability. The choice was made in favor of a foreign cloud service provider - the price won. The colleague explained everything simply, - "What's the point of paying more for the same reliability indicators?"

Given the existence different options, let's define the interpretation of the term "Accessibility" within the framework of this article. Let's define the availability as the system uptime in a certain time interval, expressed as a percentage of this interval. Or in the classical form: "The property of an object to perform the required function under specified conditions for a specified time interval." That, in general, is closer to the already well-established concept of "readiness" of the system.

The year of operation that followed this decision showed that the provider has minor disruptions in the work of the data center engineering systems during planned switchings. At the same time, the availability of the data center remained within the SLA, since the switchover took seconds. However, if Information system the customer did not stop in advance before such switches, then the database in case of failures required recovery from a backup, which stopped the work of employees for several hours. Turning off / turning on the systems, before switching, slightly corrected the situation, but at the same time there was a downtime of employees for 25-30 minutes, which also caused complaints from users.

A year has passed and now the Colleague is renting capacity in another cloud, where the availability of one of the data centers is lower than the above, and the downtime has significantly decreased. How can this be achieved and what is important when assessing the reliability of cloud solutions, and what is not very important? What are the possibilities of saving, reducing the risk of overpayment "for nice numbers", and not for actual reliability? How to highlight the critical parameters of cloud services for the reliability of your application?

I will try to formulate the answers to these questions further.

Application reliability - how it stacks up in the cloud

Application service reliability

If we try to formulate the definition of the reliability of the application, then it will sound like this: “Reliability is the property of an application to maintain performance over time with all the functionality incorporated into it”.

What determines the performance of the application and how is the reliability of the application related to the availability of the data center?

The application is based on a software platform, which, in turn, is located on an infrastructure platform using an engineering platform, see Fig. Collectively, these four levels provide the "Application Service".


Rice. A simplified example of calculating the availability of the Application Service

As can be seen from the figure, we are dealing with a system of sequential elements, where the failure of any element leads to a failure of the system as a whole.

The availability of such a system (As) is defined as the product of indicators of the availability of all elements:


A i - availability of each serially connected component.
A s = 0.99995 0.99995 0.993 0998 ≈ 0.99091 or 99.091

As you can see, the availability of the Application Service matters far from the availability of the data center engineering platform. It is possible to convert the availability figures into system downtime values. It turns out, despite the permissible annual downtime of the engineering platform, at 1 hour. 45 minutes, for the application service the annual downtime will be 86 hours 22 minutes.

Accordingly, the high availability rate of a data center does not mean the same high reliability of application services operating in this data center.

Reliability of the network application

Therefore, when choosing service providers, would it be right to focus on the aggregate availability of application services? Unfortunately, things are not so simple here.

It turns out that a software developer is able to influence the assurance of the reliability (resilience to failures, loads) of a particular application. For example, the reliability of an application in the cloud can be significantly improved through the use of specialized libraries focused on handling latency of executed requests. Applications written in standard ways will have comparatively lower reliability indicators.

One of the options for implementing the use of specialized libraries by Microsoft is the Transient Fault Handling Application Block (see http://msdn.microsoft.com/en-us/library/hh680934(v=pandp.50).aspx).

Reliability of the software platform

Reliability of the software platform, including operating system, drivers, libraries, again, remains "on the side of developers" and, so far, does not strongly depend on the service provider. However, if the service provider has thought of a proper technical support policy, then this can indirectly affect availability.

I'm talking about "hygienic" safety equipment. First of all, about the system software update service. It should be included in the service provider's portfolio of services, or even better, it should be included in the “default” service price. Secondly, it is an anti-virus protection service with a choice of anti-virus programs. And thirdly, backup virtual servers of the customer. These are not all, but the most important ways to improve the availability of your Application Service.

Infrastructure platform reliability

This component of reliability is completely dependent on the service provider and should be assessed by you on a par with the availability of the data center engineering platform. You should request this parameter from your provider as it is usually not listed in marketing materials. At the same time, it is necessary to get an explanation of how this parameter was calculated.

Although it should be borne in mind that not all service providers will want to present such data, since from the calculation it becomes clear the structural diagram of the infrastructure solution and the equipment used - and this is a certain know-how.

However:

  • Ask for a diagram of the functional structure of the infrastructure platform to host your Application Service. It should include:
    • Network infrastructure;
    • Storage area network;
    • Computing infrastructure.
  • Ask to indicate in this diagram the places of equipment reservation. It is not necessary to indicate the type of equipment used.
  • Ask for availability (or readiness) for each level.
  • Count availability as the product of the availability of the elements of the infrastructure platform.

Now you have the opportunity to determine the availability of your application service as accurately as possible. Based on our experience, 90% of joint ventures in Russia have a total availability of no more than 99%. And this is the risk of downtime up to 87 hours per year. These are normal availability rates unless you have business critical applications that cost you millions of dollars in one hour of downtime. And if an hour-long stop is akin to a disaster for your business, then for you there is the remaining 10%, joint ventures that provide enterprise-level service with the availability of the Application Service at the level of 99.99%. How this is achieved in the next section.

Solutions for High Availability of Application Service

As a result, the customer does not care how the SLA for engineering systems is observed; it is important for him what the service availability of his applications is, i.e. - guaranteed recovery time for the application.

The systems we discussed earlier had a sequential structure. The availability, which we considered above as the product of individual elements, is the technical limit provided by such systems. In fact, due to the appearance of various additional factors, the availability is even lower. Remember at the beginning of the article the story about a second power outage and five hours of downtime?

Is it possible to increase the availability of an application if the availability parameters of a particular data center are set and cannot be changed?

The answer is you can.

For example, here are two approaches that allow you to do this:

  • Geographically distributed high availability cluster;
  • Recovery of processing in a geographically remote backup data center (Disaster recovery).

Rice. Structural diagram of a geographically distributed high availability cluster


Rice. Block diagram for restoring processing in a geographically remote backup data center

The first approach is ideal from the point of view of availability (recovery of performance occurs in seconds), but it loses in price and is rather difficult to implement. The second approach restores a service from a working copy - it is not so fast and a small part of the data in case of a failure will have to be restored manually, but this option has a lower cost and is easier to implement.

In both cases, it is necessary to talk about the geographic remoteness of data centers in order to maximally avoid the possibility of interconnected resources. For example, the use of the same substations that provide power to the data centers. You can recall the power outage in the southeast of Moscow in May 2008 due to a fire at the Chaginskaya substation, New York 2003. Therefore, the backup data center should be located farther from the main one.

The approach with two data centers allows us to talk about creating a system with parallel elements. At the same time, on the one hand, the main and backup data centers are independent systems, on the other hand, they are a common platform for the application service - no matter in which data center the application is currently running, it can move from one data center to another.

The fundamental difference between a parallel system is that the reliability grows with an increase in the parallel elements of the system. The calculation of the availability of a system consisting of parallel elements can be performed using the formula:

Where: A s - Total availability, availability of the entire system,
A i - availability of each component connected in parallel.

For example, let's calculate a system of a geographically distributed high availability cluster of two data centers with 99% availability, each.

A s = 1- (1-0.99) * (1-0.99) = 0.9999 or 99.99

That is, two not the most reliable data centers can provide availability at the level of mission-critical systems.

To determine the availability of the application service in the option of restoring processing in a geographically remote backup data center with a 15-minute synchronization interval for the case of a single failure, it is calculated as follows: you need to request the recovery time of the application service, guaranteed by the joint venture; then we calculate the percentage of the annual interval - and subtract the result from one. We get availability after the first failure. For example, for a system with a 15 minute sync interval:

The total number of hours in a year is 365 * 24 = 8760
Guaranteed Downtime = Maximum Downtime
15 minutes or 0.25 hours, which is ≈ 0.003 of the annual time

Those. each failure will have a weight of 0.003%. Thus, the system before the system failure has an availability equal to 100%, after the first failure, 99.997%, after the second failure 99.994%. Let's calculate the same for a system with an hourly synchronization interval:

Guaranteed recovery time = Maximum downtime = 1 hour, which is ≈ 0.01 of annual time

Each failure will have a weight of 0.01%. Thus, the system before the system failure has an availability equal to 100%, after the first failure, 99.99%, after the second failure 99.98%. Further, adherents of the theory of probability can practice in assessing the probability of the occurrence of the first, second, third failures. The result will convince you that the influence of this factor is negligible on the results obtained. This allows me to recommend a suggested methodology for assessing the availability of services for your applications in the cloud.

In summary ...

  • Start by assessing the business criticality of the application you plan to host in the cloud. Estimate the cost of application downtime. How much will the lack of application service cost you?
  • From here, estimate the acceptable value of downtime per day, per year. Calculate the critical availability of the application service.
  • Compare the potential cost of downtime to the JV prices that offer reasonable availability for your applications.
  • When choosing a joint venture, give preference to someone who can provide not only the current level of availability, but also, as an additional service / service, provide an improvement in availability. Especially if your business is growing and developing.
  • And stay practiced. Take what they give to touch = test. Theory without practice is not very useful for business.