博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
对警报线程池的警报线程_检测和警报SQL Server代理丢失的作业
阅读量:2511 次
发布时间:2019-05-11

本文共 33057 字,大约阅读时间需要 110 分钟。

对警报线程池的警报线程

摘要 (Summary)

While alerting on failed SQL Server Agent jobs is straightforward, being notified when a job is missed is not a simple or built-in feature.

虽然对失败SQL Server代理作业发出警报很简单,但是错过作业时收到通知不是简单或内置的功能。

If the SQL Server Agent is stopped due to an issue with the service, server, or a planned maintenance, any jobs that would have executed will be skipped without a warning or alert.  This could lead to important tasks never occurring.

如果由于服务,服务器或计划的维护问题而导致SQL Server代理停止,则将跳过所有已执行的作业,而不会发出警告或警报。 这可能导致重要任务永不发生。

In this article, we will build a solution that compares expected executions with actual job runs and reports on any differences found.  When automated, this can ensure that important jobs are never missed.

在本文中,我们将构建一个解决方案,将预期执行与实际作业运行进行比较,并报告发现的任何差异。 当自动化时,这可以确保重要的工作不会丢失。

先决条件 (Prerequisites)

This project requires a handful of pieces in order to function:

该项目需要几个步骤才能起作用:

  1. A process that will capture SQL Server Agent job run history and compare it to the results of the expected executions. That is the piece of the puzzle that we will focus on in this article

    将捕获SQL Server代理作业运行历史并将其与预期执行结果进行比较的过程。 这就是我们将在本文中重点讨论的难题

To simplify getting all of this working, all of the above scripts will be attached to this article so that you can download them and get started quickly.

为了简化所有工作,上面的所有脚本都将附加到本文,以便您可以下载它们并快速上手。

建立解决方案 (Building a Solution)

To find instances of missed SQL Server Agent jobs, we need to read the job history in MSDB and compare the results to the expected job run list. Before starting, we should define boundaries for our research. Typically, we will only ever analyze a relatively limited time span of job history, and the bounds of that date/time range should be given by parameters. For our discussion, we will define the following 2 parameters for this purpose:

若要查找丢失SQL Server代理作业的实例,我们需要读取MSDB中的作业历史记录,并将结果与​​预期的作业运行列表进行比较。 在开始之前,我们应该为研究确定界限。 通常,我们只会分析相对有限的工作历史时间跨度,并且该日期/时间范围的界限应由参数给出。 在我们的讨论中,我们将为此定义以下两个参数:

  • @start_time_local: The start time for the range we will analyze. This time is provided in local time to match up with the data stored in MSDB, which is all based on local server time. If the server is maintained in UTC, then these times will match UTC times as well @start_time_local :我们将分析范围的开始时间。 该时间以本地时间提供,以与存储在MSDB中的数据相匹配,这些数据均基于本地服务器时间。 如果服务器以UTC维护,那么这些时间也将与UTC时间匹配
  • @end_time_local: The end time for the range we will analyze @end_time_local :我们将分析的范围的结束时间

Boundaries are needed to ensure we do not process too much history and exceed the retained history data within MSDB. We also would not benefit from monitoring missed jobs from a week ago. Ideally, we would want to know about this problem sooner rather than later, and more frequently rather than less often. To keep things simple, we’ll examine a half hour of data at a time whenever we run our process. If the job runs at 11:00am, we will check all jobs that ran and should have run between 10:30am and 11:00am.

需要边界以确保我们不会处理太多的历史记录并超过MSDB中保留的历史记录数据。 我们也不会从一周前监视错过的工作中受益。 理想情况下,我们希望尽早而不是以后,并且更频繁而不是更少地了解此问题。 为简单起见,每当我们运行流程时,我们将一次检查半小时的数据。 如果作业在上午11:00运行,我们将检查所有已运行且应在上午10:30至上午11:00之间运行的作业。

Our first task is to collect the job run history for a given server. All of the data needed for this task can be found in MSDB tables:

我们的第一个任务是收集给定服务器的作业运行历史记录。 可以在MSDB表中找到此任务所需的所有数据:

  • Msdb.dbo.sysjobs: Contains a row per job as defined in SQL Server Agent, along with details Msdb.dbo.sysjobs :包含每个作业在SQL Server代理中定义的行以及详细信息
  • Msdb.dbo.sysjobhistory: Contains a row per job and job step as they are executed. Since this table contains both overall job success and step details, the Msdb.dbo.sysjobhistory :每个作业和作业步骤在执行时均包含一行。 由于此表包含总体作业成功和步骤详细信息,因此使用step_id is used to distinguish between each. A step_id进行区分。 step_id of 0 indicates the overall job completion info, whereas other steps indicate the completion details of each step. Errors, as well as completion times are included here step_id为0表示总体作业完成信息,而其他步骤则表示每个步骤的完成详细信息。 错误以及完成时间都包含在此处
  • Msdb.dbo.syscategories: Job categories are defined here, both system categories and customized ones that we create Msdb.dbo.syscategories :此处定义了作业类别,包括系统类别和我们创建的自定义类别
  • Msdb.dbo.sysjobactivity: Contains rows for the next queued future job run for each job. This can be used to check if a job is currently in the processes of executing and the progress it has made Msdb.dbo.sysjobactivity:包含每个作业的下一个排队的未来作业运行的行。 这可用于检查作业当前是否正在执行中以及其进度

Using some of these tables, we can compile a list of all executed jobs present in MSDB’s history:

使用其中一些表,我们可以编译MSDB历史记录中存在的所有已执行作业的列表:

SELECT    *  FROM msdb.dbo.sysjobhistory  INNER JOIN msdb.dbo.sysjobs  ON sysjobs.job_id = sysjobhistory.job_id  INNER JOIN msdb.dbo.syscategories  ON syscategories.category_id = sysjobs.category_id  WHERE sysjobs.enabled = 1

The results provide a row per job execution, as well as a row per step:

结果提供每个作业执行一行,以及每个步骤一行:

If step details are unneeded and all you care about is overall job completion details, then you may filter out all but step_id = 0. Note that the detailed error/warning messages associated with a job will be stored within the row associated with the step that generated an error. For example, if a job fails on step 3 with a foreign key violation, then the specifics of that error will be present in the history row for step_id = 3 only, and not in the overall job data for step_id = 0.

如果不需要步骤详细信息,而您只关心整体作业完成详细信息,则可以过滤除step_id = 0之外的所有内容。请注意,与作业相关的详细错误/警告消息将存储在与步骤相关的行中产生一个错误。 例如,如果某作业在步骤3中因外键冲突而失败,那么该错误的详细信息将仅在step_id = 3的历史记录行中显示,而在step_id = 0的整个作业数据中不存在。

We can adjust the above query to bound our data based on a date range:

我们可以调整以上查询以根据日期范围限制数据:

SELECT    sysjobs.job_id,    sysjobs.name AS job_name,    sysjobhistory.step_id,    *  FROM msdb.dbo.sysjobhistory  INNER JOIN msdb.dbo.sysjobs  ON sysjobs.job_id = sysjobhistory.job_id  INNER JOIN msdb.dbo.syscategories  ON syscategories.category_id = sysjobs.category_id  WHERE sysjobs.enabled = 1  AND msdb.dbo.agent_datetime(run_date, run_time) >= @start_time_local  AND msdb.dbo.agent_datetime(run_date, run_time) <= @end_time_local

By setting bounds based on a start and end time, we can ensure we only consider recent data that is relevant to our work. A job that ran a week ago is no longer needed for researching recent missed jobs, nor do we want to scan the same data more than once. The function msdb.dbo.agent_datetime allows us to quickly convert between the integer representations of dates within MSDB and the DATETIME data type that we will want to use for any work we do.

通过基于开始时间和结束时间设置界限,我们可以确保仅考虑与工作相关的最新数据。 研究最近丢失的工作不再需要一周前运行的工作,我们也不想多次扫描相同的数据。 msdb.dbo.agent_datetime函数使我们能够在MSDB中日期的整数表示形式和我们将要用于任何工作的DATETIME数据类型之间快速转换。

In addition to job history, we also want to verify if any jobs are currently running. If so, we don’t want to inadvertently report them as missed as they should run, but have yet to complete. The following query will return data on any currently running jobs that started during the period of time we are examining:

除了作业历史记录,我们还想验证当前是否正在运行任何作业。 如果是这样,我们不想无意中将它们报告为应该运行而错过了,但尚未完成。 以下查询将返回在我们检查的时间内开始的任何当前正在运行的作业的数据:

SELECT    *  FROM msdb.dbo.sysjobactivity  INNER JOIN msdb.dbo.sysjobs  ON sysjobs.job_id = sysjobactivity.job_id  WHERE sysjobactivity.start_execution_date IS NOT NULL  AND sysjobactivity.stop_execution_date IS NULL  AND sysjobactivity.start_execution_date BETWEEN @start_time_local AND @end_time_local;

Note that we can determine if a job is running based on the existence of an execution start date, but without a stop date. The results will look like this:

注意,我们可以根据执行开始日期的存在来确定作业是否正在运行,而没有停止日期。 结果将如下所示:

With these 2 queries, we can capture a list of all jobs that are running or that have recently executed and completed.

通过这两个查询,我们可以捕获所有正在运行或最近已执行和完成的作业的列表。

The last component needed to begin analyzing missed jobs is a list of expected runtimes based on job schedules. This is a bit of an involved project, so much so that I dedicated . If you already have a solution that returns this data, feel free to insert it into the process in place of mine. Code in this article will utilize the parameter structure of the script introduced above, though, so some slight adjustments will be needed if another stored procedure is to be used with it.

开始分析丢失的作业所需的最后一个组件是基于作业计划的预期运行时间列表。 这是一个涉及到的项目,太多了,因此我专门写 。 如果您已经有返回此数据的解决方案,请随时将其插入到我的位置中。 但是,本文中的代码将利用上面介绍的脚本的参数结构,因此,如果要使用另一个存储过程,则需要进行一些细微的调整。

We’ll briefly introduce the syntax for this stored procedure before moving on to detecting missed jobs:

在继续检测丢失的作业之前,我们将简要介绍该存储过程的语法:

EXEC dbo.generate_job_schedule_data  @start_time_utc = NULL,  @end_time_utc = NULL,  @start_time_local = '1/21/2019 00:00:00',  @end_time_local = '1/21/2019 02:00:00',  @return_summarized_data = 1,  @include_startup_and_idle_jobs_in_summary_data = 0;

6 parameters are defined for the stored procedure dbo.generate_jbo_schedule_data:

为存储过程dbo.generate_jbo_schedule_data定义了6个参数:

  • @start_time_utc: The beginning of the time frame to monitor, in UTC @start_time_utc :要监视的时间范围的开始(UTC)
  • @end_time_utc: The end of the time frame to monitor, in UTC @end_time_utc :要监视的时间范围的结束(UTC)
  • @start_time_local: The beginning of the time frame to monitor, in the local server time zone @start_time_local :要监视的时间范围的开始,在本地服务器时区中
  • @end_time_local: The end of the time frame to monitor, in the local server time zone @end_time_local :要监视的时间范围的结束,在本地服务器时区中
  • @return_summarized_data: When set to 1, will return a row per job with a grouped summary of jobs, run counts, etc…When set to 0, it will return a row per job execution. While the detail data can be useful, it will also become very large on a server with lots of jobs or jobs with very frequent execution. Our solution will summarize results to keep things fast and efficient @return_summarized_data :设置为1时,将为每个作业返回一行,并包含分组的作业摘要,运行计数等…设置为0时,将为每个作业执行返回一行。 尽管详细数据可能有用,但在具有许多作业或执行频率很高的作业的服务器上,详细数据也会变得非常大。 我们的解决方案将汇总结果,以确保事情快速高效
  • @include_startup_and_idle_jobs_in_summary_data: When set to 1, will include a row for jobs configured to run on SQL Server Agent startup or when the server is idle. We do not need this here, so it will be set to 0 @include_startup_and_idle_jobs_in_summary_data :设置为1时,将为配置为在SQL Server代理启动时或服务器空闲时运行的作业包括一行。 我们在这里不需要这个,所以它将被设置为0

The datetime ranges can be provided in local time or UTC time and the proc will manage conversions as needed via the SQL server’s UTC offset. This provides some level of convenience, depending on who or what is going to run this process.

可以使用本地时间或UTC时间提供datetime范围,并且proc将根据需要通过SQL Server的UTC偏移量管理转换。 根据要运行此过程的人员或对象,这会提供一定程度的便利。

With these 3 pieces, we can create a stored procedure that will compare job history to expected job executions and determine if anything is missing and alert on it.

通过这三部分,我们可以创建一个存储过程,该过程将作业历史记录与预期的作业执行进行比较,并确定是否缺少任何内容并对其发出警报。

检测丢失的工作 (Detecting Missed Jobs)

Our task is to create two lists of job runs and compare them. If the expected job run count for any given job is higher than the actual job run count (including in-process executions), then we have a problem that is potentially worth alerting on. For the purposes of this demo, we will analyze a half-hour time period from 5 minutes ago through 35 minutes ago. Not including the immediately preceding 5 minutes reduces the noise from jobs that are executing at the moment that this analysis is performed.

我们的任务是创建两个作业运行列表并进行比较。 如果任何给定作业的预期作业运行计数都高于实际作业运行计数(包括进程内执行),则我们有可能值得提醒的问题。 就本演示而言,我们将分析从5分钟前到35分钟前的半小时时间段。 不包括紧接的前5分钟,可减少执行此分析时正在执行的作业的噪音。

To enumerate the time period that we will process, we’ll begin our process with two parameters:

为了枚举我们将要处理的时间段,我们将从两个参数开始我们的过程:

@start_time_local DATETIME,  @end_time_local DATETIME

These are self-explanatory and bound our data analysis to the specific period of time that we defined above.\

这些是不言自明的,并将我们的数据分析限制在我们上面定义的特定时间段内。\

To store data for this time period, we’ll create two temporary tables to store these metrics so that we can easily compare the two:

为了存储该时间段内的数据,我们将创建两个临时表来存储这些指标,以便我们可以轻松地比较这两个指标:

CREATE TABLE #job_schedule_metrics  ( job_id UNIQUEIDENTIFIER,    job_name VARCHAR(128) NOT NULL,    first_job_run_time_utc DATETIME NOT NULL,    last_job_run_time_utc DATETIME NOT NULL,    first_job_run_time_local DATETIME NOT NULL,    last_job_run_time_local DATETIME NOT NULL,    count_of_events_during_time_period INT NOT NULL,    job_schedule_description VARCHAR(250) NOT NULL  );

This table will contain the output of the stored procedure dbo.generate_job_schedule_data and will be summarized so that we only get a row per job, rather than a full list of all possible executions. This will be the full list of SQL Server Agent job executions that were supposed to occur during a given time period.

该表将包含存储过程dbo.generate_job_schedule_data的输出,并将对其进行汇总,以便我们仅对每个作业获得一行,而不是所有可能执行的完整列表。 这将是应该在给定时间段内执行SQL Server代理作业执行的完整列表。

CREATE TABLE #job_run_history_metrics  ( job_id UNIQUEIDENTIFIER,    job_name VARCHAR(128) NOT NULL,    run_count INT NOT NULL);

This temporary table will also contain a row per job for each execution that actually occurred and was logged in MSDB for a given time period.

对于给定时间段内在MSDB中实际发生并记录的每个执行,该临时表还将为每个作业包含一行。

Let’s start out by collecting metrics on job history:

让我们开始收集关于工作历史的指标:

WITH CTE_STEP_COUNTS AS (    SELECT      sysjobs.job_id,      sysjobs.name AS job_name,      sysjobhistory.step_id,      COUNT(*) AS job_step_count    FROM msdb.dbo.sysjobhistory    INNER JOIN msdb.dbo.sysjobs    ON sysjobs.job_id = sysjobhistory.job_id    INNER JOIN msdb.dbo.syscategories    ON syscategories.category_id = sysjobs.category_id    WHERE sysjobs.enabled = 1    AND msdb.dbo.agent_datetime(run_date, run_time) >= @start_time_local    AND msdb.dbo.agent_datetime(run_date, run_time) <= @end_time_local    GROUP BY sysjobs.name, sysjobs.job_id, sysjobhistory.step_id)  INSERT INTO #job_run_history_metrics    (job_id, job_name, run_count)  SELECT    CTE_STEP_COUNTS.job_id,    CTE_STEP_COUNTS.job_name,    MAX(CTE_STEP_COUNTS.job_step_count) AS run_count  FROM CTE_STEP_COUNTS  GROUP BY CTE_STEP_COUNTS.job_name, CTE_STEP_COUNTS.job_id;

This query provides a bit of additional detail to ensure that jobs in progress are not included (yet). By counting the occurrence of each completed step in each job, we can determine which is most frequent, which ultimately represents the number of times a job was executed, even if some subsequent steps are still awaiting execution.

该查询提供了一些额外的详细信息,以确保不包括正在进行的作业(尚未)。 通过计算每个作业中每个已完成步骤的发生次数,我们可以确定哪个频率最高,即使最终的某些步骤仍在等待执行,它也最终代表了作业执行的次数。

We still want to include running jobs as an addition to this data. This provides an optimistic interpretation of job data, assuming by default that a job in progress will complete. This is acceptable as our goal is to report missed jobs and not failed/cancelled jobs. We also automatically include manual job runs in this count as an assumption that if an operator executes a job outside of the bounds of a typical schedule that it should count towards the history for a given time period.

我们仍然希望将正在运行的作业作为此数据的补充。 在默认情况下,假设正在进行的作业将完成,这提供了作业数据的乐观解释。 这是可以接受的,因为我们的目标是报告丢失的作业,而不是失败/取消的作业。 我们还自动将手动作业运行包括在此计数中,以作为假设,如果操作员在典型计划的范围之外执行作业,则该作业应计入给定时间段的历史记录。

The following query will check exclusively for SQL Server Agent jobs that are currently executing and if so increment the job count by 1:

以下查询将专门检查当前正在执行SQL Server代理作业,如果是,则将作业计数增加1:

MERGE INTO #job_run_history_metrics AS TARGET  USING (    SELECT      sysjobactivity.job_id,      sysjobs.name AS job_name    FROM msdb.dbo.sysjobactivity    INNER JOIN msdb.dbo.sysjobs    ON sysjobs.job_id = sysjobactivity.job_id    WHERE sysjobactivity.start_execution_date IS NOT NULL    AND sysjobactivity.stop_execution_date IS NULL    AND sysjobactivity.start_execution_date BETWEEN @start_time_local AND @end_time_local    GROUP BY sysjobactivity.job_id, sysjobs.name) AS SOURCE  ON (SOURCE.job_id = TARGET.job_id)  WHEN MATCHED    THEN UPDATE      SET run_count = run_count + 1  WHEN NOT MATCHED BY TARGET    THEN INSERT      (job_id, job_name, run_count)    VALUES  (      SOURCE.job_id,      SOURCE.job_name,      1);

Since any given job may only be executing at any given point in time, incrementing by one is sufficient for our needs.

由于任何给定的作业只能在任何给定的时间点执行,因此增加一个就足以满足我们的需求。

Populating expected scheduled job metrics is a simple task as we have already tackled a stored procedure to provide this data:

填充预期的计划作业指标是一项简单的任务,因为我们已经解决了存储过程以提供以下数据:

INSERT INTO #job_schedule_metrics    (job_id, job_name, first_job_run_time_utc, last_job_run_time_utc, first_job_run_time_local,       last_job_run_time_local, count_of_events_during_time_period, job_schedule_description)  EXEC dbo.generate_job_schedule_data @start_time_local = @start_time_local, @end_time_local = @end_time_local, @return_summarized_data = 1,  @include_startup_and_idle_jobs_in_summary_data = 0;

This populates the temp table with a row per job that indicates how many times the job should run during the time period specified. Being able to encapsulate the logic of this process into a single stored procedure and reuse it as-needed here is a big convenience and is worth implementing in this fashion, regardless of how you crunch this data.

这将在临时表中填充每个作业的行,该行指示在指定的时间段内该作业应运行多少次。 能够将此过程的逻辑封装到单个存储过程中,并按需要在此处重新使用它,这是一个很大的便利,并且无论您如何处理这些数据,都值得以这种方式实现。

At this point, we have everything we need and can compare our two data sets to determine if any jobs are missed:

至此,我们拥有了所需的一切,并且可以比较我们的两个数据集以确定是否遗漏了任何作业:

SELECT    job_schedule_metrics.*,    ISNULL(job_run_history_metrics.run_count, 0) AS run_count  FROM #job_schedule_metrics job_schedule_metrics  LEFT JOIN #job_run_history_metrics job_run_history_metrics  ON job_run_history_metrics.job_id = job_schedule_metrics.job_id  WHERE job_run_history_metrics.job_id IS NULL  OR (job_run_history_metrics.job_id IS NOT NULL AND job_run_history_metrics.run_count < job_schedule_metrics.count_of_events_during_time_period);

We left join our job history to ensure that we account for both jobs that have missed some runs and those that missed all runs (and therefore have no history data to join).

我们保留了加入工作历史记录的权限,以确保我们对错过了某些运行的工作和错过了所有运行的工作进行了说明(因此没有要加入的历史记录数据)。

If no jobs are missed, then this query will return no data and an IF EXISTS check against it will return 0. If jobs are missed, then we will get a set of data that indicates when the job should have run, how many times it should have run, and the actual number of executions. Here is a sample of missed job execution data:

如果没有遗漏任何作业,则此查询将不返回任何数据,并且针对该查询的IF EXISTS检查将返回0。如果遗漏了作业,则我们将获得一组数据,该数据指示作业应在何时运行,运行了多少次应该已经运行,以及实际的执行次数。 这是错过的作业执行数据的示例:

We can see that on my test server there are 2 jobs that should have executed, but did not. We include the schedule details as well, which assists in diagnosing what should have happened (but did not). Lastly, if we want, we can dump the missed job data into a temporary table and construct an HTML email based on that data:

我们可以看到,在我的测试服务器上,应该执行2个作业,但没有执行。 我们还包括时间表详细信息,这有助于诊断应该发生什么(但没有发生)。 最后,如果需要,我们可以将错过的工作数据转储到临时表中,并根据该数据构造HTML电子邮件:

DECLARE @profile_name VARCHAR(MAX) = 'Default Public Profile';    DECLARE @email_to_address VARCHAR(MAX) = 'ed@myemail.com';    DECLARE @email_subject VARCHAR(MAX);    DECLARE @email_body VARCHAR(MAX);     SELECT @email_subject = 'Missed Job Alert: ' + ISNULL(@@SERVERNAME, CAST(SERVERPROPERTY('ServerName') AS VARCHAR(MAX)));      SELECT @email_body = 'At least one job run was missed on ' + ISNULL(@@SERVERNAME, CAST(SERVERPROPERTY('ServerName') AS VARCHAR(MAX))) + ' during the period ' + CAST(@start_time_local AS VARCHAR(MAX)) + ' to ' + CAST(@end_time_local AS VARCHAR(MAX)) + ':   
  
    
    
    
    
  
';    SELECT @email_body = @email_body + CAST((SELECT missed_jobs.job_name AS 'td', '',                      missed_jobs.job_schedule_description AS 'td', '',                      CAST(missed_jobs.count_of_events_during_time_period AS VARCHAR(MAX)) AS 'td', '',                      CAST(ISNULL(missed_jobs.run_count, 0) AS VARCHAR(MAX)) AS 'td'    FROM #missed_jobs missed_jobs    ORDER BY missed_jobs.job_name ASC, missed_jobs.job_schedule_description    FOR XML PATH('tr'), ELEMENTS) AS VARCHAR(MAX));     SELECT @email_body = @email_body + '
Job Name Job Schedule Description Expected job run count Actual job run count
';    SELECT @email_body = REPLACE(@email_body, '', '');     EXEC msdb.dbo.sp_send_dbmail      @profile_name = @profile_name,      @recipients = @email_to_address,      @subject = @email_subject,      @body_format = 'html',      @body = @email_body;

This will provide a nicely formatted email that includes all of the columns returned above. If email is your alerting method of choice, then this is a good start on turning missed job metrics into an email. Here is an example of what that email will look like:

这将提供格式精美的电子邮件,其中包括上面返回的所有列。 如果电子邮件是您选择的警报方法,那么这是将错过的工作指标转换为电子邮件的良好起点。 这是该电子邮件的外观示例:

Of course, email is not the only way to alert on missed jobs. Some alternate ideas include:

当然,电子邮件不是警报丢失作业的唯一方法。 一些替代的想法包括:

  • This data could be sent to a reporting server for regular processing

    可以将这些数据发送到报告服务器进行常规处理
  • A monitoring service can consume this data and return missed job runs when found

    监视服务可以使用此数据,并在找到后返回错过的作业运行
  • A central server can execute the checks against MSDB on a set of target servers, thus removing the need for decentralized alerts

    中央服务器可以在一组目标服务器上针对MSDB执行检查,从而消除了分散警报的需求
  • Powershell can manage this process

    Powershell可以管理此过程

Depending on your database environment, there may be other possibilities as well. Tailor your own processes to fit the data and make best use of your existing tools.

根据您的数据库环境,还可能存在其他可能性。 量身定制自己的流程以适应数据并充分利用现有工具。

此过程的目标工作是什么? (What are Target Jobs for This Process?)

Not all jobs are created equal and we do not need to monitor every single job for completeness. Only critical jobs for which a missed run is problematic should be considered.

并非所有工作都是平等创建的,我们不需要监视每个工作的完整性。 仅应考虑错过错过的问题的关键工作。

For example, missing a transaction log backup can typically be tolerated as the next run of this job will catch up where the previous one left off and continue normally from that point on. Missing a full backup, though, can result in adverse side-effects, such as bloated differential backups and unnecessarily long backup chains.

例如,通常可以容忍丢失事务日志备份,因为该作业的下一次运行将追赶上一次停止的工作,并从该点开始正常继续。 但是,缺少完整的备份会导致不利的副作用,例如过分的差异备份和不必要的长备份链。

The following is a short list of jobs that benefit most from a process such as this:

以下是从诸如此类的过程中受益最大的工作的简短列表:

  • Full backup jobs. Possibly differential backups, if infrequent

    完整备份作业。 如果不经常进行差异备份
  • Reports that generate and are expected at specific times

    在特定时间生成和预期的报告
  • ETL processes that move/transform data at specific points in time

    在特定时间点移动/转换数据的ETL流程
  • Data warehousing or analytics tasks

    数据仓库或分析任务

These tasks are all run relatively infrequently and are time-sensitive. A data collection process that needs to take a point-in-time snapshot at a certain time of day will become stale as time passes after its expected runtime. Missed backups or critical maintenance processes can have unintended consequences for performance, business SLAs, RPO (recovery point objective), or RTO (recovery time objective).

这些任务都相对不频繁运行,并且对时间敏感。 需要在一天的特定时间拍摄时间点快照的数据收集过程会随着其预期运行时间的流逝而变得陈旧。 丢失的备份或关键维护流程可能会对性能,业务SLA,RPO(恢复点目标)或RTO(恢复时间目标)产生意想不到的后果。

While all jobs can be monitored, some will benefit more than others. Worse, the noise of low-priority jobs being missed and alerted on can be a distraction when an unexpected outage occurs. Consider categorizing or filtering SQL Server Agent jobs so that only important ones are monitored for missed runs.

虽然可以监控所有作业,但有些作业将比其他作业受益更多。 更糟糕的是,当发生意外中断时,低优先级作业的丢失和警报可能会分散注意力。 考虑对SQL Server代理作业进行分类或筛选,以便仅监视重要的作业是否丢失了运行。

Categories can be added easily via TSQL:

可以通过TSQL轻松添加类别:

EXEC msdb.dbo.sp_add_category      @class = 'JOB',    @type = 'LOCAL',    @name = 'Unmonitored';

The result will be a new category that may be chosen when creating or modifying a job:

结果将是在创建或修改作业时可以选择的新类别:

Categories are stored in msdb.dbo.syscategories and we can view jobs that are associated with that query by joining msdb.dbo.sysjobs to that table:

类别存储在msdb.dbo.syscategories中 ,我们可以通过将msdb.dbo.sysjobs加入该表来查看与该查询关联的作业:

SELECT  *FROM msdb.dbo.sysjobsINNER JOIN msdb.dbo.syscategoriesON syscategories.category_id = sysjobs.category_idWHERE syscategories.name = 'Unmonitored';

The results show the job that I created above:

结果显示了我在上面创建的工作:

Alternatives to this would include:

替代方法包括:

  • Having specific parts of a job’s name indicate how monitoring tools treat them

    作业名称中的特定部分指示监视工具如何对待它们
  • Maintain a metadata table for jobs that should or should receive additional monitoring

    维护应或应该接受其他监视的作业的元数据表
  • Apply monitoring to all jobs indiscriminately

    不加选择地对所有作业进行监控

The correct choice depends on your environment and the style of jobs and loads that are processed. A report server with many infrequent (but important) jobs may benefit from a high level of failure/missed job reporting. A server that runs highly-resilient jobs constantly may not need to monitor for missed runs as they would catch up a soon as the server is returned to its typical operating state.

正确的选择取决于您的环境以及要处理的作业和负载的样式。 具有很多不频繁(但很重要)作业的报表服务器可能会从高水平的失败/丢失作业报告中受益。 不断运行高弹性作业的服务器可能不需要监视错过的运行,因为它们会在服务器返回其典型运行状态后立即赶上。

陷阱 (Gotchas)

As with any monitoring process, we should consider the target set of jobs that we care about most with respect to catching missed jobs. There is no sense in monitoring anything unless there is a valid reason to do so. Here are some considerations for implementing this process (or one like it):

与任何监视过程一样,我们应该在捕获丢失的工作方面考虑我们最关心的目标工作集。 除非有正当理由,否则监视任何事物都是没有意义的。 以下是实现此过程(或类似过程)的一些注意事项:

SQL Server代理长时间不可用 (SQL Server Agent Unavailable for Extended Period of Time)

If a server is unavailable for a long time and is not due to planned maintenance, then we would want to rely on other alerting mechanisms to find out, such as the SQL Server Agent service being unavailable or unreachable.

如果服务器长时间不可用并且不是由于计划的维护而引起的,那么我们将希望依靠其他警报机制来发现,例如SQL Server代理服务不可用或不可用。

We should never rely on an alert for a purpose it is not intended for. A missed job alert should not be used as a method of alerting to a server or service being down.

我们绝不应将警报用于非预期目的。 错过的作业警报不应用作警报服务器或服务已关闭的方法。

SQL Server代理历史记录保留 (SQL Server Agent History Retention)

Job execution history is only as valuable as the amount of data that is retained. You may configure the retention policy via TSQL or by clicking on the properties for SQL Server Agent:

作业执行历史记录仅与保留的数据量一样有价值。 您可以通过TSQL或单击SQL Server代理的属性来配置保留策略:

From here, we can adjust job history retention per job and also for the server as a whole. If job history falls off of the end of our retention period before the missed job alert reads it, then we will see false alerts.

从这里,我们可以调整每个作业以及整个服务器的作业历史记录保留时间。 如果作业历史记录在我们的保留期结束前丢失了,而错过的工作警报将无法读取,那么我们将看到错误的警报。

In general, it is a good idea to maintain more than enough history to ensure that any history data needed for monitoring processes, operators, or for troubleshooting is available.

通常,最好保留足够多的历史记录,以确保监视过程,操作员或故障排除所需的任何历史记录数据可用。

长期不一致的工作 (Long-Running Inconsistent Jobs)

Some jobs are designed to run very often, but may essentially perform no-ops most of the time. The few times that work is needed, the job could run for minutes or hours. This will result in the job either being seen as completed successfully or in-progress. Using a semaphore to initiate a SQL Server Agent job process puts more power into metadata or code, but makes monitoring it difficult.

有些作业被设计为经常运行,但在大多数情况下可能实际上不执行任何操作。 需要几次工作,该工作可能会持续数分钟或数小时。 这将导致该作业被视为成功完成或正在进行中。 使用信号量来启动SQL Server代理作业过程会使元数据或代码具有更多功能,但使监视变得困难。

Any job that runs exceedingly often may be a good candidate to not be monitored for missed jobs. Instead, building them to be resilient may be more valuable. If it fails, we care and will respond, but otherwise can know that it runs often enough to tolerate server outages or maintenance.

经常运行的任何作业都可能是很好的候选者,不会因错过的作业而受到监视。 相反,将它们构建为具有弹性的可能更有价值。 如果失败,我们会在意并会做出响应,但否则可以知道它运行的频率足以承受服务器故障或维护。

失败/取消的工作 (Failed/Cancelled Jobs)

We have covered missed jobs in detail, but could extend monitoring to provide additional metrics on failed jobs. SQL Server Agent allows jobs to be configured with failure actions, so this is a relatively straightforward scenario to alert on. That being said, a server with many jobs may spit out a ton of noise and for a busy server a more centralized approach may be beneficial. If this sounds interesting or useful, take a look at an article I wrote that covers a process to monitor and report on failed jobs:

我们已经详细介绍了错过的工作,但是可以扩展监视范围,以提供有关失败工作的其他指标。 SQL Server代理允许使用失败操作配置作业,因此这是一个相对简单的警报场景。 话虽这么说,具有许多工作的服务器可能会发出很多噪音,而对于繁忙的服务器,更集中的方法可能会有所帮助。 如果这听起来有趣或有用,那么请看一下我写的一篇文章,其中涵盖了监视和报告失败作业的过程:

An email for all job failures in a given time period may be cleaner and easier to manage than an email per failed job. This also provides us the ability to store the details of failures in a table as a permanent resource, in case email or network communications are unavailable at any given point in time.

给定时间段内所有作业失败的电子邮件比每个失败作业的电子邮件更干净,更易于管理。 如果在任何给定的时间点电子邮件或网络通信不可用,这还使我们能够将故障的详细信息存储在表中作为永久资源。

遗失工作的存储 (Storage of Missed Jobs)

If desired, a table may be created to store the details of missed jobs. This provides some additional logging in the event that email, a network, or other resources are unavailable to let us know of missed jobs when they are detected.

如果需要,可以创建一个表来存储丢失的作业的详细信息。 如果电子邮件,网络或其他资源不可用,则在检测到丢失的作业时让我们知道丢失的作业,这可以提供一些额外的日志记录。

结论 (Conclusion)

Missed jobs can be a serious problem when important reports, data loads, or backups fail to execute. By default, SQL Server does not monitor or alert on a job that was missed due to an outage, planned maintenance, or a server problem.

当重要的报表,数据加载或备份无法执行时,丢失的作业可能是一个严重的问题。 默认情况下,SQL Server不会监视由于停机,计划维护或服务器问题而丢失的作业或向其发出警报。

This can be problematic in any database environment, whether in an on-premises version of SQL Server or in Azure. Notifying an operator about a failed job condition is critically important. Equally important is to consider what to do when jobs are missed.

在任何数据库环境中,无论是在内部版本SQL Server中还是在Azure中,这都可能会出现问题。 通知操作员失败的工作条件至关重要。 同样重要的是要考虑错过工作时该怎么办。

As an additional thought experiment, we can consider making jobs more resilient so that they can easily pick up where they left off without missing out on any critical processing. For example, if a full backup does not run as expected, a backup process could detect that and run it at the next available/acceptable time.

作为一个额外的思想实验,我们可以考虑使作业更具弹性,以便他们可以轻松地从中断的地方接回工作,而不会错过任何关键的处理。 例如,如果完全备份未按预期运行,则备份过程可以检测到该情况并在下一个可用/可接受的时间运行它。

By monitoring and alerting on missed jobs, we can plug a gaping hole that is present in many SQL Server environments by default and allow us to manage maintenance, outages, or unusual events with less concern about what might have been missed during those scenarios.

通过监视和提醒丢失的作业,我们可以填补默认情况下许多SQL Server环境中存在的漏洞,并允许我们管理维护,中断或异常事件,而不必担心在这些情况下可能丢失的内容。

检视问题 (Review questions)

Q: How can I exclude specific jobs or groups of jobs?

问:如何排除特定的工作或工作组?

A: Job categories can be used to classify jobs as special with regards to monitoring. Once classified, processes that read or act on job data can automatically filter them out. Other filters can be applied up front, such as those based on schedules, business hours, or priority.

答:作业类别可用于将作业分类为与监控有关的特殊作业。 分类后,读取或处理作业数据的流程可以自动将其过滤掉。 可以预先应用其他过滤器,例如基于计划,工作时间或优先级的过滤器。

Q: A job is indicated as missed, but isn’t! Why?

问:作业被指示为已错过,但不是! 为什么?

A: This is usually the result of jobs with unusually long runtimes. If a job runs long enough to miss its next execution time, then it will appear as missed, even if it isn’t. Similarly, jobs that run indefinitely will miss schedules naturally as well.

答:这通常是作业时间异常长的结果。 如果作业运行时间足够长而错过了下一个执行时间,那么即使没有,它也会显示为已错过。 同样,无限期运行的作业也会自然地错过计划。

Assuming these types of jobs are resilient enough to tolerate these irregularities, then there is no need to monitor them for missed executions and they should be excluded from our process.

假设这些类型的工作具有足够的弹性以容忍这些不正常现象,那么就无需监视它们是否错过了执行,因此应将它们排除在我们的流程之外。

Q: Frequently executing jobs are showing as missed when they are not. What can I do about this?

问:经常执行的作业在未显示时显示为丢失。 我该怎么办?

A: SQL Server Agent maintains a history of job executions per job and overall as configured in the SQL Server Agent properties. If a job runs very often, it is possible that its history is being cleaned up before our process has a chance to read and process it.

答:SQL Server代理会维护每个作业以及整个SQL Server代理属性中配置的作业执行历史记录。 如果一项工作非常频繁地运行,则有可能在我们的过程有机会读取和处理它之前对其历史进行了清理。

There are multiple valid solutions to this, such as:

有多种有效的解决方案,例如:

  1. Adjust job history retention in the SQL Server Agent properties to maintain more job runs

    调整SQL Server代理属性中的作业历史记录保留,以维护更多的作业运行
  2. Adjust our process to run more often

    调整流程以使其更频繁地运行
  3. If the job in question is resilient towards missed runs, then you may simply want to exclude it from monitoring missed runs

    如果有问题的作业可以应对错失的运行,那么您可能只想将其排除在监视错失的运行之外
  4. (Code)

    Download T-SQL code associated with this article

    下载与本文相关的T-SQL代码

    目录 (Table of contents)

    Detecting and Alerting on SQL Server Agent Missed Jobs
    检测和警报SQL Server代理丢失的作业

翻译自:

对警报线程池的警报线程

转载地址:http://zfiwd.baihongyu.com/

你可能感兴趣的文章
SpringMVC_2
查看>>
Hadoop点滴-初识MapReduce(1)
查看>>
[专栏]壹周技术秀(第2期)
查看>>
js JSON
查看>>
该内存不能read 或written数值 叙述
查看>>
少个人保护?我来!——阿里云在ICANN第3届GDD峰会纪实
查看>>
[Spark性能调优] 第四章 : Spark Shuffle 中 JVM 内存使用及配置内幕详情
查看>>
虚拟机安装
查看>>
L1-045 宇宙无敌大招呼
查看>>
[笔记]Android 源码编译
查看>>
EasyUI特殊情况下的BUG整理
查看>>
10个最实用的Linux命令
查看>>
poj2155二维树状数组区间更新
查看>>
iOS开发UI篇—UIWindow简单介绍
查看>>
在sap系统设置纸张打印格式
查看>>
WPF DataGrid绑定到数据源的方法
查看>>
Install FindBugs in Eclipse
查看>>
标准I/O库之二进制I/O
查看>>
信号之不可靠的信号及中断的系统调用
查看>>
让你Android开发更简单
查看>>