Interval queries in SQL Server

Interval queries in SQL Server

In mathematics, the interval is a set of values ​​between some lower value, and some high value. There are different types of intervals, which are required to provide a database, including the time (e.g., sessions, contracts, projects, appointments, validity period), spatial (e.g. road segments) and numeric (for example, temperature ranges).

Starting with SQL Server 2008, spatial intervals can be represented using spatial data types GEOMETRY and GEOGRAPHY and handle these types of data through methods. For spatial queries can also use indexing and optimization.

For other types of slots, in particular a very common temporary type in SQL Server is not provided for special support. Most of these intervals are two attributes that contain lower and upper values, and use predicates those involving these attributes in queries related to the intervals. For ranges of special means of indexing and optimization there.

In preparing the queries related to intervals required to determine the types of relations between intervals. One of the most common queries — check the intersection of two slots (e.g., find two sessions which were active during a certain period of time represented by the inputs and @ 1 and @). Unfortunately, the classic methods used to determine the intersection of intervals and some other relations, the fundamental problems inherent in optimization. As a result, the performance of queries involving intervals, usually very low.

But the situation is not hopeless — Researchers from the University of Munich created a clever model called Relational Interval Tree (RI-tree). With improvements Laurent Martin an opportunity to effectively handle ranges in SQL Server. However, the model and further optimization related to the mathematical calculations that may seem too complicated.

It is possible to build a model in a component of SQL Server database and make it transparent to the user, which is sufficient to apply a simple syntax to create a new kind of index, leaving unchanged requests. The rest — it is SQL Server; Your requests will just run faster. While in SQL Server does not have similar functions, but I hope that Microsoft will react positively to the idea and implement this functionality in a future release.

This article first describes the traditional view intervals in SQL Server, classic slots and requests for basic optimization problems such requests. Next will be described a model RI-tree, and further optimization. At the end we will focus on the potential integration of RI-tree model in SQL Server.

Traditional notions intervals

Traditionally intervals in SQL Server are represented by two attributes that contain lower and upper values ​​of each interval, and an attribute that contains the key. Of course, the table may have other attributes for other purposes. In my examples use a table named Intervals and 10 million strings representing intervals conventional manner.

To represent the lower and upper range values ​​are integers used as a model RI-tree, which is discussed below, is working with integers. The date and time you want to map the integers, such as calculating the difference between a reference point and the target value based on the required accuracy. Of course, the traditional representation of the intervals can be any type that implements a full order.

Note the two indices, defined in Table Intervals. One index is created on the lower column as a key column and covers the upper, and the other created by the upper column as a key column and covers the lower.

Most queries related to intervals, eg in the query that defines the intersection intervals involved two predicates range. For example, if the input range defined variable @ 1 (for lower) and @ and (for upper), the intervals intersecting with the input intervals are those that satisfy the following predicate: lower lt; =u AND upper gt; = @ 1. This is the fundamental problem of optimization — Search index can only be based on a single predicate range. Other predicates range must be treated as a residual. Therefore, the optimizer must be selected for one of the two indexes and perform a search based on one of the main predicate on the key index to select the appropriate line. When viewing the remaining rows in the index pages check another predicate to determine which rows should be returned.

To understand this is very important, so it is useful to dwell on this issue in more detail. Consider the following list sorted by X, Y:

X Y l 10 l 20 l 40 l 50

2 20

2 30

2 50

3 20

3 40

3 50

3 50

3 60

4 U

4 U

4 50

4 60

Suppose a sorted list is the data on the final level of the binary index. Consider the following query filter: X gt; = 3 and Y lt; = 40. On the basis of the verb X gt; = 3, you can use the search index go straight to the final line (3, 20), and see only the rows that satisfy the predicate. But we can not avoid seeing all of the remaining rows in the index, and check the remaining predicate Y lt; = 40 as a residual predicate.

We can see the example of the problem of optimizing queries, search for the intersection table Intervals. First of all, use the following code to turn 10 and STATISTICS STATISTICS TIME:

10 SET STATISTICS ON;

SET STATISTICS TIME ON;

Then perform the following query to find the intervals intersecting with the input interval, which is located approximately in the middle of the full range of:

DECLARE @ 1 AS INT = 5000000,

u AS INT = 5000020;

SELECT id FROM dbo. lntervals

WHERE lower lt; =u AND upper gt; = @ 1 OPTION (RECOMPILE);

I use a query parameter RECOMPILE, since otherwise the value of the variable will not be analyzed. The plan of this query is shown in Figure 1.

Input data in the request interval are located approximately in the middle of the whole range, so actually it does not matter which of the two indices used by the optimizer. In this case, the optimizer chose index idx_ upper. With regard to the optimization problem, note that in the section Seek Predicates found only predicate for a column upper; predicate column is lower under Predicate — is a residual predicate. As a result, about half have to review the final level of the index. On my computer stats for this request was the next — of logical reads: 11256; CPU time: 482 ms. Admittedly, not very high efficiency.

As mentioned above, when searching for an interval in the middle of the whole range does not matter, which index is used by the optimizer. However, the search range in the vicinity of the beginning of the band, of course, better use idx lower, due to the viewing area at the beginning of index entries. Experiment with input data = 1 @ 80 and @ u = 100; use the RECOMPILE, so the optimizer to analyze the variables. Statistics for this request — the logical reading: 3; CPU time: 0 ms. Similarly, when requested, addressed to the end of the range given GOVERNMENTAL, the optimizer chooses idx uppcr, looking small area at the end of the index sheet. For example, try to run a query with the input data @ 1 and @ = 9999900 and 9999920. = Statistics for this request — the logical reading: 3; CPU time: 0 ms.

This experience shows that if the parameters in the stored procedure, and does not specify the RECOMPILE, then the request will be very sensitive to the problems of the analysis parameters. In any case, an important conclusion is that if users do not always ask for only a small area of ​​the entire range, located in the vicinity of its beginning or end, the query performance will decrease due to optimization problems.

RI-tree

Model RI-tree developed Kriegel, CCIP and Seidl, provides very efficient processing of requests for slots. When building the model is calculated for each interval attribute (stored as a column in the table), the two indices are constructed and used to search for new requests intersections and other relations of the interval.

The core model — the virtual binary tree whose nodes represent integer values ​​in the range that you want to reach. For example, if you want to submit to the intervals that start and end in the range of 1 to 31, the virtual tree represented in the virtual base node tree and branch (not yet pay attention to the range shown here and the branch node).

You can use the LOG, to calculate the height of the tree. To cover the range from 1 to the value @ s height binary tree-h = CEILING (LOG (max + 1, 2)). The root of the tree — POWER (2, @ h-l). The reason the tree — a virtual, is that the entire tree is not stored anywhere; nodes are used only when necessary, as will be shown below. Branch node. A fundamental component of the model RI-tree — object, referred to as «branching node» (fork node). It is calculated for each interval to be represented in the database, and is stored together with them. Figure 2 shows how to calculate the branch node for some of the test interval. You pass through the virtual tree starting from the root, branching into two; the first node, discovered within the range — branch node.

In Listing 3, the algorithm based on the R-1 tree in a user-defined function of T-SQL forkNode named for a tree with a height of 31 (ranges from 1 to 2147483647). For example, the call function with inputs 11 and 13, and 12 as the receive branch node on the outlet:

SELECT dbo.forkNode (11,13);

Lack of calculating branch node, as shown in Listing 3 that uses a user-defined function of T-SQL and iterative logic. This combination is very inefficient, especially if you want to calculate the branch node for each interval is stored in the database. To give an idea of ​​the degree of inefficiency, I note that on my computer to fill in the table with 10 million rows took 304 seconds, 297 seconds of which was spent on payment host branch.

The load on the processor in the algorithm iterations to calculate the branch node is proportional to the height of the tree. The model RI-tree Kriegel, CCIP and Seidl used a variant in which the height and range of tree dynamically expands depending on the added slots. But it is necessary to introduce additional parameters for the tree and modify some of them each time you add interval. As a result, only supported single-insertion; moreover, inserts are slow and often give rise to a bottleneck.

Laurent Martin proposed a method of calculating the branch node with a scalar expression. The result was the ability to handle large RI-tree and enable highly efficient multi-insert. Branch node for the given interval — lowest common ancestor boundaries of the interval represented by the values ​​in the columns and lower upper.

For each of the node will accept L = leading zeros before the final level; for example, for node 12 (01100), L = 011.

When a given node X, the leading bits that — L, all the nodes in the left subtree of X are of leading bits L-1 (e.g., for L = 011, L-1 = 010) and all the nodes in the right subtree of X are of leading bits L.

For some nonleaf node X and X-1 ancestor of the same, so for calculation ancestor X value X may be replaced by X 1. If the Z — end node, Z and Z-1 differ only by the last digit; for Z is 1 and for the Z-1-0. Given these circumstances, branch node can be calculated as follows: a comparison of the prefix (lower- 1) and the upper, coupled with a 1 and enlarged trailing zeros.

Perform these calculations in SQL Server by using the following actions (taking as an example the interval [11, 13]):

Let A = (lower -1) L upper

Let B = POWER (2, FLOOR (LOG (A, 2)))

Let C = upper% in Let D = upper — With In step 1 is calculated value (call it A), marking the ranks in the different (lower — 1) and upper as «units». To achieve this, it performs a bitwise XOR operation between the (lower -1) and upper. In our example, IL 13 in binary representation — 01010 N 01101 = 00111.

Step 2 calculated value (call it B), in which the leftmost digit differs (lower- 1) and upper, is set to «one», and all other bits are set to 0. In other words, this step determines the leftmost digit And in that set. Please note that on the basis of the above considerations, the leftmost digit differs (lower- 1) and the upper, will be equal to 0 (lower — 1) and 1 in the upper. In our example, the result of step in binary form — 00100.

In step 3, trailing bits are stored from upper, following the discharge set to 1 in parameter B. In this example 01101 00100 = 01%.

In step 4, in the upper ranks of the final set after discharge will be assigned a value of 0. In our example: 01101-00001 = 01100. Done — we have branch node!

All these operations can be combined in the following formula (shown as a T-SQL expression in the SQL Server 2012) for calculating branch node: upper — upper% POWER (2, FLOOR (LOG ((lower — 1) upper A, 2)))

This expression is a very ingenious way to calculate tion branch node. When I described it to his friend, a specialist in T-SQL, he was so excited that jokingly said that he wanted to be like Martin when he grows up.

Having scalar expression for calculating branch node based on the columns and lower upper, you can implement it as a computed column (called a «node») in a table containing intervals. Based on the model RI-tree require two indices: one for the list of keys (node, lower) and one for the list of keys (node, upper). Please note that depending on your needs, you can add the leading columns in a list of keys for filters on an equal basis in requests, as well as a list of columns in the index include for completeness.

The source code creates a table IntervalsRIT, populates it with test data from the staging table and create a primary key constraint and index-based model RI-tree.

IntervalsRIT completed table and its indexes — a representation of the intervals based on the RI-tree, which replaces the former Intervals table and its indexes. Remember that when using the forkNodc on my computer it took just 297 seconds to calculate the branch nodes to 10m intervals. The optimized formula Martin it took only 6 seconds.

Querying models RI-tree

In IntervalsRIT intervals are stored, along with the node branch in the column named node. Here’s how to send a request to the table to determine the intervals intersecting with some input interval | @ 1 and @ |. The examples in this article uses the input interval [@ 1 = 11, @ u = 13].

The model RI-dersva has three unrelated groups of slots which can intersect the input interval. I call them the left (left), middle (middle) and right (right) groups.

Left group. Consider the path of the virtual tree descending from the root to the @ 1. Let leftNodes — a set of all the nodes w, which are found on the road and to the left of the input range.

For all nodes w in the following statements are true leftNodes.

Interval registered at node d in the left subtree w, can not overlap with the input interval. If that happened, he would have been registered in the ancestor d.

Interval registered w, may belong to one of two types: if upper lt; @ 1, then, obviously, the range does not overlap with the input interval; if the upper gt; @ = 1, then the interval of overlap with the input interval, since it is already known that the lower lt; =u (interval, recorded in the node that is to the left of the input range, obviously begins earlier than the input end interval).

Therefore, the intervals left belonging to the group are registered in the node and in have haveuppcr leftNodes gt; = @ 1.

Listing 5 shows the definition of a table function named leftNodes, which returns the nodes in the above set leftNodes.

The following query returns all intervals in the left panel (registered in the nodes in leftNodes and intersecting with the input interval):

DECLARE @ 1 AS INT = 5000000,

u AS INT = 5000020;

SELECT l.id

FROM dbo.IntervalsRIT AS I JOIN dbo.leftNodes (l,u) AS L ON I.node = L.node AND l.upper gt; @ = 1;

Plan searches idx_ upper index for each node in leftNodes.

It is important that there is now one predicate equality and a range predicate; both can be used as predicates search.

Right group. The right group is symmetrical left. Consider the path of the virtual tree descending from the root to @ and. Let rightNodes — a set of all the nodes w, which occur on the way and the right of the input range.

For all nodes w in the following statements are true rightNodes.

Interval registered at node d in the right subtree w, can not overlap with the input interval. If that happened, he would have been registered in the ancestor d.

Interval registered w, may belong to one of two types: if lower gt; u, then, obviously, the interval does not intersect with the input interval; if lower lt; =u, then the interval of overlap with the input interval, since it is already known that the upper gt; @ = 1 (interval, recorded in the node that is to the right of the interval input obviously finishes after the input starts interval).

Therefore slots belonging to the right group registered in the node and have lower rightNodes lt; =u.

The definition of functions and leftNodes rightNodes defines a table function, called rightNodes, which returns the nodes in the above set rightNodes.

The following query returns all intervals in the right group (registered in the nodes in rightNodes, which intersect with the input interval):

DECLAREl AS INT = 5000000,

u AS INT = 5000020;

SELECT Lid

FROM dbo.IntervalsRIT AS I JOIN dbo.rightNodes (l,u) AS R ON l.node = R.node

AND l.lower lt; =u;

Plan zaprosas right nodes. It is symmetrical plan of the query nodes left, but this time the index is used idxlower. The middle group. Let middleNodcs — a set of nodes w, located in the input range. Any interval registered in w, is lower lt; =u and upper gt; @ = 1; therefore it intersects with the input interval. Use the following query to return all the slots in the middle group:

DECLARE @ 1 AS INT = 5000000,

u AS INT = 5000020;

SELECT id

FROM dbo.IntervalsRIT

WHERE node BETWEEN @ 1 ANDu; The optimizer can use as idxupper, and idx lower for efficient query processing. In this case, the optimizer seems to prefer to use idx upper. The source code from the model RI-tree, combining the results of queries that return crossing in all three groups (left, middle and right).

For combined request was received the following statistics: 81 IntcrvalsRIT + 2 for a table variable returned leftNodes + 2 for a table variable returned rightNodes; CPU time: 16 ms. This is much better than traditional query shown earlier in this article; Statistics remind you of the traditional request: logical reads: 11256; CPU time: 482 ms.

Optimized calculation ancestors. The disadvantage of implementing the functions and leftNodes rightNodes in the previous section is to use iterative logic, is not effective in T-SQL. Introduce excessive costs to find no intersection with the only input interval, and with a set of input intervals. Laurent Martin proposed an elegant solution that uses logic based on sets. This article describes a kind of solution, in my opinion easier to understand. The basic logic is about the same as in the decision of Martin. If there is an input range of | @ 1 and @ |, then leftNodes — @ 1 set of ancestors that are to the left of the input range. Similarly, rightNodes — set of ancestors @ r to the right of the input range. Martin made the following observation concerning the ancestors of any given nodenode. Assume,node = 13 (binary 01101). You can define the parent node, assigning a value of 0 to the right category node, and the category left — the value of 1. This process can be repeated until it reaches the root, to identify all ancestors. For example, the ancestors of 13-14, 12, 8 and 16: 01101 (13)

01110 (14)

01100 (12)

01000 (8)

10000 (16)

To calculate the ancestors of the node uses an auxiliary table named Bit Masks (column values ​​b1 and b3 are given in binary format).

For a virtual tree of height h lines of the table should be filled in which n has a value between 1 and h-1. For example, the code in Listing 7 creates a table and fills it BitMasks values ​​for a tree with a height of h = 31 (n from 1 to 30).

One might ask why there is no table BitMasks column L2. This variety of solutions are used only b1 and b3, but the initial decision of March there is also a column L2.

To calculate the ancestors of the input nodenodc, sent a request to the table BitMasks, using the expressionnodc & N | bs on each level, which, in fact, implements the above logic: at every level, assign a value of 0 category this level and set digits to the left of it a value of 1. However, it is necessary to filter out only the levels at which there are ancestors, namely those in which the right-most bit set to 1 (only bZ), located to the left of the right-most set in one discharge @ node.

In SQL Server to represent integers using the additional code. As indicated in the relevant article of Wikipedia, «an easy way to manually convert a binary number to an additional code — to start with the least significant bit and copy all zeros (moving in the direction of the most significant digit) as long as it met the first one, and invert all remaining level. «

This means that for a given valuenodc rightmost bit set to 1 can be calculated using the expressionnode & — @ node. For filtering only the levels that represent the ancestors of the positive input node, use the following predicate: b3 gt; node & — @ node.

Definition of a table-valued function built Ancestors, which returns all of the ancestors of the input node. To return only the ancestors of the left node, the request is sent to the function and filter only those lines which returned less than the input node node. To return only the right ancestors, the request is sent to the function and filter only those lines in which the returned node greater than the input node.

Note the addition of a filter that excludes nodes of left path, which is below the minimum unit in the table and right nodes of the path that are maximal in the node table. Figure 10 shows the query plan in Listing 9. As a function of Ancestors built, the plan looks BitMasks table directly, without the overhead associated with the function as in the previous case. Statistics of this request — the logical reading: 66 IntervalsRIT + 2 for BitMasks; CPU time: 0 ms.

Integration capabilities in SQL Server

Thanks Rl-tree model and optimize manage quite effectively handle ranges in SQL Server. But the mathematics needed for this solution may be too complex for users. It is possible to build a model in the core SQL Server, making it transparent to the user. This can be achieved in several ways.

One option, to introduce a new type of index (called the index range), which the user must create a fairly simple syntax. SQL Server can calculate the branch node for each interval and build two binary code, such as those described above and idx lower idx_ upper. Optimizer is also needed to find queries based on ranges of predicates in a filter query, and should be performed after the detection of the internal processing of the request. With this approach, the user can only create an index of the interval, and the SQL Server engine will do the rest. Initial inquiries remain unchanged and only run faster. The previously described indices idx_ lower and idx upper represent the simplest form of the required indexes. As described in the Advanced interval queries with the Static Relational Interval Tree, Allen all relationships for the intervals can be processed on the basis of the model RI-tree. Some require one index for a list of keys (node, lower, upper) and one for (node, upper, lower). For queries relating only to the intersections, is enough to define indexes (node, lower) and (node, upper), respectively (so 1NTERSECTS_ ONLY option in the proposed syntax of the index). Also, if the request has additional filters on the basis of equality for the other columns, it is necessary to add to the index as both initial keys. Finally, if you need to return more columns than those indicated in the list of keys and you want to index covers the query, you must add the columns to be included. Therefore, the proposed index interval with all optional elements will look like this: CREATE INDEX myindex ON dbo.lntervals [(fcol1, fcol2, …)] — initial equality-based filters INTERVAL (lower, upper) — Columns interval [INCLUDE (icoll , icol2, …)] — included columns [WITH (INTERSECTS.ONLY = ON)]; — Specifies a list of keys. The internal mechanism of SQL Server will create the following two ordinary B * tree index:

CREATE INDEX myidxl ON dbo.lntervals

([fcol1, fcol2] node, lower [, upper]) [INCLUDE (icoll, icol2, …)];

CREATE INDEX myidx2 ON dbo.lntervals

([fcol1, fcol2] node, upper [, lower]) [INCLUDE (icoll, icol2, …)];

As noted above, the optimizer detects requests slots based on query predicates in the filter and uses these indices.

Remember that although all of the examples used integers in practice often have to work with dates and time intervals. Comparison of date and time values ​​integers converse can lead to significant costs in T-SQL. If you use your own internal implementation of the compiled code in conjunction with any low-level language will allow Microsoft’s do it effectively.

Among other potential additions to the SQL Server on the basis of this model — built-in functions to calculate the branch node and ancestors, types of support date and time and again their more effective implementation in comparison with custom calculations. In addition, these features provide an individual approach for experienced users. As noted above, I have submitted a proposal to expand the functionality to Microsoft.

Faster processing requests from operating intervals

Traditional methods frequently used to handle requests dealing with intervals, especially typical queries intersections inherent fundamental problems that reduce productivity. If the filter has two predicate query range only one can be used as a predicate in the search index on the basis of the binary tree. Thanks model RI — tree and additional optimization will be able to get a search index, based on a single predicate range, which greatly speeds up the processing of requests. However, the model and its optimization can be unnecessarily complicated for users. As a purely technical problem and the solution already exists, it can be fully integrated into the core database, making transparent to users.

Like this post? Please share to your friends: