What is selectivity in SQL? How is selectivity calculated and how does it relate to a database index?
The terms selectivity and cardinality are closely related – in fact, the formula used to calculate selectivity uses the cardinality value. The term selectivity is used when talking about database indexes. This is the formula to use to calculate the selectivity of an index – don’t worry we do explain what it all means below:
How to calculate the selectivity of an index:
Selectivity of index = cardinality/(number of records) * 100%
Note that the number of records is equivalent to the number of rows in the table.
What does selectivity mean?
So, you see the formula and you are thinking that’s great, but what does this actually mean? Well, let’s say we have a table with a “Sex” column which has only two possible values of “Male” and “Female”. Then, that “Sex” column would have a cardinality of 2, because there are only two unique values that could possibly appear in that column – Male and Female. If there are 10,000 rows in the table, then this means that the selectivity of an index on that particular column will be 2/10,000 * 100%, which is .02%.
The key with the selectivity value is that it basically measures how “selective” the values within a given column are – in other words how many different values are available in the given sample set. A selectivity of .02% is considered to be really low, and means that given the number of rows, there is a very small amount of variation in the actual values for that column. In our example “Sex” column,
Why does the database actually care about the selectivity and how does it use it? Well, let’s consider what a low selectivity means. A low selectivity basically means there is not a lot of variation in the values in a column – that there is not a lot of possibilities for the values of a column. Suppose, using the example table that we discussed earlier, that we want to find the names of all the females in the table.
How does selectivity affect usage of a database index
Database query optimizers have to make a decision about whether it would actually make sense to either use the index to find certain rows in a table or to not use the index. This is because there are times when using the index is actually less efficient than just directly scanning the table itself. This is something that you should remember: even if a column has an index created for it, that does not mean the index will always be used, because scanning the table directly without going through the index first could be a better, more efficient, option.
When is better to not use a database index?
So, when exactly is it better to not use a database index? Well, when there is a low selectivity value! Why does a low selectivity mean that using the index is not a good idea? Well, think about it – let’s say we want to run a query that will find the names of all the females in the table – we are of course assuming that there is another column for “Name” in addition to the “Sex” column. If we are searching for all the female rows in a table with 10,000 rows then there is a good chance that 50% of the rows are females, because there really are just two possible values – male and female. Assuming that 50% of the rows are indeed females, then this means that we would have to access the index 5,000 times to find all the female rows. Accessing the index takes time, and consumes resources. If we are accessing the index 5,000 times, it is actually faster to just directly access the table and do a full table scan. So, you can see that the selectivity value was used by the query optimizer to determine whether it was more efficient to use an index or just read the table directly.
What selectivity value determines if an index will be used or not?
It’s really hard to say since that exact value varies from one database to another.
Of course, a high selectivity value means that the index should definitely be used. For example, if we are dealing with a column that has a selectivity of 100%, then all the values in that column are unique. This means that if a query is searching for just one of those values then it makes much more sense to use the index, because it will be far more efficient than risking a full table scan – which is the worst case scenario if the table is searched directly without consulting the index first.