That way, by joining two small tables, the hashes are much smaller. In those cases, you can create a table, usually a temporary one, that is a subset of a table being joined but has whatever filtering you need already applied. In many cases, we are only retrieving small subsets of data from the tables being joined but are doing a hash join of whole tables. While holding on to the result of the subquery takes some memory, it is usually much less than what’s needed for a hash join.Ĭreate and join subtables. Instead of joining the two tables, we can select users whose ids are in the subscriptions table. For example, a common query for us is to get some piece of information about users with subscriptions. In those cases, the join can often be replaced by an IN clause and a subquery. Some queries that use joins only need data from one of the tables but are using the join to verify some piece of information. If you can’t change the dist key because the dist key is optimized for another query, the new key would cause skew issues, or some other reason, you may be able to make some changes to your query so it can still be executed. When setting up dist keys, though, be wary of skew, which is discussed in the next section. By setting up the tables so their dist keys are the same, you may be able to avoid a disk full error. However, if the two join keys for the query are on the same node, the whole query can happen in place without using any additional memory. When joining large tables, this quickly fills up disk space. That means each node will have to store hashes for every row of the table. By default, it performs a “hash join” by creating hashes of the join key in each table, and then it distributes them to each other node in the cluster. When Redshift executes a join, it has a few strategies for connecting rows from different tables together. If the query that’s failing has a join clause, there’s a good chance that’s what’s causing your errors. (sum(capacity) - sum(used))/1024 as free_gbytes If it looks like you have plenty of space, continue to the next section, but if you’re using more than 90%, you definitely need to jump down to the “Encoding” section. Redshift should continue working well even when over 80% of capacity, but it could still be causing your problem. Ideally, you won’t be using more than 70% of your capacity. You can figure out which is the case by seeing how much space your tables are using by querying the stv_partitions table. This could be because the query is using a ton of memory and spilling to disk or because the query is fine and you just have too much data for the cluster’s hard disks. If you’re getting a disk full error when running a query, one thing for certain has happened - while running the query, one or more nodes in your cluster ran out of disk space. Make Sure You Know How Much Disk Space You Actually Have We’ll share what we’ve learned to help you quickly debug your own Redshift cluster and get the most out of it. Over the last year, we’ve collected a number of resources on how to manage disk space in Redshift. One area we struggled with when getting started was unhelpful disk full errors, especially when we knew we had disk space to spare. You can work faster with larger sets of data than you ever could with a traditional database, but there’s a learning curve to get the most out of it. You have new options like COPY and UNLOAD, and you lose familiar helpers like key constraints. When working with Amazon’s Redshift for the first time, it doesn’t take long to realize it’s different from other relational databases.
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |