Is it possible for me to see rate limit errors at the 30 second polling rate, even if my environment isn’t that big?
We have multiple customers with varying environments who poll at the 30 second rate. They have not reported rate limit issues. AWS has different throttle limits per region but does not publish this information as it is proprietary. We have seen throttle levels in a region fluctuate over time as AWS makes changes to its environment. If Fugue takes 30 seconds or less to poll for data and index it, there is a cool off period. If this process exceeds 30 seconds then Fugue makes another call immediately. As a result, there is not enough time to have a cool off period. To address this specific situation, we may adjust our polling and indexing time from 30 seconds to <X> minutes to give more time for the cool off period.
If a third-party application such as Splunk or Chef uses the same account to access a service in a region that Fugue supports, it is possible for 2 calls to be made at the same time. Fugue proactively prevents itself from being throttled by using the back-off and retry approach. This means that if we encounter throttling, we generate a message but continue to try indexing data at a less frequent rate.
Why would you make such a big adjustment from 30 seconds to <X> minutes? Why wouldn’t you start off with an <X> minute polling rate?
Fugue fine tunes its system for maximum responsiveness and low latency. Fugue follows AWS best practices for using exponential back-off and retries, which is why we typically do not see throttling issues with customers. In cases where customers report issues, however, we want to be extra cautious to minimize the possibility for throttling so we will changed the retry time from 30 seconds to <X> minutes.
How does Fugue work with the AWS API Throttle?
AWS API throttles are typically set “per service per account per region.” Therefore, staying under the API throttle is a complex problem with multiple possible solutions.
Fugue’s strategy is to call serially instead of using concurrent calls. With our serial approach, we make 1 call per service per region per account at a time. Fugue’s indexing daemons (known as “Reflector Descriptors”) are run on a clock interval of 30 seconds. They are kicked off only when the previous run is complete so the interval may be greater than 30 seconds but never less. If you have no resources in a service-region, the indexing typically takes a few seconds. If you have thousands of resources in a service-region, the indexing can take a few minutes.
Fugue’s indexing of your resources is what enables our drift detection and automated remediation to work. The increase in indexing time means that as you add resources to a particular service in a particular region for a particular account, it takes Fugue longer to detect drift (e.g., someone changing a security group in the AWS Web Console) and correct it.
In a large enough environment, it is still possible for even a single-threaded pass of describe-calls to exhaust the API throttle. This limitation is not specific to Fugue but applies to anything hitting the AWS API, even the AWS Web console, as shown in the image below. Fugue deals with this limitation by ending the indexing job immediately and waiting for the next one to kick off.
Unfortunately, it is not possible to partition off API throttle capacity to particular sessions or users. Therefore, any other tool that examines a service in a region using a specific account counts against the same throttle that applies to Fugue. It is therefore possible for API throttle exhaustion collisions to result between Fugue and other tools. AWS advises tool-makers to use exponential back-off and retries when they experience throttling, but tools vary widely and may or may not follow this recommendation. Some can work side-by-side with Fugue without issue. Others may encounter problems.
The number of AWS API calls seems really high for my service. Why are there so many AWS API calls?
There are typically two types of “describe” API calls in AWS: “describe-all” calls and “list-and-enrich” calls.
EC2 is a great example of the “describe-all” type. The API exposes calls like “describe_vpcs()” and “describe_subnets()”. When you call them, you get all the VPCs and all the subnets in that region for that account. To create an accurate picture of how things are related, you call both and then use the “vpc_id” in any particular subnet’s result to infer what VPC it belongs to. No matter how many VPCs or subnets you have, you only make two calls against the API (setting aside pagination of massive result sets).
ELB, on the other hand, is a great example of the “list-and-enrich” type. You call “describe_load_balancers()”, which returns a list of every ELB in that region. Then, you make a set of calls for *each* ELB to fill in all its characteristics. In Fugue’s case, it currently uses:
ELB v1 (4 calls all together)
If you have no ELB’s, you effectively have one call. If you have 50 ELB’s, you have 1 + (50 * 3) = 151 calls. As you can see, the API call growth is steep for this kind of service.
We also need to index these resources in every region. Even with nothing deployed into an account that we’re indexing, the number of calls over a day comes out to:
14 regions x 20 services x 2/min x 1440 minutes per day = 806,400 calls per day globally
That number is an oversimplification of what’s going on in terms of services and their calls, but it does demonstrate that during a normal day, Fugue is working hard to ensure an accurate picture of your environment.
There are two near-term ways to deal with these conflicts: lower the indexing rate, as we are doing here, or you can also spread your resources across as many different regions and accounts as possible. Doing this has the effect of increasing the aggregate API throttle rate for indexing your infrastructure. We are also researching and testing other methods of lowering the number of total calls.
This graph shows the amount of API calls for a single pass of each of the listed services: EC2, ELB, and ELBv2. It assumes 3 rules per ELBv2 Listener and it’s repeated every 30 seconds by default for each of the listed services.