Terraform, AWS, and Idempotency. A bug in Terraform? Or a… | by Jeffery Smith | Jun, 2022

[ad_1]

A bug in Terraform? Or a misunderstanding of how a specific stanza works? Or possibly even our personal automation round Terraform?

Photograph by James Harrison on Unsplash

Hopefully, that is solely half 1 of this sequence because it doesn’t actually have a satisfying ending to this point, however nonetheless a narrative price sharing. We encountered an error in Terraform that was transient however appeared to go away by itself, most certainly some race situation. This publish goes to stroll by way of that failure. It’s a bit extra technical than a few of my different posts so those that come for the management ideas might need their eyes glaze over. Your mileage might range.

A bit about our deployment course of first. We use a blue/inexperienced deployment technique in the environment (minus the database). Automation round Terraform is chargeable for citing the second utility stack and deploying code to it. Throughout the creation of that second stack, we acquired an error message we had by no means encountered earlier than.

Error: error creating EC2 Launch Template: IdempotentParameterMismatch: Shopper token already used earlier than. 
standing code: 400, request id: 4c6edce7-7497-4884-ab63-f215f9b82f6e
on ../terraform-asg/principal.tf line 33, in useful resource "aws_launch_template" "launch_template":
33: useful resource "aws_launch_template" "launch_template" {

There have been a couple of issues that wanted to be researched right here.

When an motion is idempotent, it means it may be carried out a number of instances with out altering the outcome past the preliminary utility. I

n apply what this often means is that if I run command X, the command is conscious if it had been run earlier than on this context and should skip the same old utility of the command and as a substitute return a standing or a outcome. A easy instance is the mkdir command whenever you use the -p flag (which tells mkdir to create the total path if it doesn’t exist).

If I run that on my native workstation the next occurs.

mkdir -p /tmp/take a look at
ls -l /tmp/|grep take a look at
drwxr-xr-x 2 jeff.smith wheel 64 Jun 9 06:31 take a look at
mkdir -p /tmp/take a look at

The mkdir command created the trail /tmp/take a look at and we will see it was created efficiently. After I run the command once more, it completes efficiently with no error message.

That makes this command idempotent. Irrespective of what number of instances I run this command, I do know ultimately I’ll get a profitable outcome and that the trail /tmp/take a look atwill exist. Now distinction that with simply mkdir.

mkdir /tmp/test2
ls -l /tmp/|grep test2
drwxr-xr-x 2 jeff.smith wheel 64 Jun 9 06:35 test2
mkdir /tmp/test2
mkdir: /tmp/test2: File exists

Now with mkdir, on the second execution I get an error message, which is a unique finish outcome than my first execution. The listing exists, however the error code I get again is completely different. That makes this a non-idempotent operation. However why will we care? Nicely within the second instance I must do much more error dealing with for starters. And it is a primary instance.

When doing one thing like creating infrastructure, it might end in launching extra cases than you meant, which is what this IdempotentParameterMismatch error is designed to forestall.

If you make an AWS API name to create infrastructure, nearly each endpoint (to my information) does that is in an asynchronous trend. What this implies is your API name returns instantly however the work of really creating the infrastructure you requested remains to be in progress. Due to this you usually want some form of polling mechanism to find out when the operation has been accomplished.

A number of AWS API endpoints assist idempotency, which lets you specify a consumer token to uniquely establish this request. In case you make the identical API infrastructure creation name and use the identical consumer token, as a substitute of making a brand new occasion, it would return the standing of the beforehand requested occasion. When creating infrastructure programmatically this is usually a massive security web to keep away from creating many copies of the identical infrastructure. And that’s the place our error is available in.

The error is stating that we’ve got already used the consumer token. Digging into the documentation a bit, a extra particular that means is that we modified parameters within the request however reused the consumer token, that means the API doesn’t know what our intent really is. Do we would like new infrastructure with these parameters? Or are we anticipating already present infrastructure that matches these parameters? To be protected, it throws this error.

The consumer is at all times chargeable for producing the consumer token. However on this case, the consumer is definitely Terraform. That created some shock and aid on our half because it meant it most certainly wasn’t any of our wrapper automation. However we nonetheless wanted to make certain. The very first thing we did was an try to search out what consumer token was used and was it really used twice.

Fortunately the error gave us a RequestId which we used CloudTrail to search for. Within the requestParameters area of that request we have been capable of finding the ClientToken used.

"requestParameters": {
"CreateLaunchTemplateRequest": {
"LaunchTemplateName": "sidekiq-worker-green_staging02-launch_template20220510183630312700000005",
"LaunchTemplateData": {
"UserData": "<sensitiveDataRemoved>",
"SecurityGroupId": [
{
"tag": 1,
"content": "sg-0743bcbd08cadba1d"
},
{
"tag": 2,
"content": "sg-6daf421e"
},
{
"tag": 3,
"content": "sg-0818108e24803a418"
}
],
"ImageId": "<eliminated>",
"BlockDeviceMapping": {
"Ebs": {
"VolumeSize": 100
},
"tag": 1,
"DeviceName": "/dev/sda1"
},
"IamInstanceProfile": {
"Title": "asg-staging02-20220510183628901400000003"
},
"InstanceType": "m4.2xlarge"
},
"ClientToken": "terraform-20220510183630312700000006"
}

It appears to be like sufficiently random in the identical format that Terraform typically makes use of to generate random values.

It undoubtedly didn’t look like one thing we generated. We then determined to look all requests in that timeframe to see if any of them had the identical consumer token.

Positive sufficient, there was a second request made that reused the identical Terraform module (which we wrote) to generate a second ASG and launch template.

"requestParameters": {
"CreateLaunchTemplateRequest": {
"LaunchTemplateName": "biexport-worker-green_staging02-launch_template20220510183630312700000005",
"LaunchTemplateData": {
"UserData": "<sensitiveDataRemoved>",
"SecurityGroupId": [
{
"tag": 1,
"content": "sg-0743bcbd08cadba1d"
},
{
"tag": 2,
"content": "sg-6daf421e"
},
{
"tag": 3,
"content": "sg-0818108e24803a418"
}
],
"ImageId": "<eliminated>",
"BlockDeviceMapping": {
"Ebs": {
"VolumeSize": 200
},
"tag": 1,
"DeviceName": "/dev/sda1"
},
"IamInstanceProfile": {
"Title": "asg-staging02-20220510183629183700000003"
},
"InstanceType": "m5.xlarge"
},
"ClientToken": "terraform-20220510183630312700000006"
}

As you may see, there are variations within the request, however the consumer token stays the identical. Now we’re beginning to freak out and assume possibly it’s our code, however we nonetheless couldn’t see how.

As I discussed beforehand, producing the consumer token is the job of the consumer from AWS’ perspective. From our perspective, that consumer is Terraform. We’re not GO specialists by any stretch of the creativeness on our group (though we’re searching for a couple of good tasks to take it for a spin. We’ve got a variety of curiosity).

However with a purpose to perceive how the consumer token will get generated, we have been going to have to take a look at the Terraform supply code. After a bit of digging, we got here throughout the code dwelling within the terraform-plugin-sdkas a helper perform.

func PrefixedUniqueId(prefix string) string {
// Be exact to 4 digits of fractional seconds, however take away the dot earlier than the
// fractional seconds.
timestamp := strings.Change(
time.Now().UTC().Format("20060102150405.0000"), ".", "", 1)
idMutex.Lock()
defer idMutex.Unlock()
idCounter++
return fmt.Sprintf("%spercentspercent08x", prefix, timestamp, idCounter)
}

On this perform, the creator is producing a timestamp correct to the second. It’s doable that a number of executions might hit in the identical second-time span, however the worth additionally will get a counter appended to it.

The counter is in a mutex so the worth of idCounter is shared throughout executions and the mutex prevents concurrent execution. There ought to be no approach that this perform generates the identical consumer token twice. However that doesn’t imply that the perform calling for the consumer token isn’t storing it and presumably reusing it.

That is the place our story ends for the second. We began to look into how and the place the consumer token was getting used, however since we felt strongly that this was going to be associated to a Terraform problem of some type, we needed to shift gears for an answer. We weren’t going to run a customized patched model of Terraform. We weren’t going to improve on the spot. And we weren’t going to attend till a PR obtained authorised, merged, and launched, in order that put us on a unique remediation path.

Our present repair was to specify a depends_on argument for the 2 sources in battle. Different instances when the error occurred, we seen it was at all times these two sources in battle, so the hope was that the depends_on flag would stop these from being created in parallel.

Thus far that hope has paid off and we haven’t seen the error in any atmosphere once more. However we plan to proceed to analysis the problem out of nothing greater than morbid curiosity. It would lead us to a bug in Terraform, a misunderstanding of how a specific stanza works, or possibly even our personal automation round Terraform.

Who is aware of? If we discover it, you’ll make sure you discover a Half 2 to this text!

[ad_2]

Source_link