amazon-755800
< < Articles

The AWS content type journey

15/04/2019

Let me ask you something. Do you know the difference between complex and complicated?

Each HTTP request contains headers, one of them being Content-Type. It is in charge of telling the browser the kind of file its request is carrying. Simple, right? Well…

Figure this: I code, and my work is sent into a repository on a daily basis. Every single time the code is received, a CI pipeline starts, which itself contains jobs. Some of the jobs test my code, some of them build it and others deploy the resulting files on servers. So far, so good.

In my case, the servers are simple AWS S3 buckets. Which themselves are placed behind an AWS CloudFront distribution for the CDN and cache handling. So, when I audit the final built, deployed, served and cached production release of my project, and I find out that the Content-Type header is not correct on some resources (thank you Dareboost), who is to blame?

Answer is: it's complex.

content-type

*The current and incorrect Content-Type

At first, I obviously bet on CloudFront. It is the first one to respond, proxying each request to the S3 buckets. Easy target! But it turns out that it is a good lad as it does not modify anything from the underlying layer.

A quick test of requesting some files from the S3 buckets, without CloudFront, shows me that the Content-Type header is also incorrect. So I read the doc, and learn that the S3 buckets are not responsible for the MIME type of the files they serve. I’d rather look at the provider of those files, which by design, must send that information. Yeah, right.

At that point, I'm a bit lost. What kind of web server would not know the files it serves on the web? I mean, an S3 bucket is a sort of a gigantic hard drive, mounted on a Linux VM which runs an Nginx or something, isn't it? It isn’t…

You can compare a bucket to a relational database table. A spreadsheet. It contains the list of its files, with a MIME type column. And that column has to be filled by the sender of those files. Period.

So here I am, ready to follow the lead and hunt down the villain, hiding in the dark pipelines of my CI. Kinda fun!

First stop, the deploy job. It gathers artifacts and send the files by using the AWS CLI. So I read the doc again and learn that this tool is indeed supposed to guess the MIME type of the files when it sends them to the S3 buckets. Ok, but why is it guessing wrong in my case? I have to dig further in its source code, which is open source and based on Python. Lucky me!

1200px-Amazon Web Services Logo.svg

I find the guessing function and see that it is based on a Python core module called, mimetypes... A quick look at the doc tells me that, eventually, there is no magic. The Python module gets the list of the known MIME types in an OS level definition file. Makes sense.

As you probably already know, in 2019, CI jobs often use Docker containers to run. So I continue my journey looking at the OS used as base of the Docker image which is itself used to create the job container. In the end, it turns out to be Debian, which is not known to be the most up to date distribution (this is a strength in a lot of cases). As you can guess now, its MIME types definition list is incomplete, as the one I am looking for is missing.

At that point, I can feel the fix at the tip of my fingers! An upgrade of the OS is not enough so I run to the Debian repositories and find the Sid version of the file (stable enough for my use case), grab it in the Dockerfile of the image and propose a PR. Done!

After the PR merge, the release of the new Docker image and the update of my CI definition file to use it, the AWS CLI can guess the right MIME types and send them correctly to the S3 buckets, which serve them to the browsers. The network dev tool now shows the shiny new Content-Type header, and a new Dareboost audit quickly confirms that the issue has been fixed. Done and done!

Eventually, our daily production code has to pass through a lot of steps before being consumed by the end users, who can by the way be human beings or robots. That journey was a good use case to show that front-end developers have to know more than just trendy frameworks and libraries. In my opinion, we can’t create anything good without pushing the boundaries.