Thursday, 12 February 2015

Post 26: URL encoding

URL was designed to make it as usable and interoperable as possible. Therefore the internet standard defines so called “unsafe characters”.

Examples for unsafe characters are:
The space “ ”, because they seem to disappear when printed or you don't know how man space characters are there.
The pond/sharp character “#”, because it is reserved for the fragment (we covered what a "fragment" is here already).
The caret “^”, because not all network devices transmit this character correctly.

What is considered a safe and what an unsafe character is defined in the RFC 3986. RFC stands for Request for Comments. It's a recommendation made by the IETF (Internet Engineering Task Force). Even though it is officially a recommendation only it is considered a de facto standard.

The RFC 3986 defines safe characters as alpha numeric characters in the US-ASCII and a few special characters like the colon “:” and the slash mark “/”.

If you want to transmit one of these unsafe characters, then you have to “percent-encode” or also called “URL encode” them. For example if you want to store on the server foo.com the file “^hello world.txt”, then the valid URL would look like: “http://foo.com/%5Ehello%20world.txt”

As you can see the caret “^” and the space “ ” have been replaced with “%5E” resp. “%20”. The characters after the percent characters “%” represent the corresponding hexadecimal number in the US-ASCII charachter table, i.e. “5E” and “20” are stands for “^” resp. “ ” in the US-ASCII table.

The full US-ASCII table can be found here.

Source(s):
HTTP Succinctly by Scott Allen Syncfusion
Wikipedia

Tweet