Decoding HTTP: Networking Fundamentals for System Designers
Episode #16: Learn the basics about the HTTP protocol before your next System Design Interview.
This is another episode in the series of articles about System design.
If you missed my last article on this series about What happens when you type a URL into your browser, please read it now. You will need it to understand some of the topics here.
This article will discuss in depth the HTTP protocol, where it is used, and how it can benefit a System Design interview.Â
Instead of just providing the facts, I would like to explain why you need to learn some of these topics and where you might use the knowledge you gain here.Â
As in previous articles, what you learn here is not limited to system design interviews but can be used in your daily job.
I expect this article's audience to be Junior Software engineers with no prior experience and more experienced developers who want to explore a different point of view. I hope both of them learn something from how I approach this topic.
Enough with the introduction; let's jump into the meat of this article. We have a lot to cover.
The OSI model
The HTTP protocol, which stands for Hypertext Transfer Protocol, is one of the essential components at the foundation of the web.
There would be no web without the HTTP protocol. No Facebook, Netflix, Uber, LinkedIn, and so on.
Because of its ubiquity, it's fundamental for any software engineer to have a good understanding of it.
In university, I studied all about the network layers of the OSI model. We spent months covering anything from the bottom layers close to the hardware to the top application layers.
My suggestion to a newcomer?
As a Software Engineer, you should focus primarily on layer 7, the application layer where you can find the HTTP protocol, and layer 4, where you can find the Transport protocols like TCP/IP and UDP.
Going deep into Networking fundamentals is outside the scope of this article since it would take hundreds of pages in a book. Having said that, you should explore some of the concepts introduced in this article in more detail.
The Application layer
In essence, HTTP is a request-response protocol in a client-server computing model.
A web browser, acting as a client, sends an HTTP request to the server, where the web server then responds with an HTTP response. The client always initiates the communication.
Both request and response in HTTP are in plain text, and they follow a fixed format made of three components:
A start-line.
headers in the format of
Key: Value
.An optional body preceded by an empty line.
In the previous article What happens when you type a URL into your browser, we already discussed how the start-line looks like
GET /p/the-role-of-content-delivery-networks HTTP/1.1
To refresh your memory, this line is made of 3 components:
An HTTP verb. Like
GET
, in this case.A path to a resource. Like
/p/the-role-of-content-delivery-networks
in this case.The HTTP protocol. Like
HTTP/1.1
here.
There are many HTTP verbs (also known as methods) available. The most common are:
GET
to request a resource. It usually doesn't have a body, but there are exceptions to this rule.POST
is the most common verb for sending data back to the server, for example, via a web form. This verb is used both to create and modify existing resources.PUT
is similar toPOST
but should only be used to modify existing HTTP resources.DELETE
, as the name says, deletes existing resources.HEAD
is similar toGET
but returns only the headers and not the response body. It is primarily used for testing and metadata retrieval.
There are other HTTP verbs, but they are less common than those listed here.
A good understanding of the HTTP methods is fundamental if you work as a Software Engineer on REST APIs.
A path in an HTTP request uniquely identifies an HTTP resource at a Host server. Paths can also be used in a REST API for versioning, pagination, filtering, and more.
Finally, the last part of the start-line identifies which version of the protocol is being used. The most common value is still HTTP1.1
, but newer versions are HTTP2.0
and HTTP 3.0
.
These newer protocols provide many improvements, but unfortunately, they are still not as widespread as version HTTP1.1
. More about these later in this article.
Headers in an HTTP request can be used for:
authentication and authorization
caching
metadata about the body of the request or the response, like the text encoding or the binary format
compression
and much more
I'm planning to talk more about HTTP headers in future articles.
The Transport layer
The HTTP protocol is based on a Transport layer protocol called TCP/IP.
This reliable protocol ensures that the packages are received in order without loss when data travels from the client to the server.
TCP is also a connection-oriented protocol. That means a connection must be established before any communication can happen between the client and the server.
A TCP connection is established in a three-way handshake process that takes at least a complete round trip between the client and the server. The only way to make it faster is to have the server geographically closer to the client. This is precisely the reason why CDNs are so famous in System design. They make serving static content much faster during the three-way handshake and data transmission. I have written an article about the role of CDNs in System Design.
The three-way handshake is considered an expensive operation, and that's why HTTP protocol tries to reuse the same TCP connection as much as possible.
The front page picture of this article depicts exactly how the TCP connection is established. The picture has been created using a tool called PlantUML, which I discussed in a previous article, Enhancing Software Design with Diagrams as Code.
Future of HTTP protocol
As we said in the previous sections, the most common version of the HTTP protocol is currently HTTP1.1
.
In recent years, two major versions of this protocol have been released: HTTP2.0
and HTTP3.0
.
The HTTP2.0
has a couple of significant changes
-Â it solves the head of line blocking. Suppose a packet in the sequence is lost or delayed. In that case, all subsequent packets must wait for the missing packet to be retransmitted and received, even if they've arrived at the destination. This delay affects all packets lined up behind the lost/delayed one. HTTP1.1
suffers from this problem, and it tries to solve it by using six parallel connections to the same domain and multiple different domains. HTTP2.0
solves this problem at the network level by using a form of multiplexing and not requiring the domain sharding
(the six parallel connections to the same domain) described above.
-Â HTTPS enabled by default. HTTPS is an extension to the HTTP protocol that encrypts data transmitted over the wire, making it secure from eavesdroppers. HTTPS is highly recommended and not enforced in previous versions of the protocol.
-Â Server Push. In the previous section, we said the client always initiates the communication. This is not always true since HTTP 2.0 with Server Push
allows the server to initiate the communication instead.
HTTP3.0
builds on top of HTTP2.0
and improve the performance even further by using a protocol developed by Google called quic protocol which is based on UDP.
The significant difference between TCP and UDP is that the UDP protocol is connectionless, and it doesn't provide guarantees about reliability and order of packages.
UDP has many modern use cases:
DNS protocol, described in the previous article, resolves hostnames into IP addresses.
Transmission of voice and video.
transmission of data in online multi-player video games
The quic protcol in HTTP3.0
removes the need to establish a connection with the three-way handshake, but it still provides reliability guarantees similar to the TCP protocol.
System Design
The most notable reason to use the HTTP protocol in a distributed system is that it is stateless protocol.
"Stateless" means that each HTTP request from a client to a server is treated as a completely new and independent transaction.
Being stateless makes the system design of your web application particularly easy.
You can store the state in a database, and the rest of the application can be stateless. To handle an increasing customer load, you can use horizontal scaling by starting multiple identical instances of the application in parallel and putting them behind a Load Balancer.
Being stateless also has limitations; some applications must maintain the user state between requests.
HTTP solves part of this problem with Cookies and sticky sessions.
Knowing the difference between layer 7 and layer 4 and the associated protocols is particularly useful in system design when deciding between an ALB (Application Load Balancer) (located at layer 7) or a NLB (Network Load Balancer) (located at layer 4) in AWS.
I'll leave a discussion about Load Balancers in a future article.
Resources
tcp vs udp (affiliate marketing link) - this is a course on system design fundamentals that also covers network essentials, CDN, Load Balancers, and much more.
system design primer - a great free resource for learning more about System Design. Mostly for beginners.
Want to connect?
👉 Follow me on LinkedIn and Twitter.
If you need 1-1 mentoring sessions, please check my Mentorcruise profile.