About pastebin.com
User can paste or write or store text for the specific period of time and the same content can be accessed / shared via a unique URL. Idea behind this system is that people should be able to share large amount of text online in simple and convenient manner with other people.
Functional requirements
- User should be allowed to paste their content and paste should be accessible via unique URL.
- Registered users can edit or delete their paste
- Paste would be removed from the system after 1 year. / Paste URL would be expired after 1 year of period.
Non-functional requirements
- System should be highly available and reliable in terms of creating and accessing the paste.
Constraints
- Maximum 2 MB of content can be allowed per paste.
- Paste would be removed from the system after 1 year. / Paste will be expired after 1 year of period.
Estimations
Reading of content would be more than the writing of content. Lets assume that when 1 user will write the content, it will be read by 10 users. It means it will be read heavy system than write.
Traffic: Lets assume that daily we are receiving 1 millions of write requests then there will be 10 millions of read requests (write = x, read= 10x) of the pastes. it means we need highly scalable system to handle such a huge traffic.
Storage: Lets assume that we store text for 1 year. As we will receive 30 millions write request per month, So 360 millions of requests we are going to store for 1 year. (30 millions * 12 months). So total storage we will require for 1 year is 900 TB of data. (360 millions object * 2500 kb (2 MB text + 500 kb for user and other telemetries))
APIs offered by this system
- CreatePaste(userId,text,expirationInDays) : Key – This API is responsible to create the paste in database. If user is logged in then userId would be available otherwise it would be null value as anonymous paste is also allowed. expirtationInDays has one of these values {1,2,3,7,30,60,90,180,365}
- ReadPaste(key) : text – This API will expect to pass key as parameter and it will return text. If key is not available then 404 would be returned from API which needs to be handled by called of the API.
Database
There is a relation between user and created paste hence we can use any RDBMS here.
System design
TrafficManager will receive all users requests. Based on the location of request, it will forward the request to the nearest hosted API or in the nearest region. We will host the instance of API in all the regions (one hosting in each continent) so latency would be minimal.
Initially, We will generate offline keys in UnUsedKeys table. Whenever PasteAPI receive request for creating of new paste, that request will be forwarded to KeyGenerationService. This service will pick the first key from UnUsedKeys table and respond back to the PasteAPI. Background thread in KeyGenerationService will remove that key from UnUsedKeys table.
KeyCleanUp service will be responsible for cleaning of the keys. We can run this service at every end of the day. This service will move all the expired keys from UserContent table to UnUsedKeys table.
UnUsedKeys and UserContent both tables should be highly available so to achieve this, we can use the replication of database (master-slave). So whenever one node goes down, other node can serve the request. Also, one node can accept the request for creating the paste and other node can serve the read requests.
Caching
To reduce the latency, we can use caching at couple of places.
1) Whenever KeyGenerationService receive the request for new key, instead of reading it from database, it can return the key from cache itself.
2) Other potential place for using cache is while reading the paste. Assume that whenever new paste will be created, its highly possible that paste will be retrieved by multiple users (10x). So we can place those demanded pastes in to cache. We can use LRU (Least recently used) as cache eviction policy.
Offline key generation service
Earlier, we have decided that we will generate offline keys in to UnUsedKeys table. But the questions here are like why we store keys offline and why not generate keys when needed, how many keys we will insert initially and what would be the length of those keys.
We need to generate an unique key each time whenever we receive write request. Problem is that after generating key, we need to check whether generated key is being used by other paste or not. If yes, we need to regenerate the key and this process continues until we don’t get unique key. So due to this reason, we can generate keys offline.
Paste url would look like this: {sitename.com/{key}}. Key can contains capital letters, small letters and numbers. Total of these is 62. If we add any two special character, total would be 64 characters. Using base64 encoding, an 8 letter long key would result in 64^8 = Approx. 281 trillion strings and 6 letter long key would result in 64^6 = Approx. 68 billions strings.
While estimation, we have assumed that we will receive 1 million request daily so for 1 year we would need 365 millions keys so 6 letter keys would be suffice.
Telemetries
It would always good to store the telemetries like visitors country, date and time of access and UI widgets clicks, change events etc.
Happy designing!