Update
I apologize to all 3 of you who had actually bookmarked this blog in hopes that I might actually produce something useful, but a lot has happened in the past several months. The birth of my son, Maximilian, and the start of a new adventure. You may have even noticed some of it showing up in interesting places.
I will get to the article on caching shortly. First though, I have to say: it's been so refreshing switching back to Java. PHP does a lot of things well, but running enterprise scale software is definitely not one of them. The subsequent re-introduction to the now-fully-mature Spring IOC and MVC, and Hibernate have been eye-opening to say the least. I had explored both of these products several years ago, and they were clunky in their infancy, but have now fully matured into out-of-the-box enterprise grade software. Their scope and utility has been humbling to someone who has spent many years designing frameworks.
I also need to give credit where credit is due, so thank you Sebastien Stadil, CEO of the awesome Scalr, for all of the help recently. Thanks to Endemol for partnering with us. Thanks to all of the individuals who came out of the woodwork to support Bonnie and myself for the past few months, with the company and our personal lives. It's amazing how open the game development community is, despite what most are led to believe.
Outside of work, I had a very enlightening racing season, travelling around the US with RCD Racing, experiencing new tracks and eventually finishing 3rd in our class. It was truly unfortunate that our good friend Steven Golladay was taken from us. Steve had been a recent fixture in the WERA paddock, and would have raced with Dave and myself full time with RCD this season if it had not been for other circumstances. Instead, he was only able to compete in one of the endurance races, one that I missed. You are missed and remembered Steve. I wish I had been able to ride with you more.
Scaling Data, Part 2
This is my second post in this series on Scaling Data. You can find the first post here.
Simple Non-Relational Databases (Key-Value Stores)
What we've been looking at so far is the traditional way of storing information in a database, using a relational model and using the features of the underlying RDMBS to your advantage where possible. With this approach, the "weight" of computation for data relationships lies mostly on the underlying RDBMS system, although each evolution of our solution has been moving more computation into the application itself. A different approach to data storage is a non-relational database (dubbed NoSQL). I prefer to call them key-value stores, since they are, at their most basic level, large hash-mapping systems. Key-value stores have been around for a while, and used in other industries, but are now attracting attention in the web-application realm due to their speed. Examples of these key-value stores would be systems like Cassandra, Project Voldemort, Tokyo Cabinet, etc. New projects are emerging at a rapid rate, but many of these solutions remain in fairly unstable condition. Key-value systems tend to offer very little in terms of features when compared to traditional RDBMS systems. Most of what you need to know of the API for any of these systems will be very similar to this:
value = get( key ) put( key, value )
That's all you get, but that's all you need.
Complex Non-Relational Databases (Object Databases)
Object Databases are similar to Non-Relational database, although they allow some context. For example, in MongoDB, your object must be in JSON format, but you are able to create indexes on fields within any object that is inserted.
{
id: 21,
name: "person a"
}
{
id: 22,
name: "person b"
}
An index can be created on the above two objects stored in the database on the "name" field, allowing applications using the system to look up objects by name. Examples of Object Databases would be MongoDP, Hadoop HBase, db4o, etc. The lines between simple and complex non-relational databases are blurring as these technologies become more prevalent. Some of these databases even have a variety of simple relational features, although nowhere near as feature-complete as a traditional RDBMS.
Pros and Cons
So why would anyone want to use a non-relational database? Each situation is different, but there are two major advantages.
- Speed. When used correctly, non-relational databases are faster than relational databases.
- Data flexibility. Without defined schemas, your data can continually adapt and evolve.
Touching on point 2, as long as our serialization and deserialization methods support adding and removing fields, we are able to version data very easily. This is usually accomplished just by changing the model structure definition in your code, and maintaining a version field and a method to upgrade between versions. For each situation you have to weigh the advantages with the disadvantages. To name a few:
- Newer projects mean less stability, possible breaking API changes with each update, and potential data loss.
- Fewer enterprise grade features built in (sharding, clustering, replication, access control etc. may have to be implemented on your end).
Reducing the Pizza System to work with Key-Value Stores
It is no coincidence that we have already started reducing the database operations to a level that would work well with a non-relational database. We have been forced to avoid the use of some of the RDBMS' more "expensive" features, such as joins, as we inch ever closer to a system where we no longer require any of these features. Looking at Solution 5, all of our joins and foreign keys are handled in the application itself, not the database. Our goal is to reduce "access" complexity, and consequently increase data complexity. This means we want to reduce most of our database options to the get/put procedures mentioned earlier, but the size of each row will grow quite large, and may contain more than one object.
The next step is to figure out our "super" objects, as these will become our non-relational tables. A super object is something that contains a large chunk of the data that we need to work with, and is usually representative of a completed logical process of the system. There are no tangible rules to determining these objects, a lot of it comes from experience and in-depth knowledge about how this data will be used.
Looking at Solution 5, we have two objects that qualify as super objects: Order and Pizza. Also note that Customer would make sense as a super object if it had additional information, such as a list of addresses or phone numbers rather than just one. Since our solution requires that every Pizza is attached to an Order, we really only have one point of focus with this application, the Order data.
Now, we'll use a "model" class to represent the data that we're going to put in the key-value store. This model class will use serialization and deserialization to save and reconstruct itself in the key-value store. Unfortunately, our application requires a few more advanced features, so we'll have to split up the Order data into two different places.
Solution 6
Relational Databases:
Server 1
Topping: (id, name, price)
Crust: (id, name, price)
Sauce: (id, name, price)
PizzaLocation: (id, name, priceMultiplier)
Server 2
OrderQueue: (orderId, delivery, fulfilled, date)
Key-Value Stores
Server 3
Order_1
Server 4
Order_2
Server 5
Order_3
Server 6
Order_4
Code:
class Order
{
orderId;
customerId;
delivery;
fulfilled;
date;
pizzas = array();
function Order(id)
{
orderId = id;
this.load();
}
function load()
{
shard = orderId % 4;
serverId = shard + 4;
tableName = "Order_" + shard;
data = get (tableName, orderId);
this.unserialize(data);
}
function store()
{
shard = orderId % 4;
serverId = shard + 4;
tableName = "Order_" + shard;
data = this.serialize();
store (tableName, orderId, data);
}
}
allToppings = SELECT * FROM Topping;
allCrusts = SELECT * FROM Crust;
allSauces = SELECT * FROM Sauce;
allPizzaLocations = SELECT * FROM PizzaLocation;
function getToppingById (id) {
foreach topping in allToppings {
if topping.id == id { return topping }
}
}
function getCrustById() { ... }
function getSauceById() { ... }
function getPizzaLocationById { ... }
orderId = SELECT orderId FROM OrderQueue WHERE delivery=true AND fulfilled=false
LIMIT 1 ORDER BY date;
order = new Order(orderId);
orderCustomer = SELECT * FROM Customer WHERE id=[order.customerId];
foreach pizza in order.pizzas {
pizza.displayInfo = getCrustById(pizza.crustId) + getSauceById(pizza.sauceId);
foreach topping in pizza.toppings {
topping.displayInfo = getToppingById(topping.id) + getPizzaLocationById(topping.pizzaLocationId);
}
}
Analysis
At this point, our Order class is our Order table. Some important things to remember at this point:
- Fields can be freely added to a model class and have them save in the key-value store.
- Fields can be removed from a model class, but depending on the language you're using and the serialization method, you may have to catch non-existent properties.
- The Pizza instances are completely contained within the Order class and need to be handled by whatever serialization method has been chosen.
- If data structures need to be adjusted, add a version field and a method to check for upgrades after loading data.
This entire solution hinges on one critical point: serialization and deserialization methods. Most languages have built in functionality for serialization, but I've found that using JSON as a serialization format has several advantages. Particularly in PHP, where serialization is extremely slow, I found that my own JSON encoding and decoding methods were much faster. JSON data is also programming-language-agnostic, you can have one service running in PHP and another in Java, both reading and writing to the same structures. As an added benefit, the data can be passed directly to a front-end renderer, be it a JavaScript website (like Ext-JS) or a Flash game.
One severe disadvantage to Solution 6 is that it completely destroys our ability to run statistics on the raw data. The easiest way to combat this is to add another Relational-Database server running just statistics tables and have the code only INSERT into these tables when something interesting happens. As long as the code doesn't read from those tables, inserts should be fairly quick, and the code doesn't necessarily need to wait for the response from the server to determine if the insert was successful.
Next up: Caching.
Scaling Data, Part 1 – Scaling Relational Databases
This is my first post in a series of discussions on scaling applications. This applies to everything from web sites to corporate intranet software. Most of the context will be in terms of web applications, since that's what I've recently been working on. The first focus on this series is going to be data. Whether the data is generated by your employees or by your users, modern applications tend to have excessive amounts of it. Users expect to see this data in meaningful ways, as do managers, investors, and analysts. They want it all, and they want it now. So the question becomes, how do we handle all of this data as the application continues to grow? Beyond that, how fast can we make it? I'm going to do my best to give you meaningful answers to these questions.
These posts are geared toward developers, and cover a few issues in painful detail, but glance over a few very difficult topics that you'll have to research on your own. I'm going to continue to build on all of these examples and topics in later posts.
Data Relationships
Data gets it value through context. For instance, take the integer 12. By itself, it means next to nothing, unless you have numerophobia, but if I tell you that's how many purchases people have made in the last 10 minutes, or that's how many days overdue your credit card bill is, that number has a relative value beyond it's definition. These simple relationships are the main way we give data context, and database systems have been continually evolving the way we can express the relationships between data for half a century.
If you haven't read it yet, please take some time to look over The Pizza Question, specifically Solution 2.
The standard solution for these types of problem is a Relational Database Management System (RDBMS), something like MySQL, MsSQL, Oracle, PostgreSQL, etc. These systems all have their individual strengths and weaknesses, but are essentially based off of the same ideas of how data should be stored, organized, and accessed. They even share a (mostly) standardized method of accessing these features through the Structured Query Language (SQL).
Solution 2 is almost completely reliant upon the RDBMS to extrapolate meaning from the data. The RDBMS enforces constraints (foreign keys), concatenates data (joins), and handles every operation that we could possibly want to enact on our data. It's a pretty good deal for us, as engineers, to be able to take advantage of all of this management, as we don't have to write any of those pieces of code ever again. This is the tried and true approach that has been used for as long as I have ever known, and it works very well.
Relationship Issues
Solution 2 works in practice, the data seems flexible and the code wasn't too difficult to write, but there is a major issue. The issue is computational complexity, which is usually represented in Big-O notation. Every time we select the oldest unprocessed Order, the database has to search all Order records to find the first one that matches, this is dependent on the number of order records. If it takes .001 seconds to find a matching record from the Order table when there are 1,000 Order records, it could take up to 1 second to find each record at 1 million rows, or 10 seconds at 10 million rows. This is an Order-n complexity, O(n), where n is the number of records that need to be analyzed. This means that as the size of n grows, the time will also grow at a proportional rate. The idea of 10 million rows might seem preposterous for a pizza ordering system, but it would probably make sense if you were building a system for an international pizza conglomerate.
Now there are things much worse than an O(n) complexity, for instance O(n^2) or O(2^n), but there are also ways to much better than O(n), and it's these methods that have to take advantage of when n becomes too large to be handled by traditional means. For instance, a binary-tree index can reduce search time (on the indexed field) to O(log n), and a mapping algorithm can reduce searches to O(1) which is a constant time. These approaches are exactly what the underlying RDBMS attempts to do when you index a column.
The Big-O notation is not an exact measurement, but a rough estimation of the complexity of the operation. As anyone who's attempted to scale an application can tell you, the "trivial" factors like network overhead, process overhead, and context switching can add up so fast it will make your head spin. You must also consider that the database must do a join when you request information spread across multiple rows or tables. All of these things increase the complexity, and therefore increase the response times of the RDBMS, inching you ever closer to that point where things just stop working, or things get so slow that customers take their business elsewhere.
Continuing with the pizza conglomerate example, let's say that you receive 100k unique orders per day. I'm not going to focus on the "static" data, I'll go into caching in a later article, for now I'll focus on user-generated-data. During the order process, we'll estimate that the order information is accessed 3 times on average (1 insert, 1 select, 1 update). That means the database system has to do about 3.4 transactions per second. This is perfectly reasonable with a single RDBMS server with proper configuration. But this doesn't give us the entire picture, for instance: Inserts are usually faster than updates, but they can cause locking issues. There are also external factors, and these are the ones that will cause the most trouble. The site you've built for the pizza conglomerate is doing so well, that the boss decides to buy some traffic on the most popular social network. Anyone talking about pizza, browsing pizza fan groups, or even thinking about pizza is going to see an advertisement for your site, all for a nominal fee. Overnight, the traffic doubles. The RDBMS begins to slow at first, but as the number of records increases, the load increases, and it eventually stops responding. Now we have to figure out why and how to alleviate it.
We know that RDBMS systems utilize RAM to store commonly accessed indexes and tables. Once the size of this data exceeds the size of the physical memory on the server, the RDBMS will need to utilize clever strategies to try and keep access times down, but at some point they will break down and the machine will need to use virtual memory, meaning disk swaps. The problem is that memory issues can cascade. As requests come in, the server still accepts them until it's input buffer fills up, but each request gets queued up. If a request takes too long to elicit a response from the server (say 15 second timeout), "smart" code will usually attempt to get the data once again, because the code needs that data to continue on. This means we now have multiple copies of every query in the queue as they're waiting to be answered. Once the input buffer fills up, the server will stop responding. Even once all of the connections are eliminated, after a warm-up period, the server will still go into a state of "thrashing" where it just can't keep up anymore, even if traffic is reduced.
So what do we do? First, we need to analyze how to reduce the number of accesses, and make sure that we actually need to access the data 3 times per order. This is a topic for caching and optimization, and I'll cover this later. Assuming data access is already optimized, we need more hardware, but how do we utilize it? The easiest possible solution is to move different tables to different servers. But wait, if you do that, most RDBMS systems won't let you do joins across multiple servers, and so far, we're reliant on the RDBMS enforcing data integrity and our example queries from the application use joins. Now what? We have three options:
- modify your data model, reducing data normalization and eliminating constraints, meaning data may have to be duplicated on multiple servers
- modify your code to handle the joins itself, increasing the number of queries each time order data will need to be constructed
- some combination of the above two
The first steps with our pizza model would be to eliminate the foreign keys on the database tables, which means that the RDBMS can no longer enforce data normalization. After we do that, we need to reduce all of our SELECT statements from our code to only access one table at a time. This will increase the number of queries defined in our pseudocode for Solution 2 from 3 to at least 7. This means that we'll need to do more queries, smaller in individual size, but hopefully faster. Now we can move these tables to almost any machine we wish, allowing us to utilize all this fancy new hardware we just purchased. With those fairly major changes, we can get the system back up and responding.
Solution 3
Server 1
Topping: (id, name, price)
Crust: (id, name, price)
Sauce: (id, name, price)
PizzaLocation: (id, name, priceMultiplier)
Server 2
Pizza: (id, name, crustId, sauceId)
PizzaXTopping: (pizzaId, toppingId, pizzaLocationId)
Server 3
Customer: (id, name, address, phone)
Server 4
Order: (id, customerId, delivery, fulfilled, date)
OrderXPizza: (orderId, pizzaId)
And here's what our queries would look like from the application, and some pseudocode to glue everything together.
order = SELECT * FROM Order WHERE delivery=true AND fulfilled=false
LIMIT 1 ORDER BY date;
orderCustomer = SELECT * FROM Customer WHERE id=[order.customerId];
pizzaIds = SELECT pizzaId FROM OrderXPizza WHERE orderId = [order.id];
pizzas = SELECT * FROM Pizza WHERE id IN ([pizzaIds]);
foreach pizza in pizzas {
pizza.displayInfo = SELECT Crust.*, Sauce.* FROM Crust, Sauce
WHERE Crust.id=[pizza.crustId] AND Sauce.id=[pizza.sauceId];
pizza.toppings = SELECT * FROM PizzaXTopping WHERE pizzaId=[pizza.id];
foreach topping in pizza.toppings {
topping.displayInfo = SELECT Topping.*, PizzaLocation.* FROM
Topping, PizzaLocation
WHERE Topping.id=[topping.id] AND PizzaLocation.id=[topping.pizzaLocationId];
}
}
Small Issues
With the above solution running, we notice that the system is immediately more resilient than it was originally, but it almost immediately begins to slow down as traffic increases. Looking at the database servers, we can see there's excessive load on the Server 1, the system hosting Crust, Sauce, Topping, and PizzaLocation. Digging further, we can see that there are a lot of queries going through for Crust and Sauce, but more than double that amount for Topping and PizzaLocation. A quick eureka moment later, you realize that the data in these tables doesn't change very often, therefore we only really need to request it once and we can reuse it.
Solution 4
This solution uses the same database tables as Solution 3.
allToppings = SELECT * FROM Topping;
allCrusts = SELECT * FROM Crust;
allSauces = SELECT * FROM Sauce;
allPizzaLocations = SELECT * FROM PizzaLocation;
function getToppingById (id) {
foreach topping in allToppings {
if topping.id == id { return topping }
}
}
function getCrustById() { ... }
function getSauceById() { ... }
function getPizzaLocationById { ... }
order = SELECT * FROM Order WHERE delivery=true AND fulfilled=false
LIMIT 1 ORDER BY date;
orderCustomer = SELECT * FROM Customer WHERE id=[order.customerId];
pizzaIds = SELECT pizzaId FROM OrderXPizza WHERE orderId = [order.id];
pizzas = SELECT * FROM Pizza WHERE id IN ([pizzaIds]);
foreach pizza in pizzas {
pizza.displayInfo = getCrustById(pizza.crustId) + getSauceById(pizza.sauceId);
pizza.toppings = SELECT * FROM PizzaXTopping WHERE pizzaId=[pizza.id];
foreach topping in pizza.toppings {
topping.displayInfo = getToppingById(topping.id) + getPizzaLocationById(topping.pizzaLocationId);
}
}
Sizable Issues
With Solution 4 up and running, the system seems to respond much better, and handles the current load, and you finally drift off to a well-deserved sleep.
Is that your PDA buzzing at 3am? Now what? One particular database server has stopped responding... the Pizza server. The size of the Pizza table has exceeded the size of the physical RAM. The RDMBS server put up a good fight to try and keep only relevant records in RAM, but these dang customers just keep coming back and ordering pizza and posting links on their profiles to their pizza-rific creations!
The logical next step is to employ a different technique called table sharding to split up a single table across multiple servers. There are a lot of sharding algorithms out there, most of them don't allow for the number of shards to be increased later on (such as shardId = uniqueKey mod shardCount), so make an effort to pick a good number of shards to start with. We decide to split the Pizza table into 4 shards on 4 different servers. The next step is moving all of the data from one table into 4 different tables, and modifying the code so that it can determine which shard it needs to access. Some of this process can be alleviated if the RDBMS system can handle sharding internally, but most likely, these changes are going to require code changes and an import script.
Once your data is migrated and your code is modified once again, the system is able to grow a bit further.
Solution 5
Database:
Server 1
Topping: (id, name, price)
Crust: (id, name, price)
Sauce: (id, name, price)
PizzaLocation: (id, name, priceMultiplier)
Server 2
PizzaXTopping: (pizzaId, toppingId, pizzaLocationId)
Server 3
Customer: (id, name, address, phone)
Server 4
Order: (id, customerId, delivery, fulfilled, date)
OrderXPizza: (orderId, pizzaId)
Server 5
Pizza_1: (id, name, crustId, sauceId)
Server 6
Pizza_2: (id, name, crustId, sauceId)
Server 7
Pizza_3: (id, name, crustId, sauceId)
Server 8
Pizza_4: (id, name, crustId, sauceId)
Code Change:
allToppings = SELECT * FROM Topping;
allCrusts = SELECT * FROM Crust;
allSauces = SELECT * FROM Sauce;
allPizzaLocations = SELECT * FROM PizzaLocation;
function getToppingById (id) {
foreach topping in allToppings {
if topping.id == id { return topping }
}
}
function getCrustById() { ... }
function getSauceById() { ... }
function getPizzaLocationById { ... }
function getPizzaById(id) {
shard = id % 4;
serverId = shard + 4;
tableName = "Pizza_" + shard;
pizza = on( serverId, SELECT * FROM [tableName] WHERE id = [id] );
return pizza;
}
order = SELECT * FROM Order WHERE delivery=true AND fulfilled=false
LIMIT 1 ORDER BY date;
orderCustomer = SELECT * FROM Customer WHERE id=[order.customerId];
pizzaIds = SELECT pizzaId FROM OrderXPizza WHERE orderId = [order.id];
pizzas = array;
foreach pizzaId in pizzaIds {
pizzas += getPizzaById( pizzaId );
}
foreach pizza in pizzas {
pizza.displayInfo = getCrustById(pizza.crustId) + getSauceById(pizza.sauceId);
pizza.toppings = SELECT * FROM PizzaXTopping WHERE pizzaId=[pizza.id];
foreach topping in pizza.toppings {
topping.displayInfo = getToppingById(topping.id) + getPizzaLocationById(topping.pizzaLocationId);
}
}
Analysis
This approach is essentially how RDBMS systems scale horizontally (ignoring the shortcomings of the code above). There are also methods of clustering, which essentially just abstract out how the sharding is done behind the scenes. Replicas are another weapon in the RDBMS arsenal, but often only serve to increase code complexity and generate race conditions when dealing with huge amounts of interdependent user generated content. In this new web 2.0 frontier, telling Jimmy that "Jesse just bought 2 large cheese pizza's on mypizzaorderingsite.com!" via some social network and giving the next 10 people that click the link a 5% discount increases sales by an astronomical figure. With millions of transactions per hour, speed is no longer a technical requirement, it is a necessity for business survival.
The above solution (focusing on the database) combined with some caching and code ninjutsu can take an application far beyond where most applications will need to go, but it is not without issues. As you can see, the further we go with these examples, the more the focus moves to the code and away from the database.
Next time, we'll look at solutions using non-relational databases.
The Pizza Question
Interviewing a candidate for a Software Engineering position is never easy. I've been doing it for years, and I've never been very comfortable with it. You're attempting to derive an impossible amount of information from this person in a very short amount of time. I've taken advice from a few people and articles over the years, learned a few techniques, and I have to say that looking for "complete voids" in specific areas has been a successful tactic for at least eliminating the people who definitely wouldn't fit.
I'm going to elaborate on one of my most successful, and painful, questions that I commonly used in interviews, because I'll continue to use this model during my later scalability articles. Nothing really makes someone as uncomfortable as asking them an open-ended question and giving them a marker and a white-board to draw on. It may seem cruel, but how someone reacts under pressure in time-sensitive situations is a pretty good indicator at how they'll hold up in a critical position.
The question and the solutions are important for the upcoming articles on scalability and I will continue to present evolving solutions as new issues are discussed.
The Question
The goal is fairly simple, we want to construct an ordering system for a pizza restaurant. Good design practices dictate that we make the design flexible enough so that we don't want to have to come back and do maintenance on this system every time the customer wants a new type of pizza (contrary to common business practices). We need to represent a few things with the data: customers, orders, and variations of pizzas.
Solution 1
Let's take a look at a simple solution to this problem:
Customer: (id, name, address, phoneNumber)
Pizza: (id, name, price)
Order: (id, customerId->Customer.id, pizzaId->Pizza.id , delivery)
This gives us a very simple solution to the problem with 3 very simple tables. Each of the tables has an "id" column, which for our purposes is always going to be a unique numeric identifier generated by the database. There are a couple foreign keys used to ensure that rows in the Order table are indeed correct. Let's say I wanted to order a pepperoni pizza and dine-in, the rows in these tables might look something like this:
Pizza: (1, cheese, 5.99), (2, pepperoni, 6.99)
Customer: (1, Kurt, "555 Nowhere St.", "555-123-4567")
Order: (1, 1 [Kurt], 2 [pepperoni], false)
There are some glaring deficiencies with this, but hang in there, I have some points. Let's look at the good points first. It's extremely easy to extrapolate meaning from this data. For example: if I wanted to list all orders for pepperoni pizzas, it's one SQL query with no joins between tables (SELECT * FROM Order WHERE pizzaId=2). The data is also pretty self-explanatory. We all know what a Pizza, a Customer, and Orders are, so it's pretty easy to figure out what each row in the table represents.
Let's take a look at the technical shortcomings of this solution. First, a new row has to be added to the Pizza table for every variation of pizza that could ever be needed. This can turn into a nightmare if you want to have something like half-pepperoni, half-olive, and everything in between. Second, each person can only order one pizza at a time, which is just a bit weird.
Solution 2
So now, let's look at a more-complete solution to this problem with the following tables:
Topping: (id, name, price)
Crust: (id, name, price)
Sauce: (id, name, price)
Pizza: (id, name, crustId->Crust.id, sauceId->Sauce.id)
PizzaLocation: (id, name, priceMultiplier)
PizzaXTopping: (pizzaId->Pizza.id, toppingId->Topping.id, pizzaLocationId->PizzaLocation.id)
Customer: (id, name, address, phone)
Order: (id, customerId->Customer.id, delivery, fulfilled, date)
OrderXPizza: (orderId->Order.id, pizzaId->Pizza.id)
I never said it was perfect, there will be more solutions later. This is a much more complicated solution to the problem than the first, but it's much more flexible, and probably a lot closer to what a production solution would look like. It's somewhere within the realm of a BCNF normalization, and it breaks the data down into seemingly atomic representation of the needed pieces. If I wanted to order two pizzas, one all pepperoni, one half black olives and half sausage, this is what the data might look like:
Topping: (1, pepperoni, 1.00), (2, "black olives", 1.00), (3, sausage, 2.00)
Crust: (1, "hand-tossed 12-inch", 4.00)
Sauce: (1, red, 0.00), (2, alfredo, 0.50), (3, bbq, 1.00)
PizzaLocation: (1, full, 1.0), (2, "left half", 0.5), (3, "right half", 0.5)
Pizza: (25, "kurt's first pizza", 1, 1), (26, "kurt's second pizza", 1, 1)
PizzaXTopping: (25, 1, 1), (26, 2, 3), (26, 3, 2)
Customer: (1, Kurt, "555 Nowhere St.", "555-123-4567")
Order: (301, 1, false, true, 2010-05-01)
OrderXPizza: (301, 25), (301, 26)
The above data definitely provides a solution to the order. It's flexible enough to match more use-cases than the previous solution, but it's also more complicated. In addition, the rows themselves become much more difficult to interpretate directly, such as the row (301, 25), which is pretty meaningless without the context of the related rows from other tables.
Application Context for Solution 2
Let's look at this from the application's point of view. The application needs to read all unfulfilled delivery orders, process them, and mark the "fulfilled" flag on each order. For the sake of argument, we'll say that the processing is all done computationally, maybe this pizza parlor has some awesome new machine that builds the pizzas based on instructions in a new Perl extension called the PCML (pizza construction modeling language). Here's what the pseudo-code for the application might look like:
- order = select an order where delivery=true and fulfilled=false
- pizzas = select all pizzas and their basic information from order
- for each pizza in pizzas, select all toppings and their locations
- convert pizza definition to PCML and send to machine
- print delivery instructions and bill with customer address for delivery dude
The first part is easy, we need the an order and the customer information for that order:
SELECT Order.*, Customer.* FROM
Order, Customer
WHERE Order.delivery=true AND Order.fulfilled=false AND Customer.id=Order.customerId
LIMIT 1 ORDER BY Order.date
Which will give us something like this:
(Order.id:1234, Order.customerId:1, Order.delivery:true, Order.fulfilled:false, Order.date:2010-05-30, Customer.id:1, Customer.name:"Kurt", Customer.address:"555 Nowhere St.", Customer.phoneNumber:"555-123-4567")
Now the next part gets a little more difficult:
SELECT Pizza.*, Crust.*, Sauce.* FROM
Pizza, Crust, Sauce, OrderXPizza
WHERE
OrderXPizza.orderId=1234, OrderXPizza.pizzaID=Pizza.id, Crust.id=Pizza.crustId, Sauce.id=Pizza.sauceId
This will give us a weird looking row that's a combination of the Pizza, Crust, and Sauce tables:
(Pizza.id: 512, Pizza.name:"Joe's pizza", Pizza.crustId:1, Pizza.sauceId:1, Crust.id:1, Crust.name:"hand-tossed 12-inch", Crust.price:4.00, Sauce.id=1, Sauce.name="red", Sauce.price=0.00)
That gives us most of the information about this pizza, but it's still missing the toppings and their locations, which adds another step:
SELECT Topping.* PizzaLocation.* FROM
Topping, PizzaLocation, PizzaXTopping
WHERE PizzaXTopping.pizzaId=512, Topping.id=PizzaXTopping.toppingId, PizzaLocation.id=PizzaXTopping.pizzaLocationId
This gives us one or more rows like the following:
(Topping.id:1, Topping.name:"pepperoni", Topping.price: 1.00, PizzaLocation.id:1, PizzaLocation.name="full", PizzaLocation.priceMultiplier:1)
Now we have all the data, so we can construct the PCML and pass it off to the machine. We also have all information required to print the delivery instructions and the receipt.
Using the Question
This is how I use this question: After presenting the question to the applicant, I'll ask them to flesh out the database tables on the white board, and define the relationships between them. I'll offer advice when they ask for it, or when they don't seem to be able to put their thoughts into words. After the applicant is satisfied with their answer, I'll have them write SQL queries against their tables to get the types of information we looked at from the application standpoint. Sometimes I'll throw in a couple questions relating to statistics reporting, like "how many people ordered pepperoni pizzas in the last 2 weeks" or "what are the addresses of people who have ever ordered a pizza with sausage".
Next, I'll ask the applicant to draw a basic class diagram of how he would represent the data in an application. I'll be looking for some sort of inheritance, even if I have to modify the requirements on the fly a bit to make it relevant to their solution. Depending on time, I might have the applicant flesh out a bit more of the class diagram and program functionality.
Analysis
Neither of the solutions here is particularly complete or graceful, and each have a long list of issues (which I will get into in the scalability series). There will be more people than you would think that apply for senior level positions and offer something similar to Solution 1. I've even had people get outright angry at me during questions like this, one of them told me "I don't want to design anything, just hand me the specifications and I'll write the code." I'm not saying that people who offer Solution 1 are wrong, or that the solution is wrong, but they generally have less knowledge about how to design good solutions to real-world problems. Unless you're a large company or are looking for entry-level positions, they're likely not a good fit.
Those who offer something closer to Solution 2 should be able to elaborate on their solution, at least verbally, and should be able to carry on extended conversations with you about why they did things, or how they would change their design if a requirement of the system changed. A solution similar to Solution 2 doesn't imply that this person is the correct person for this job, but it means they at least understand basic solution design and object-relational-modeling, two skills absolutely necessary for success.
Bonus Question
Just because this is another favorite for interviews: the Fibonacci Question. I'll ask the applicant if he remembers the Fibonacci sequence, at least vaguely. If not, just give them the equation
fib(n) = fib(n-1) + fib(n-2)
If the applicant struggles with the equation, just give it to them, memorization isn't the issue. I've had people sit there for 10 minutes who had remembered the numeric sequence, not the formula itself, and refused to let me give them the formula and tried to derive it themselves.
Ask the applicant to write a piece of code (in whatever language the job requires) that takes in the index of the Fibonacci sequence and returns the value of that index (ie: n -> fib(n)).
If the applicant has been in school recently, they'll probably give you the textbook recursive solution. Some will give you the iterative solution. I consider the iterative more difficult, probably because I learned the recursive first. Whichever one they give you, ask them if they can give you the other solution as well. These don't have to be absolutely perfect, but for the recursive version, they should realize they need a base case. If they don't notice they need a base case, they might have a gap in understanding recursion. If they can't create most of the iterative solution after giving you the recursive solution, they will have difficulty dealing with scaling issues later.
Recursive Solution:
function fib (n) {
if ( n <= 1 ) return n;
return fib(n - 1) + fib(n -2 );
}
Iterative Solution:
function fib (n) {
if ( n <= 1 ) return n;
a = 1; b = 1;
for (i = 3; i<=n; i++) {
c = a+b;
a = b;
b = c;
}
return b;
}
The fun part with this question, I've had people get seriously angry about it and question the relevance of the question to the position they are applying for. The short answer is this is what scaling is about... taking well known and accepted polynomial-time solutions and figuring out how to do them in something less-than-polynomial (log n, or even constant).
Scaling
Next up, I'll talk about scaling Solution 2 well beyond what most people will ever see, where the problems occur, and how to attack them.
Of Hammers and Triangles
I'm working on some pretty detailed articles at the moment, for what will be the first group in a series on Scalability Issues. I thought I'd take a break from these real quick and share some common pitfalls that I've seen many times over the years. These seem to be a little less common-knowledge than the several software specific issues, like the moving target problem.
The Golden Hammer
A variation on an american proverb, this rule states:
When you have a Golden Hammer, everything starts to look like a nail.
This commonly occurs when a new piece of technology is discovered, or a novel way of using an existing technology is successfully applied to a problem. Essentially, every problem is reduced to some variation of this one particular success, and not necessarily by a hammer-happy engineer. The solution may work, but it has potential to be clunky, not well planned out, even unstable.
I'll give you a an example. Recently everyone has been going crazy for NoSQL databases as an alternative to relational databases. I've watched as companies moved entirely to NoSQL solutions, only to move back to RDBMS several months later, after countless hours of work, when all that was really needed was to use this "new" technology to augment the existing system.
The Engineering Triangle
Also known as the "Designer's Triangle", the "Producer's Triangle," etc. This one's pretty simple:
Pick any two.
As a manager, you get to choose any one side of the triangle. The one you choose could determine the long-term outcome of this project. I'll let you figure out which side is the industry standard.
It seems like an an easy guide to follow, there are two correct choices, and one seemingly bad one. The problems usually stem from poor planning from the top end of the company and a fast-paced market. It starts off as one "just get it done" decision that results in a success, and cascades into a company philosophy. Like the Golden Hammer, these rash decisions usually come back to cost more in time wasted and revenue lost than if a few hours, days, or even weeks were spent looking for a proper solution.
Just for proper credit, I didn't learn either of these in some engineering class. I was introduced to these terms by Adam Carnine and Jim McCullough when we were working together.
PacketPal working with SharpPcap 3.1
PacketPal is now working with SharpPcap 3.1. The build was completed using VS2008, but I'll have a Mono build available shortly, as there shouldn't be much, if anything, that needs to be modified between the two build environments.
You can find a .zip file with the built application here, or use subversion to check out the sources here.
Now it's time to work on the really fun projects.
PacketPal update and the Scale of Scaling
So the new version of SharpPcap tries to enumerate all of the defined values for all of the odd fields in different packet structures. For a couple days, I fought with this, trying to get PacketPal to work with these new features. I finally realized that doing these enumerations completely defeats the purpose of PacketPal, which is to let people experiment with the undefined regions of these network protocols. I've got the lib, main forms, Ethernet editor, and ARP editor building and working in VS208 (not checked into svn yet because everything else is broken). After I finish the rest, I'll work on getting everything to build in Mono.
On to scalability discussion: Recently this article about scaling Reddit has been making its way around. There's a bit of good information to glean from the article, but I thought I'd throw out some numbers for comparison. The article talks about scaling Reddit to 270 million pageviews a month. If we say there's 30 days in the average month, that's 9 million pages per day, 370,000 pages per hour. Island Paradise, at its peak of popularity had over 2 million unique users per day, and well over 10 million pages per hour for the peak hours of the day. So for those of you familiar with Reddit, but unfamiliar with the scale of a social game on Facebook, hopefully this will give you some perspective.
Getting started
So I'm starting this off by releasing a lot of stuff that's been sitting dormant for a while on my computer. I'm currently updating my master's project from 2007, Packet Pal, to work with the current versions of libpcap and SharpPcap, as well as running on Mono instead of just .NET. I should have this done in a week or so. It's currently sending and receiving packets with the new libraries, but I need to update the editors. I'm hoping to get some interest in this project, since it is just as much of a challenge now to keep working as it was when I built it, and I still believe it to be relevant and useful.
In other project news, I've recently located and released the much-joked-about BeFunge JavaScript Interpreter, which was related to the other experiments I ran using the BrainFsck JavaScript Interpreter and Compiler. Both of these JavaScript projects are probably not that exciting, just some fun stuff I was playing around with. They did reveal some interesting problems with some larger JavaScript frameworks, such as Ext-JS, when I originally built them, so they may come in useful again one day.
Finally, I've got some big framework news coming up shortly!

