Programming Google App Engine

The Datastore Philosophy

The chapter on datastore queries is now mostly complete and available to Rough Cuts subscribers. The draft posted this morning adds diagrams for the chapter. I'm happy to say that the final edition will feature diagrams by O'Reilly's professional illustrators, and not these temporary images. I'm interested in feedback on the temporary diagrams, as they will be the sketches the illustrators will use as the basis for the final versions.

The queries chapter is the first of several chapters that seek to impart an intuitive understanding of App Engine's philosophy of scalable data. Most of us—myself included—have developed data modeling habits from working with single-server relational databases, many of which predate the web and its demand for immediate answers. Using the App Engine datastore requires a different way of thinking about data to fit its scalable design. But with the right tools, working with the datastore is about as easy as what we're used to for web applications, and the datastore takes care of the scaling for us.

The App Engine datastore is designed to query and retrieve data very quickly, regardless of how much data is stored. Unlike traditional databases, which plan and execute the query across the raw data when the query is made, App Engine calculates the results for every query the application is going to make in advance. When the app performs a query, App Engine simply finds the results in the table of answers (the index) for the query. Defining indexes for queries is a semi-automatic process: You generate index configuration by running your app using the development web server tool in the SDK, then you upload the configuration with your app code.

Many App Engine data query problems are best solved by stealing this page from the datastore's book: Calculate and store the answer to the question when the data is written. The query engine alone can't tell you the sum of the numbers stored in a property over thousands of entities, and there's no time during a request for an app to retrieve all of the entities and add up the values. But if the app updates a stored sum each time one of the summed values is updated, the value will be readily and instantly available when needed.

Most of us learned to not represent a value in more than one place in a database. A layout that represents each datum once, a "normalized" layout, is easy to understand, easy to maintain, and fast to update, and it uses space efficiently. By calculating the answers to questions about the data when the questions are asked, all answers are up to date. But it's not necessarily fast to perform such queries, especially with very large data sets. Such queries are also problematic once the data set outgrows the first database server.

For some data scaling problems, the solution is to "denormalize" the data model, to store and maintain answers to (known) questions about the data before the answers are needed. This comes at a cost of update performance and code complexity, but it's worth it if read performance and scaling are what's important, as they usually are with web applications. Database indexes are not typically considered denormalization because the database maintains them without making demands on the application's code. In the case of App Engine, indexes require merely the configuration file. Other kinds of denormalization across entities can be managed with the Python API's object interface, providing a reasonably consistent view of the data to the rest of your application.

Understanding how the datastore queries and retrieves data is the first step toward designing a data model that makes the most of the App Engine datastore. The next chapter will focus on how the datastore writes data: performance, consistency guarantees, and how to update multiple entities in a single transaction. Chapters following that will explore the challenges of maintaining sophisticated data models in a schemaless datastore, the tools App Engine includes to make it easy, and techniques for managing denormalized schemas.