Photo by Caspar Camille Rubin on Unsplash
SQL and Data Modeling for Web
18 min read
Table of contents
- I. Interacting with Databases
- II. SQLAlchemy Basics
- III. SQLAlchemy ORM in Depth
- IV. Migrations
- V. CRUD operation and Model View Controller (MVC)
This is a summary of what I learned in Full Stack Web Developer (Nanodegree) on Udacity. This article is about SQL and Data Modeling for Web
I. Interacting with Databases
Here are the three core topics we'll be covering in this article:
1.1. Interacting with a (remote) database:
Backend developers need to interact with databases on a regular basis in order to manipulate and maintain the models behind their web applications.
In working with databases, we'll need to use a Database Management System (DBMS)
A Database Management System (DBMS) is simply software that allows you to interact with a database (access or modify the data in that database).
There are many different Database Management Systems out there, but the particular DBMS we'll be using is called PostgreSQL (or simply Postgres).
1.2. Database Application Programming Interfaces (DBAPIs)
Once we've looked at the basics of interacting with a database, we'll need to understand how to interact with that database from another language or web server framework (such as Python, NodeJS, Ruby on Rails, etc.). This is where DBAPIs come in.
Finally, we'll get some experience working with the widely used psycopg2 library, which will allow us to interact with a database from Python.
psycopg2 is a database adapter that allows us to interact with a Postgres database from Python.
There is a lot going on between all of the components. The diagram below can give you a feel for how it is all connected
2. Relational Databases
- A database is a collection of data
- A database system is a system for storing collections of data in some organized way.
- This is a property that is known as Persistence (allowing access later, after it was created)
- Shared source of truth accessible by many users
- Ability to store many types of data (efficiently)
- Concurrency control (handling multiple DB actions at once)
- All data is stored in tables
- Every table is characterized by a list of columns with data types per column, and its set of data (organized in rows)
- Comes with rules for enforcing data integrity, such as constraints and triggers.
3. Primary keys & Foreign Keys
- The primary key is the unique identifier for the entire row, referring to one or more columns.
- If there are more multiple columns for the primary key, then the set of primary key columns is known as a composite key.
- A primary key in another (foreign) table.
- Foreign keys are used to map relationships between tables.
4. Client-Server Model
In order to build database-backed web applications, we first need to understand how servers, clients, and databases interact.
A major part of this is the client-server model, so let's look at that first. The basic idea is very simple, and looks something like this:
A server is a centralized program that communicates over a network (such as the Internet) to serve clients.
And a client is a program (like the web browser on your computer) that can request data from a server.
When you go to a web page in your browser, your browser (the client) makes a request to the server—which then returns the data for that page
Relational database systems follow a client-server model:
Servers, Clients, Hosts
- In a Client-Server Model, a server serves many clients.
- Servers and clients are programs that run on hosts.
- Hosts are computers connected over a network (like the internet!).
Requests and Responses
- A client sends a request to the server
- The server's job is to fulfill the request with a response it sends back to the client.
- Requests and responses are served via a communication protocol, which sets up the expectations and rules for how the communication occurs between servers and clients.
Relational Database Clients
- A database client is any program that sends requests to a database
- In some cases, the database client is a webserver! When your browser makes a request, the webserver acts as a server (fulfilling that request), but when the webserver requests data from the database, it is acting as a client to that database and the database is the server (because it is fulfilling the request).
Basically, we call things clients when they are making a request and servers when they are fulfilling a request. Since a web server can do both, it sometimes acts as a server and sometimes acts as a client.
In this section, we'll look at the suite of communication protocols that are used to transfer data over the Internet. These communication protocols are most often referred to as TCP/IP, which is an abbreviation that refers to the two main protocols involved—Transmission Control Protocol (TCP) and Internet Protocol (IP).
Takeaways TCP/IP is a suite of communication protocols that are used to connect devices and transfer data over the Internet.
- IP addresses: An IP address identifies the location of a computer on a network.
- Ports: A port is a location on the recipient computer, where data is received.
While an IP address tells you where to find a particular computer, it doesn't tell you specifically where on that computer a particular connection should be made—that's what port numbers are for.
Some port numbers you should know:
- Port 80: The port number most commonly used for HTTP requests. For example, when a client makes a request to a web server, this request is usually sent through port 80.
- Port 5432: The port number used by most database systems; default port for Postgres.
6. Connections and Sessions in TCP/IP
- TCP/IP is connection-based, meaning all communications between parties are arranged over a connection. A connection is established before any data transmission begins.
Over TCP/IP, we'll always need to establish a connection between clients and servers in order to enable communications. Moreover:
Deliveries over the connection are error-checked: if packets arrive damaged or lost, then they are resent (known as retransmission).
Connecting starts a session. Ending the connection ends the session.
- In a database session, many transactions can occur during a given session. Each transaction does work to commit changes to the database (updating, inserting, or deleting records).
Aside: the UDP Protocol The internet also offers the UDP protocol. UDP stands for User Datagram Protocol. UDP is much simpler than TCP: hosts on the network send data (in units called datagrams) without any connections needing to be established.
TCP vs UDP If TCP is like building roads between houses before sending packages between them, then UDP is much like sending over a carrier pigeon from house to house in order to deliver messages: you don't have much control over the pigeon once it flies away -- will the pigeon will fly the right direction, drop your package into the dirt, or encounter attacking bald-eagle during flight. On the other hand, USD has less overhead than TCP / building a highway.
When speed is more important than reliability, especially when applications need to stream very small amounts of information quickly (smaller packages of information mean fewer issues with reliability), then UDP is preferred. A lot of real-time streaming applications, (e.g. live TV streaming, Voice over IP (VoIP)) prefer UDP over TCP. Since UDP does not need to retransmit lost datagrams, nor does it do any connection setup, there are fewer delays over UDP than TCP. TCP's continuous connection is more reliable but has more latency.
- Databases are interacted using client-server interactions, over a network
- Postgres uses TCP/IP to be interacted with, which is connection-based
- We interact with databases like Postgres during sessions
- Sessions have transactions that commit work to the database.
Transactions capture logical bundles of work. Work is bundled into transactions so that in case of system failures, data in your database is still kept in a valid state (by rolling back the entire transaction if any part of it fails). To ensure a database is consistent before and after work is done to it, databases use atomic transactions, and actions like commits and rollbacks to handle failures appropriately. Transactions are, in other words, ACID.
Let's deep dive into what Postgres is
- Postgres is an open-source, general-purpose, and object-relational database management system, considered by many to be the most advanced open-source database system available. It's a relational database system extended with object-oriented features, that works across operating systems.
Object-relational support includes support for arrays (multiple values in a single column), and inheritance (child-parent relationships between tables).
- Built in 1977, it is lauded for being highly stable, requiring minimal effort to maintain compared to other systems.
- Widely used, everywhere: by Apple, Cisco, Etsy, Microsoft, Yahoo, Reddit, Instagram, Uber,... and many others.
- Comprehensive support for the SQL standard.
- Transaction-based: operations on the database are done through atomic transactions.
- Has multi-version concurrency control, avoiding unnecessary locking when multiple writes are happening to the database at once (avoiding waiting times for access to the database)
Postgres lets you have several databases available for reading from and writing to, at once.
- Offers great performance and many indexing capabilities for optimizing query performance
- PostgreSQL is also often just called Postgres.
9. DBAPIs and psycopg2
- provides a standard interface for one programming language (like Python) to talk to a relational database server.
- Is a low-level library for writing SQL statements that connect to a database
- is also known as database adapters
Different DBAPIs exist for every server framework or language + database system
Database adapters define a standard for using a database (with SQL) and using the results of database queries as input data in the given language.
Examples across languages and server frameworks
- For Ruby (e.g. for Sinatra, Ruby on Rails): pg
- For NodeJS: node-postgres
- For Python (e.g. for Flask, Django): pyscopg2
II. SQLAlchemy Basics
Takeaways + A Note on ORMS
- SQLAlchemy is the most popular open-source library for working with relational databases from Python.
- It is one type of ORM library, AKA an Object-Relational Mapping library, which provides an interface for using object-oriented programming to interact with a database.
- Features function-based query construction: allows SQL clauses to be built via Python functions and expressions.
- Avoid writing raw SQL. It generates SQL and Python code for you to access tables, which leads to less database-related overhead in terms of the volume of code you need to write overall to interact with your models.
- Moreover, you can avoid sending SQL to the database on every call. The SQLAlchemy ORM library features automatic caching, caching collections, and references between objects once initially loaded.
The diagram below can give you insight into where SQL Alchemy and ORM are in the big picture.
2. Layers of Abstraction
SQLAlchemy is broken down into various layers of abstraction. Let's go through each layer in depth.
Without SQLAlchemy, we'd only use a DBAPI to establish connections and execute SQL statements. Simple, but not scalable as complexity grows.
SQLAlchemy offers several layers of abstraction and convenient tools for interacting with a database.
SQLAlchemy vs psycopg2
- SQLAlchemy generates SQL statements
- psycopg2 directly sends SQL statements to the database.
- SQLAlchemy depends on psycopg2 or other database drivers to communicate with the database, under the hood.
SQLALchemy lets you traverse through all 3 layers of abstraction to interact with your database.
- Can stay on the ORM level
- Can dive into database operations to run customized SQL code specific to the database, on the Expressions level.
- Can write raw SQL to execute, when needed, on the Engine level: Can more simply use psycopg2 in this case
We'll go over every layer of abstraction in SQLAlchemy and what they offer.
Layers of SQLAlchemy
- The Dialect
- The Connection Pool
- The Engine
- SQL Expressions
- SQLAlchemy ORM (optional)
3. The Dialect
When we're using SQLAlchemy, we can forget generally speaking about the database system that we're using, allowing us to use SQL lite or Postgres or generally switch out the database system whenever we need to. The reason why we can do this is because of the dialects
Dialects allow the flavor of SQL that we're using to get abstracted away from us because the dialect's layer controls the quirks and flavor of the specific database system that we're using.
The dialect is the system SQLAlchemy uses to communicate with various types of DBAPI implementations and databases. The sections that follow contain reference documentation and notes specific to the usage of each backend, as well as notes for the various DBAPIs.
All dialects require that an appropriate DBAPI driver is installed.
4. The connection Pool
With a connection pool, the opening and closing of connections and which connection you are using when you're executing statements within a session are completely abstracted away from you.
As a result of having a connection pool:
- Connections are easily reused after they are started. This avoids the problem of continually opening and closing connections every time we want to make data changes to our database.
- A connection pool also easily handles dropped connections for us, for example, when we have network issues.
- It also helps us avoid doing very many small calls to the database when we're continually assigning changes to the database, which can be very slow.
A connection pool is a technique used to maintain long-running connections in memory for efficient re-use, as well as management for the total number of simultaneous connections an application might use.
5. The Engine
- 1 of 3 main layers for how you may choose to interact with the database.
- Is the lowest level layer of interacting with the database, and is much like using the DBAPI directly. Very similar to using psycopg2, managing a connection directly.
- The Engine in SQLAlchemy refers to both itself, the Dialect and the Connection Pool, which all work together to interface with our database.
- A connection pool gets automatically created when we create an SQLAlchemy engine.
6. SQL Expressions
- Instead of sending raw SQL (using the Engine), we can compose python objects to compose SQL expressions, instead.
- SQL Expressions still involves using and knowing SQL to interact with the database.
7. SQLAlchemy ORM
- Lets you compose SQL expressions by mapping python classes of objects to tables in the database
- Is the highest layer of abstraction in SQLALchemy.
- Wraps the SQL Expressions and Engine to work together to interact with the database
- Will be used in this course, so we can know how to use ORM libraries in general.
Moreover, SQLAlchemy is split into two libraries:
- SQLAlchemy Core
- SQLAlchemy ORM (Object Relational Mapping library). SQLALchemy ORM is offered as an optional library, so you don't have to use the ORM in order to use the rest of SQLAlchemy.
- The ORM uses the Core library inside
- The ORM lets you map from the database schema to the application's Python objects +The ORM persists objects into corresponding database tables]
SQLAlchemy Layers of Abstraction Overview (Diagram)
8. SQLAlchemy Data Types
SQLAlchemy has its own data types that we should become familiar with. In SQLAlchemy, there is a one-to-one parity between an SQLAlchemy datatype and the data type that would be understandable in the semantics of the particular database system that you're linking your SQLAlchemy engine to.
- db.integer, that's the integer type for the database system that we're using.
- db.string, where you can optionally pass in a number that represents the maximum length of that string should be. For Postgress in particular, we're able to specify a variable character string, so we can omit the size variable, so that setting db.string with nothing in it, specifies a varchar data field.
- db.text for longer text
- db.DateTime for date time objects
- large binaries for storing large binary data or pickled Python objects.
III. SQLAlchemy ORM in Depth
SQLAlchemy Object Lifecycle
- Within a session, we create transactions every time we want to commit work to the database.
- Proposed changes are not immediately committed to the database and instead, go through stages to allow for undos.
- The ability to undo is allowed via db.session.rollback()
- Transient: an object exists, it was defined....but not attached to a session or database (yet).
- Pending: Some type of action has occurred but we have not yet decided to make it permanent yet. An object was attached to a session. "Undo" becomes available via db.session.rollback(). This means we can still clear any work that has been done so far. An object stays in this state until a flush happens!
- Flushed: Translating actions(pending changes) into SQL commands that are ready to be committed. Nothing happens in the actual database yet. The only thing that can do that is 'commit'
- Committed: manually called for all pending changes to persist to the database permanently.
A flush takes pending changes and translates them into commands ready to be committed. It occurs: when you call Query. Or on db.session.commit()
A commit leads to persisted changes on the database + lets the db.session start with a new transaction.
When a statement has been flushed already, SQLAlchemy knows not to do the work again of translating actions to SQL statements.
- Migrations deal with how we manage modifications to our data schema, over time.
- Mistakes to our database schema are very expensive to make. The entire app can go down, so we want to: quickly roll back changes, and test changes before we make them
- A Migration is a file that keeps track of changes to our database schema (structure of our database). Offers version control on our schema.
Upgrades and rollbacks
- Migrations stack together in order to form the latest version of our database schema
- We can upgrade our database schema by applying migrations
- We can roll back our database schema to a former version by reverting migrations that we applied
Migrations are changes to the database data and schema that are saved into a file. This file can be applied to rollback or apply those changes
- encapsulate a set of changes to our database schema, made over time.
- are uniquely named
- are usually stored as local files in our project repo, e.g. a migrations/ folder
- There should be a 1-1 mapping between the changes made to our database, and the migration files that exist in our migrations/ folder.
- Our migrations files set up the tables for our database.
- All changes made to our DB should exist physically as part of migration files in our repository.
Migration command-line scripts There are generally 3 scripts needed, for
- migrate: creating a migration script template to fill out; generating a migration file based on changes to be made
- upgrade: applying migrations that hadn't been applied yet ("upgrading" our database)
- downgrade: rolling back applied migrations that were problematic ("downgrading" our database)
Migration library for Flask + SQLAlchemy
- Flask-Migrate is our library for migrating changes using SQLAlchemy. It uses a library called Alembic underneath the hood.
Flask-Migrate & Flask-Script
- Flask-Migrate (flask_migrate) is our migration manager for migrating SQLALchemy-based database changes
- Flask-Script (flask_script) lets us run migration scripts we defined, from the terminal
Steps to get migrations going
- Initialize the migration repository structure for storing migrations
- Create a migration script (using Flask-Migrate)
- (Manually) Run the migration script (using Flask-Script)
Why use migrations?
- We do heavy-handed work, creating and recreating the same tables in our database even for minor changes
- We can lose existing data in older tables we dropped
- Auto-detects changes from the old version & new version of the SQLAlchemy models
- Creates a migration script that resolves differences between the old & new versions
- Gives fine-grain control to change existing tables
This is much better, because
- We can keep existing schema structures, only modifying what needs to be modified
- We can keep existing data
- We isolate units of change in migration scripts that we can roll back to a “safe” db state
V. CRUD operation and Model View Controller (MVC)
1. CRUD operation
CRUD in IT stands for Create, Read, Update and Delete - a set of operations that generally constitutes a fundamental structure of a module.
While there can be additional specific operations in a module, the CRUD operations are considered the essential basic operations on any generic module.
For example, consider a user module in general. The general use cases one can think of are, creating a user (C), fetching a user’s details to display them (R), editing/updating a user (U), or deleting a user (D).
2. Model view controller (MVC)
- MVC stands for Model-View-Controller, a common pattern for architecting web applications
- Describes the 3 layers of the application we are developing
- Models manage data and business logic for us. What happens inside models and databases, capturing logical relationships and properties across the web app objects
- Views handle display and representation logic. What the user sees (HTML, CSS, JS from the user's perspective)
- Controllers: routes commands to the models and views, containing control logic. Control how commands are sent to models and views, and how models and views wound up interacting with each other.
- MVC, or Model-View-Controller, is a design and architectural pattern that breaks up an application into three elements: the model(the data), the view(the graphical interface that users see), and the controller(the code-logic behind the app) It is an industry-standard web development framework used heavily throughout the world.
- Ajax allows web pages to be updated fast through asynchronously exchanging small amounts of data back and forth with the server. It allows only a part of a web page to be updated without reloading the entire page.
- CRUD Any database should be able to (C)reate, (R)ead, (U)pdate, and (D)elete data.
- A session represents all interactions with the database and actually implements a “holding zone” for all the data objects that were affected during this time. They can be finalized (made permanent) by committing the changes, or rolling back if unwanted'
- XMLHttpRequest (XHR) objects are used to interact with servers in order to get data from a URL (or page) without having to do an actual full page refresh. Web pages can update just a small part of a page without interrupting what the user is doing. XMLHttpRequest is used heavily in AJAX programming.
- Migrations are code-based strategies that allow you to manipulate the schema or data in a database after it has already been created and has data in it. They are useful for recording changes, as well as providing a way to "roll back" changes. There can be several migration files "stacked" on top of one another in order.