Skip to content

General knowledge

Created: 2015-08-07 17:23:01 -0700 Modified: 2023-08-07 14:15:16 -0700

These are general notes that aren’t specific to any particular database.

  • In general, if you’re looking to map specific concepts from relational databases, then you’re not going to see the value in NoSQL databases. They’re only comparable at a high level (reading data, writing data, etc.).
  • I don’t think an ORM is generally helpful for this, especially since there’s no concept of a schema, so all of the objects mapped from the database may have random properties or varying types.

In a production environment, here are things you will want out of your database or from your provider of the database-as-a-service:

  • ACID (see below). Note that you may only want consistency in some cases (for example, if I were Facebook, I would only want the amount of “likes” on a photo to be eventually consistent, but I want your password to be consistent).
  • Scalability
  • Security
  • Backups

Amazon supports a whole bunch of databases with Amazson RDS. It looks like they are literally just databases from an API level, meaning that sure, they handle storage, back-up, security, etc. for you on top of that, but your code doesn’t need to change.

https://aphyr.com/tags/Jepsen - this guy has done lots of research and work to show flaws in various popular databases.

ACID (https://en.wikipedia.org/wiki/ACID) - properties of a RELIABLE database.

  • Atomicity - transactions are all or nothing; either one part will fail and it will appear as though nothing changed at all, or everything will succeed and then all changes are applied at once (with nothing able to interrupt this set of changes).
  • Consistency - ensures any transaction brings the database from one valid state to another (this relates to constraints, cascades, and triggers).
  • Isolation - this entails that the concurrent execution of transactions results in a system state that would be obtained if transactions were executed serially.
    • Not terribly sure how this is different from atomicity. Online resources say that it “specifies when and how the changes implemented in an operation become visible to other parallel operations”.
  • Durability - once a transaction is committed, it will remain so, even in the event of power loss, crashes, or errors. I.e. after a transaction is committed, a durable database could suffer from IMMEDIATE power loss and the data would still be in non-volatile memory.

BASE (“basically available, soft-state, eventual consistency”) (https://en.wikipedia.org/wiki/Eventual_consistency):

This is MOSTLY characterized by just “eventual consistency”, which is a property where it’s enough for the database to EVENTUALLY be in a consistent state. It’s okay to give stale data, and it’s okay to give approximate answers. I don’t think this goes against ACID at all.

CAP theorem (https://en.wikipedia.org/wiki/CAP_theorem)

This theorem states that it is impossible for a distributed computer to simultaneously provide all three of the following guarantees:

  • Consistency - all nodes see the same data at the same time. This is NOT the same as the “Consistency” in ACID.
  • Availability - a guarantee that every request receives a response about whether or not it succeeded
  • Partition tolerance - the system continues to operate despite arbitrary partitioning due to network failures
    • Partition tolerance means that if you have 10 machines and #3, #6, and #7 go down, the system can still operate.

Note that the CAP theorem doesn’t say you can’t have all three of those things, just that you can’t have them simultaneously. A “CP” database is something that has consistency and partition tolerance but not availability.

In databases, if you don’t index a table, then attempting to find, say, an image whose URL is “http://foo.com/bar.png” would require searching through each row/document.

I think this ~28-minute video is the most helpful resource I’ve seen on database normalization, so much so that I bumped all of the notes that were in this section to an “Old normalization notes” section and I’m just going to cover the video contents here.

Note that the video apparently does simplify some concepts to the point where minor nuances are lost, so the notes below do, as well.

  • Why normalize at all?
    • It improves data integrity since you won’t be able to store redundant or incomplete data.
  • It’s not hard
    • I originally thought that going through all of the normalization levels was supposed to be a difficult task, but they’re not that hard, and the way you generally fix a violated normal form is to add more tables and move columns around.
  • 1NF (first normal form)
    • Using row order to convey information is not permitted
      • If you need something to be ordered, make sure the data itself can represent that order. E.g. to get a list of the tallest people, you can just store their heights rather than inserting them into the database in order from tallest to shortest (since we can’t rely on row order to convey information).
    • Mixing data types within the same column is not permitted
      • E.g. don’t allow a column to be either an integer or a string—it should be one or the other. In general, databases don’t even let you do this unless you’re storing something like JSON data.
    • Having a table without a primary key is not permitted
      • PKs uniquely identify rows in a table. A PK can be formed of many columns. Without a PK, you may have multiple rows that represent the same information (e.g. a person being stored with two heights).
    • Repeating groups are not permitted
      • E.g. rather than storing a gamer’s inventory as a difficult-to-query string like “3 shields, 5 arrows”, put that inventory into its own entire table with player_id, item_type, and item_quantity, then make the PK include the first two columns.
  • 2NF
    • Each non-key attribute in the table must be dependent on the entire primary key.
      • E.g. if you added a “player_rating” into the inventory table from the last 1NF bullet point, then it would violate 2NF because if doesn’t depend on the entirety of the PK {player_id, item_type}; it only depends on player_id. As such, it should be in its own table with just player_id and player_rating.
  • 3NF
    • Each non-key attribute in the table must depend on the key, the whole key, and nothing but the key.
      • E.g. suppose you add a “player_skill_level” to the table from the last 2NF bullet, and suppose player_rating is derived from player_skill_level. The player_rating no longer depends on the PK (player_id), so it should be moved to its own table whose PK is “player_skill_level”, which represents a mapping from the skill level to the player_rating.
  • Boyce-Codd Normal Form (BCNF)
    • Same as 3NF, but applies to all attributes, not just non-key attributes.
  • 4NF
    • The only kinds of multivalued dependency allowed in a table are multivalued dependencies on the key.
      • (note: it’s probably faster to just watch the video for this, but I wrote it out anyway)
      • E.g. you have houses with that are available in certain colors and styles, e.g. House A is available in yellow or blue, and it’s available in the style of a condo or a duplex. If you had a table with “model”, “color”, and “style” columns all forming the PK, then you would need four rows to represent the combinations. One day, House A becomes available in red, meaning two rows should be added to the table. However, nothing in the database enforces this, so it’s possible that you only end up adding the red condo and not the red duplex. That’s because house A has two multivalued dependencies: color and style. 4NF says that the only way to have these multivalued dependencies is if they rely on the key, which currently contains all three columns.
        • To fix this, you would split the table in two: model_colors and model_styles. The first would have three rows: [A, yellow], [A, blue], [A, red]. The second would have two rows: [A, condo], [A, duplex]. Joining them would give you six results as you expect, so now you can add/remove colors and styles without any inconsistencies.
  • 5NF
    • It must not be possible to describe the table as being the logical result of joining some other tables together.
      • (it’s faster to just watch the video for the example here)

The table from the Wikipedia reference link is pretty helpful.

I ran into an issue where I wanted to store purchases from different platforms, e.g. Android, iOS, Steam, etc. There were certain common properties like “created_at”, “user_id”, and “cost”, but then there were platform-specific properties. For example, Android gets a purchase token, but iOS gets a receipt. These may take on slightly different forms.

While streaming, people suggested several different resources for something like this:

  • They mentioned that this is similar to the entity-attribute-value model (EAV).
  • This may be a decent time to use a document-storage database like DynamoDB or a JSON blob in MySQL.
  • 6NF was suggested where each table just contains a PK and one other attribute, that way JOINs are trivial.

What I ended up going with based on my specific needs was to have many separate tables: purchases_common, purchases_android, purchases_ios, etc.

TL;DR: you should do 1 query to return 100 items instead of 100 queries to return 1 item each

If you ever need to store large amounts of data (e.g. replay journals in Bot Land), consider storing them as files on S3 instead since they can stay around practically forever that way. I could hash the replay information to form the file name (for easy lookup) or I could just store the file URL in a database for look-up that way.

In general, avoid storing incredibly large objects in a database.

12/14/2015

I asked these while I was streaming on Twitch and people answered me.

  • In async replication, what happens if you read from the slave?
    • JonathanDelgado says: you will get the data, it may be behind from the master if it is async, but sync will always return the correct data
    • So if you have synchronous replication or you don’t care about consistency, then you should spread out your reads, but all writes would go to the master.
  • Should I have async or sync replication?
    • If you want data to always be consistent, sync, if you want speed, async
  • If I go with master-master (where both nodes can accept writes), do I need to load-balance, or does the arbitrator do that?
    • No - you balance reads and writes on the actual NodeJS server, a load-balancer would introduce a single point of failure, so that’s bad
    • When you DO have a variable number of master databases (even just two), you should use a JSON file that the NodeJS servers can read from so that they know the addresses of the databases to be targeted.
  • Does master-master scale horizontally? Can I have 3 or 4 master nodes?
    • Wasabi says yes. They are supporting tens of thousands of users right now and over a billion HTTP requests where most result in SQL execution.
  • If the arbitrator node goes down in a master-master configuration, how is that not a single point of failure?
    • If you start getting large, you add in more arbitrators. Jonathan has one in each datacenter.

EveryJuan: So brief overview on Percona XtraDB Cluster, it’s a write on top of Galera (which is multi-master cluster software based on synchronous replication) for MySQL and replaces the entire NDB system with a multi-master replication. They have stuff like STT for node-catchup and stuff as well as Percona-tools

http://www.hammerdb.com/ can be used for database load-testing.