What is NoSQL?
Why and when should you consider using it/them?
How do you use it/them?
Take a look at (a few, of many possible) examples
"A database system is basically just a computerized record-keeping system."
"The database itself can be regarded as a king of electronic filing cabinet; that is, a repository or container for a collection of computerized data files."
(Introduction to Database Systems, 8th Ed, Date)
"a collection of data, usually pertaining to some reasonably well-defined purpose"
"data has to be stored in a way that lends it certain characteristics"
data must have a known format
data should be stored, retrieved, and modified only by a special program (DBMS)
data should be under transaction control
"A place where data goes to die, never to be seen again except by database administrators who will hoard it gleefully, like Smaug's hoard under the mountain." --Anonymous
Typically it is a software system that provides:
Ability to enter, retrieve, edit, and remove data in some record form
Some degrees of guarantees around consistency and verification
Note that this is a pretty vague definition!
Server-hosted network-aware databases
Embedded or client-hosted databases
Source-code control systems (GitHub!)
Files and file systems
Networked relational databases (RDBMS)
Commercial: Oracle, SQLServer, DB2, Informix, ....
Open-Source: MySQL, Postgres, Maria, Apache Derby, ...
Embedded relational databases
Open-Source: SQLite, BerkeleyDB
Commercial: Microsoft Access (format)
"NoSQL"
MongoDB, CouchDB, Firebase
Redis, Cassandra, HBase, Couchbase
ArangoDB, OrientDB, FaunaDB
Data
Information and data are often synonymous, but not always
Data: what is stored in the database
Information: the meaning of that data
Metadata: data about the data (schema)
Integrated: a unification of facts, with any redundancy partly or wholly eliminated
Shared: among different users
Hardware
Most concern is around storage
Secondarily the CPU for processing
Software
Server processes
Reporting processes
Clients
Human users (direct interrogation)
User programs (bespoke programs)
Administration tools
Reporting tools
External level: the way the data is seen by the clients
Programming language data types (int, String, etc)
Conceptual level: the way the data "feels" as a concept
Commonly called "database models" or "data models"
Internal level: the way the data is stored by the system
Highly optimized to support database functionality
Storage Engine
Query Engine
Transport (Distributed System Layer)
Execution Engine
Transaction/Lock Manager
Buffer/Recovery manager
Manages network interaction
Primary means by which interaction with clients occurs
... in non-local/non-embedded databases
Responsible for peer nodes
... in clustered database scenarios
Provides data segregation, consistency, etc
Language parser
Optimizes storage access
Sometimes called a "query plan" or "execution plan"
a sequence of operations, sometimes in bytecode
Hands query plan off to Execution Engine
Executes the query plan
Collects the results
Some may be local operations
Some may be remote operations
Some may be arbitrary non-relational code
Communicates with the Storage Engine
Manages direct-to-longterm-storage interface
Provides a simpler API to clients
Sometimes pluggable
File formats are usually closely mapped
Interacts with other components
Transaction Manager
Lock Manager
Buffer Manager/Cache Manager
Recovery Manager (operation/transaction log)
Memory- vs Disk-based DBMS
Storage block strategies (column- vs row-oriented, others)
Data vs index files/blocks/structures
B-tree disk storage
Buffering, ordering, mutability (concerns/features)
Infrastructure for efficient data management
... that we can mostly treat as a "black box"
... except for how we "see" the data inside the database
Service for keeping data safe
consistent
secure
resilient
record: collection of data elements
index: an optimized (sorted) collection of keys
key: a means of identifying a record
transaction: carrying out an operation against the database
lock: a mechanism designed to constrain concurrent access
query: a request to the database to retrieve some data
"... an abstract, self-contained, logical definition of the objects, operators, and so forth, that together constitute the abstract machine with which users interact."
"The Third Manifesto", p11 (Date, Darwen, 1994)
A particular domain schema or collection of entity types
A particular "shape" to domain data models
Influences storage and relationships between entities
Influences performance of reads, writes, etc
Structures: how is the data formatted ("shaped")?
Constraints: what rules are enforced on the data (if any)?
Operations: what can we do with the data (retrieve, manipulate, etc)?
Relational: relations, tuples, and relvars
If you don't know what these are, you don't know your relational theory
strongly-typed, enforced by the database
Objects
capturing the object graphs that appear in O-O systems
strongly-typed, defined by an O-O language (not external schema)
Key-value pairs
CRUD based solely on the primary key; no joins
weakly- or untyped
Documents
collections of named fields holding data (or more collections)
weakly- or untyped
Graphs
capturing not just graph structures, but the "arcs" between nodes
graph-based query API/language
Columns
think tables, then turn your head 90 degrees
we group by columns, not by rows
Hierarchical
single-rooted strictly "one-way" acyclic graphs
generally, these are XML stores
"Network"
predecessor type to RDBMSs, finding some interest again
collection of values with "pointers" to related data
users manually "follow" the pointers in code
Hybrids of all the above ("multi-model")
any attempt to use one model from a different model is problematic
fundamental assumptions of one not present in the other, and vice versa
called an impedance mismatch
significant loss of functionality
attempts to mitigate through tooling
Object-Relational impedance mismatch is only the most obvious
which we attempt to solve through tooling (O/R-Ms)
Object-Hierarchical, Object-Document, Object-Graph, ...
the "object" on the left-hand side is because of our choice of languages
sort of begs the question: what if we had different languages?
a model of data centered around relational algebra
emphasis is on mathematical operations on relations, sets, and tuples
entities can be related to one another via attributes
most commonly expressed as "tables, columns, and rows"
RDBMSs had great features
Efficient storage
Simple retrieval
Guaranteed data stability
But with more data came... more data
We needed to "partition" data
across multiple database servers/storage engines
but we refused to lose the benefits of RDBMS
Enter "Two-Phase Commit Transactions"
And all was well and good, until...
Before, enterprise apps were internal
Known user base, known loads, known scale
This user base was not likely to change without huge warning
But now, the Web!
After, we began to project our enterprise apps out into the Internet
This meant an unknown and unpredictable user base
With that came an unknown and unpredictable load and scale
"Each node in a system should be able to make decisions purely based on local state. If you need to do something under high load with failures occurring and you need to reach agreement, you've lost."
Werner Vogels, CTO, Amazon
But why did we care about scale?
Repeat after me: "Contention is the enemy of scalability"
Contention began to take down the existing infrastructure
Traditionally-managed RDBMS'es simply couldn't keep up
ACID has its uses... but we found the edge really quickly
CAP Theorem
Consistency: every read receives the most recent write or an error
Availability: every request receives a (non-error) response, without the guarantee that it contains the most recent write
Partitionability: system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes
"Pick two"
RDBMS goes for C + A
Mostly owing to RDBMS embracing ACID
ACID: Atomic, Consistent, Idempotent, Durable
But we can choose A+P or C+P by establishing conventions
Most "NoSQL" databases go for A + P
NoSQLs often embrace BASE
BASE: Basically Available, Soft-state, Eventually consistent
But some keep C + A and just offer different "shapes" to data
PACELC: if partition (P) happens, the trade-off is between availability (A) and consistency (C); Else (E), the trade-off is between latency (L) and consistency (C)
Less known, but more accurate depiction of the problem
semi-structured data
loosely/un- typed
data is arranged into "documents"
(typically) query-by-example retrieval facility
documents are (more or less) independent entities
no strong relationships to one another
no strong definitions on content
semi-structured data
typically name/value pairs and arrays
semi-hierarchical (sort of)
documents usually arranged into named collections
think the filing cabinet in a doctor's or lawyer's office
loosely typed means easy refactoring
no strong relationships makes sharding/scaling easy
document concept "clusters" relevant data together
loosely typed means one typo ruins your whole day
data validation is whatever your code provides
this has serious impact in polyglot scenarios
lack of a standard query language hurts migration
but there is little migration story to these dbs anyway
lack of a query language makes complex queries tricky
query-by-example hurts here
no "hard links" between data elements isolates data
no "joins" in SQL terms means potentially multiple round trips
blog (or other CMS)
bank accounts and transaction history
customer payment history
data attributes stored in columns, collected into rows
rows identified by row keys
but the emphasis is on columns, not rows
little to no "relationships" between rows
or between "tables", for that matter
typically very low-level data
binary arrays/buffers, strings, etc
emphasis is on speed of read/write
duplicating data as convenient
yep; the model is about data access, not abstraction
"what do we need?"
"what do we need when we query it with 'this' thing?"
example: Users
we may want to query users by email or username
so we create two "tables": one keyed on email, one on username
both "tables" store the same data
and rely on code to keep them in sync
this often makes Columnar DBs good candidates for caches
we don't define "types", per se
data is instead stored very "clustered"
tightly-related data elements
create the key, add the data elements, store it
updates: add the data elements to a key, store it
focused entirely on the key
little to no queries along alternate axes
pull back that row or column-collection, nothing else
rows can be flexible in composition
tightly-clustered data stored/fetched in one "bucket"
scale, scale, scale
performance
no relationships
little to no query language support
little to no query capability, period
four fundamental atoms:
nodes/vertices
labels (group nodes into sets)
relationships/arcs (between the nodes)
properties (name/value pairs on nodes and relationships)
how we use those depends on the problem
nodes are often "nouns"
arcs often connect nodes as modifiers or qualifiers
flexible
like document- or hierarchical data models
allows for easy(er) refactoring
data model is "whiteboard friendly"
data model matches the whiteboard exactly
captures data about relationships
in other models, this would need to be modeled
data can be associated with the relationship
cyclic relationships trivial to see, model
nodes can have any number of arc connections
concerns over scale
concerns over query performance
nonstandard query language
lack of schema enforcement (beyond nodes/arcs/etc)
Neo4J (https://neo4j.com)
TitanDB (http://thinkaurelius.github.io/titan/)
flexible data storage engines
Infogrid (http://www.infogrid.org)
an object-oriented model, writ to disk
objects hold data
objects inherit
objects refer to one another
often in cyclic or acyclic graphs
instantiate the object
store it into the OODBMS "object set"
commit the object set
update the object
save (either the object or the "object set")
some OODBMSs "remember" the object, others reqire passing it into the object set again
"dirty flags"
automatic persistence of modified objects
query-by-example
"OQL": "SQL for objects"
out of favor after 2000
query-by-native language
Important note: "fetch depth"
how far deep do we traverse?
no impedance mismatch to O-O languages
no need for O/R-Ms, no mapping, none of it
easy refactoring
just refactor the class model
easy inheritance support
no more "table-per-class/table-per-concrete" joins
not prevalent
in fact, often considered "dead"
concerns over scale
lack of tooling
concerns over query optimization
db4o
Versant
GemStone
buckets into which we drop data items
data items are keys that have a set of attributes
keeps related data together accessed by key
no formal structure beyond "key" and associated data (value)
super-fast retrieval (index is the lookup)
opaque values are entirely flexible to the app
(generally) no query capabilities beyond key-based lookup
opaque values are therefore application-specific
Amazon Dynamo
Redis
database implementation that supports more than one data model natively
integrated relationship support
integrated query support
generally easier/more flexible than "polyglot persistence"
most databases are key-value storage engines under the hood
B-tree implementations
with sorted pointer structures for fast query
so really the multimodel support comes in a few places:
indexing targets
query language support
SQLite
MongoDB
Couchbase
Neo4J
".. a software library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine."
http://www.sqlite.org
Mostly SQL-92 compliant
ACID Transactions supported
Stored as a single file stored locally
Self-contained
Available as a single C file if necessary
Many operating systems already have it
Test first: In a command-line shell, type sqlite
If you get "not found" you need to install it
If you get "SQLite version..." you already have it
Do you have Homebrew?
Yes: brew install sqlite
No: get Homebrew! Seriously!
SQLite Download page: https://www.sqlite.org/download.html
Linuxbrew: brew install sqlite
Debian/Ubuntu: sudo apt install sqlite
RedHat: sudo yum install sqlite -y
Others: It's probably the "sqlite" package
Package Manager
WinGet: doesn't seem to be supported yet?
Chocolatey: choco install sqlite
Scoop: scoop install sqlite
Otherwise https://www.sqlite.org/download.html
Grab zip file, unzip and adjust PATH
Command-line shell: sqlite3
Should open a "transient in-memory database"
Assuming that works:
.help
: Full list of meta-commands
.open
: Connects to a database file
.backup
: Backup database to a new file
.databases
: List all databases in use
.exit
: Quit the SQLite shell
SQLite Studio: https://sqlitestudio.pl/
DBeaver: https://dbeaver.io/
Neo4J: graph-oriented
data model is basically "just" nodes and arcs
graph API (Visitor) for navigation/query
Java implementation
http://www.neo4j.org
Upshot: perfect for graph analysis and storage
Download: https://neo4j.com/download-neo4j-now/
Package managers:
macOS: brew install neo4j
Linux:
Windows:
neo4j-admin
CLI tool
restricted subset available through neo4j
neo4j
/neo4j-admin server
commands:
status
: return server status
console
: start the server in the foreground
start
: start server in the background
stop
: stop the background server
restart
: restart the background server
http://localhost:7474/browser/ brings up GUI dashboard/UI
this is the Editor, the primary (developer) interface
help topics
connection to Neo4J databases
execute Editor commands (colon-prefixed)
execute Cypher (query language) commands
each command creates a "result frame"
system
: administrative upkeep and management
neo4j
: default database, empty at start
NoSQL
multimodel
distributed
ad-hoc query language
SDKs across mainstream languages
https://couchbase.com
Cloud
https://cloud.couchbase.com/sign-up
Docker
Package managers
Homebrew: brew install --cask couchbase-server-community
Download
https://docs.couchbase.com/server/current/install/get-started.html
Docker download/run
docker run -t --name db -p 8091-8096:8091-8096 -p 11210-11211:11210-11211 couchbase/server:enterprise-7.2.0
Couchbase Server has several dataset samples
Settings | Sample Buckets
travel-sample
gamesim-sample
beer-sample
Choose one, CBS will set it all up
JSON document model
"json types" describe different kinds of JSON documents
SQL-like query syntax
SELECT
document fields
FROM
source
WHERE
criteria
documents stored in collections stored in scopes
fully-qualified names: namespace:bucket.scope.collection
note that free-form documents lack the uniformity of relational tables
this is by design!
Query syntax
SELECT name FROM `gamesim-sample`;
Returns 586 elements in a JSON array
SELECT jsonType FROM 'gamesim-sample';
Returns 586 elements, all of which are either "player", "item", or "monster"
Network database server
http://www.mongodb.org
stores BSON documents
query-by-example ad-hoc query facilities
map/reduce, server-side execution
master/slave replication; sharding
macOS:
Homebrew: brew install mongodb; brew services start mongodb
Linux:
Ubuntu: sudo apt-get install mongodb-org
RedHat: sudo yum install mongodb-org
Windows:
Chocolatey: choco install mongodb
Downloads: http://www.mongodb.org
standalone executables; just unzip and PATH
mongod.conf
(text) file
configuration parameters are same as command-line parameters
Configuration file example
storage: dbPath: ./data security: authorization: disabled
MongoDB often available from your cloud provider
Mongo Atlas: https://www.mongodb.com/atlas/database
a NodeJS console shell running JavaScript
connects to localhost server on default port (unless parameterized otherwise)
most Mongo examples assume the JS client shell
Connect to a running server with the mongo client -- client uses JavaScript and JSON
Current Mongosh Log ID: 644f441765e9cf735506f318 Connecting to: mongodb://127.0.0.1:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+1.8.1 Using MongoDB: 6.0.5 Using Mongosh: 1.8.1 For mongosh info see: https://docs.mongodb.com/mongodb-shell/ test> show databases admin 40.00 KiB config 84.00 KiB local 40.00 KiB test>
MongoDB has numerous popular language drivers
Some are community-suppoted, some from the company
Select as necessary
This is obviously a high-level view
further investigation is necessary!
prototypes are necessary!
allowing yourself time to fail is necessary!
Who is this guy?
Architect, Engineering Manager/Leader, "force multiplier"
Principal -- Neward & Associates
http://www.newardassociates.com
Educative (http://educative.io) Author
Performance Management for Engineering Managers
Author
Professional F# 2.0 (w/Erickson, et al; Wrox, 2010)
Effective Enterprise Java (Addison-Wesley, 2004)
SSCLI Essentials (w/Stutz, et al; OReilly, 2003)
Server-Based Java Programming (Manning, 2000)