ted.neward@newardassociates.com | Blog: http://blogs.newardassociates.com | Github: tedneward | LinkedIn: tedneward
installation
CAP Theorem
querying
storing
"applications"
Much of "original" computing operated directly on files
Much processing was done "batch" style
run a program at a particular time
process input (files) into output (files)
as an extension of punchcards
common systems of the data were organized into "graph" or "network" models
these are not the same terms as we use them today
breaks records and indexes into separate files
data files
fixed-length records
records contain many fields
records appended to end of file
today commonly called "flat files"
index files
records with two fields: key and index number
may support fast-search algorithm (B-trees, etc)
accessed directly by programs
fragile; loss of a single byte could be disastrous
file access required new/repeated code for each program
simultaneous access nearly impossible
large server programs would do the batched work
all done under transactional semantics
safer
allowed for more throughput
eased working with multiple files
centralized logic, eased maintenance
"TP monitors" (Tuxedo, ...) became big business
computers got faster
computers got personal
people got impatient
so we moved to more interactive models
Father of the relational database
"A relational model of data for large data banks" (1970)
available at https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf
"Future users of large data banks must be protected from having to know how the data is organized in the machine"
proposed a more "mathematical" model of thinking about data, around sets
By the 80s, relational databases were the norm
dBase pioneered desktop database
others added interactive application/GUI functionality
Paradox, FoxPro, Access, and others
By the 90s, client-server RDBMSs had all the data
Oracle, MS SQL Server, Sybase, Informix, DB2, ...
combined with TP monitors over time
Most companies invest into "DBAs" and "data centers"
"data warehouse", "data lake": progenitors of "big data"
OLAP vs OLTP: analysis vs transaction processing
The 90s brought us object-oriented databases
part of the "object revolution" of the mid-90s
aligned well with OO languages (C++, Java)
Poet, Versant, Gemstone, ...
but the RDBMSs had a decade of entrenchment
In 2000, the CAP Theorem was published
Consistency, Availability, Partitioning... pick two!
This gave rise to "NoSQL"
CouchDB, MongoDB, Cassandra, ...
scale, scale, scale
Pandora's box was well and fully opened
By 2010, enter cloud
Google Cloud, AWS, Azure, ...
remove the "maintenance and upkeep"
grow as necessary, dynamically
more secure (in theory)
... and open-source databases
relational: MySQL, Postgres, Maria, ...
and non-relational
In the 2020s, we have significant numbers of options
it's quite overwhelming
RDBMSs had great features
Efficient storage
Simple retrieval
Guaranteed data stability
But with more data came... more data
We needed to "partition" data
across multiple database servers/storage engines
but we refused to lose the benefits of RDBMS
Enter "Two-Phase Commit Transactions"
And all was well and good, until...
Before, enterprise apps were internal
Known user base, known loads, known scale
This user base was not likely to change without huge warning
But now, the Web!
After, we began to project our enterprise apps out into the Internet
This meant an unknown and unpredictable user base
With that came an unknown and unpredictable load and scale
"Each node in a system should be able to make decisions purely based on local state. If you need to do something under high load with failures occurring and you need to reach agreement, you've lost."
Werner Vogels, CTO, Amazon
But why did we care about scale?
Repeat after me: "Contention is the enemy of scalability"
Contention began to take down the existing infrastructure
Traditionally-managed RDBMS'es simply couldn't keep up
ACID has its uses... but we found the edge really quickly
CAP Theorem
Consistency: every read receives the most recent write or an error
Availability: every request receives a (non-error) response, without the guarantee that it contains the most recent write
Partitionability: system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes
"Pick two"
RDBMS goes for C + A
Mostly owing to RDBMS embracing ACID
ACID: Atomic, Consistent, Idempotent, Durable
But we can choose A+P or C+P by establishing conventions
Most "NoSQL" databases go for A + P
NoSQLs often embrace BASE
BASE: Basically Available, Soft-state, Eventually consistent
But some keep C + A and just offer different "shapes" to data
PACELC: if partition (P) happens, the trade-off is between availability (A) and consistency (C); Else (E), the trade-off is between latency (L) and consistency (C)
Less known, but more accurate depiction of the problem
semi-structured data
loosely/un- typed
data is arranged into "documents"
(typically) query-by-example retrieval facility
documents are (more or less) independent entities
no strong relationships to one another
no strong definitions on content
semi-structured data
typically name/value pairs and arrays
semi-hierarchical (sort of)
documents usually arranged into named collections
think the filing cabinet in a doctor's or lawyer's office
loosely typed means easy refactoring
no strong relationships makes sharding/scaling easy
document concept "clusters" relevant data together
loosely typed means one typo ruins your whole day
data validation is whatever your code provides
this has serious impact in polyglot scenarios
lack of a standard query language hurts migration
but there is little migration story to these dbs anyway
lack of a query language makes complex queries tricky
query-by-example hurts here
no "hard links" between data elements isolates data
no "joins" in SQL terms means potentially multiple round trips
blog (or other CMS)
bank accounts and transaction history
customer payment history
CouchDB: document-oriented
http://couchdb.apache.org
JSON documents
JavaScript map/reduce
replication in multiple forms
MVCC (no-lock) writes
CouchDB: document-oriented
no drivers; everything via REST protocol
"Couch apps": HTML+JavaScript+Couch
Erlang implementation
Upshot: best with predefined queries, accumulating data over time
Upshot: Couch apps offer an appserver-less way to build systems
No locks; MVCC instead
Multi-Version Concurrency Control
documents are never locked; instead, changes are written as new versions of a document on top of old document
each document has a revision number+content-hash
advantage: read requests already in place can finish even in the face of a concurrent write request; new read requests return the written version
result: high parallelization utility
conflicts mean both versions are preserved
leave it to the humans to figure out the rest!
CouchDB's principal I/O is HTTP/REST
GET requests "retrieve"
PUT requests "store" or "create"
POST requests "modify"
DELETE requests "remove"
this isn't ALWAYS true
for the most part, though, it holds
the CouchDB programmer's best friend: curl
or any other command-line HTTP client
good exercise: write your own!
CouchDB's principal data is the document
schemaless key/value nestable store
(essentially a JSON object)
certain keys (_rev, _id) reserved for CouchDB's use
_id is most often a UUID/GUID
fetch one (if you like) from http://localhost:5984/_uuids
database will attach one on new documents
_rev is this document's revision number
formatted as "N-{md5-hash}"
OK to leave blank for new document...
... but must be sent back as part of the stored document
assuming the _rev's match, all is good
if not, "whoever saves a changed document first, wins"
(essentially an optimistic locking scheme)
Assume this is stored in person.json
{ "_id": "286ccf0edf77bfb6e780be88ae000d0b", "firstname": "Ted", "lastname": "Neward", "age": 40 }
... then insert it into Couch like so:
curl -X PUT http://localhost:5984/%1 -d @%2 -H "Content-Type: application/json"
Documents can have attachments
attachments are binary files associated with the doc
essentially URLs within the document
simply PUT the document to the desired {_id}/{name}
provide the document's {_rev} as a query param
after all, you are modifying the document
add ?attachments=true to fetch the attachments as binary (Base64) data when fetching the document
Documents can be inserted in bulk
send a POST to http://localhost:5984/{dbname}/_bulk_docs
include array of docs in body of POST
if updating, make sure to include _rev for each doc
"non-atomic" mode: response indicates which documents were saved and which weren't
default mode
"all-or-nothing" mode: all documents will be saved, but conflicts may exist
pass "all_or_nothing : true" in the request
Retrieving an individual document is just a GET
GET http://localhost:5984/{database}/{_id}
Retrieving multiple documents uses Views
views are essentially predefined MapReduce pairs defined within the database
create a "map" function to select the data from a given document
create a "reduce" function to collect the "map"ped data into a single result (optional)
views are "compiled" on first use
Views can be constrained by query params
?key=...
: return exact row
?startkey=...&endkey=...
: return range of rows
Updating a individual document is just a POST
POST http://localhost:5984/{database}/{_id}
Recall: Everything is a document * a document can contain JavaScript code, executed by CouchDB * this is called a 'design document' * a design document is a doc URL-prefixed with "_design"
{ "_id" : "_design/example", "views" : { "all_docs" : { "map" : "function(doc) { emit(doc._id, doc._rev) }" } } }
CouchDB supports server-side validation
add a "validate_doc_function" member to _design/
function takes three parameters:
"newDoc": the incoming document
"savedDoc": the document on disk (if any)
"userCtx": the user and their roles
CouchDB can also show arbitrary pages
add "show" functions to design document
"shows" member in design doc
array of name : function value pairs
functions take 2 params: doc (document) and req (request)
use GET /{db}/_design/{design}/_show/{show}/{id}
URL query parameters are available on request object
view "templates" are also possible, in much the same way
ditto for "lists"
CouchApp is a framework/build system for building CouchDB applications (design docs)
It stores views in the CouchDB database
Because CouchDB is a REST-based API, it effectively acts as its own app server
In other words, CouchDB can be a 1-tier application server
a schemaless pseudo-revisioning document store
JavaScript-based application engine
designed for easy replication
CouchDB: The Definitive Guide
J. Chris Anderson et al, O’Reilly, 2011
CouchDB website
http://couchdb.apache.org/
Download: http://couchdb.apache.org/downloads.html
Complete HTTP reference: http://wiki.apache.org/couchdb/Complete_HTTP_API_Reference
CouchApp
Download: https://github.com/couchapp/couchapp/downloads
CouchOne website
Downloads or signup: http://www.couchone.com/get
Architect, Engineering Manager/Leader, "force multiplier"
http://www.newardassociates.com
http://blogs.newardassociates.com
Sr Distinguished Engineer, Capital One
Educative (http://educative.io) Author
Performance Management for Engineering Managers
Books
Developer Relations Activity Patterns (w/Woodruff, et al; APress, forthcoming)
Professional F# 2.0 (w/Erickson, et al; Wrox, 2010)
Effective Enterprise Java (Addison-Wesley, 2004)
SSCLI Essentials (w/Stutz, et al; OReilly, 2003)
Server-Based Java Programming (Manning, 2000)