Objectives

Relax! It's CouchDB

installation

CAP Theorem

querying

storing

"applications"

History of Databases

Those who cannot remember the past....

History of Databases

We didn't always have databases

Much of "original" computing operated directly on files

Much processing was done "batch" style

run a program at a particular time

process input (files) into output (files)

as an extension of punchcards

common systems of the data were organized into "graph" or "network" models

these are not the same terms as we use them today

History of Databases

Indexed Sequential Access Method (ISAM) files

breaks records and indexes into separate files

data files

fixed-length records

records contain many fields

records appended to end of file

today commonly called "flat files"

index files

records with two fields: key and index number

may support fast-search algorithm (B-trees, etc)

accessed directly by programs

History of Databases

Problems with ISAM

fragile; loss of a single byte could be disastrous

file access required new/repeated code for each program

simultaneous access nearly impossible

History of Databases

Transactional processing

large server programs would do the batched work

all done under transactional semantics

safer

allowed for more throughput

eased working with multiple files

centralized logic, eased maintenance

"TP monitors" (Tuxedo, ...) became big business

History of Databases

Interactive computing

computers got faster

computers got personal

people got impatient

so we moved to more interactive models

History of Databases

E.F. (Ted) Codd

Father of the relational database

"A relational model of data for large data banks" (1970)

available at https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf

"Future users of large data banks must be protected from having to know how the data is organized in the machine"

proposed a more "mathematical" model of thinking about data, around sets

History of Databases

Codd's model slowly ate the world

By the 80s, relational databases were the norm

dBase pioneered desktop database

others added interactive application/GUI functionality

Paradox, FoxPro, Access, and others

By the 90s, client-server RDBMSs had all the data

Oracle, MS SQL Server, Sybase, Informix, DB2, ...

combined with TP monitors over time

Most companies invest into "DBAs" and "data centers"

"data warehouse", "data lake": progenitors of "big data"

OLAP vs OLTP: analysis vs transaction processing

History of Databases

Other models

The 90s brought us object-oriented databases

part of the "object revolution" of the mid-90s

aligned well with OO languages (C++, Java)

Poet, Versant, Gemstone, ...

but the RDBMSs had a decade of entrenchment

In 2000, the CAP Theorem was published

Consistency, Availability, Partitioning... pick two!

This gave rise to "NoSQL"

CouchDB, MongoDB, Cassandra, ...

scale, scale, scale

Pandora's box was well and fully opened

History of Databases

By 2010, enter cloud

Google Cloud, AWS, Azure, ...

remove the "maintenance and upkeep"

grow as necessary, dynamically

more secure (in theory)

... and open-source databases

relational: MySQL, Postgres, Maria, ...

and non-relational

In the 2020s, we have significant numbers of options

it's quite overwhelming

CAP Theorem

What was wrong with the RDBMS, anyway?

CAP Theorem

Pre-Web

RDBMSs had great features

Efficient storage

Simple retrieval

Guaranteed data stability

But with more data came... more data

CAP Theorem

Pre-Web

We needed to "partition" data

across multiple database servers/storage engines

but we refused to lose the benefits of RDBMS

Enter "Two-Phase Commit Transactions"

And all was well and good, until...

CAP Theorem

The Web

Before, enterprise apps were internal

Known user base, known loads, known scale

This user base was not likely to change without huge warning

But now, the Web!

After, we began to project our enterprise apps out into the Internet

This meant an unknown and unpredictable user base

With that came an unknown and unpredictable load and scale

CAP Theorem

Load/scale and contention

"Each node in a system should be able to make decisions purely based on local state. If you need to do something under high load with failures occurring and you need to reach agreement, you've lost."

Werner Vogels, CTO, Amazon

But why did we care about scale?

CAP Theorem

The Result: Bad

Repeat after me: "Contention is the enemy of scalability"

Contention began to take down the existing infrastructure

Traditionally-managed RDBMS'es simply couldn't keep up

ACID has its uses... but we found the edge really quickly

CAP Theorem

The Solution (sort of)

CAP Theorem

Consistency: every read receives the most recent write or an error

Availability: every request receives a (non-error) response, without the guarantee that it contains the most recent write

Partitionability: system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes

"Pick two"

CAP Theorem

Practical

RDBMS goes for C + A

Mostly owing to RDBMS embracing ACID

ACID: Atomic, Consistent, Idempotent, Durable

But we can choose A+P or C+P by establishing conventions

Most "NoSQL" databases go for A + P

NoSQLs often embrace BASE

BASE: Basically Available, Soft-state, Eventually consistent

But some keep C + A and just offer different "shapes" to data

CAP Theorem

CAP begat PACELC

PACELC: if partition (P) happens, the trade-off is between availability (A) and consistency (C); Else (E), the trade-off is between latency (L) and consistency (C)

Less known, but more accurate depiction of the problem

Data Models: Document-Oriented Data

What does that mean?

Data Models: Document-Oriented Data

What is a Document-Oriented Data Model?

semi-structured data

loosely/un- typed

data is arranged into "documents"

(typically) query-by-example retrieval facility

Data Models: Document-Oriented Data

What is a document?

documents are (more or less) independent entities

no strong relationships to one another

no strong definitions on content

semi-structured data

typically name/value pairs and arrays

semi-hierarchical (sort of)

documents usually arranged into named collections

think the filing cabinet in a doctor's or lawyer's office

Data Models: Document-Oriented Data

Advantages

loosely typed means easy refactoring

no strong relationships makes sharding/scaling easy

document concept "clusters" relevant data together

Data Models: Document-Oriented Data

Disadvantages

loosely typed means one typo ruins your whole day

data validation is whatever your code provides

this has serious impact in polyglot scenarios

lack of a standard query language hurts migration

but there is little migration story to these dbs anyway

lack of a query language makes complex queries tricky

query-by-example hurts here

no "hard links" between data elements isolates data

no "joins" in SQL terms means potentially multiple round trips

Data Models: Document-Oriented Data

Canonical Doc-Oriented examples

blog (or other CMS)

bank accounts and transaction history

customer payment history

CouchDB

Document-oriented REST

CouchDB

CouchDB: document-oriented

http://couchdb.apache.org

JSON documents

JavaScript map/reduce

replication in multiple forms

MVCC (no-lock) writes

CouchDB

CouchDB: document-oriented

no drivers; everything via REST protocol

"Couch apps": HTML+JavaScript+Couch

Erlang implementation

Upshot: best with predefined queries, accumulating data over time

Upshot: Couch apps offer an appserver-less way to build systems

Basics

No locks; MVCC instead

Multi-Version Concurrency Control

documents are never locked; instead, changes are written as new versions of a document on top of old document

each document has a revision number+content-hash

advantage: read requests already in place can finish even in the face of a concurrent write request; new read requests return the written version

result: high parallelization utility

conflicts mean both versions are preserved

leave it to the humans to figure out the rest!

Basics

CouchDB's principal I/O is HTTP/REST

GET requests "retrieve"

PUT requests "store" or "create"

POST requests "modify"

DELETE requests "remove"

this isn't ALWAYS true

for the most part, though, it holds

the CouchDB programmer's best friend: curl

or any other command-line HTTP client

good exercise: write your own!

Basics

CouchDB's principal data is the document

schemaless key/value nestable store

(essentially a JSON object)

certain keys (_rev, _id) reserved for CouchDB's use

_id is most often a UUID/GUID

fetch one (if you like) from http://localhost:5984/_uuids

database will attach one on new documents

_rev is this document's revision number

formatted as "N-{md5-hash}"

OK to leave blank for new document...

... but must be sent back as part of the stored document

assuming the _rev's match, all is good

if not, "whoever saves a changed document first, wins"

(essentially an optimistic locking scheme)

Basics

Assume this is stored in person.json

{
   "_id": "286ccf0edf77bfb6e780be88ae000d0b",
   "firstname": "Ted",
   "lastname": "Neward",
   "age": 40
}

... then insert it into Couch like so:

curl -X PUT http://localhost:5984/%1 -d @%2 -H "Content-Type: application/json"

Basics

Documents can have attachments

attachments are binary files associated with the doc

essentially URLs within the document

simply PUT the document to the desired {_id}/{name}

provide the document's {_rev} as a query param

after all, you are modifying the document

add ?attachments=true to fetch the attachments as binary (Base64) data when fetching the document

Basics

Documents can be inserted in bulk

send a POST to http://localhost:5984/{dbname}/_bulk_docs

include array of docs in body of POST

if updating, make sure to include _rev for each doc

"non-atomic" mode: response indicates which documents were saved and which weren't

default mode

"all-or-nothing" mode: all documents will be saved, but conflicts may exist

pass "all_or_nothing : true" in the request

Basics

Retrieving an individual document is just a GET

GET http://localhost:5984/{database}/{_id}

Basics

Retrieving multiple documents uses Views

views are essentially predefined MapReduce pairs defined within the database

create a "map" function to select the data from a given document

create a "reduce" function to collect the "map"ped data into a single result (optional)

views are "compiled" on first use

API

Views can be constrained by query params

?key=... : return exact row

?startkey=...&endkey=... : return range of rows

Basics

Updating a individual document is just a POST

POST http://localhost:5984/{database}/{_id}

CouchDB and 'Code'

What about executable bits in/with Couch?

Code

Recall: Everything is a document * a document can contain JavaScript code, executed by CouchDB * this is called a 'design document' * a design document is a doc URL-prefixed with "_design"

{
    "_id" : "_design/example",
    "views" : {
        "all_docs" : {
            "map" : "function(doc) { emit(doc._id, doc._rev) }"
        }
    }
}

Code

CouchDB supports server-side validation

add a "validate_doc_function" member to _design/

function takes three parameters:

"newDoc": the incoming document

"savedDoc": the document on disk (if any)

"userCtx": the user and their roles

Code

CouchDB can also show arbitrary pages

add "show" functions to design document

"shows" member in design doc

array of name : function value pairs

functions take 2 params: doc (document) and req (request)

use GET /{db}/_design/{design}/_show/{show}/{id}

URL query parameters are available on request object

view "templates" are also possible, in much the same way

ditto for "lists"

CouchApps

CouchDB is a single-tier app server

CouchApps

CouchApp is a framework/build system for building CouchDB applications (design docs)

It stores views in the CouchDB database

Because CouchDB is a REST-based API, it effectively acts as its own app server

In other words, CouchDB can be a 1-tier application server

Summary

CouchDB is... different

a schemaless pseudo-revisioning document store

JavaScript-based application engine

designed for easy replication

CouchDB Resources

Where to go from here

References

CouchDB: The Definitive Guide

J. Chris Anderson et al, O’Reilly, 2011

CouchDB website

http://couchdb.apache.org/

Download: http://couchdb.apache.org/downloads.html

Complete HTTP reference: http://wiki.apache.org/couchdb/Complete_HTTP_API_Reference

References

CouchApp

Download: https://github.com/couchapp/couchapp/downloads

CouchOne website

Downloads or signup: http://www.couchone.com/get

Credentials

Who is this guy?

Architect, Engineering Manager/Leader, "force multiplier"

http://www.newardassociates.com

http://blogs.newardassociates.com

Sr Distinguished Engineer, Capital One

Educative (http://educative.io) Author

Performance Management for Engineering Managers

Books

Developer Relations Activity Patterns (w/Woodruff, et al; APress, forthcoming)

Professional F# 2.0 (w/Erickson, et al; Wrox, 2010)

Effective Enterprise Java (Addison-Wesley, 2004)

SSCLI Essentials (w/Stutz, et al; OReilly, 2003)

Server-Based Java Programming (Manning, 2000)

Busy Developer's Guide to CouchDB

Objectives

Relax! It's CouchDB

History of Databases

Those who cannot remember the past....

History of Databases

We didn't always have databases

History of Databases

Indexed Sequential Access Method (ISAM) files

History of Databases

Problems with ISAM

History of Databases

Transactional processing

History of Databases

Interactive computing

History of Databases

E.F. (Ted) Codd

History of Databases

Codd's model slowly ate the world

History of Databases

Other models

History of Databases

CAP Theorem

What was wrong with the RDBMS, anyway?

CAP Theorem

Pre-Web

CAP Theorem

Pre-Web

CAP Theorem

The Web

CAP Theorem

Load/scale and contention

CAP Theorem

The Result: Bad

CAP Theorem

The Solution (sort of)

CAP Theorem

Practical

CAP Theorem

CAP begat PACELC

Data Models: Document-Oriented Data

What does that mean?

Data Models: Document-Oriented Data

What is a Document-Oriented Data Model?

Data Models: Document-Oriented Data

What is a document?

Data Models: Document-Oriented Data

Advantages

Data Models: Document-Oriented Data

Disadvantages

Data Models: Document-Oriented Data

Canonical Doc-Oriented examples

CouchDB

Document-oriented REST

CouchDB

CouchDB

Basics

Basics

Basics

Basics

Basics

Basics

Basics

Basics

API

Basics

CouchDB and 'Code'

What about executable bits in/with Couch?

Code

Code

Code

CouchApps

CouchDB is a single-tier app server

CouchApps

Summary

CouchDB is... different

CouchDB Resources

Where to go from here

References

References