Databases¶

Records | SQL, Embedded SQL, Distributed SQL
Data structures | Redis
KV, Distributed KV | Memcache, BDB, Others
CSV | Hadoop
Dataframes - Spark, Pandas, Druid
JSON | Mongodb, CouchDb, Cassandra
Files | File System

Scaling¶

Vertical
Horizontal

Sharding
Master / Slave
Replication

Data¶

Types: Document, Graph, Table, Dictionary, List, Array, Matrix

Data Science: Crawling, Analytics, Statistics, Transformation, Predictions, Visual Modelling, …
Simulations

ER / Binary / XML / Object / EAV / File
Queries / Commands
Serialization
Storage and Caching
Indexing / Searching / Tagging / Learning
Compression
Caching
Ontology
Analytics
Events / Messages
Logs
ETL
Snapshot + Restore
State / Context / Session
Protocol / Format
Language + Parsing + Validation
Transactions
Lazy Loading / Iterator
Streaming
Memento
Memory Pool
Data Binding
Composite
Inheritance

Concurrency¶

Double Checked Locking
Scheduling
Lock / Monitor
CAP
ACID
BASE
Active Object
Promise
Balking

Distributed Computing¶

dispy
disco
dask
druid
soop
https://codespeak.net/execnet/
https://pythonhosted.org/joblib/index.html

Scaling Tuning Sharding Cassandra

https://www.youtube.com/watch?v=QA-owFLFNcQ&list=PLalrWAGybpB-L1PGA-NfFu2uiWHEsdscD&index=9 https://medium.com/@jeeyoungk/how-sharding-works-b4dec46b3f6 https://medium.com/@jeeyoungk/how-sharding-works-b4dec46b3f6

Ideal backend would combine

Couchbase (rebalancing)
Custom Sqlite (Actordb, rqlite) (simplicity, performance, storage)
Remote Storage (UI)
Couchdb (replication)
Hasura (UI)

And have

Oauth for users / Multiple projects support
Auditing
Performance metrics
Analytics
CRUD / File ( support insert directly)
Query by Map / Reduce and Select
Locking / Transactions
Shardlist / Update dynamic management
Backup
Each shard should redirect to a relevant shard

Connection Pooling¶

Fix max number of connections
Wait indefinitely for new connection
Idle timeout for connections

This means that a previously-created connection will be reused where possible, or a new connection created if all existing ones are still in use (i.e. have outstanding queries whose results have not yet been consumed). In particular, holding on to a reference to a connection will also prevent it from being returned to the pool, so for the most part you should acquire connections when you need them and not memoize or save connections in a global persistent variable.

GQL connections are cached according to use_cache.

https://www.mysqlcalculator.com/ 1mb per connection sql

Connections can’t be shared between processes.

Basically databases are made of btrees. The differenece between mongodb and sql is schema

enforcement.

ACID¶

Atomic: Either it happens or doesn’t happen
Durability: A transaction is permanant and is stored in the non-volatile memory
Isolation: Transactions appear to run sequentially.
Consistency: constraints, cascades, triggers are all followed and atomic.

Concurrency¶

MVCC algorithm - Takes a snaphot to allow reads while writes happen separately - View Serialisability 2 Phase locking - Use locks for allowing consistent reads as well - Conflict serialisability

CAP Theorem¶

Availability is achieved by replicating the data across different machines. Let’s say there is a network problem - partition. You have to pick between consiteny and availability.

BASE - eventual consistensy. Its more performant ~100ms to be consistent.

Practical Errors¶

Network
Disastor
Operating System crash
Hardware crashed
Power cuts

Indexes¶

Nonclustered indexa - Only index has sorted rows. Actual rows are stored in an arbitrary order. Clustered Index - Store rows Compound index - Sorted excel

InnoDB uses a Btree to store values at the leaf nodes. Covering Index already has the values needed.

Data Warehouse¶

OLAP -> Analytics Data OLTP -> Transactional Data

Sharding¶

Split tables to databases
Split users table based on alphabet.
Convert hash to a number and hash the id
Lookup table
Geolocation

Data Retrieval

Get Request
Get Shrad
Aggregate
Session Affinity

Operations that should be automated

Split Shrad Move Shrad

To do the above online you need to lock tables while using replication.

ER Modelling¶

datum = unit of information information = unit of the world = fact data is a plural form information can be a set of combination of datum attribue can be atomic, composite, multiple-valued, derived over a domain and super

database ~ databank analogy to a bank from which application of constraints is clear

dbms is a set of utilities used to build and maintain db’s

data model is a structural and a calculus logic plan by which data are understood and manipulated

the accepted conceptual model consists of entitity types and relationships

entity = self-contained information weak entity = context-dependent information entity set = set of entities relationship set = mapping of affairs among entities of different entity sets ER schema = outline = set of combination of attributes which give out entity sets and relation sets = can be a diagram (see cheatsheet)

the accepted logical model consists of tables table = relationship set | entity set = relation = set of tuples tuple = entity or a mapping relation schema = set of attributes database = set of relation schema

ddl = data definition language create, alter, drop insert, delete, update dml = data manipulation language select+where,project, rename, union, intersection, difference, cross, division, join+where

*set operations work wrt entire table join is implemented using cross and select intersection is implemented using union and difference division = column matching

1-1 does not require a separate table but separation can help speed up 1-M requires the M table to hold a foregn key M-M requires a association table

join and nested queries are used in the 1-M and M-M case

equi join = union of tuples with same foreign key inner join = natural join = equi join + unique columns outer join = place nulls (U) = union of tuples wrt foreign key while preserving the entire table cross join removes the common part

decomposition = cohesion or else annomalies entity constraint = primary key = super attribute = should be unique and not null referenctial integrity constraint = foreign key = foreign tuple must exist sematinc constraints = triggers

strict functional dependency contraint makes common attributes of any two tuples determine anyother attributes semanticlly impossible without a relationship

update anamoly – update at multiple tuples for single change insertion anamoly – can not insert some information deletion anamoly – more information is lost than required

0nf = data should have a primary key 1nf = atomic attribute values. the redundant values are repeated within another table. 2nf = primary key must the the sole primary key (a primary attribute alone should not be a key) 3nf = 2nf + no non-prime attribute is transitively dependant on the prime key (association table) bcnf = 2nf + no non-prime attribute is transitively dependant on any prime key

stored procedures - checking, looping, variables, network transperancy

backup and replication - dump, M-M, M-S(-S-S ..)

security - authentication

optimization - indexing

views

Distributed¶

couchbase cassandra voldemort mongodb immudb dgraph

Memcache¶

get set delete incr decr multiget multiset read

read from cache
else query

write

write to db
refresh key

Doesn’t do any clustering. Just add servers manually. Uses consistent hashing. Expiration + LRU

Memory for an item is not actively reclaimed. If you store an item and it expires, it sits in the LRU cache at its position until it falls to the end and is reused. However, if you fetch an expired item, memcached will find the item, notice that it’s expired, and free its memory. This gives you the common case of normal cache churn reusing its own memory.

Items can also be evicted to make way for new items that need to be stored, or expired items are discovered and their memory reused.

Redis is better at reclaiming unused memory.

BAS¶

Parse Firebase Couchdb p2pDb https://remotestorage.io https://unhosted.org/adventures/16/Our-plan-to-save-the-web.html
https://5apps.com/deploy/home
Couchbase Hasura Pouchdb Nimbus http://deployd.com
https://kuzzle.io
https://getmateria.com/#addons
https://appwrite.io
https://sapphire-db.com/start/main
https://api-platform.com
http://nobackend.org
https://bubble.io/plugins
https://strapi.io
Featherjs https://glitch.com/@glitch

Pandas¶

http://ehneilsen.net/notebook/pandasExamples/pandas_examples.html http://pbpython.com/pandas-pivot-table-explained.html

Explanation of pandas pivot_table function.

relational algebra <-> sql relation <-> table <-> entity set derived relation <-> view tuple <-> row <-> entity && entity type attribute <-> column

domain constraints boolean constraints

primary key <-> entity foreign key <-> weak entity

1st form -> no lists within an attribute if so split 2nd form -> no functional dependencies in a concatenated key if so split and recursively apply 3rd form -> no functional dependencies at all if so separate and use foreign key

File system¶

folder/file basic attributes, arbitrary attributes fileops new save delete move copy link attribute_shit sync or async io deafault caching(os) concurrency(locking) network(nfs) support security provided by jails unions tickets serialisation persistence simple transaction needs a library suuport revision control easy multiple formats – hash, tree, record, xml, document query by grep && find or more sophisticated functions

folder == table file == row link == foreign link

however scripting languages are a must for using filesystem as database

EAV¶

https://martin.kleppmann.com/2015/05/11/please-stop-calling-databases-cp-or-ap.html
https://www.youtube.com/watch?v=ZoLoIFW1H6g
https://en.wikipedia.org/wiki/Consistent_hashing https://www.linuxjournal.com/article/7451 Pintrest Flickr

Coding Notes

Databases¶

Scaling¶

Data¶

Concurrency¶

Distributed Computing¶

Connection Pooling¶

ACID¶

Concurrency¶

CAP Theorem¶

Practical Errors¶

Indexes¶

Data Warehouse¶

Sharding¶

ER Modelling¶

Distributed¶

Memcache¶

BAS¶

Pandas¶

File system¶

EAV¶