Notes on File-Based Backup and Restoration for CouchDB

All of the varied noSQL databases of recent years are a continuation of a longer line of what used to be called object databases. The point of the exercise is that by abandoning SQL and most of its underpinnings you can perform some tasks more rapidly or in greater volume, but at the cost of flexibility in development. The biggest loss from my perspective is the inability to throw together the equivalent of a safe web SQL tool and a replicated database to allow motivated business owners to answer tough reporting questions on their own. If you can figure out business legal paperwork, then you can certainly learn SQL to the degree needed to find out how your business is doing. That in and of itself is a great reason to stick to using SQL databases if you are working with a startup: you'll save a lot of of time and effort on the reporting front.

In any case, every noSQL or object database typically emerges from a different problem space, and so they are diverse in their details relating to service management, backup and restoration, replication, security models, and so forth. This periphery is only indirectly connected to the business of working with data, but nonetheless has to be managed in order to keep the lights on and the servers running.

The noSQL database CouchDB is, to a first approximation, a scalable JSON store with a REST API. If arriving from the land of SQL to work with CouchDB one of the first things you'll notice is that the standard method of dumping data to a file has a different format to the standard method of importing data from a file. So you cannot export data from a single instance to a file and then import it again as is. Madness! But if you look over one of the better guides to CouchDB you'll see that backup and restoration from file doesn't even merit a heading. Instead replication and clustering is the focus. Nonetheless, sometimes you find yourself in the position of needing to export a CouchDB database and reimport it later.

Export

The following script will produce an archive of JSON files, one per database, that contains all of the documents and design documents from the listed databases in a single CouchDB server.

#!/bin/bash
#
# Export a list of CouchDB databases from the server to JSON files.
# Then bundle the exported files into a tar archive.
#

TIMESTAMP=$(date "+%Y%m%d-%H%M%S")
# Adjust the location as appropriate.
TAR_FILE="/root/couch-backup-${TS}.tar.gz"
SERVER="localhost"
PORT="5984"
DATABASES=("db1" "db2")

FILES=""
for DATABASE in ${DATABASES}; do
  FILE="/tmp/${DATABASE}.json"

  curl -X GET \
    http://${SERVER}:${PORT}/${DATABASE}/_all_docs?include_docs=true \
    > ${FILE}

  # Build a list of the files to add to the archive.
  FILES="${FILES} ${FILE}"
done

# Tar and gzip the exported files.
tar -zcf ${TAR_FILE} ${FILES}

Convert the Data to an Import Format

The exported JSON for a single database has the following format:

{
  "total_rows":2, "offset":0, "rows":[
    {"id":"bar", "key":"bar", "value":{"rev":"1-4057566831"}, "doc":{
      "_id":"bar", "_rev":"1-4057566831", "name":"jim"}
    },
    {"id":"baz", "key":"baz", "value":{"rev":"1-2842770487"}, "doc":{
      "_id":"baz", "_rev":"1-2842770487", "name":"trunky"}
    }
  ]
}

Imports have one of several similar formats, however, depending on the outcome you are aiming for. The choice is whether or not to set _id and _rev parameters in each document to be imported. If both are provided then the import will only take effect for those provided documents that match an existing document's _id and _rev - this is suitable for making updates to an existing database to create new document revisions.

{
  "docs": [
    {"_id": "0", "rev":"1-62657917", "integer": 0, "string": "0"},
    {"_id": "1", "rev":"1-2089673485", "integer": 1, "string": "1"},
    {"_id": "2", "rev":"1-2063452834", "integer": 2, "string": "2"}
  ]
}

If _id is provided without a _rev value than a document will be created with that ID provided that it doesn't already exist.

{
  "docs": [
    {"_id": "0", "integer": 0, "string": "0"},
    {"_id": "1", "integer": 1, "string": "1"},
    {"_id": "2", "integer": 2, "string": "2"}
  ]
}

If both _id and _rev are omitted then a new document is created with an ID assigned by CouchDB.

{
  "docs": [
    {"integer": 0, "string": "0"},
    {"integer": 1, "string": "1"},
    {"integer": 2, "string": "2"}
  ]
}

One use case for restoration from a backup is to import into a fresh, empty database while maintaining the document IDs but not the revisions. This example script converts the output format to the input input the retaining _id property but not the _rev property on each document.

#!/bin/bash
#
# Convert files db-name.json to db-name-import.json.
#
# Note that this maintains the _id values for each document from
# export format to import format.
#

DATABASES=("db1" "db2")

for DATABASE in ${DATABASES}; do
  EXPORT_FILE=${DATABASE}.json
  IMPORT_FILE=${DATABASE}-import.json

  # Using cat is the only way to get the content with escaped quotes preserved.
  # cat "${EXPORT_FILE}" | \
  #  Remove the first line in the export.
  #  sed 's/{"total_rows":.*,"offset":.*,"rows":[//' | \
  #  Remove the last character of the line.
  #  sed 's/.$//' | \
  #  Remove unwanted stuff relating to the exported docs.
  #  sed 's/{"id":.*,"key".*,"value":.*,"doc"://' | \
  #  Remove revision info for the doc.
  #  sed 's/"_rev":"[^"]*",//' | \
  #  Terminate each line correctly.
  #  sed 's/},$/,/' | \
  #  Deal with the last line, which should have the closing brace.
  #  sed 's/}$//' 

  echo '{"docs":[' >> "${IMPORT_FILE}"
  cat "${EXPORT_FILE}" | \
    sed 's/{"total_rows":.*,"offset":.*,"rows":[//' | \
    sed 's/.$//' | \
    sed 's/{"id":.*,"key".*,"value":.*,"doc"://' | \
    sed 's/"_rev":"[^"]*",//' | \
    sed 's/},$/,/' | \
    sed 's/}$//' 
    >> "${IMPORT_FILE}"
  echo "}" >> "${IMPORT_FILE}"
done

Import

Importing documents into a database is simple enough, but there are some permissions issues to think about if the import contains design documents. If the CouchDB has at least one administrative user created then only an admin or a user assigned as a database admin can import a design document to a given database. This import script assumes the use of a suitably privileged user:

#!/bin/bash
#
# Import documents from the import format files associated
# with the given databases.
#

# Either a server admin or a database admin user assigned to
# all of the databases in the list.
ADMIN="import-admin"
PWD="password"
SERVER="localhost"
PORT="5984"
DATABASES=("db1" "db2")

for DATABASE in ${DATABASES}; do
  IMPORT_FILE=${DATABASE}-import.json
  curl -d @"${IMPORT_FILE}" \
    -X POST \
    -H 'Content-Type: application/json' \
    http://${ADMIN}:${PWD}@${SERVER}:${PORT}/${DATABASE}/_bulk_docs
done