In this chapter the database administrator can learn about the functionalities that Sparksee offers in order to maintain and monitorize Sparksee databases.
We would like to place particular emphasis on the Recovery functionality that will help the administrator to always keep an automatic copy of the database stored and safe.
Sparksee provides functionality for performing a cold backup and restoring a database which has been previously backed up.
During a cold backup, the database is closed or locked and not available to users. The data files do not change during the backup process so the database is in a consistent state when it is returned to normal operation.
The method Graph#backup
performs a full backup by writing all the content of the database into a given file path and Sparksee#restore
creates a new Database
instance from a backup file.
Alternative Graph#encryptedBackup
and Sparksee#restoreEncryptedBackup
methods are also available to create and restore AES encrypted backup files.
Next code-blocks provide an example of this functionality:
// perform backup
Graph graph = sess.getGraph();
...
graph.backup("database.gdb.back");
...
sess.close();
// restore backup
Sparksee sparksee = new Sparksee(new SparkseeConfig());
Database db = sparksee.restore("database.gdb", "database.gdb.back");
Session sess = db.newSession();
Graph graph = sess.getGraph();
...
sess.close();
db.close();
sparksee.close();
// perform backup
Graph graph = sess.GetGraph();
...
graph.Backup("database.gdb.back");
...
sess.Close();
// restore backup
Sparksee sparksee = new Sparksee(new SparkseeConfig());
Database db = sparksee.Restore("database.gdb", "database.gdb.back");
Session sess = db.NewSession();
Graph graph = sess.GetGraph();
...
sess.Close();
db.Close();
sparksee.Close();
// perform backup
Graph * graph = sess->getGraph();
...
graph->Backup(L"database.gdb.back");
...
delete sess;
// restore backup
SparkseeConfig cfg;
Sparksee * sparksee = new Sparksee(cfg);
Database * db = sparksee.Restore(L"database.gdb", L"database.gdb.back");
Session * sess = db->NewSession();
Graph * graph = sess->GetGraph();
...
delete db;
delete sess;
delete sparksee;
# perform backup
graph = sess.get_graph()
...
graph.backup("database.gdb.back")
...
sess.close;
# restore backup
sparks = sparksee.Sparksee(sparksee.SparkseeConfig())
db = sparks.restore("database.gdb", "database.gdb.back")
sess = db.new_session()
graph = sess.get_graph()
...
db.close()
sess.close()
sparks.close()
// perform backup
STSGraph * graph = [sess getGraph];
...
[graph backup: @"database.gdb.back"];
...
[sess close];
[db close];
[sparksee close];
//[sparksee release];
// restore backup
STSSparkseeConfig * cfg = [[STSSparkseeConfig alloc] init];
STSSparksee * sparksee = [[STSSparksee alloc] initWithConfig: cfg];
//[cfg release];
STSDatabase * db = [sparksee restore: @"database.gdb" backupFile: @"database.gdb.back"];
STSSession * sess = [db createSession];
STSGraph * graph = [sess getGraph];
...
[sess close];
[db close];
[sparksee close];
//[sparksee release];
Note that OIDs (object identifiers) for both node and edge objects will be the same when the database is restored, however type or attribute identifiers may differ.
Take into consideration that although it does not update the database it works as a writing method. As Sparksee’s concurrency model only accepts 1 writer transaction at a time (see more details about this in the ‘Processing’ section of the ‘Graph database’ chapter), this operation blocks any other transaction.
Sparksee includes an automatic recovery manager which keeps the database safe for any eventuality. In case of application or system failures, the recovery manager is able to bring the database to a consistent state in the next restart.
By default, the recovery functionality is disabled so in order to use it, the user must enable and configure the manager. The recovery manager introduces a small penalty in the performance, so there is always a trade-off between the functionality it provides and a minor decrease in performance.
The configuration includes:
Log file: the recovery log file stores all data pages that have not been flushed to disk. It is used for the recovery in the next restart after a failure.
Cache: the maximum size for the recovery cache. Some parts of the recovery log file are stored in this cache. In case the cache is too small some extra I/O will be required. Anyway, a small cache should be enough to work properly.
Checkpoint: the checkpoint frequency for the recovery cache. A checkpoint is a safe point which guarantees database consistency. On one hand a high frequency increases the number of writes to disk slowing the process. On the other, a low frequency requires a larger recovery log file and increases the risk of lost information.
This configuration can be performed with the SparkseeConfig
class or by setting the values in a Sparksee configuration file. This is explained in detail in the ‘Recovery’ section of the ‘Configuration’ chapter.
It is possible to enable the logging of Sparksee activity. The log configuration requires both the level and the log file path.
This configuration can be performed with the SparkseeConfig
class or by setting the values in a Sparksee configuration file. This is explained in detail in the ‘Log’ section of the ‘Configuration’ chapter.
Current valid Sparksee log levels are defined in the LogLevel
enum class. This is the list of values ordered from the least verbose and increasing:
Off
Log is disabled.
Severe
The log only stores errors.
Warning
Log errors and situations which may require special attention are included in the log file.
Info
Log errors, warnings and information messages are always stored.
Config
Log includes configuration details of the different components.
Fine
This is the most complete log level; it includes the previous levels of logging plus additional platform details.
Debug
Log debug information. It only works for a debug version of the library, so it can only be used by developers.
There are two methods to dump a summary of the content from a Sparksee database.
Graph#dumpData
writes a summary of the user’s data. It contains attributes and values for each of the database objects as well as other type of user-oriented information.
Graph#dumpStorage
writes a summary of the internal structures. This type of dump is useful for developers.
Both files are written using YAML, a human-readable data serialization format.
Sparksee offers a set of runtime statistics available for different Sparksee components. In order to use each statistical method it is recommended checking the class in the reference manuals of the chosen programming language.
The class DatabaseStatistics
provides general information about the database:
Database size.
Temporary storage database size.
Current number of concurrent sessions.
Cache size.
Total read data since the start.
Total write data since the start.
Use the Database#getStatistics
method to retrieve this information.
The class PlatformStatistics
provides general information about the platform where Sparksee is running:
Physical memory size.
Free physical memory size.
Number of CPUS.
The epoch time.
CPU user and system time.
Use the Platform#getStatistics
method to retrieve this information.
The class AttributeStatistics
provides information about a certain attribute:
Number of distinct values.
Number of objects with null and non-null values.
Minimum and maximum values.
Mode value and the number of objects having the mode value.
For numerical attributes (integer, long and double) it also includes:
Mean value.
Variance.
Median.
For string attributes it also includes:
Maximum length.
Minimum length.
Average length.
Use the Graph#getAttributeStatistics
method to retrieve this information. The user should take into account the fact that the method has a boolean argument in order to specify if basic (TRUE value) or complete statistics (FALSE value) for that datatype must be retrieved. Check in the reference manual for those statistics which are considered to be basic.
The administrator may also want to check which attributes have a value in a certain range, in which case the method Graph#getAttributeIntervalCount
would be the most appropriate.
Note that both methods do not work for Basic attributes, statistics can only be retrieved for Indexed or Unique attributes. See ‘API’ chapter for more details on the attribute types.
Finally, it is also possible to enable the logging of the cache to monitorize its activity. By default, the logging of the cache is disabled, so it should be enabled and configured first. This configuration can be performed with the SparkseeConfig
class or by setting the values in a Sparksee configuration file. This is explained in detail in the ‘Log’ section of the ‘Configuration’ chapter.
The configuration of the cache statistics includes:
The output cache statistics log file. This is a CSV file where columns are separated by semicolons so it can be easily exported to a spreadsheet to be processed.
Some statistics are reset for each snapshot. The frequency of snapshots can be defined by the user.
The cache statistics log includes:
General platform statistics.
General database statistics.
Group statistics.
With regards to functionality, Sparksee internally groups data pages into different groups. For each of the available groups, there is the following set of statistics available:
Number of requests and hits.
Number of reads and writes.
Number of pages and cached pages.
These statistics are duplicated: for persistent and temporary data.
Note that group 0 has the accumulated results from the rest of groups.
Sparksee computes a checksum everytime a page is read and written from/to the disk. Such checksum allows for Sparksee to detect external I/O data corruption and report .
The type of checksum Sparksee uses is a 4-byte CRC-32 checksum. The checksum is computed for every page everytime it is written to disk and is stored at its header. When a page is read a checksum of the page is computed and compared to that stored in its header. If the two checksums disagree, the read of the page is retried and the checksum compared again. If the discrepancy persists, an Unrecoverable Error
is reported as explained in ‘API’ chapter.
Additionally, Sparksee provides a mechanism to verify the integrity of a database image with respect to checksums, via the Sparksee
object.
Sparksee sparksee = new Sparksee(new SparkseeConfig());
boolean success = sparksee.verifyChecksums("database.gdb");
if(!success) {
...
}
sparksee.close();
Sparksee sparksee = new Sparksee(new SparkseeConfig());
boolean success = sparksee.VerifyChecksums("database.gdb");
if(!success) {
...
}
sess.Close();
SparkseeConfig cfg;
Sparksee * sparksee = new Sparksee(cfg);
bool success = sparksee->VerifyChecksums(L"database.gdb");
if(!success)
{
...
}
delete sparksee;
sparks = sparksee.Sparksee(sparksee.SparkseeConfig())
success = sparks.verify_checksums("database.gdb")
if not success:
...
sparks.close()
STSSparkseeConfig * cfg = [[STSSparkseeConfig alloc] init];
STSSparksee * sparksee = [[STSSparksee alloc] initWithConfig: cfg];
//[cfg release];
BOOL success = [sparksee verifyChecksums: @"database.gdb"];
if(!success)
...
[sparksee close];
//[sparksee release];
We distribute Sparksee with a set of maintenance tools.
Verifies the checksum integrity of a database image, and returns an error code if the image appears to be corrupted.
./GDBCheck [-c <cfg file>] -g <gdb file> [--] [--version] [-h]
Where:
-c <cfg file>, --cfg <cfg file>
Sparksee config file
-g <gdb file>, --gdb <gdb file>
(required) Database file
--, --ignore_rest
Ignores the rest of the labeled arguments following this flag.
--version
Displays version information and exits.
-h, --help
Displays usage information and exits.
Tool used to change configuration parameters of a given database.
./GDBConf [-d] [-e] [-n] [-k] [-c <cfg file>] -g <gdb file> [--]
[--version] [-h]
Where:
-d, --re
Remove encryption
-e, --ae
Add encryption
-n, --rc
Remove checksums
-k, --ac
Add checksums
-c <cfg file>, --cfg <cfg file>
Sparksee config file
-g <gdb file>, --gdb <gdb file>
(required) Database file
--, --ignore_rest
Ignores the rest of the labeled arguments following this flag.
--version
Displays version information and exits.
-h, --help
Displays usage information and exits.
If encryption is enabled, the key and the IV must also be provided to add or remove checksums
Tool to create and restore backups
./GDBBackup [-c <cfg file>] [-r] [-i <hex encoded iv>] [-k <hex encoded
key>] -b <backup file> -g <gdb file> [--] [--version] [-h]
Where:
-c <cfg file>, --cfg <cfg file>
Sparksee config file
-r, --restore
Restore mode (will overwrite the DB file)!
-i <hex encoded iv>, --iv <hex encoded iv>
Backup encryption initialization vector
-k <hex encoded key>, --key <hex encoded key>
Backup encryption key
-b <backup file>, --backup <backup file>
(required) Backup file
-g <gdb file>, --gdb <gdb file>
(required) Database file
--, --ignore_rest
Ignores the rest of the labeled arguments following this flag.
--version
Displays version information and exits.
-h, --help
Displays usage information and exits.
Simple Backup arguments example:
GDBBackup -g sourceDB.gdb -b targetBackup.back
Simple Restore arguments example:
GDBBackup -r -g targetDB.gdb -b sourceBackup.back
Tool to execute queries on a Sparksee image using any of the supported languages
./sparkseecli [-l <algebra|cypher>] [--ro] [--create] [-r <numRows>]
[-e <command>] [-b] [-s <file>] [-c <cfg file>] -g
<filename> [--] [--version] [-h]
Where:
-l <algebra|cypher>, --lang <algebra|cypher>
Query language
--ro
Open the DB in READ ONLY mode
--create
Create a new DB
-r <numRows>, --rows <numRows>
Number of output rows limit
-e <command>, --execute <command>
Execute command
-b, --batch
Batch mode (non interactive)
-s <file>, --script <file>
Script file
-c <cfg file>, --cfg <cfg file>
Sparksee config file
-g <filename>, --gdb <filename>
(required) Database filename
--, --ignore_rest
Ignores the rest of the labeled arguments following this flag.
--version
Displays version information and exits.
-h, --help
Displays usage information and exits.
Multiline commands are allowed. '/' marks the end of a command.
Special commands for the console:
quit Exit the console
CTRL+C Force exit the console
CTRL+A Go to the beginning of the line
CTRL+E Go to the end of the line
UP arrowkey History up
DOWN arrowkey History down
Sparksee uses the disk as a sequential storage unit. Removing information from the database, such as nodes, edges, types, etc. does not free the used space, but marks it as removed to be reused by subsequent database additions.
However, there are some restrictions that apply when it comes to reusing the space of deleted information. Sparksee clusters the storage used for nodes, edges and other information into storage groups of elements. Such storage groups are the minimum unit of allocation and deallocation and their size depends on the data structure.
For this reason, in order to effectively deallocate the used space of an information item, all the elements allocated in the same storage group must be deallocated. Otherwise, the storage group cannot be deallocated and reused in future database insertions.
As mentioned above, the size of such groups depends on the data structure and the type of information to store. However, the following information can be useful when deciding which nodes and edges need to be removed in order to effectively free storage space for a later reuse:
Sparksee assigns consecutive object identifiers to nodes and edges of the same type. Sparksee preemptively reserves large chunks of identifiers per node/edge type, and then nodes and edge created from that type are assigned consecutive identifiers of the reserved chunk. When a chunk is exhausted, a new chunk is reserved. This means that the ids of large groups of nodes and edges, even if not created consecutively, are clustered into chunks of consecutive ids.
Sparksee stores nodes and edges with consecutive ids into the same storage group in a data structure. That is, storage groups contain nodes or edges with consecutive ids, which means that removing nodes or edges will only be effective if these have consecutive object identifiers.
The sizes of the storage groups are relatively small. Typically, sizes of small powers of two are used in storage group of nodes and edges. Depending on the data structure, such sizes can range between 32 to 64 elements. This means that it is not necessary to remove large amounts of nodes or edges in order to effectively free space, which grants the user a fine grained control of the storage used.
In conclusion, removing nodes or edge with consecutive ids is the more effective way to free and reuse storage space. Since ids are assigned consecutively to nodes and edges of the same type as they are created, this means, for instance, that removing the older nodes and/or edges of a graph should effectively free the storage for a later reuse.
When creating/opening an image with encryption enabled, we recommend not to provide the key and the IV through the sparksee.cfg configuration file, but to use the API, not to expose the key and the IV through a human readable medium. The option to pass them through the sparksee.cfg file is meant to be used only for testing purposes.
Unless otherwise specified, you should always be able to upgrade your Sparksee to the next release, but you may not be able to skip version upgrades.
Usually only a major version number change implies a change in the database file format. But we highly recommend to backup your data before upgrading the Sparksee release. You can checkout the ‘Backup’ chapter above and you can also keep a backup copy of your database files (you should do the copy while the database is closed).
If the database file format has changed with the new release, the database must be opened at least once without the read-only mode to be able to upgrade the files.
In addition to the general considerations for upgrading Sparksee, upgrading from version 5 to 6 requires a few changes in your database settings.
The version 6.0 of Sparksee is the first release to include checksum verification enabled by default and the new Licensing system with mandatory identifiers. So in order to open a Sparksee 5 database with Sparksee 6 you need to both disable the checksum verification and setup the license identifiers that you should have received.
Disable checksum verification by adding the line “sparksee.storage.checksum=False” to your Sparksee config file. Or you can call the setChecksumEnabled
method of SparkseeConfig
with the argument value False
.
Add the lines sparksee.clientId=You_CLIENT_identifer
and sparksee.licenseId=You_LICENSE_identifer
to your Sparksee config file and remove your old sparksee.license
line from it. Or you could call the equivalent methods of SparkseeConfig
.
You can find more information about the Checksums in the ‘Checksums’ chapter and more information about how to setup both the checksums and the license in the ‘Configuration’ chapter.
Once you have successfully opened a Sparksee 5 database file with Sparksee 6, your database files shoud have been successfully upgraded to the new format. Then, you can choose to use the new Sparksee 6 features like checksum verification or database encryption. But first you should add checksums and or encryption to the database. You can do it using methods from the Sparksee
class while the database is closed, but we recommend using the command line tool GDBConf
. More information in the ‘Tools’ chapter.