Very common operations such as creating the schema or loading data from a CSV file can alternatively be performed by using Sparksee scripts.
A Sparksee script is a plain-text file which contains a set of commands. Each of these commands is interpreted and executed by a parser. Different alternatives for processing a Sparksee script are explained in the 'Scripting' section of the 'API' chapter.
A Sparksee script has the following features:
Although it is case sensitive, the grammar keywords are case insensitive.
** Strings can be quoted** in order to use them as grammar keywords. It is also a way to use blanks or tabs in the grammar keywords. Strings can be quoted using single-quote (
') or double-quoted (
") characters. Examples of quoted strings:
"this is a user string",
'this is a user string'
Everything between the '
#' character and the end of line character is considered a comment and therefore ignored by the parser.
This chapter explains valid Sparksee script commands and gives examples for each of them.
To define the schema of a Sparksee database the scripts include commands in order to create every element on the schema, such as the database itself and the types and attributes.
These are the commands to create and open a Sparksee database:
CREATE GDB alias INTO filename OPEN GDB alias INTO filename
aliasis the name for the new Sparksee graph database.
filenamecorresponds to the file path where the Sparksee graph database will be physically stored.
CREATE also opens the database after its creation. Also, both commands close any other previously opened database.
Examples of use, take into account that everything after the '
#' character is a description of the command:
open gdb "My graph database" into "database.gdb" # opens 'database.gdb' create gdb MyGDB into '/data/mygdb.dbg' # closes 'database.gdb' and opens 'mygdb.dbg'
The rest of commands require having a database open so this should be the first step in the scripts.
The user can create a new node type with the following command. Attributes may also be created in this step, or added later with the specific commands for them:
CREATE NODE type_name "(" [attribute_name (INTEGER|LONG|DOUBLE|STRING|BOOLEAN|TIMESTAMP|TEXT) [BASIC|INDEXED|UNIQUE] [DEFAULT value], ...] ")"
type_name is the name for the new node type.
An optional attribute list can be specified:
attribute_nameis the name for the new attribute.
nullvalue is assumed by default.
This is an example of a valid node type creation:
create node PEOPLE ( ID string unique, NAME string indexed, BIRTH timestamp, AGE integer basic default 0 )
This is the command to create an edge type and, optionally, its attributes:
CREATE [UNDIRECTED] EDGE type_name [FROM tail_node_type_name TO head_node_type_name] "(" [attribute_name (INTEGER|LONG|DOUBLE|STRING|BOOLEAN|TIMESTAMP|TEXT) [BASIC|INDEXED|UNIQUE] [DEFAULT value], ...] ") [MATERIALIZE NEIGHBORS]"
Edge type creation has the same features as in the case of node types but in addition it includes the following parameters:
UNDIRECTEDsets the new edge type as undirected. If not given, directed is assumed.
FROM tail_node_type_name TO head_node_type_namesets the new edge type as restricted from a tail or source node type to a head or destination node type. If not given, non-restricted edge type is assumed. Given node type names must have been created first.
MATERIALIZE NEIGHBORSforces the creation of the neighbor index. For more details about indexing, check the 'Indexing' section of the 'Graph database' chapter
This is an example of a valid edge type creation:
create edge KNOWS from PEOPLE to PEOPLE ( SINCE timestamp indexed, WHERE string default "Barcelona, ES" ) materialize neighbors
An alternative to specifying attributes would be creating them afterwards, using the specific command for that:
CREATE ATTRIBUTE [GLOBAL|NODES|EDGES] [type_name.]attribute_name (INTEGER|LONG|DOUBLE|STRING|BOOLEAN|TIMESTAMP|TEXT) [BASIC|INDEXED|UNIQUE] [DEFAULT value]
In this case,
type_name references the parent node or edge type for the new attribute. When it is omitted and the scope is not specified the attribute is created as global.
The same attributes from the previous examples could have also been created as follows:
create node PEOPLE create attribute PEOPLE.ID string unique create attribute PEOPLE.NAME string indexed create attribute PEOPLE.BIRTH timestamp create attribute PEOPLE.AGE integer basic default 0 create edge KNOWS from PEOPLE to PEOPLE materialize neighbors create edge KNOWS.SINCE timestamp indexed create edge KNWOS.WHERE string default "Barcelona, ES"
And with the following commands, they could be created as attributes defined for a node (for all the nodes of the graph), edge (for all the edges of the graph) or global scope (for all nodes and edges of the graph):
create attribute IDENT string unique create attribute global DESCRIPTION string unique create attribute nodes NICKNAME string unique create attribute edges WEIGHT double
In addition, the default value or the index specification can be updated after the attribute definition with the following commands:
SET ATTRIBUTE [GLOBAL|NODES|EDGES] [type_name.]attribute_name DEFAULT value INDEX [type_name.]attribute_name [INDEXED|UNIQUE|BASIC]
For example, the default value for the previous
IDENT global attribute and the index specification of the previous
WHERE attribute could be updated as follows:
set attribute IDENT default "" set attribute global DESCRIPTION default "" set attribute nodes NICKNAME default "" set attribute edges WEIGHT default 0.0 index KNOWS.WHERE basic
Moreover, existing node or edge types and attributes can be removed from the database with the following commands respectively:
DROP (NODE|EDGE) type_name DROP ATTRIBUTE [GLOBAL|NODES|EDGES] [type_name.]attribute_name
Thus, we could remove all previously added types and attributes as follows:
drop attribute IDENT drop attribute global DESCRIPTION drop attribute nodes NICKNAME drop attribute edges WEIGHT drop attribute PEOPLE.ID drop attribute PEOPLE.NAME drop attribute PEOPLE.BIRTH drop attribute PEOPLE.AGE drop node PEOPLE drop attribute KNOWS.SINCE drop attribute KNWOS.WHERE drop edge KNOWS
Sparksee provides a couple of commands to easily load data from a CSV file into node or edge types that have been previously created in the graph.
The processing of the CSV file assumes that:
Each row of the CSV file corresponds to a new object. Thus, each row creates a new node or edge object.
All elements of the schema required and referenced by
LOAD commands must previously exist.
All previously existing instances and their values will not be removed from the graph.
This is the command to load data from a CSV file into a node type:
LOAD NODES file_name [LOCALE locale_name] COLUMNS column_name [alias_name], ... INTO node_type_name [IGNORE (column_name|alias_name), ....] [FIELDS [TERMINATED char] [ENCLOSED char] [ALLOW_MULTILINE [num]]] [FROM num] [MAX num] [MODE (ROWS|COLUMNS [SPLIT [PARTITIONS num]])] [LOG (OFF|ABORT|logfile)]
file_name is the CSV file path.
LOCALE allows the user to set a specific locale for the CSV file by means of a case in-sensitive string argument. Locale formats are described in the 'Data import' and 'Data export' sections of the 'API' chapter
COLUMNS sets a list of names and optionally an alias for each column of the CSV file. For example, the first element of the list matches the first column of the CSV file, and so on. In case that the name of a column corresponds to an attribute name, that column is used to set values for that attribute, otherwise the column is ignored.
The character '
*' can be used to ignore a column, too.
node_type_name is the name of the node type to be populated.
IGNORE sets a list of columns to be ignored. The column can be referenced by the name or the alias.
By default, no column is ignored.
FIELDS allows the user to specify certain features for reading data from the CSV file:
TERMINATED sets the character to separate fields (columns) in the CSV file. Comma ('
,') character is the default value.
ENCLOSED sets the character to quote strings. Double-quoted ('
"') character is the default value.
Inner quotes of a quoted string can be escaped if they are doubled. Thus, the string
"my string has a " in the middle" can be processed if the inner quoted is doubled as follows:
"my string has a "" in the middle".
ALLOW_MULTILINE specifies that a string may be in multiple lines of characters. Note that those strings must be quoted.
num specifies a maximum number of lines allowed. By default, strings are considered as single-lined.
FROM sets the index row to start loading from, skipping all text previous to that point. By default all rows are loaded. Note that the row index starts at 0, which is the default value.
MAX sets the maximum number of rows to load. A value of 0, which is the default, means unlimited.
MODE sets the load mode, with the following options:
ROWS. The CSV file is read once. This is the default value.
COLUMNS. The CSV file is read twice: once to create the objects and once to set the attribute values.
COLUMNS SPLIT. The CSV file is read N times: once to create the objects and then, once time for each attribute column.
COLUMNS SPLIT PARTITIONS num. The CSV file is read N times: once to create the objects and then, once for each attribute column. Also, each attribute column is logically partitioned, so data column is loaded in
LOG sets how the errors should be managed:
logfile. Errors are dumped into a file, so it does not stop the processing of the CSV file. This is the default value assuming
logfileas the type name.
ABORT. If an error is raised it interrupts the processing of the CSV file.
OFF. Errors are not dumped anywhere and the processing of the CSV file is never interrupted.
With this CSV file:
ID; NAME; SALARY; AGE 1; name1; 1800; 18 2; name2; 1600; 16 3; name3; 2000; 20 4; name4; 2200; 22
The loading of "PEOPLE" nodes using Sparksee scripts would look like the example below:
create gdb FRIENDS into "friends.dbg" create node PEOPLE create attribute PEOPLE.ID string unique create attribute PEOPLE.NAME string indexed create attribute PEOPLE.AGE integer basic default 0 load nodes "people.csv" columns ID, NAME, *, AGE into PEOPLE fields terminated ; from 1 log "people.csv.log"
This is the command to load data from a CSV file into an edge type:
LOAD EDGES file_name [LOCALE locale_name] COLUMNS column_name [alias_name], ... INTO node_type_name [IGNORE (column_name|alias_name), ....] WHERE TAIL (column_name|alias_name) = node_type_name.attribute_name HEAD (column_name|alias_name) = node_type_name.attribute_name [FIELDS [TERMINATED char] [ENCLOSED char] [ALLOW_MULTILINE [num]]] [FROM num] [MAX num] [MODE (ROWS|COLUMNS [SPLIT [PARTITIONS num]])] [LOG (OFF|ABORT|logfile)]
Most of the features are the same as in the case of loading nodes, with the exception of the following
WHERE elements, which are required to specify the source (tail) and destination (head) of the edges:
TAIL shows which column name (or the alias) from the edge corresponds to values of the specified attribute of a node (
That value would be used to retrieve a node using this edge, so it is strongly advisable to define that node attribute as unique.
TAIL tailColumn = PEOPLE.ID means:
tailColumnshows the column to be used to retrieve the tail nodes.
IDattribute of the
tailColumnis used to retrieve a
PEOPLEnode object using the
HEAD is analogous to
TAIL but in this case the retrieved node will be the head (destination) of the new edge.
This is an example of a Sparksee script to load edges from a CSV file with Sparksee scripts. It assumes that "PEOPLE" nodes have already been created:
1;2;2012-01-01;'Barcelona' 2;3;2011-05-01;'NYC,USA' 3;4;2012-01-01;'Paris,France' 2;1;2012-01-01;'Barcelona' 3;2;2011-05-01;'NYC,USA'
In the example, 'tail tailId = PEOPLE.ID' sets the value referenced by the 'TailId' column as the "PEOPLE" "ID" attribute. The value is used to find a "PEOPLE" node which is the tail of the edge (because the attribute "ID" matches). For instance, the first row created would be an edge that goes from a "PEOPLE" node with ID=2 to the "PEOPLE" node with ID=1.
open gdb FRIENDS into "friends.dbg" create edge KNOWS from PEOPLE to PEOPLE materialize neighbors create edge KNOWS.SINCE timestamp indexed create edge KNWOS.WHERE string default "Barcelona, ES" load edges "knows.csv" columns headId, tailId, SINCE, WHERE into KNOWS ignore headId, tailId where tail tailId=PEOPLE.ID head headId=PEOPLE.ID fields terminated ; enclosed ' log off
In order to be able to create attributes with a timestamp the following command must be specified in the scripts:
SET TIMESTAMP FORMAT timestamp_format_string
Valid format fields are:
yy: Year without century interpreted. Within 80 years before or 20 years after the current year. For example, if the current year is 2007, the pattern MM/dd/yy for the value 01/11/12 parses to January 11, 2012, while the same pattern for the value 05/04/64 parses to May 4, 1964.
MM: Month [1..12]
dd: Day of month [1..31]
hh: Hour [0..23]
mm: Minute [0..59]
ss: Second [0..59]
SSS: Millisecond [0..999]
So, valid timestamp formats may be:
MM/dd/yyyy-hh.mm. If not specified, the parser automatically tries to match any of the following timestamp formats:
Valid commands to set a timestamp would be:
set timestamp format "MM/dd/yy" set timestamp format "MM/dd/yyyy-hh.mm"