Very common operations such as creating the schema or loading data from a CSV file can alternatively be performed by using Sparksee scripts.
A Sparksee script is a plain-text file which contains a set of commands. Each of these commands is interpreted and executed by a parser. Different alternatives for processing a Sparksee script are explained in the ‘Scripting’ section of the ‘API’ chapter.
A Sparksee script has the following features:
Although it is case sensitive, the grammar keywords are case insensitive.
** Strings can be quoted** in order to use them as grammar keywords. It is also a way to use blanks or tabs in the grammar keywords. Strings can be quoted using single-quote ('
) or double-quoted ("
) characters. Examples of quoted strings: "this is a user string"
, 'this is a user string'
Everything between the ‘#
’ character and the end of line character is considered a comment and therefore ignored by the parser.
This chapter explains valid Sparksee script commands and gives examples for each of them.
To define the schema of a Sparksee database the scripts include commands in order to create every element on the schema, such as the database itself and the types and attributes.
These are the commands to create and open a Sparksee database:
alias
is the name for the new Sparksee graph database.filename
corresponds to the file path where the Sparksee graph database will be physically stored.Note that CREATE
also opens the database after its creation. Also, both commands close any other previously opened database.
Examples of use, take into account that everything after the ‘#
’ character is a description of the command:
open gdb "My graph database" into "database.gdb" # opens 'database.gdb'
create gdb MyGDB into '/data/mygdb.dbg' # closes 'database.gdb' and opens 'mygdb.dbg'
The rest of commands require having a database open so this should be the first step in the scripts.
The user can create a new node type with the following command. Attributes may also be created in this step, or added later with the specific commands for them:
CREATE NODE type_name "("
[attribute_name
(INTEGER|LONG|DOUBLE|STRING|BOOLEAN|TIMESTAMP|TEXT)
[BASIC|INDEXED|UNIQUE]
[DEFAULT value],
...]
")"
type_name
is the name for the new node type.
An optional attribute list can be specified:
attribute_name
is the name for the new attribute.null
value is assumed by default.This is an example of a valid node type creation:
create node PEOPLE (
ID string unique,
NAME string indexed,
BIRTH timestamp,
AGE integer basic default 0
)
This is the command to create an edge type and, optionally, its attributes:
CREATE [UNDIRECTED] EDGE type_name
[FROM tail_node_type_name TO head_node_type_name] "("
[attribute_name
(INTEGER|LONG|DOUBLE|STRING|BOOLEAN|TIMESTAMP|TEXT)
[BASIC|INDEXED|UNIQUE]
[DEFAULT value],
...]
") [MATERIALIZE NEIGHBORS]"
Edge type creation has the same features as in the case of node types but in addition it includes the following parameters:
UNDIRECTED
sets the new edge type as undirected. If not given, directed is assumed.FROM tail_node_type_name TO head_node_type_name
sets the new edge type as restricted from a tail or source node type to a head or destination node type. If not given, non-restricted edge type is assumed. Given node type names must have been created first.MATERIALIZE NEIGHBORS
forces the creation of the neighbor index. For more details about indexing, check the ‘Indexing’ section of the ‘Graph database’ chapterThis is an example of a valid edge type creation:
create edge KNOWS
from PEOPLE to PEOPLE (
SINCE timestamp indexed,
WHERE string default "Barcelona, ES"
) materialize neighbors
An alternative to specifying attributes would be creating them afterwards, using the specific command for that:
CREATE ATTRIBUTE [GLOBAL|NODES|EDGES] [type_name.]attribute_name
(INTEGER|LONG|DOUBLE|STRING|BOOLEAN|TIMESTAMP|TEXT)
[BASIC|INDEXED|UNIQUE]
[DEFAULT value]
In this case, type_name
references the parent node or edge type for the new attribute. When it is omitted and the scope is not specified the attribute is created as global.
The same attributes from the previous examples could have also been created as follows:
create node PEOPLE
create attribute PEOPLE.ID string unique
create attribute PEOPLE.NAME string indexed
create attribute PEOPLE.BIRTH timestamp
create attribute PEOPLE.AGE integer basic default 0
create edge KNOWS from PEOPLE to PEOPLE materialize neighbors
create attribute KNOWS.SINCE timestamp indexed
create attribute "KNOWS.WHERE" string default "Barcelona, ES"
Note that the “KNOWS.WHERE” needs to be between double quotes because “WHERE” is a reserved word of the Script parser.
And with the following commands, they could be created as attributes defined for a node (for all the nodes of the graph), edge (for all the edges of the graph) or global scope (for all nodes and edges of the graph):
create attribute IDENT string unique
create attribute global DESCRIPTION string unique
create attribute nodes NICKNAME string unique
create attribute edges WEIGHT double
In addition, the default value or the index specification can be updated after the attribute definition with the following commands:
SET ATTRIBUTE [GLOBAL|NODES|EDGES] [type_name.]attribute_name DEFAULT value
INDEX [type_name.]attribute_name [INDEXED|UNIQUE|BASIC]
For example, the default value for the previous IDENT
global attribute and the index specification of the previous WHERE
attribute could be updated as follows:
set attribute IDENT default ""
set attribute global DESCRIPTION default ""
set attribute nodes NICKNAME default ""
set attribute edges WEIGHT default 0.0
index "KNOWS.WHERE" basic
Moreover, existing node or edge types and attributes can be removed from the database with the following commands respectively:
Thus, we could remove all previously added types and attributes as follows:
drop attribute IDENT
drop attribute global DESCRIPTION
drop attribute nodes NICKNAME
drop attribute edges WEIGHT
drop attribute PEOPLE.ID
drop attribute PEOPLE.NAME
drop attribute PEOPLE.BIRTH
drop attribute PEOPLE.AGE
drop node PEOPLE
drop attribute KNOWS.SINCE
drop attribute "KNOWS.WHERE"
drop edge KNOWS
Sparksee provides a couple of commands to easily load data from a CSV file into node or edge types that have been previously created in the graph.
The processing of the CSV file assumes that:
Each row of the CSV file corresponds to a new object. Thus, each row creates a new node or edge object.
All elements of the schema required and referenced by LOAD
commands must previously exist.
All previously existing instances and their values will not be removed from the graph.
This is the command to load data from a CSV file into a node type:
LOAD NODES file_name
[LOCALE locale_name]
COLUMNS column_name [alias_name], ...
INTO node_type_name
[IGNORE (column_name|alias_name), ....]
[FIELDS
[TERMINATED char]
[ENCLOSED char]
[ALLOW_MULTILINE [num]]]
[FROM num]
[MAX num]
[MODE (ROWS|COLUMNS [SPLIT [PARTITIONS num]])]
[LOG (OFF|ABORT|logfile)]
file_name
is the CSV file path.
LOCALE allows the user to set a specific locale for the CSV file by means of a case in-sensitive string argument. Locale formats are described in the ‘Data import’ and ‘Data export’ sections of the ‘API’ chapter
COLUMNS sets a list of names and optionally an alias for each column of the CSV file. For example, the first element of the list matches the first column of the CSV file, and so on. In case that the name of a column corresponds to an attribute name, that column is used to set values for that attribute, otherwise the column is ignored.
The character ‘*
’ can be used to ignore a column, too.
node_type_name
is the name of the node type to be populated.
IGNORE
sets a list of columns to be ignored. The column can be referenced by the name or the alias.
By default, no column is ignored.
FIELDS
allows the user to specify certain features for reading data from the CSV file:
TERMINATED
sets the character to separate fields (columns) in the CSV file. Comma (‘,
’) character is the default value.
ENCLOSED
sets the character to quote strings. Double-quoted (‘"
’) character is the default value.
Inner quotes of a quoted string can be escaped in the two following ways:
"my string has a " in the middle"
can be processed if the inner quoted is doubled as "my string has a "" in the middle"
.\
character before the quotes. Thus, the string "my string has a " in the middle"
can be processed if the inner quoted is escaped as "my string has a \" in the middle"
.ALLOW_MULTILINE
specifies that a string may be in multiple lines of characters. Note that those strings must be quoted. num
specifies a maximum number of lines allowed. By default, strings are considered as single-lined.
FROM
sets the index row to start loading from, skipping all text previous to that point. By default all rows are loaded. Note that the row index starts at 0, which is the default value.
MAX
sets the maximum number of rows to load. A value of 0, which is the default, means unlimited.
MODE
sets the load mode, with the following options:
ROWS
. The CSV file is read once. This is the default value.COLUMNS
. The CSV file is read twice: once to create the objects and once to set the attribute values.(DEPRECATED)COLUMNS SPLIT
. The CSV file is read N times: once to create the objects and then, once time for each attribute column.(DEPRECATED)COLUMNS SPLIT PARTITIONS num
. The CSV file is read N times: once to create the objects and then, once for each attribute column. Also, each attribute column is logically partitioned, so data column is loaded in num
partitions. (DEPRECATED)LOG
sets how the errors should be managed:
logfile
. Errors are dumped into a file, so it does not stop the processing of the CSV file. This is the default value assuming logfile
as the type name.ABORT
. If an error is raised it interrupts the processing of the CSV file.OFF
. Errors are not dumped anywhere and the processing of the CSV file is never interrupted.(DEPRECATED)With this CSV file:
The loading of “PEOPLE” nodes using Sparksee scripts would look like the example below:
create gdb FRIENDS into "friends.dbg"
create node PEOPLE
create attribute PEOPLE.ID string unique
create attribute PEOPLE.NAME string indexed
create attribute PEOPLE.AGE integer basic default 0
load nodes "people.csv"
columns ID, NAME, *, AGE
into PEOPLE
fields terminated ;
from 1
log "people.csv.log"
This is the command to load data from a CSV file into an edge type:
LOAD EDGES file_name
[LOCALE locale_name]
COLUMNS column_name [alias_name], ...
INTO node_type_name
[IGNORE (column_name|alias_name), ....]
WHERE
TAIL (column_name|alias_name) = node_type_name.attribute_name
HEAD (column_name|alias_name) = node_type_name.attribute_name
[FIELDS
[TERMINATED char]
[ENCLOSED char]
[ALLOW_MULTILINE [num]]]
[FROM num]
[MAX num]
[MODE (ROWS|COLUMNS [SPLIT [PARTITIONS num]])]
[LOG (OFF|ABORT|logfile)]
Most of the features are the same as in the case of loading nodes, with the exception of the following WHERE
elements, which are required to specify the source (tail) and destination (head) of the edges:
TAIL
shows which column name (or the alias) from the edge corresponds to values of the specified attribute of a node (node_type_name.attribute_name
).
That value would be used to retrieve a node using this edge, so it is strongly advisable to define that node attribute as unique.
For instance, TAIL tailColumn = PEOPLE.ID
means:
tailColumn
shows the column to be used to retrieve the tail nodes.PEOPLE.ID
shows the ID
attribute of the PEOPLE
node type.tailColumn
is used to retrieve a PEOPLE
node object using the ID
attribute.HEAD
is analogous to TAIL
but in this case the retrieved node will be the head (destination) of the new edge.
This is an example of a Sparksee script to load edges from a CSV file with Sparksee scripts. It assumes that “PEOPLE” nodes have already been created:
1;2;2012-01-01;'Barcelona'
2;3;2011-05-01;'NYC,USA'
3;4;2012-01-01;'Paris,France'
2;1;2012-01-01;'Barcelona'
3;2;2011-05-01;'NYC,USA'
In the example, ‘tail tailId = PEOPLE.ID’ sets the value referenced by the ‘TailId’ column as the “PEOPLE” “ID” attribute. The value is used to find a “PEOPLE” node which is the tail of the edge (because the attribute “ID” matches). For instance, the first row created would be an edge that goes from a “PEOPLE” node with ID=2 to the “PEOPLE” node with ID=1.
open gdb FRIENDS into "friends.dbg"
create edge KNOWS from PEOPLE to PEOPLE materialize neighbors
create attribute KNOWS.SINCE timestamp indexed
create attribute "KNOWS.WHERE" string default "Barcelona, ES"
load edges "knows.csv"
columns headId, tailId, SINCE, "WHERE"
into KNOWS
ignore headId, tailId
where
tail tailId=PEOPLE.ID
head headId=PEOPLE.ID
fields terminated ; enclosed "'"
log off
In order to be able to create attributes with a timestamp the following command must be specified in the scripts:
Valid format fields are:
yyyy
: Yearyy
: Year without century interpreted. Within 80 years before or 20 years after the current year. For example, if the current year is 2007, the pattern MM/dd/yy for the value 01/11/12 parses to January 11, 2012, while the same pattern for the value 05/04/64 parses to May 4, 1964.MM
: Month [1..12]dd
: Day of month [1..31]hh
: Hour [0..23]mm
: Minute [0..59]ss
: Second [0..59]SSS
: Millisecond [0..999]So, valid timestamp formats may be: MM/dd/yy
or MM/dd/yyyy-hh.mm
. If not specified, the parser automatically tries to match any of the following timestamp formats:
yyyy-MM-dd hh:mm:ss.SSS
yyyy-MM-dd hh:mm:ss
yyyy-MM-dd
Valid commands to set a timestamp would be: