Scripting

Very common operations such as creating the schema or loading data from a CSV file can alternatively be performed by using Sparksee scripts.

A Sparksee script is a plain-text file which contains a set of commands. Each of these commands is interpreted and executed by a parser. Different alternatives for processing a Sparksee script are explained in the ‘Scripting’ section of the ‘API’ chapter.

A Sparksee script has the following features:

Although it is case sensitive, the grammar keywords are case insensitive.
** Strings can be quoted** in order to use them as grammar keywords. It is also a way to use blanks or tabs in the grammar keywords. Strings can be quoted using single-quote (') or double-quoted (") characters. Examples of quoted strings: "this is a user string", 'this is a user string'
Everything between the ‘#’ character and the end of line character is considered a comment and therefore ignored by the parser.

This chapter explains valid Sparksee script commands and gives examples for each of them.

Schema definition

To define the schema of a Sparksee database the scripts include commands in order to create every element on the schema, such as the database itself and the types and attributes.

Create and open

These are the commands to create and open a Sparksee database:

CREATE GDB alias INTO filename

OPEN GDB alias INTO filename

alias is the name for the new Sparksee graph database.
filename corresponds to the file path where the Sparksee graph database will be physically stored.

Note that CREATE also opens the database after its creation. Also, both commands close any other previously opened database.

Examples of use, take into account that everything after the ‘#’ character is a description of the command:

open gdb "My graph database" into "database.gdb" # opens 'database.gdb'
create gdb MyGDB into '/data/mygdb.dbg' # closes 'database.gdb' and opens 'mygdb.dbg'

The rest of commands require having a database open so this should be the first step in the scripts.

Types and attributes

The user can create a new node type with the following command. Attributes may also be created in this step, or added later with the specific commands for them:

CREATE NODE type_name "(" 
    [attribute_name 
        (INTEGER|LONG|DOUBLE|STRING|BOOLEAN|TIMESTAMP|TEXT) 
        [BASIC|INDEXED|UNIQUE] 
        [DEFAULT value], 
    ...]
")"

type_name is the name for the new node type.
An optional attribute list can be specified:
- attribute_name is the name for the new attribute.
- A valid data type must be given.
- An index specification is optional. Basic is assumed by default.For more details about indexes see the ‘Indexing’ section of the ‘Graph Database’ chapter.
- A default value is also optional. A null value is assumed by default.

This is an example of a valid node type creation:

create node PEOPLE (
    ID string unique,
    NAME string indexed,
    BIRTH timestamp,
    AGE integer basic default 0
)

This is the command to create an edge type and, optionally, its attributes:

CREATE [UNDIRECTED] EDGE type_name 
[FROM tail_node_type_name TO head_node_type_name] "("
    [attribute_name 
        (INTEGER|LONG|DOUBLE|STRING|BOOLEAN|TIMESTAMP|TEXT) 
        [BASIC|INDEXED|UNIQUE] 
        [DEFAULT value], 
    ...] 
") [MATERIALIZE NEIGHBORS]"

Edge type creation has the same features as in the case of node types but in addition it includes the following parameters:

UNDIRECTED sets the new edge type as undirected. If not given, directed is assumed.
FROM tail_node_type_name TO head_node_type_name sets the new edge type as restricted from a tail or source node type to a head or destination node type. If not given, non-restricted edge type is assumed. Given node type names must have been created first.
MATERIALIZE NEIGHBORS forces the creation of the neighbor index. For more details about indexing, check the ‘Indexing’ section of the ‘Graph database’ chapter

This is an example of a valid edge type creation:

create edge KNOWS 
from PEOPLE to PEOPLE (
    SINCE timestamp indexed,
    WHERE string default "Barcelona, ES"
) materialize neighbors

An alternative to specifying attributes would be creating them afterwards, using the specific command for that:

CREATE ATTRIBUTE [GLOBAL|NODES|EDGES] [type_name.]attribute_name 
    (INTEGER|LONG|DOUBLE|STRING|BOOLEAN|TIMESTAMP|TEXT) 
    [BASIC|INDEXED|UNIQUE] 
    [DEFAULT value]

In this case, type_name references the parent node or edge type for the new attribute. When it is omitted and the scope is not specified the attribute is created as global.

The same attributes from the previous examples could have also been created as follows:

create node PEOPLE 
create attribute PEOPLE.ID string unique
create attribute PEOPLE.NAME string indexed
create attribute PEOPLE.BIRTH timestamp
create attribute PEOPLE.AGE  integer basic default 0

create edge KNOWS from PEOPLE to PEOPLE materialize neighbors
create attribute KNOWS.SINCE timestamp indexed
create attribute "KNOWS.WHERE" string default "Barcelona, ES"

Note that the “KNOWS.WHERE” needs to be between double quotes because “WHERE” is a reserved word of the Script parser.

And with the following commands, they could be created as attributes defined for a node (for all the nodes of the graph), edge (for all the edges of the graph) or global scope (for all nodes and edges of the graph):

create attribute IDENT string unique
create attribute global DESCRIPTION string unique
create attribute nodes NICKNAME string unique
create attribute edges WEIGHT double

In addition, the default value or the index specification can be updated after the attribute definition with the following commands:

SET ATTRIBUTE [GLOBAL|NODES|EDGES] [type_name.]attribute_name DEFAULT value

INDEX [type_name.]attribute_name [INDEXED|UNIQUE|BASIC]

For example, the default value for the previous IDENT global attribute and the index specification of the previous WHERE attribute could be updated as follows:

set attribute IDENT default ""
set attribute global DESCRIPTION default ""
set attribute nodes NICKNAME default ""
set attribute edges WEIGHT default 0.0

index "KNOWS.WHERE" basic

Moreover, existing node or edge types and attributes can be removed from the database with the following commands respectively:

DROP (NODE|EDGE) type_name

DROP ATTRIBUTE [GLOBAL|NODES|EDGES] [type_name.]attribute_name

Thus, we could remove all previously added types and attributes as follows:

drop attribute IDENT
drop attribute global DESCRIPTION
drop attribute nodes NICKNAME
drop attribute edges WEIGHT

drop attribute PEOPLE.ID
drop attribute PEOPLE.NAME
drop attribute PEOPLE.BIRTH
drop attribute PEOPLE.AGE
drop node PEOPLE

drop attribute KNOWS.SINCE
drop attribute "KNOWS.WHERE"
drop edge KNOWS

Load

Sparksee provides a couple of commands to easily load data from a CSV file into node or edge types that have been previously created in the graph.

The processing of the CSV file assumes that:

Each row of the CSV file corresponds to a new object. Thus, each row creates a new node or edge object.
All elements of the schema required and referenced by LOAD commands must previously exist.
All previously existing instances and their values will not be removed from the graph.

Load nodes

This is the command to load data from a CSV file into a node type:

LOAD NODES file_name 
    [LOCALE locale_name] 
    COLUMNS column_name [alias_name], ... 
    INTO node_type_name 
    [IGNORE (column_name|alias_name), ....] 
    [FIELDS 
        [TERMINATED char] 
        [ENCLOSED char] 
        [ALLOW_MULTILINE [num]]]
    [FROM num] 
    [MAX num] 
    [MODE (ROWS|COLUMNS [SPLIT [PARTITIONS num]])] 
    [LOG (OFF|ABORT|logfile)]

file_name is the CSV file path.
LOCALE allows the user to set a specific locale for the CSV file by means of a case in-sensitive string argument. Locale formats are described in the ‘Data import’ and ‘Data export’ sections of the ‘API’ chapter
COLUMNS sets a list of names and optionally an alias for each column of the CSV file. For example, the first element of the list matches the first column of the CSV file, and so on. In case that the name of a column corresponds to an attribute name, that column is used to set values for that attribute, otherwise the column is ignored.

The character ‘*’ can be used to ignore a column, too.
node_type_name is the name of the node type to be populated.
IGNORE sets a list of columns to be ignored. The column can be referenced by the name or the alias.

By default, no column is ignored.
FIELDS allows the user to specify certain features for reading data from the CSV file:
- TERMINATED sets the character to separate fields (columns) in the CSV file. Comma (‘,’) character is the default value.
- ENCLOSED sets the character to quote strings. Double-quoted (‘"’) character is the default value.
  
  Inner quotes of a quoted string can be escaped in the two following ways:
  - If they are doubled. Thus, the string "my string has a " in the middle" can be processed if the inner quoted is doubled as "my string has a "" in the middle".
  - With prepending a escaping \ character before the quotes. Thus, the string "my string has a " in the middle" can be processed if the inner quoted is escaped as "my string has a \" in the middle".
- ALLOW_MULTILINE specifies that a string may be in multiple lines of characters. Note that those strings must be quoted. num specifies a maximum number of lines allowed. By default, strings are considered as single-lined.
FROM sets the index row to start loading from, skipping all text previous to that point. By default all rows are loaded. Note that the row index starts at 0, which is the default value.
MAX sets the maximum number of rows to load. A value of 0, which is the default, means unlimited.
MODE sets the load mode, with the following options:
- ROWS. The CSV file is read once. This is the default value.
- COLUMNS. The CSV file is read twice: once to create the objects and once to set the attribute values.(DEPRECATED)
- COLUMNS SPLIT. The CSV file is read N times: once to create the objects and then, once time for each attribute column.(DEPRECATED)
- COLUMNS SPLIT PARTITIONS num. The CSV file is read N times: once to create the objects and then, once for each attribute column. Also, each attribute column is logically partitioned, so data column is loaded in num partitions. (DEPRECATED)
LOG sets how the errors should be managed:
- logfile. Errors are dumped into a file, so it does not stop the processing of the CSV file. This is the default value assuming logfile as the type name.
- ABORT. If an error is raised it interrupts the processing of the CSV file.
- OFF. Errors are not dumped anywhere and the processing of the CSV file is never interrupted.(DEPRECATED)

With this CSV file:

ID; NAME; SALARY; AGE
1; name1; 1800; 18
2; name2; 1600; 16
3; name3; 2000; 20
4; name4; 2200; 22

The loading of “PEOPLE” nodes using Sparksee scripts would look like the example below:

create gdb FRIENDS into "friends.dbg"

create node PEOPLE 
create attribute PEOPLE.ID string unique
create attribute PEOPLE.NAME string indexed
create attribute PEOPLE.AGE  integer basic default 0

load nodes "people.csv"
    columns ID, NAME, *, AGE
    into PEOPLE
    fields terminated ;
    from 1
    log "people.csv.log"

Load edges

This is the command to load data from a CSV file into an edge type:

LOAD EDGES file_name 
    [LOCALE locale_name]
    COLUMNS column_name [alias_name], ... 
    INTO node_type_name 
    [IGNORE (column_name|alias_name), ....]
    WHERE
        TAIL (column_name|alias_name) = node_type_name.attribute_name
        HEAD (column_name|alias_name) = node_type_name.attribute_name
    [FIELDS
        [TERMINATED char]
        [ENCLOSED char]
        [ALLOW_MULTILINE [num]]]
    [FROM num]
    [MAX num]
    [MODE (ROWS|COLUMNS [SPLIT [PARTITIONS num]])]
    [LOG (OFF|ABORT|logfile)]

Most of the features are the same as in the case of loading nodes, with the exception of the following WHERE elements, which are required to specify the source (tail) and destination (head) of the edges:

TAIL shows which column name (or the alias) from the edge corresponds to values of the specified attribute of a node (node_type_name.attribute_name).

That value would be used to retrieve a node using this edge, so it is strongly advisable to define that node attribute as unique.

For instance, TAIL tailColumn = PEOPLE.ID means:
- tailColumn shows the column to be used to retrieve the tail nodes.
- PEOPLE.ID shows the ID attribute of the PEOPLE node type.
- When processing each row of the CSV file the value of the tailColumn is used to retrieve a PEOPLE node object using the ID attribute.
HEAD is analogous to TAIL but in this case the retrieved node will be the head (destination) of the new edge.

This is an example of a Sparksee script to load edges from a CSV file with Sparksee scripts. It assumes that “PEOPLE” nodes have already been created:

1;2;2012-01-01;'Barcelona'
2;3;2011-05-01;'NYC,USA'
3;4;2012-01-01;'Paris,France'
2;1;2012-01-01;'Barcelona'
3;2;2011-05-01;'NYC,USA'

In the example, ‘tail tailId = PEOPLE.ID’ sets the value referenced by the ‘TailId’ column as the “PEOPLE” “ID” attribute. The value is used to find a “PEOPLE” node which is the tail of the edge (because the attribute “ID” matches). For instance, the first row created would be an edge that goes from a “PEOPLE” node with ID=2 to the “PEOPLE” node with ID=1.

open gdb FRIENDS into "friends.dbg"

create edge KNOWS from PEOPLE to PEOPLE materialize neighbors
create attribute KNOWS.SINCE timestamp indexed
create attribute "KNOWS.WHERE" string default "Barcelona, ES"

load edges "knows.csv"
    columns headId, tailId, SINCE, "WHERE"
    into KNOWS
    ignore headId, tailId
    where
        tail tailId=PEOPLE.ID
        head headId=PEOPLE.ID
    fields terminated ; enclosed "'"
    log off

Other

In order to be able to create attributes with a timestamp the following command must be specified in the scripts:

SET TIMESTAMP FORMAT timestamp_format_string

Valid format fields are:

yyyy: Year
yy: Year without century interpreted. Within 80 years before or 20 years after the current year. For example, if the current year is 2007, the pattern MM/dd/yy for the value 01/11/12 parses to January 11, 2012, while the same pattern for the value 05/04/64 parses to May 4, 1964.
MM: Month [1..12]
dd: Day of month [1..31]
hh: Hour [0..23]
mm: Minute [0..59]
ss: Second [0..59]
SSS: Millisecond [0..999]

So, valid timestamp formats may be: MM/dd/yy or MM/dd/yyyy-hh.mm. If not specified, the parser automatically tries to match any of the following timestamp formats:

yyyy-MM-dd hh:mm:ss.SSS
yyyy-MM-dd hh:mm:ss
yyyy-MM-dd

Valid commands to set a timestamp would be:

set timestamp format "MM/dd/yy"
set timestamp format "MM/dd/yyyy-hh.mm"

Back to Index

Sparksee User Manual

by Sparsity Technologies