Graph Done Right

Introduction

A graph database uses graph structures for semantic queries with nodes, edges, and properties to represent and store data about each node and graph. Graph databases are often used where data is hard to model with traditional relational databases of tables with rows and columns.

For example, graph databases are used to model social networks of who is friends with whom. Each node can be a person, and each edge can be a relationship, such as whether they’re friends with another person or have liked one of their posts. Data for each node can be their demographics: their age, where they live, etc. With this data, it’s relatively easy to do things like suggest new friends based on who your friends have friended, whose posts they’ve liked, and how similar their demographics are.

Another example is supply chains. Nodes can be locations such as factories, roads, ports, warehouses, and retail stores. Edges can indicate which sites are connected to which, essentially forming a supply chain “road graph.” Once modeled, it’s easy to find the shortest path through a supply chain, or which parts of the supply chain are poorly connected and thus have potential bottlenecks.

Most real-world interactions form a graph. Cellular networks are simply graphs of cell phones and the cell towers to which they connect. Data centers are graphs of networking hardware, servers, applications, and the connections between them. Customer 360 data is a graph of all customers and the various touchpoints — online and off — they’ve had with a company.

In each of these examples, companies are making cellular service more reliable, applications more snappy, and customer service more satisfying — all thanks to graph data models stored in graph databases.

A quick aside to clear up any potential confusion: when we say “graph” in this document, we don’t mean a graph of a function, such as a line chart or bar chart. That information is best stored in a relational database or spreadsheet. Instead, we mean a graph as defined by the graph theory of computer science, where we are modeling relationships between two objects.

Graph Basics

To summarize the examples above, graph databases store objects — also called nodes or vertices. The relations between the nodes are called edges. These nodes and edges form a network of data points called a “graph.” Using the graph data model allows you to represent data alongside the inherent connections that exist within it. A graph database efficiently leverages this representation with built-in graph queries, referred to as traversals.

One last point about nodes and edges: each has a set of properties. In the social media example above, each user can have properties, such as an age of 21 years old or a hometown of Chicago. In the supply chain example, each location might have latitude and longitude properties. These properties are schema-free, meaning we easily add new properties to each node or edge. For example, some users might list their favorite movie; this becomes a new property for that user node. While ArangoDB is natively schema-less, if your use case does require a defined schema, you can enforce it by enabling the built-in JSON schema validation. In ArangoDB, schema validation comes with varying levels of configuration and validation control and complies with the popular JSON schema specification.

‌Graph Components

A graph consists of nodes and edges.

Nodes

In a graph database, each object is called a node. We represent nodes as circles and edges as lines or arcs. The terms node and vertex are synonymous. Nodes may be:

  • Connected with more than one other node via multiple edges
  • Connected to themselves
  • Disconnected from the graph, having no connecting edges

Edges

The connections between the nodes are called edges. In other words, edges store information about the relationships between the nodes. They can have properties just like other documents, but they uniquely describe the relationship between nodes.

Graph databases offer specialized algorithms to analyze the relationships among data. The simplest algorithm is a graph traversal (also known as graph search), referring to the process of checking or updating each node in a graph, beginning at a defined start node and ending at a defined depth with the end node.

Why Graph?

Graphs are a good data model for representing relationships in data. In many real-world cases, a graph is a natural data model. It captures relations and, using JSON, can store complex data on edges and nodes.

A graph database excels at navigational queries. A crucial component for a graph database is that the query language must implement traversal algorithms such as breadth or depth- first traversal, shortest path(s), k paths, and more. The fundamental capability for these algorithms is to rapidly access the list of all outgoing or incoming edges of a node.

Breadth-first algorithms explore each node at the present depth before moving on to other nodes. If it’s likely that you are looking for a node close to your starting node, a breadth-first search is likely to work best. For example, when building social capabilities into an app, looking for friends-of-friends (just two levels deep) calls for a breadth-first search.

Depth-first algorithms explore each path as far as possible before backtracking and exploring another. If a graph has many edges from each node, a breadth-first search might consume too much memory, necessitating a depth-first search. Depth-first might be better when exploring deep within a graph, such as money laundering cycles.

Shortest path algorithms find the shortest path from one node to another. Finding the shortest route is a common task in the real world; for instance, determining which train route to take. If we go from Paris to Berlin, we can take many different trains, but their paths have different times or weights in the graph world. It might be cheaper to take a route with more stops (hops, in graph speak), and it may even travel a shorter distance overall but take longer. The amount of time it takes to travel from one stop to the other is the weight of the path between those two stations. You can add complexity to your search by saying you only want paths below a certain weight value (total travel time). Or, you can consider the route’s cost and filter your results further. Refining your search this way is trivial with graph traversals and is an example of where a graph database shines.

K path algorithms generalize on shortest path algorithms by finding other paths through a graph. These alternate paths may be the same length or longer than the shortest path. K path algorithms can be useful in supply chain problems when looking for alternative (if slightly more expensive) routes to ship goods.

ArangoDB as a Graph Database

In ArangoDB, each edge has a single direction; it can’t point both ways simultaneously. This model is also known as a directed graph.

Edges are always directed, but users can ignore the direction (follow in ANY direction) when they walk through the graph or follow edges in the reverse direction (INBOUND) instead of going in the direction they point to (OUTBOUND).

Graph Store

In ArangoDB, data models can be implemented by storing a JSON document for each node and a JSON document for each edge. Edges are kept in special edge collections that ensure that every edge has _from and _to attributes that reference the starting and ending nodes of an edge as well as the direction of a relationship. ArangoDB enables efficient and scalable graph query performance by using a special hash index on _from and _to attributes (i.e., an edge index). This allows for constant lookup times. Using an edge index, ArangoDB can process graph queries very efficiently.

Arango Query Language (AQL)

AQL is the query language used in ArangoDB that allows users to express document queries, key/value lookups, graph queries, full-text search, and arbitrary combinations of these. The example below demonstrates combining a full-text search for sci-fi movies with a graph traversal to retrieve metadata:

FOR d IN v_imdb SEARCH
  ANALYZER(d.description
    IN TOKENS(‘amazing action world alien sci-fi science documental’, ‘text_en’) ||
    BOOST(d.description IN TOKENS(‘galaxy’, ‘text_en’), 5), ‘text_en’) SORT BM25(d) DESC
  LIMIT 10
    FOR vertex, edge, path IN 1..1 INBOUND d imdb_edges FILTER path.edges[0].$label == “DIRECTED”
    RETURN DISTINCT {
      “director” : vertex.name, “movie” : d.title
    }

Identifying Graph Use Cases

Graph databases tend to be most appropriate for highly connected data. For instance,

if you find yourself using a relational database, and your queries have many JOINs, that’s an indicator to consider a graph database. Another indicator is when your relational queries need to follow JOINs on multiple levels. Yet another is when you’re trying to uncover hidden patterns in your data.

In other words, graph databases are most useful when the connections between data are just as interesting, if not more so, than the data itself. This contrasts with relational data, where what’s in each row — such as a customer ID or a transaction amount — is what’s most interesting.

Use Cases for Graph Databases

Specific use cases include:

Identity and Access Management

When determining who can see which information in an organization, a manager often has permission to view data about their team. For instance, each sales manager can see the travel expenditures of their team, each sales director can see the expenditures of their managers and teams, and so on. But, accountants can cut across these hierarchies allowing them to audit a set of sales teams by viewing their travel purchases. This web of permissions is best represented as a graph and is crucial to provide access to all appropriate employees, but no one else.

Fraud Detection

Detecting fraud involves complex pattern matching that also considers the graph structure of connections (e.g., an unusual amount of connections between different entities and accounts, IP addresses, etc.), as well as statistical analysis, associative queries, and joins. In many cases, this can sensibly be modeled by a graph structure that involves assembling and integrating a huge amount of data.

Knowledge Graph

Enterprise Knowledge Graphs (EKGs) have been on the rise and are valuable tools for harmonizing internal and external data relevant to an organization. EKGs bring data into a common semantic model to improve enterprise operational efficiency and increase business units’ competitive advantage.

Research

Research teams use graphs to uncover and catalog valuable insights across projects. Citations networks describe the contributions within individual papers but can connect to large EKGs as described above. For example, one research organization has nodes for each portion of the human genome, all medical research papers published, and each other then has edges describing which authors wrote which papers on which genome segment. This makes it easier for researchers to collaborate with others working on similar genomic topics.

Recommendation Engine

There are many different approaches and techniques for generating recommendations, and most synergize perfectly with graph databases. To enrich graph traversals, one approach stores inferences derived from machine learning activities as document attributes, usually along graph edges. It is also possible to eliminate the need for complex ML pipelines and instead use built-in AQL functions and graph algorithms to offer on-the-fly recommendations using just the data stored in the database.

Network and IT Operations

Computer networks, the associated hosts and their components, as well as virtualizations of software-defined infrastructure form a graph. Management of such an infrastructure involves queries about the graph structure, as well as queries about the set of hosts or similar things.

Social Media Management

Social networks are the prime example of large, highly connected graphs. They typically involve graph algorithms and graph traversal queries.

Traffic Management

Street networks are naturally modeled as a graph. Traffic flow data produces a high volume of time-based data that is closely related to the street network. Finding good decisions about traffic management involves querying all this data and running intelligent algorithms using aggregations, graph traversals, and joins.

Graph Done Right

ArangoDB goes beyond just a graph database with document and key/value store capabilities, offering full-text search, integrations for machine learning (ML), and more.

Features Beyond Graph

ArangoDB provides access to a rich set of features that includes:

ArangoGraph Insights Platform

ArangoGraph Insights Platform (ArangoGraph) is a cloud-based, next-generation graph data and analytics platform that natively integrates graph, JSON, search, and machine learning. It is a fully-managed service that allows users to take advantage of the complete functionality of an ArangoDB cluster deployment without running or managing the system in-house.

ArangoGraph runs in data centers of a preferred cloud provider: Google Cloud Platform (GCP), Amazon Web Services (AWS), or Microsoft Azure. This ensures that your databases are always available, up-to-date, and encrypted.

ArangoSearch

ArangoSearch is a search and similarity ranking engine integrated natively into ArangoDB and AQL. It supports relevance-based searching, phrase and prefix-matching, complex Boolean searches, fuzzy search capabilities, and query-time relevance tuning. You can combine ArangoSearch with all supported data models in a single query. Many specialized language analyzers are available out of the box (e.g., English, German, French, Chinese, Spanish, and many other languages).

Pregel

Pregel is a system for large-scale graph processing. This system can perform distributed graph processing without needing distributed global locking. Distributed graph processing enables users to conduct online analytical processing directly on graphs stored in ArangoDB. ArangoDB implements Pregel to discover hidden patterns, identify communities, and perform in-depth analytics of large graph data sets.

Kube-Arangodb

The ArangoDB Kubernetes Operator (kube-arangodb) is a set of operators deployed in a Kubernetes cluster to:

  • Manage ArangoDB database deployments
  • Provide PersistentVolumes on local storage nodes for optimal storage performance
  • Configure ArangoDB datacenter to datacenter replication

ArangoGraphML

ArangoGraph Insights Platform offers support for both analytics tasks and graph-powered machine learning. ArangoGraphML is backed by the graph capabilities of ArangoDB.

These graph capabilities are especially useful in a machine-learning platform for feature engineering. They enable users to combine different data aspects into features that can be used by machine learning frameworks such as TensorFlow or PyTorch to train models. ArangoGraphML offers a simple interface for accessing machine learning frameworks and tools. In a production-grade machine learning infrastructure, ArangoGraphML provides support for common metadata storage across the entire machine learning lifecycle, enabling reproducibility, monitoring, and auditing for machine learning models.

Graphs at Scale

The graph capabilities of ArangoDB are similar to a property graph database, but they provide more flexibility in data modeling because nodes and edges are both full JSON documents.

As an application grows, so does graph size. To make sure graph traversals stay as performant as possible, even when sharded across multiple servers in a cluster, ArangoDB provides EnterpriseGraph and SmartGraphs. For larger datasets, EnterpriseGraph and SmartGraphs reduce the needed network hops by intelligently sharding data.