AQL on ArangoDB and Cypher on Neo4J

Technical Comparison – Multi-Model Data Handling And Performance with Distributed Datasets

Introduction

In the exploration of query languages designed for graph databases, ArangoDB Query Language (AQL) and Cypher, the query language for Neo4j, emerge as pivotal tools tailored for distinct operational paradigms. This technical paper delves into a comparative analysis of AQL and Cypher, highlighting scenarios where AQL demonstrates superior capabilities. Two domains where AQL distinctly outperforms Cypher are highlighted:

Multi-Model Data Handling

AQL’s adeptness in managing multiple data modelsโ€”ranging from graphs, documents, to key-value pairsโ€”within a single query framework is unparalleled. This feature becomes indispensable in applications where graph data integration with other data types is requisite, enhancing the fluidity and comprehensiveness of data manipulation. A scenario showcases the joining of graph-based relational data with document-based user profiles, epitomizing AQL’s versatility in handling complex join operations with great flexibility.

Performance on Large, Sharded Datasets

The architectural design of ArangoDB, with its innate support for efficient sharding and data distribution, confers upon AQL a significant advantage in the execution of queries over large, distributed datasets. This capability is critical in large-scale environments where data is voluminous and dispersed, necessitating a query language that minimizes performance degradation while ensuring data integrity and speed. Through an example, we share AQL’s proficiency in managing complex queries within a distributed database framework, affirming its superiority in scalability and performance optimization.

Let’s dive deeper by first considering the differences between ArangoDBโ€™s AQL and Neo4jโ€™s Cypher in the area of Multi-Model Data Handling.


Multi-Model Data Handling

Combining graph-based relationship data with document-based user profiles in a single query is a scenario where AQL on ArangoDB’s multi-model capabilities can shine.

Scenario #1: Query Flexibility & Control

Let’s illustrate this with an example where the objective is to (i) find users who have completed transactions with one another and also (ii) display some of those usersโ€™ profile information in a document store.

Assume we have a database with two collections:

  • Users: A document collection containing user profiles.
  • Transactions: An edge collection representing transactions between users.

We want to find all users who have transacted with a specific user and display their names and email addresses along with the transaction details.

In the following AQL Example, note the greater flexibility for fine-tuning queries:

FOR user IN Users
  FILTER user._key == "specific-user-key"
  FOR transaction IN Transactions
    FILTER transaction._from == user._id
    FOR targetUser IN Users
      FILTER targetUser._id == transaction._to
      RETURN {
        fromUser: user.name,
        toUser: targetUser.name,
        toUserEmail: targetUser.email,
        transactionDetails: transaction.details
      }

In this AQL query:

  • We start by selecting a specific user from the Users collection.
  • Then, we find all transactions where this user is the sender.
  • Next, for each of these transactions, we find the receiving user in the Users collection.
  • Finally, we return the names and emails of both the sender and receiver, along with the transaction details.
  • Each FOR and FILTER statement in AQL can be finely tuned. This allows for more complex data processing logic.

Next, letโ€™s look at the equivalent Cypher example, noting the reduced flexibility for fine-tuning queries :

MATCH (user:User)-[transaction:TRANSACTION]->(targetUser:User)
WHERE user.id = 'specific-user-id'
RETURN user.name AS fromUser,
       targetUser.name AS toUser,
       targetUser.email AS toUserEmail,
       transaction.details AS transactionDetails

In the Cypher query:

  • A MATCH clause is used to find a pattern where a user is connected to another user through a transaction.
  • We filter for transactions starting from a specific user.
  • We return the names and emails of both users involved in the transaction, along with transaction details.
  • The WHERE clause in Cypher provides filtering capabilities, but it does not offer the same level of flexibility as multiple FILTER statements in AQL.

So why is AQL better in this case?

First, greater flexibility in data retrieval: AQL’s approach is more flexible when dealing with different data types (documents and edges) within the same query. It allows for more complex traversals and conditions.

Each FOR and FILTER statement in AQL can be finely tuned. This allows for more complex data processing logic, such as additional filtering or computation at various stages of the query. The alternative would be having to stitch together multiple, highly complex Cypher queries to achieve the same result.

Next, better support for complex relationships: AQL can handle complex relationships and conditions more natively, especially in scenarios where different data models (like documents and graphs) must be combined in a single query. AQL is superior to Cypher in the context of ArangoDB’s multi-model capabilities and the flexibility and control it provides.

Finally, AQL provides more control over how each step of the query is executed, which can be beneficial for optimizing queries based on specific data structures and requirements. The FOR loops in AQL explicitly define how each collection is accessed and joined. This provides fine-grained control over the query execution, allowing for optimizations based on the specific data model and structure.

On the other hand, the MATCH statement in Cypher is more declarative and abstracts away the join logic. While this can appear to be simpler, it offers less control over how the database executes the joins, which can significantly impact performance in complex or large datasets, especially across distinct data types such as document, graph, and search.

Developers who are new to graph databases will therefore find AQL to be far easier to learn and work with, as they will easily recognize the fine-grained tuning and control features they have become accustomed to.

Scenario #2: Readability and Maintainability in a Fraud Example

Imagine a dataset where transactions are recorded along with accounts and their owners. We’re interested in identifying accounts involved in a pattern of behavior indicative of fraud: specifically, accounts that have made transactions to multiple other accounts in a short period, where those recipient accounts have then quickly transferred money to accounts outside of the country.

Assume we have a database with the following:

  • Collections: Accounts, Transactions
  • Graph: Edges in transactions connecting Accounts

In the following AQL Example, note how AQL’s step-by-step approach can dissect complex queries into manageable parts, enhancing clarity and maintainability:

FOR account IN Accounts
LET outboundTransactions = (
  FOR transaction IN Transactions
  FILTER transaction.from == account._id
    AND transaction.date > DATE_SUBTRACT(CURRENT_DATE(), 7, 'days')
  COLLECT recipient = transaction.to WITH COUNT INTO numTransfers
  FILTER numTransfers > 1
  RETURN recipient
)

LET internationalTransfers = (
  FOR recipientId IN outboundTransactions
  FOR transaction IN Transactions
  FILTER transaction.from == recipientId
    AND transaction.isInternational == true
    AND transaction.date > DATE_SUBTRACT(CURRENT_DATE(), 7, 'days')
  COLLECT accountId = transaction.from WITH COUNT INTO numInternational
  FILTER numInternational > 0
  RETURN accountId
)

FILTER LENGTH(internationalTransfers) > 0
RETURN account

Letโ€™s look at each step of this query in detail:

  • Step 1: Iterate over each account in the Accounts collection.
  • Step 2: For each account, find transactions originating from this account in the last 7 days that have been made to multiple distinct recipients.
  • Step 3: For each recipient identified in step 2, find if there are subsequent international transactions in the last 7 days.
  • Step 4: Filter accounts that have at least one recipient meeting criteria in step 3 and return these accounts.

Next, letโ€™s look at the equivalent Cypher example, noting the reduced flexibility for fine-tuning queries.

Assume we have a Neo4j database with the following:

  • Nodes: Account
  • Relationships: MADE_TRANSACTION connecting Account nodes, with properties on the relationships (e.g., date, isInternational)
MATCH (a:Account)-[t:MADE_TRANSACTION]->(b:Account)
WHERE t.date > date()-7 AND NOT t.isInternational
WITH a, b, COUNT(*) AS transfers
WHERE transfers > 1
MATCH (b)-[t2:MADE_TRANSACTION]->(c:Account)
WHERE t2.date > date()-7 AND t2.isInternational
RETURN DISTINCT a AS SuspiciousAccount

Letโ€™s look at each step of this Cypher query in detail:

  • Step 1: Find transactions made by an account to another within the last 7 days that are not international.
  • Step 2: Count these transactions and filter for accounts making more than one transaction to different accounts.
  • Step 3: From those recipient accounts, find if there have been subsequent international transactions in the last 7 days.
  • Step 4: Return the originating accounts of transactions that meet the fraud pattern criteria.

So why is AQL better in this case?

First, AQL’s Verbose Approach breaks down the query into smaller, understandable parts. This is particularly useful in dissecting complex logic required for fraud detection, making the query more maintainable. The step-by-step approach in AQL can be more verbose but offers clarity for different stakeholders within or across development teams responsible for fraud detection and remediation. This is especially useful for complex queries involving multiple steps. This can make the query more maintainable and understandable.

While the Cyphers syntax in this example appears concise, but in complex scenarios like fraud detection involving multiple steps and conditions, it might obscure the logic, making it far harder to debug and/or extend.


Performance on Large, Sharded Datasets

AQL is far better than Cypher for executing complex queries on a distributed database with minimal performance degradation.

Executing complex queries on a distributed database with minimal performance degradation is an area where the design and optimization capabilities of a graph database query language become crucial. AQL is designed to work efficiently with ArangoDB’s distributed architecture, which can handle sharded data across multiple nodes.

Scenario: Analyzing Transaction Patterns across a Distributed Database

Consider a scenario where we need to analyze transaction patterns across a distributed database. The goal is to find users who have participated in a high volume of transactions in a short period, which could be indicative of suspicious activity.

FOR transaction IN Transactions
  COLLECT userId = transaction.userId WITH COUNT INTO transactionCount
  FILTER transactionCount > 100
  LET userInfo = DOCUMENT(Users, userId)
  FILTER DATE_DIFF(transaction.date, NOW(), 'day') < 30
  RETURN {
    user: userInfo.name,
    email: userInfo.email,
    numberOfTransactions: transactionCount
  }

In this AQL query:

  • We iterate over Transactions, which could be distributed across multiple shards.
  • We use COLLECT to aggregate transactions by user, counting them.
  • The FILTER clause finds users with more than 100 transactions.
  • LET is used to retrieve user information from the Users collection.
  • Another FILTER checks for transactions within the last 30 days.
  • The RETURN clause outputs the user’s name, email, and transaction count.
MATCH (user:User)-[transaction:TRANSACTION]->()
WITH user, COUNT(transaction) AS transactionCount
WHERE transactionCount > 100 AND transaction.date > date()-30
RETURN user.name AS Name, user.email AS Email, transactionCount

In the Cypher query:

  • We use a MATCH clause to find users and their transactions.
  • WITH is used to count transactions per user.
  • We filter for users with more than 100 transactions in the past 30 days.
  • The RETURN clause outputs the user’s name, email, and transaction count.

So why is AQL better in this case?

Optimized for Sharded Collections: AQL is designed to work efficiently with ArangoDB’s distributed nature, which supports sharded collections. The FOR loop and COLLECT operation in AQL can be highly optimized for accessing and aggregating data across shards, which is crucial for performance in distributed environments.

Aggregation and Filtering Efficiency: The AQL snippet performs aggregation (COLLECT with COUNT) and then filters based on the aggregated count. This two-step process is inherently optimized in AQL for distributed databases, ensuring that aggregation can scale with the data.

Subsequent Filtering on Aggregated Data: AQL allows for additional filtering on the aggregated data (FILTER transactionCount > 100) before fetching user details (LET userInfo = DOCUMENT (Users, userId)), which can reduce the overhead in distributed database environments by minimizing the amount of data transferred over the network.

Flexible Date Filtering: The AQL query explicitly filters transactions within the last 30 days (FILTER DATE_DIFF(transaction.date, NOW(), ‘day’) < 30) after aggregating them, which ensures that only relevant transactions are considered. This can optimize performance by reducing unnecessary calculations and data retrieval.

Direct Document Fetch: The use of DOCUMENT(Users, userId) in AQL for fetching user information is optimized for document stores like ArangoDB. This can be more efficient in a distributed setup than traversing relationships in graph databases, especially when dealing with large volumes of data.

Reasons why Cypher fails in the area of distributed databases:

The Cypher query performs match and count operations within a single statement, which is concise but does not offer the same level of optimization for sharded or distributed data as AQL. This problem is compounded by the fact that Neo4j, unlike ArangoDB, cannot shard/distribute multiple data structures across a single cluster. Therefore, depending on application requirements, comparing the query syntaxes in this scenario would be somewhat irrelevant.

Conclusion

AQL on ArangoDBโ€™s unparalleled ability to manage multiple data models, including graphs, documents, and key-value pairs within a single query, makes it indispensable for applications that require the integration of different data types. This capability enhances data manipulation’s fluidity and comprehensiveness, as demonstrated by its effectiveness in joining graph-based relational data with document-based user profiles. Such versatility underscores AQL’s superior handling of complex join operations.

Furthermore, ArangoDB’s architectural design, emphasizing efficient sharding and data distribution, grants AQL a significant edge in executing queries over large, distributed datasets. This is particularly vital in large-scale environments, where AQL’s design minimizes performance degradation and ensures data integrity and speed, showcasing its scalability and performance optimization in managing complex queries within a distributed database framework.