Introduction
Enterprise Knowledge Graphs (EKGs) have been on the rise and are incredibly valuable tools for harmonizing internal and external data relevant to an organization into a common semantic model to improve operational efficiency for the enterprise and competitive advantage for the business units. On the other hand, EKGs can be difficult to develop and sustain, suffer from scalability issues, and can be difficult for business units to consume.
This White Paper describes some of these challenges and how a flexible data representation of a multi-model graph can address them (see Figure 1). Native multi-model databases offer the flexibility to operationalize EKG’s to business data consumers and can also be used to ease the challenges of data harmonization to EKG’s, because the flexibility of having the data models commonly used by producers and consumers of data, such as tables, documents, key-values as first class citizens alongside graphs in an EKG.
What is an Enterprise Knowledge Graph?
Knowledge graphs have been instrumental in creating trillions of dollars in wealth for companies like Google, Apple, Facebook, Twitter, MicroSoft, LinkedIn, eBay, and Alibaba, who developed their own technology stacks to support knowledge graphs. By contrast, EKGs are developed on open source and commercial graph database products to harmonize an organization’s content, data, and information assets in terms of industry or enterprise-specific knowledge models.
An EKG is a representation of an organization’s knowledge domain and artifacts that is understood by both humans and machines. It is a collection of references to your organization’s knowledge assets, content, and data that leverages a data model to describe the people, places, and things and how they are related.
Not all graphs are EKGs: Enterprises may have many business knowledge graph (BKG) solutions deployed, and an important distinction to note, is that bespoke knowledge graphs built to address a specific business need, for example, next best action, recommendation, or impact analysis are not EKGs. BKG’s are built to support narrow business use cases, whereas EKGs are developed to supply high quality harmonized data to multiple business units and address multiple use cases. In the next section, we will talk about the challenges and opportunities in leveraging EKGs to support business use cases.
Most EKG’s are developed using the W3C semantic web standards. These standards offer a rich interchangeable knowledge representation capable of representing the subtle nuances in meaning for disparate data harmonized into EKG’s. In addition, there are standard domain ontologies for industry verticals, for example, the Financial Industry Business Ontology (FIBO) for financial services and National Information Exchange Model (NIEM) for government.
In the next section, we will describe how the power, complexity, and data formats of semantic EKG’s make them difficult to operationalize: Business users struggle with the complex EKG knowledge representations, their tools use conventional data formats and do not connect to EKG’s, and their developers are unfamiliar with the specialized technology.
Similarly, the enterprise struggles to harmonize and fuse disparate non-homogeneous data into EKG’s.
EKG Challenges and Opportunities
EKGs contain valuable, high-quality data harmonized from multiple data sources. The advantage to business units is that it eliminates time and effort of integrating data sources for supporting high value business use cases, where, for example, the curated data could be feature vectors for training, validating and operating (ML) algorithms.
Current EKG solutions harmonize multiple disparate heterogeneous source systems in terms of an enterprise conceptual model or ontology. The raw data is usually staged on distributed storage (Hadoop/HDFS, S3), and then a middleware cluster is used to extract transform and load (ETL) the data to graph database cluster. EKGs then support enterprise applications like enterprise search and they also need to extract and transform the EKG data in a variety of formats (documents, tables, key-value, and graph) to support business applications.
EKGs often fail to realize their full potential because enterprises struggle with the complex multi-source data logistics needed to harmonize data into graphs for an EKG and then business users struggle with the complex and unfamiliar knowledge graph representations and the lack of tooling needed to consume them. Organizations can expend massive effort to harmonize dozens to hundreds of data sources into an EKG, while solving data governance issues like data provenance and preservation of entitlements, only to face challenges in the last few hundred feet in getting their business units to leverage the high quality curated EKG data, which can drive high value data-driven ML use cases.
The essence of the problem is that the “all or nothing” conversion of data to graph causes an impedance mismatch (see Figure 2) between source data representations and EKGs and between EKGs and the way business units would like to consume and process their data with their tools. Multi-model based EKGs reduce data impedances by allowing diversity of representation in the knowledge graph, which allows agile incremental harmonization to graph as well as minimal transformation to data when needed by the consuming business units.
The Challenge of Harmonizing Data to Graph
Enterprises need to harmonize a large number of disparate data sources. In general, the more relevant data sources that are harmonized, the greater the potential value to the enterprise.
However, the cost of harmonizing data to the graph can increase exponentially with the number of data sources. This is why enterprises are eager to find ways to automate data harmonization and to apply agile methodologies to provide data harmonization based on needs.
Complex knowledge representations are needed to represent the nuances of disparate data and normalize to a graph structure. All relevant source data consumed and syndicated by the knowledge graph needs to be transformed to graph structure in a single model graph database. Mapping source data to these complex knowledge graph representations requires time, effort, and knowledge.
The resulting EKGs can stress the performance at scale capabilities of graph databases and require huge amounts of resources. The truth is that there will always be more data than single model graph databases are able to scale to, particularly when you consider the practical scale of data housed in key-value and document stores.
As a native multi-model database, ArangoDB is able to blend key-value, document, joins, and graph data models in a way that allows them to scale and, at the same time, simplify the graph representations needed (special features like SmartGraph & SmartJoin allow for efficient execution of graph and join-type queries against distributed data).
For example, cybersecurity information in an enterprise grows at a rate of many trillions of edges per year when represented as a pure graph. The same enterprise cybersecurity graph could be represented in billions of edges when combining graph, documents, and joins.
The enterprise looking for ways to reduce the effort needed to develop and maintain EKGs often ask questions, like:
- Can we automatically classify, map, and transform source data to knowledge graph?
- Can we automatically refactor EKGs when the source schemas or conceptual models change?
- Can we search over source, knowledge graph, and curated data?
No practical solutions exist yet for automating data harmonization to a graph. This article focuses on challenging the key assumptions underlying EKGs: that the EKG must be a monolithic graph model and that all data must be converted to a graph to be useful. EKG deployment and sustainment effort can be reduced, and the potential scale of EKGs increased by relaxing this assumption by allowing it to contain other data models. This would allow for EKG development and sustainment to be more dynamic and agile.
Knowledge graphs that permit other data models allow staging data and graphs to exist in the same database and delay graph harmonization to when to be tackled in an agile and iterative way.
The Challenge of Making EKGs Easily Consumable
The complex knowledge representations that are needed to represent the nuances of the data and normalize to a graph structure are also an impediment to business users.
Business users struggle with the complex representations and unfamiliar data formats used in knowledge graphs and the lack of tooling needed to consume them. Common EKG questions are:
- Does it work with the tools I am using?
- Will my developers know how to use it?
- How do I find relevant data?
- How do I bound the data I want?
- How do I get the data in the format that I need?
The essence of the challenge is that there is an impedance mismatch between EKGs and the way business units would like to consume and process their data with their tools. It would be a perfect world if everyone worked with graph data — graphs are the exception, not the rule.
For example, a business might need all of the trades from January 2017 to December 2019 for politically exposed customers and direct family members and require this data to be delivered in a JSON document collection in a particular document structure. They do not want to learn or use a graph query language to do this. What they want is a data shopping experience where they visit the EKG store and search the EKG shopping catalog for data using faceted filters and the EKG store recommends data sets as well as data that complements their data, and then they specify how they want it delivered and when.
Multi-Model Enterprise Knowledge Graph
Multi-model enterprise knowledge graphs (MMEKGs) can alleviate many of the issues described earlier by allowing users to blend and manage source, EKG, and curated data representations in one ecosystem.
Reduced Time and Cost
MMEKGs allow graph transformation to be delayed until needed. Multi-model graphs also tend to reduce the size of graphs because they allow edges and vertices to contain documents. This allows EKGs to be developed using agile iterative processes.
Reduced Computing Resources
EKG solutions often require separate data systems for staging, graph ETL, graph management, and delivering data to consuming business units. MMEKGs can eliminate the impedance mismatch between source data, knowledge graph, and curated business data allowing the data to be managed in one system, thus reducing transformation latencies and making all data searchable. This reduces the cost of having separate clusters for staging, transformation, graphs, and business applications.
Ease of Use
Multi Model makes source data, knowledge graph, and business application data searchable and findable in the same data system. Business users can consume the data in their own formats, without having to understand the complex enterprise graph models.
Enterprise users can search for source data as well as curated data.
Data Lineage/Provenance
With data staged, transformed, and delivered in the same multi-model system, it is much easier to keep track of data lineage.
Enhance Existing EKGs
Enterprises that have RDF EKGs can preserve the effort put into them and leverage them in an MMEKG. Model databases can ingest RDF ontologies and RDF EKGs because the multi-model graph is a superset of the labeled directed graphs that RDF is based on. Similarly, multi-model graphs subsume property graphs making it easy to absorb property graph-based EKGs.
Conclusion
Multi-model is an enabling technology for EKGs. The Multi-Model EKG provides a scalable, efficient and effective means of operating and operationalizing EKG’s. MMDB’s also provide an efficient business friendly way to operationalize existing EKG’s developed using RDF or property graphs. The benefits include the ability to streamline and accelerate multi-source data ingestion and harmonization, accelerate EKG operationalization by business, enabling greater scale by blending models, and reducing EKG ecosystem footprint to a single data system versus current approaches requiring complex orchestrations of systems of systems.