A recent client project has again brought to my attention two important benefits of W3C standards around semantic technologies and graph data:
- Stability, in the sense of portability of data and queries
- Choice of products between which I can exchange my data and queries
- And thus investment protection that proprietary graph databases do not offer
tl;dr
Why RDF? Because it protects your investment!
In example: data migration to a new product after 6 years, effort: 1 day. Migration of existing queries: probably less than 4% changes. The data model remains, the APIs and queries for customers remain. The customer could choose between three products to meet the changed requirements.
Stability
The issue here is that the data itself represents the value that RDF will continue to receive. Significant resources have likely gone into structuring, collecting, merging, validating, and maintaining this data.
To preserve this value, a stable data format is important. I want to still be able to read this data in 10 years, otherwise my investment is lost, or at least demands significant additional resources to convert the data into a format that is currently opportune or fashionable.
This is where a standard like RDF helps. Yes, it's slowly changing. And that's precisely why I can still access it, work with it, 10 years later with different tools. My investment is protected.
In addition:
Many products support RDF
There are now quite a few products to work with the standards around RDF.
These include various Triplestores, or graph databases. And all offer different features and focus, which make them suitable for different applications.
Whether I need particularly fast rule processing, incremental validation of updates even for huge amounts of data, connection to unstructured data or full-text search engines, transactions, time-based queries, location-based queries, high availability or connection to the just preferred programming language: for almost everything there is an offer that covers the requirements.
And their data will still remain in a standardized format that they may be able to process in 5 years in another application or database.
In addition, there is a rich selection of tools and methods to work with this data, from user interfaces, various inference systems, visualization libraries, SQL-to-RDF adapters, there is much available. This ranges from open-source to freeware to high-priced enterprise-level products.
In addition, there is training material, examples, code, and business and scientific articles galore, and for many problems there is an already formulated solution. Anyone who wants to get into the field will find more than enough training materials, both to train staff and for self-study.
If I compare this to Neo4J, for example: one vendor, one query language, one database implementation, a limited set of tools. This may seem like an advantage at first: no choice also means no learning and decision process once Neo4J is set. But it also means "no choice" when new capabilities are required after a longer runtime. It also means transformation of the data when the new capabilities require a new database.
An example
My current customer is looking for a new triplestore for a dataset of 600 million triples that is expected to grow to 900 million.
In the current application, every time the data is updated, more than 600 queries are run to validate the data.
To test the alternatives, I exported the 600 million triples from the existing database into a standardized format (NQuads), transformed them into another standardized format (TriG) using an existing tool, and imported them into three different triplestores.
This went off without a hitch. Effort for data migration in three different products: one day.
Then I sent the existing 600 validation queries - unchanged - to all three alternative products, and of the 600 validations all were accepted and processed, and only for 20-25 validations out of 600 the different stores claimed errors in the - valid - data.
This is a more than good result, 96,x% of the queries run with the new stores completely without any changes.
Thereby the investments are preserved, the effort to adapt these 20+ queries is very manageable, there was basically no effort to convert the data.
Summary
My customer can by their decision to use W3C standards 6 years ago today:
- with no cost of data migration
- with minimal cost of customization of existing queries (<5% changes)
- increase performance of existing solution by an order of magnitude
- increase robustness and resilience by switching to HA configuration
- simplify existing architecture
- significantly reduce scope of homegrown services
His costs are limited to one-time costs for actual improvements: the queries are converted to constraints, the architecture is simplified to work with a smaller number of self-written services.
Comments