Why doesn’t the Database and Semantic Web research community interact more?

I was in Chicago to meet with colleagues from the Graph Query Language task force at the Linked Data Benchmark Council (LDBC) so we could have an impromptu face-to-face meeting (great progress towards our graph query language proposal!). They were in Chicago attending one of the main academic database conferences: SIGMOD/PODS. I was able to take a quick look at papers, demos and tutorials.

I left with the following question: Why doesn’t the Database and Semantic Web research community interact more? The cross pollination, in my opinion, is minimal. It should be much bigger. A couple of examples:

If you go to two conferences of your field in a year, consider swapping one conference to attend another conference in a different field. For example, for the Semantic Web community, if you attend ISWC and ESWC, consider swapping one of those to attend SIGMOD or VLDB. Same for the database community.  VLDB 2017 will be in Munich from August 28th to September 1, 2017.

I made a list of papers from SIGMOD/PODS (research papers, demos and tutorials) that I believe are relevant to the Semantic Web community. The SIGMOD and PODS papers are available online

PODS Papers

SIGMOD Papers

SIGMOD Demos

SIGMOD Tutorials

P.S. For the travel and points geeks. Last minute travel to Chicago was really expensive. Over $500 USD.  I was able to use 25000 miles and pay just $10 USD. And I even got upgraded to first class!

Smart Data and Graphorum Conference Trip Report

I attended the Smart Data-Graphorum Conference (January 30 – February 1) in the Bay Area (actually Redwood City). This conference series originally was called Semantic Technology (SemTech) Conference and I have been presenting at it since 2010.

This year, the conference had a cozy feeling with ~250 attendees. I gave two talks:

  • Graph Query Languages: Similar to my Graph Data Texas talk, I gave an update from the Graph Query Language task force at the LDBC. The latest discussions were incorporated in this talk. We have been discussing the idea of having a paths as a datatype and also its own table ( a table for Nodes, Edges and Paths). Additionally, there are two notions of projection: relational vs graph. The slides provide some examples. This is still on going work.

  • Virtualizing Relational Databases as Graphs: a multi-model approach: In this talk I discussed how relational databases can be virtualized as RDF Graphs by using the W3C RDB2RDF standards: Direct Mapping and R2RML. I argue that graphs are cool, and ask if relational databases are cool? If you are  deciding to move from a relational database to a graph database, you should understand the tipping point. I believe virtualization is a viable option to keep your data in a relational database while continuing to take advantage of graph features. However, that may not always be the case.

 

Additional highlights of the conference

  • I was glad to see a lot of friendly faces. I feel very lucky to that I can always have a chat with Deborah McGuinness and Michael Uschold, two legends in ontologies. It’s always great to see Souri Das from Oracle (and all the Oracle folks from the semantic technology group) and discuss how the W3C RDB2RDF standards are doing. We both agree that we did a good job with that standard and gave a pat on our own backs 🙂 Also great to see Peter Haase, Dean Allemang, Atanas Kiryakov, Bart van Leeuwen, Jans Aasman, Dave McComb and many more.
  • Michael Uschold and I discussed the pragmatics of part-of and has-label semantics. For some situations you want to be generic. For example, it’s easier for a user to just use “has label” for any thing, instead of having to know the exact type of “has label” for a specific thing. Now I understand many of the modeling decisions made in gist. I argue that from a database point of view, query performance is better if you have more specific properties, unless you have some sort of semantic query optimizations.
  • Cambridge Semantics gave a presentation on their in-memory analytics graph database. They presented results using the LUBM benchmark where they claim to have blown Oracle away. Important to note that they used 4x the hardware. Atanas Kiryakov, Ontotext’s CEO was in the audience and rightfully asked why they didn’t use a more up to date benchmark given that LUBM is from 2007. It seems that everybody has been using LUBM (since 2007) so in order to compare to others, they continue to use LUBM. Hopefully they will start using the LDBC benchmarks!
  • I have been aware that Marklogic markets themselves as a document and graph database. I now understand how they represent things underneath the hood. Each entity, with their corresponding attributes and values are represented in a document (key-values). The relationships between the entities are represented as RDF triples.  This makes a lot of sense to me and I can imagine how this can improve query performance to a certain degree.
  • Brian Sletten gave a great talk on JSON-LD. I wish all web developers could see this presentation in order to understand the value of Linked Data. Even though Brian was not able to give his talk on the new W3C upcoming standard SHACL, the Shapes Constraint Language, his slides left a lasting impression. This is the best definition I have ever seen for the Open World Assumption!

  • It was great to see Emil Eifren, Neo Technologies’ CEO again. We discussed history of RDF and Semantic Web (I didn’t know he was a very early user of Jena!). We seem to be in agreement that RDF is great technology for data integration. Anything else graph related, he argues that you should use Neo4J. Not surprising 😛 I was also glad to see that Neo4j is starting to work on formalizing the semantics of Cypher, including making it a closed query language.

This was a great couple of days and hopefully next year we will have more people!

A Data Weekend in Austin

On the weekend of January 14-15, I attended Data Day Texas, Graph Day Texas and Data Day Health in Austin and gave three talks.

Do I need a Graph Database: This talk came out of a Q/A during a happy hour after a talk I gave at a meetup in Seattle. We were discussing when to use a Graph Database? What type of graphs should you use: RDF or Property Graph.

 

Graph Query Languages: This talk gave an update on the work we have been doing in the Graph Query Language (GQL) task force at the Linked Data Benchmark Council (LDBC). The purpose of the GQL task force is to study query languages specifically for the Property Graph data model because there is a need for a standard syntax and semantics of a query language. One of the main points I was arguing in this talk is the need of a closed language: graphs in, graphs out. One can argue that a reason for success of relational databases is because the query language is closed (tables in, tables out). With this principle, queries can be composed (i.e. views!). This talk was well received and generated a lot of interesting discussion, specially when Emil Eifrem, Neo Technologies’ CEO is in the room.  An interesting part of the discussion was if we are too early for standardization. Emil stated that we need standardization now because their clients are asking for it. I stated that graph databases today are in the mid 1980’s of relational databases, so time is about right to start the discussion. Andrew Donoho said I was too optimistic. He thinks we are in the late 70s and we are too early. I will be giving this talk next week at the Smart DataGraphorum conference, with some updated material. Special thanks to Marcelo Arenas, Renzo Angles and specially Hannes Voigt for helping me organize these slides.

Semantic Search Applied to Healthcare: In this talk, I introduced how we are identifying patients who are in need of Left Ventricular Assist Devices (LVADs) using Ultrawrap, the semantic data virtualization technology developed at Capsenta. This talk presented a use case with the Ohio State University Wexner Medical Center. Patients are being missed through traditional chart pull methods. Our approach has resulted in ~20% increase in detection over previously known population at OSU, which is a mature institution. This talk will also be given at the Smart Data conference.

Main highlights of the conference:

  • Emil Eifrem, CEO of Neo Technology gave the keynote. It was nice to learn the use cases where Neo4j is being used: Real-time recommendation, Fraud detection, Network and IT operations, Master Data Management, Graph-Based Search and Identity & Access Management. It was not clear why were graphs specifically used because these are use cases that have been around for a long time and have been addressed using traditional technologies. Emil ended talking about a “connected enterprise”, meaning integrating data across silos using graphs. If you take a look at my Do I need a graph database talk,  you will see that I argue to use RDF for data integration, not Property Graphs.
  • Luca Garulli, the founder and CEO of OrientDB gave a talk focusing on the need of a multi model database like OrientDB. In his talk, he argued for many features which Neo4J apparently didn’t support. Not long after, there was a good back-and-forth twitter discussion between Emil and Luca. Emil was correcting Luca. Seems like this talk may need to be updated. An interesting take away for me: how do you benchmark a multi model database?
  • Many talks about “I’m in relational, how do I get to property graphs”. All of them at an introductory level. Given that we have studied very well the problem of relational to RDF, this should be a problem that can be address quickly and efficiently.
  • Standards was a big topic, one of the reasons my Graph Query Language talk was well received. Neo4j is pushing for OpenCypher to become the standard, while in fact, one could argue that Gremlin is already the defacto standard. Before this weekend, I wasn’t aware of anybody implementing OpenCypher. Apparently there are now 10 OpenCypher implementation including Bitnine, Oracle and SAP HANA.
  • Bitnine: they are implementing a PropertyGraph DB on top of Postgres and using OpenCypher as the query language. They are NOT translating OpenCypher to SQL. Instead, they are doing the translation to relational algebra internally. I enjoyed the brief discussion with Kisung Kim, Bitnine’s CTO. Apparently they have already benchmarked with LDBC and did very well. Looking forward to seeing public results. Bitnine is open source.
  • Take a look at sql2gremlin.com
  • grakn.ai looks interesting. Need to take a closer look.
  • Cray extended the LUBM benchmark and added a social network for the students.
  • Property Graphs is what comes to mind when people thing about graph databases. However, an interesting observation is that the senior folks in the room prefer RDF than Property Graphs. We all agreed that RDF is more mature than Property Graph databases.
  • “Those who do not learn history are doomed to repeat it.” It is crucial to understand what has been done in the past in order to not re-invent the wheel. I feel lucky that early on in grad school, my advisor pushed me to read pre-pdf papers. It was great to meet this weekend with folks like Darrel Woelk and Misty Nodine who used to be part of MCC. A lot of the technologies we are seeing today has roots back to MCC. For example, we discussed how similar graph databases are to object oriented databases. On twitter, Emil seemed to disagree with me. Nevertheless we had an interesting twitter discussion.
  • Check out JanusGraph, a graph database, which if I understood correctly, is  a fork from Titan. Titan hasn’t been updated in over a year because the folks behind it are now at DataStax.

Thanks to Lynn Bender and co. for organizing such an awesome event! Can’t wait for it to happen in Austin next year. Recordings of the talks will start to show up on the Global Data Geek youtube channel.