Discussion Forum

Question on query performance

Here is a requirement:

  1. An instance of an entity is created in some workspace.
  2. Only certain people have access to the workspace
  3. However access can be granted to other people in other workspaces.

We can implement the above in a few ways:

  1. Have the notion of Workspace as an attribute on the entity
  2. Connect instances to the relevant workspaces.

Assuming that workspaces are not frequently changed and for all practical purposes remains a constant, is it better to model it as an attribute on the entity itself or is it better to manage it as a relation? There could be several millions (billions eventually) of records in the database. Query performance is very important.

From a general GRAKN specific performance optimization standpoint which of the options is better?

It kind of depends on the ratio of people to workspaces and the cardinality between them - if we have many people, with each exactly one workspace-membership relation each it would probably be fine to use a relation. Using an attribute will encourage the query planner to start its search from the attribute in some cases… which if the worspace attribute is very highly shared (100s of k’s of edges) it might end up being a slow query.

On the other hand, attribute ownership is 1-hop when optimised under the hood, versus relations which are always implemented as edge-node-edge for now (will also be optimised under the hood in the future), so you’re saving space and some query time using attributes.

All that being said it might be best just to try it out - it should be pretty easy to write a script to load a lot of users and connect them to workspaces that you load as well, then perform some queries on top of this :slight_smile:

But as a product you should have some performance characteristics and therefore some guidelines. Instead of us discovering things as we go such information should be published based on usage by other customers, internal performance measurements etc.

We’re working on it, @Kgyk! We’ll be publishing a thorough benchmarking report (and improvements) in 3-4 months time! :slight_smile:

Hey, this might be the answer to m slow performance problems, in my model everything connects to one specific entity, the log, and I had never realised that the degree of connectivity may impact performance.

So instead of having a top-level owner, that owns everything, is it a better design technique to distribute the linking through a hierarchy? should I redesign on this basis for better performance?

It could matter if you are creating supernodes @modeller, multiple 100k’s of relations to a node

There’s a trade off here between modeling and performance. We tend to encourage trying to design the best model to suit their domain first, and verify if the performance matches. If we begin by writing the schema obfuscated to meet predicted performance requirements we lose one of the main benefits of grakn!

However you are correct, we should have general guidelines in our documentation on how to write performant schemas, which will be added. For now some tips might be to avoid building data supernodes, attributes are more performant than relations to entities (for now, optimisations still incoming), shorter hop queries are more performant and likely to be planned better, and transitivity in reasoning can be quite expensive sometimes!

1 Like

Its more about unintended, or emergent consequences, which may not be obvious as one starts the design, but may come into play later on. These are often a question of degree, as you say, and thereby may be difficult to describe due to inherent variability.

Hence, best practice guidelines, that explain the reasoning behind trade-off’s, what to look out for, ways of testing to be sure I dont go there, would be greatly appreciated. Guidelines with instructions without the rationale that exposes the trade-off’s is not that useful.