Inserting relations using the python API

awkh · 27 March 2022 19:19

Hello!

I am trying to insert some (240k) reasonably complex relations (they connect entities and relations and have attributes) into a typedb with the python API. I found when trying to load them that the system locks up if I try to insert more that about 250 at once without committing. Why is this happening? Does it have to do with the configuration of my server?
I am just using the standard config of v2.8.0. Any thoughts?

Thanks!
awkh

awkh · 28 March 2022 06:24

I guess I could rephrase that and just ask what the fastest way to insert data is? And how the config of the typedb server would affect this?

james · 28 March 2022 11:20

Hi @awkh thanks for posting!

This is a known issue, the cause being that the work that needs to be done at commit-time increases non-linearly with the number of concepts written in the transaction. For the foreseeable future please reduce the size of your write transactions by committing more frequently and imports should run smoothly. If this is not the case then naturally please write back with more info.

In general for loading large data sets, particularly from CSVs, consider using TypeDB Loader, which is a community-led project for loading data with a lot of useful features, supported by TypeDB-OSI.

awkh · 28 March 2022 12:49

Hello James,

Thank you for the advice. Regarding typeDB loader, the data I am inserting comes from some messy excel sheets that need a lot of cleaning up prior to inserting. The data is also reasonably complex, so it would not be straight forward to spit it out as .csv files before loading. Would you recommend I try to do this anyway? I have already written my own functions to handle it.

I have started inserting in batches of 200 at a time, which each take a few seconds right now. It will take a while for 240k. What I was surprised by now is that the server RAM maxes out after a while when doing this and the system crashes. This happens in the middle of a session so the database often ends up corrupted. Do you have any suggestions as to how to deal with this? I am trying to increase the amount of RAM accessed by my JVM and I have increased the allocation for data and indexing in the typeDB server. I will also look at parallelisation, but that does not solve the problem with the RAM.

Thanks again!

james · 28 March 2022 13:58

@awkh the nature of the crash you’re describing should never happen! We never expect to see data corruption. Please get in touch with us by DM on the Vaticle Discord server so that we can work to reproduce the issue and get this fixed.

Meanwhile, yes it’s understandable to write your own scripts for loading data rather than trying to create CSVs to load in. This can be justified when there are complex interdependencies between elements in your data.

My suggestion to avoid the issue (which we should still fix) is to decrease the batch size significantly more. Try 50 or even 10. Opening transactions in TypeDB doesn’t carry a high cost, and they are designed to do small and incremental pieces of work before they are closed or committed.

Hope this helps!
James

awkh · 29 March 2022 10:47

Hello James,

Thank you for the response. Before I contact you on the discord server, I will attempt to reproduce the corruption here. I realise that breaking ACID would be a big deal for TypeDB. To be honest, a couple of times right at the beginning the system would freeze and when I then force quit the server it would not start and load the database again. The result was that I began to manually delete the DB directory on my drive manually before restarting the server, starting a new DB, writing the schema and then beginning to rewrite the data to it.

Regarding the batch size, I reduced it to 50 yesterday but the system still kept crashing. Also, given that I am trying to insert 240k relations and each of the 50 is taking >5 seconds it would take a good long time to load the data. My machine has 16GB RAM, but I am not enough of a devops guy to know how to set up the environment properly to get this done right. I am pretty surprised that the system is having trouble dealing with data from a single excel sheet though… Have you got any further advice for me?

A

james · 29 March 2022 15:22

Honestly I’m surprised by both the slowness and the crashes and corruption you’re seeing. I don’t have any further advice without seeing what you’re doing in more detail. You are welcome to share import code with queries here on the forum, or DM if it’s proprietary/private. This is unusual behaviour so I can’t say yet what could be causing this.

awkh · 4 April 2022 07:18

Hello James,
I hope you had a good weekend.
I sent you a DM with the necessary code to reproduce the issue. Might you have time to have a look at it? I need this problem solved reasonably fast, or I will have to find another solution unfortunately. We are hoping to role TypeDB out more broadly across our organisation.
A

james · 4 April 2022 08:32

Hi @awkh,

Apologies I couldn’t reply over the weekend - yes this is a high priority and we’ll be looking discussing it this morning. That you very much for the reproducible example. It makes sense that this issue only occurs in one very specific pattern of writing relations to TypeDB, because otherwise we would have seen this issue already. There must be one bad code path that your example should be able to identify.

joshua · 7 April 2022 13:10

Hi @awkh !
I’ve had a look into your data loading and sample data and tried to reproduce on my local machine (16GB Mac, using TypeDB 2.8.1).

I have been able to get the Python seg fault, but not crash the TypeDB server. However, I do have a couple of pointers that could be causing your issues.

Firstly, it seems like your loader is structure rather inefficiently, and uses small methods that constantly open clients and sessions to do work. It’s best practice to actually open a single client and long-lived session, and reuse these throughout the loading of data. I think doing so would resolve your Python segfault issue, which may stem from the fact that creating a new client/session too many times can hit a Python memory or thread limit.

Secondly, I added some logging to the loading of the loadTimeCourseDataFlag section of the code, which prints out the number of inserts that a single match-insert query actually performed. In general, when inserting relations, this should be a single relation beause you match a specific set of role players, then create a single relation between them.

However, my printouts look like this:

inserted: 1
inserted: 1
inserted: 1
inserted: 1
inserted: 1
inserted: 1
inserted: 1
inserted: 1
inserted: 1
inserted: 1
inserted: 1
inserted: 1
inserted: 1
inserted: 1
inserted: 1
inserted: 1
inserted: 2314
inserted: 2314
inserted: 189
inserted: 1
inserted: 1
inserted: 1
inserted: 2314
inserted: 2314
inserted: 189
inserted: 1
inserted: 1
inserted: 1
inserted: 1
inserted: 1
inserted: 1
inserted: 1
...

This means some of your match-inserts are creating way more relations than you expect.

In fact, by the time Python crashed, if I run a count over the number of relations in the database from console:

> transaction test_database data read
test_database::data::read> match $x isa relation; count;
                           
2172435

2.1 million relations is far larger than the expected number (~240k).

The best and fastest way to proceed is actually to keep all your data cleaning code, and use it to create cleaned CSV data files. We then recommend using TypeDB Loader to load these CSV files, since it will automatically handle all the transaction handling, parallelisation (which you haven’t touched on yet!), and quite well tested. If the excessive number of relations is coming from a bug in your own loading code and not a property of the data itself, then the TypeDB Loader will resolve all the issues above!

Alternatively, if you wish to proceed with your own code, you’ll need to at minimum:

use a single client and session for writing the data
ensure your match-insert statements are creating the expected number of relations.
[if you want faster loading] parallelise the data loading

Hopefully this information clears up what your next steps will be!

awkh · 7 April 2022 15:22

Hello Joshua,

Thank you for your time.

I am very glad that the problem is at my end and not yours. I guess I am new to all this after all. Haha!

Thanks for the help debugging! I was getting soo frustrated. I will go and have a dig around and figure out what is wrong. There seems to be some sort of pattern to them, so I should be able to work that out. Can I ask how you logged that information? How did you check how many inserts were made?

Thanks again for your time! And I hope you had a good holiday.

A

joshua · 7 April 2022 16:03

You can get the number of inserts in one of your functions for example by doing so:

def insertBatchOfDataIntoDatabase(arrayOfInsertStrings):
    with TypeDB.core_client(serverIP) as client:
        with client.session(databaseName, SessionType.DATA) as session:
            with session.transaction(TransactionType.WRITE) as writeTransaction:
                for insertString in arrayOfInsertStrings:
                    inserted = len(list(writeTransaction.query().insert(insertString)))
                    print(f"inserted: {inserted}")
                writeTransaction.commit()

Good luck!

awkh · 10 April 2022 18:37

Hello Joshua and James,

I hope you had good weekends.

Is there a way I can enforce that entities with multiple attributes are unique? I think that is what is causing my problems. Ie. measurements in my schema have quantity and unit. I want to make an entity where quantity and unit separately are not unique, but the entity that has them together in a certain combination (quantity, unit) is unique.

Or do I have to filter before loading?

/A

james · 11 April 2022 11:45

@awkh the feature you’re describing is composite keys, and TypeDB doesn’t support composite keys at this point. You could filter before loading, or add loading logic that first matches for an entity based on value and unit, and inserts that entity (and its attributes) only if it’s missing.

awkh · 11 April 2022 13:15

Hello James,

That is the conclusion I have come to, and I think the cause of quite a lot of my grief. I am implementing filtering prior to loading.

Thanks for the support!

/awkh

Topic		Replies	Views
Data loading to Schema TypeDB schema-design	10	499	15 September 2022
Syntax of insert query of Relation TypeDB typedb-client-api	2	576	14 January 2023
How many rules does the TypeDB support? TypeDB	19	480	8 December 2022
Python client slower at retrieving query results than TypeDB Studio TypeDB typedb-client-python	1	358	1 July 2022
Doubt reg the parser TypeDB	4	104	29 February 2024

Inserting relations using the python API

Related Topics