Hi @awkh !
I’ve had a look into your data loading and sample data and tried to reproduce on my local machine (16GB Mac, using TypeDB 2.8.1).
I have been able to get the Python seg fault, but not crash the TypeDB server. However, I do have a couple of pointers that could be causing your issues.
Firstly, it seems like your loader is structure rather inefficiently, and uses small methods that constantly open clients and sessions to do work. It’s best practice to actually open a single client and long-lived session, and reuse these throughout the loading of data. I think doing so would resolve your Python segfault issue, which may stem from the fact that creating a new client/session too many times can hit a Python memory or thread limit.
Secondly, I added some logging to the loading of the loadTimeCourseDataFlag
section of the code, which prints out the number of inserts that a single match-insert
query actually performed. In general, when inserting relations, this should be a single relation beause you match a specific set of role players, then create a single relation between them.
However, my printouts look like this:
inserted: 1
inserted: 1
inserted: 1
inserted: 1
inserted: 1
inserted: 1
inserted: 1
inserted: 1
inserted: 1
inserted: 1
inserted: 1
inserted: 1
inserted: 1
inserted: 1
inserted: 1
inserted: 1
inserted: 2314
inserted: 2314
inserted: 189
inserted: 1
inserted: 1
inserted: 1
inserted: 2314
inserted: 2314
inserted: 189
inserted: 1
inserted: 1
inserted: 1
inserted: 1
inserted: 1
inserted: 1
inserted: 1
...
This means some of your match-insert
s are creating way more relations than you expect.
In fact, by the time Python crashed, if I run a count over the number of relations in the database from console:
> transaction test_database data read
test_database::data::read> match $x isa relation; count;
2172435
2.1 million relations is far larger than the expected number (~240k).
The best and fastest way to proceed is actually to keep all your data cleaning code, and use it to create cleaned CSV data files. We then recommend using TypeDB Loader to load these CSV files, since it will automatically handle all the transaction handling, parallelisation (which you haven’t touched on yet!), and quite well tested. If the excessive number of relations is coming from a bug in your own loading code and not a property of the data itself, then the TypeDB Loader will resolve all the issues above!
Alternatively, if you wish to proceed with your own code, you’ll need to at minimum:
- use a single client and session for writing the data
- ensure your
match-insert
statements are creating the expected number of relations.
- [if you want faster loading] parallelise the data loading
Hopefully this information clears up what your next steps will be!