Discussion Forum

Batch insertion through Python API

Hi ,

I am trying to insert a fairly large dataset into typedb. I have inserted the entities first and now inserting their relationships. I am using Python API for this but the insertion is taking more than a whole day to insert.
I have looked into parallelization and batch insertion concepts but still not sure how to incorporate that.
I am attaching the code snippet for insertion. Can someone review it and tell me how to incorporate batch and parallelization into that?

with TypeDB.core_client(“localhost:1729”,10) as client:
with client.session(“KG”, SessionType.DATA) as session:
for x in Relationlines:
if len(x.strip()) != 0 :
with session.transaction(TransactionType.WRITE) as write_transaction:
insert_iterator = write_transaction.query().insert(x)
write_transaction.commit()
print(‘Inserted’ + x )

session.close()
client.close()

hi Bilal1,

I too am interested in this topic, I can tell you about batch, but i am unfamiliar with the parallelisation.

Normally, when inserting records from data, like a csv or json, one sets up a loop where one generates a match/insert statement with the data in it. Traditionally one generates a commit message for every statement. The advice here is that one can do 20 to 50 iterations and collect the query strings into a larger string, which is then committed.

I have not yet tried this, although i will do it soon, but i am also interested in the parallelisation

I’m not the Python guy but with the experience with the Julia TypeDBClient we solved the problem as follows:

  • If you use a CSV file load it into the RAM and wrangle the data until you have only the necessary items.
  • Build for every line which should be inserted a statement and omit the insert clause.
  • Split the resulting array into chunks of 50 - 100 statements each.
  • Use one insert clause and concatenate the items of the chunks.
  • Send the construct you get from the concatenation in 1 insert call to the server.

Sadly :grinning: I could provide the Julia code but in Python I wouldn’t guaranty for anything. :grinning:

Best Frank

1 Like

Aah, thats a key point @FrankUrbach , so i dont have tweny match/insert clauses and concatenate them, instead i concatenate all of the matches into an uber match, and all of the inserts into an uber insert. Obviously, then all of the variable names need to be different in this case, I assume you add the index to each variable name as you step through, to provide that uniqueness.

Very illuminating, thanks

Also @FrankUrbach , there is a difference between inserting a piece of data, with an insert clause only, and inserting a relation with, for example, four roles. In the first case i do as you say, in the second case i have to separate out the match clauses and the insert clauses, and concatenate them separately.

This is good to know, as i was planning on concatenating multiple match/insert statements togther, not separated, and this would have caused an error i assume

You haven’t to differ each variable. E.g. you have to insert an entity with the name person the insert row would looks like this:
insert
$_ isa person, has first-name Frank, has birth-date xx.mm.yyyy;
$_ isa person, has first-name Anton, has birth-date xx.mm.yyyy;

and so on.
This will free you from determining every time a new variable, which isn’t necessary in such an environment because you didn’t work with it further.

HTH Frank

clever @FrankUrbach , so because you redefine the variable each line, you dont have to use a different variable name. Thats nice work, i didn’t think of that.

what if you have 20 match/insert statements, you would at least have to have uniques on the match variables, no?

Thats not my found. It’s what I saw elsewhere tried it out and it worked like a charm. :grinning:
In the case of match insert I would divide the match and the insert. You will get a full concept map for each match. With this concept map you should be able to define the needed insert statements which you can use as stated in an earlier post. This I haven’t tried personally but celeste from the discord channel have done this in this fashion. If I understood the feedback correctly it worked as it should be.

Best, Frank