Escape hell: handling string values

Hi there,
I am ingesting a large corpus of data which contains a field of type string which contains ASCII text. The text contains a lot of variety and includes all sort of commas, quotes, double quotes, back/forwarded slashes and so on.
I get a lot of failures when inserting because it is unclear even how to escape a simple double quote (I use the slash but I get it back too) situation.
I am thinking there could be two ways to solve this:

  • we enable to include BASE64 encoding for string values and a type (ASCII/UTF)
  • we enable a sort of raw insertion mode via data api instead of query api

I think the first option is more simple to implement.
@james.williams

Hi, thank you for reporting this.

We’re looking at this internally and aiming to align on the standard SQL & NoSQL ways of handling strings (i.e. backslashes escaping characters aren’t stored in the database.)

Based on what you’ve said in your post only double quotes and backslashes that don’t intend to escape characters should be an issue for you right now. Have you had a look at using our concept API? Here’s a link: Concept API | Vaticle.

1 Like

Hello James,
that is correct for now my most common use cases is having include url strings (basklashes, ampercent etc) and text strings with single/double quotes.
Can you elaborate more on the Concept API?
I don’t see an entity method that would allow me to insert a thing programmatically?

There are some examples of inserting data using the concept API in the python client tests - see test_get_attributes_by_value and below:

If you had some example pieces of data that were causing problems upon attempting insertion this would help greatly.

1 Like

Thanks for the pointer, this looks the right approach, I will test it with the most problematic strings and report back.

Hi @james.williams ,

I am the one Paolo has set to fix this problem. I have a tricky demo string, its from this Stix resource ( Using Granular Markings (oasis-open.github.io)) see line 31 of the JSON tab.

so my input variable name and string look like

"pattern": "[email-message:from_ref.value MATCHES '.+\\\\banking@g0thamnatl\\\\.com$']"

and the aim is to import that string and then query it back out, and match the before and after strings using a unit test assert

At the moment, my import/export code creates an error, as shown below

I have access to the attributes immediately before creating the tql string, and immediately after accessing it as a grpc call, so i am happy to change it to suit your recommendation.

Note I am using the python client

Hey @modeller,

That’s quite the string! Could you post the error you’re getting back?

Thank you for the comprehensiveness of your post.

1 Like

Hey @james.williams ,

yep its a doozy, but you can see the error in the image pasted in my original answer.

basically, the two lines above the highlight in the image, where “a” is the input object, and b is the output object.

As you can see the returned object has twice as many back slashes than the input.

In fact I probably could have hard coded a response, but i wanted to get your feedback on how best to handle it in Python

Ah, I understand what you were demonstrating now. a_obj is your input and b_obj is what you get back out of TypeDB given that input.

We have an upcoming fix for how we handle string escapes but it is on hold as it will break backwards compatibility, see Backslash escape character is always stripped by flyingsilverfin · Pull Request #224 · vaticle/typeql · GitHub.

Are you using the concept API in Python as a workaround as suggested in Escape hell: handling string values - #4 by james.williams? My understanding is that this will insert strings as they are provided. Let me know if that isn’t working for you.

1 Like

um, yeah, at the moment I transform the json models into typeql, because i thought it was a beautiful idea. This works perfectly, even with tql object definitions of >50 lines, so it would be a hassle to convert to api calls for insert.

I imagine the best way in the interim is to hard code a response in my grpc attribute retrieval code, so that if there are multiple back slashes, I divide the number by two, or some such hack. Can you advise on the best approach to handle during extraction?

You could certainly try! Prefacing this with the advice that this approach is difficult and error-prone across large datasets, if there are an odd number of backslashes take the ceiling of half (e.g. 4 backslashes become 2, 5 becomes 3, 6 becomes 3) as it may also be escaping a non-backslash.

Interested to hear how you get on.

1 Like

so @james.williams , just to show you what we did

I took the return, at the point where i was still making Concept API calls, and do a replace to every string, like so

        elif a.is_string():
            temp_string = a.get_value()
            prop["value"] = temp_string.replace("\\\\", "\\")
            prop['datetime'] = False

this seems to work a treat, although perhaps it is only the examples we have tested it on. But it is built into our unit tests as well

Hey @modeller, glad to hear this worked for you!

I’ll update this thread when our fix is deployed into a new version of TypeDB, at which point this should be more ergonomic.