Escape hell: handling string values

priamai · 28 July 2022 21:00

Hi there,
I am ingesting a large corpus of data which contains a field of type string which contains ASCII text. The text contains a lot of variety and includes all sort of commas, quotes, double quotes, back/forwarded slashes and so on.
I get a lot of failures when inserting because it is unclear even how to escape a simple double quote (I use the slash but I get it back too) situation.
I am thinking there could be two ways to solve this:

we enable to include BASE64 encoding for string values and a type (ASCII/UTF)
we enable a sort of raw insertion mode via data api instead of query api

I think the first option is more simple to implement.
@james.williams

james.williams · 1 August 2022 10:07

Hi, thank you for reporting this.

We’re looking at this internally and aiming to align on the standard SQL & NoSQL ways of handling strings (i.e. backslashes escaping characters aren’t stored in the database.)

Based on what you’ve said in your post only double quotes and backslashes that don’t intend to escape characters should be an issue for you right now. Have you had a look at using our concept API? Here’s a link: Concept API | Vaticle.

priamai · 2 August 2022 14:13

Hello James,
that is correct for now my most common use cases is having include url strings (basklashes, ampercent etc) and text strings with single/double quotes.
Can you elaborate more on the Concept API?
I don’t see an entity method that would allow me to insert a thing programmatically?

james.williams · 3 August 2022 09:06

There are some examples of inserting data using the concept API in the python client tests - see test_get_attributes_by_value and below:

github.com

vaticle/typedb-client-python/blob/f3f7f1d2989929428477952a290fb86c5f7d1a6d/tests/integration/test_typedb.py#L379


      
                  # valid label
                  person_schema_type = self.tx.get_schema_concept('person')
                  self.assertTrue(person_schema_type.is_schema_concept())
              with self.subTest(i=1):
                  # nonexistant label
                  not_person_type = self.tx.get_schema_concept('not_a_person')
                  self.assertIsNone(not_person_type, msg="Nonexistant label type does not return None")
          
          

          
# --- test get attributes by value ---
          def test_get_attributes_by_value(self):
              """ Retrieve attribute instances by value """
              with self.subTest(i=0):
                  # test retrieving multiple concepts
                  firstname_attr_type = self.tx.put_attribute_type("firstname", ValueType.STRING)
                  middlename_attr_type = self.tx.put_attribute_type("middlename", ValueType.STRING)
                  firstname = firstname_attr_type.create("Billie")
                  middlename = middlename_attr_type.create("Billie")
          
          
        attr_concepts = self.tx.get_attributes_by_value("Billie", ValueType.STRING)
                  attr_concepts = list(attr_concepts)

If you had some example pieces of data that were causing problems upon attempting insertion this would help greatly.

priamai · 23 August 2022 21:16

Thanks for the pointer, this looks the right approach, I will test it with the most problematic strings and report back.

modeller · 24 August 2022 06:37

Hi @james.williams ,

I am the one Paolo has set to fix this problem. I have a tricky demo string, its from this Stix resource ( Using Granular Markings (oasis-open.github.io)) see line 31 of the JSON tab.

so my input variable name and string look like

"pattern": "[email-message:from_ref.value MATCHES '.+\\\\banking@g0thamnatl\\\\.com$']"

and the aim is to import that string and then query it back out, and match the before and after strings using a unit test assert

At the moment, my import/export code creates an error, as shown below

I have access to the attributes immediately before creating the tql string, and immediately after accessing it as a grpc call, so i am happy to change it to suit your recommendation.

Note I am using the python client

james.williams · 24 August 2022 10:30

Hey @modeller,

That’s quite the string! Could you post the error you’re getting back?

Thank you for the comprehensiveness of your post.

modeller · 24 August 2022 10:36

Hey @james.williams ,

yep its a doozy, but you can see the error in the image pasted in my original answer.

basically, the two lines above the highlight in the image, where “a” is the input object, and b is the output object.

As you can see the returned object has twice as many back slashes than the input.

In fact I probably could have hard coded a response, but i wanted to get your feedback on how best to handle it in Python

james.williams · 24 August 2022 10:59

Ah, I understand what you were demonstrating now. a_obj is your input and b_obj is what you get back out of TypeDB given that input.

We have an upcoming fix for how we handle string escapes but it is on hold as it will break backwards compatibility, see Backslash escape character is always stripped by flyingsilverfin · Pull Request #224 · vaticle/typeql · GitHub.

Are you using the concept API in Python as a workaround as suggested in Escape hell: handling string values - #4 by james.williams? My understanding is that this will insert strings as they are provided. Let me know if that isn’t working for you.

modeller · 24 August 2022 11:05

um, yeah, at the moment I transform the json models into typeql, because i thought it was a beautiful idea. This works perfectly, even with tql object definitions of >50 lines, so it would be a hassle to convert to api calls for insert.

I imagine the best way in the interim is to hard code a response in my grpc attribute retrieval code, so that if there are multiple back slashes, I divide the number by two, or some such hack. Can you advise on the best approach to handle during extraction?

james.williams · 24 August 2022 11:13

You could certainly try! Prefacing this with the advice that this approach is difficult and error-prone across large datasets, if there are an odd number of backslashes take the ceiling of half (e.g. 4 backslashes become 2, 5 becomes 3, 6 becomes 3) as it may also be escaping a non-backslash.

Interested to hear how you get on.

modeller · 5 September 2022 02:59

so @james.williams , just to show you what we did

I took the return, at the point where i was still making Concept API calls, and do a replace to every string, like so

        elif a.is_string():
            temp_string = a.get_value()
            prop["value"] = temp_string.replace("\\\\", "\\")
            prop['datetime'] = False

this seems to work a treat, although perhaps it is only the examples we have tested it on. But it is built into our unit tests as well

james.williams · 5 September 2022 07:22

Hey @modeller, glad to hear this worked for you!

I’ll update this thread when our fix is deployed into a new version of TypeDB, at which point this should be more ergonomic.

Topic		Replies	Views
Support for Right to Left languages? Query/Sort TypeQL	4	406	13 January 2023
List of key or reserved words? TypeQL	1	345	17 July 2023
Where to store type metadata? TypeDB schema-design	1	646	7 September 2022
Schema for Entity with to many attributes to define TypeQL	1	433	5 September 2022
RDF URIs as attributes of all types TypeDB schema-design	1	456	23 December 2023

Escape hell: handling string values

Related Topics