An Algebraic Specification/Schema for JSON

JavaScript Object Notation (JSON) is an open standard data format that is used widely across the internet as means of exchanging structured data due to its low overhead. While originally created in the early 2000s, it has only gained standard status in 2013 and then again in 2017 with a new version that focused more on security and interoperability. In this paper the authors present a different specification of the JSON standard that relies on algebraic formal methods and provides certain benefits over a regular natural language specification. This specification can also function as a schema that can attest a JSON data document’s compliance to its blueprint. The absorption of Formal Specification methods by the industry happens at a very slow pace, mostly because there is little incentive to tread into a fairly unknown territory. Notwithstanding this reluctance, the authors encourage the usage of Formal Specification techniques to the specifications of open standards; Formal specifications are more succinct, less ambivalent, consistent to the standard, reusable as they support module inheritance and can be executable. The process of designing new Standards can benefit from Formal Specifications as the resulting specification i) is more tangible; ii) allows a thorough and clear understanding of the standard and also iii) allows property checking and property verification.


Introduction
Standards released by a standard setting organization are usually accompanied by a publicly-available delineating specification document that consists of the requirements that any implementation of the standard has to hold [1]. However, because such a document is written in a natural language there can be some shortcomings: 1. Verbosity: Just like any technical document, a specification document written in any natural language often yields quite lengthy documents. For instance, the specification of Digital Imaging and Communications in Medicine (DICOM), an open standard that describes the operations that can take place in medical imaging (e.g. handling, storing, printing, and transmitting information) consists of 18 different documents ( [2]) and a total of 4900 pages.
2. Vagueness: Using natural languages to express terms that depend on being expressed with precision and unambiguity can be an arduous task that usually results in documents that are difficult to comprehend [3]. Words with different meanings or phrases that hypothesize some common background can make this issue even worse. Clarity is important when trying to formalize a statement.
3. Requirements intermixture: Due to the non modal nature of natural languages, we often see the merging of a standard's disparate requisites [3]. This renders the tracking of the consequences that a change incurs rather cumbersome, since it is better to examine each requirement individually rather than process a fusion of different yet connected requirements. 4. Requirements confusion: Different kinds of a standard's properties can appear mixed up; functional and non functional requirements, system goals and design information are at times used as if they are the same thing [3] .
Those shortcomings can create issues throughout the standard's development and specification stages, issues that are frequently ascertained at the latest stages and are therefore, costly to correct [3]. These drawbacks can be present to any standard's specification document (open or not) but this paper focuses on open standards since their specifications are publicly available. In this paper we: 1. Reason about how a Formal Specification of an open standard can deal with the issues of specifications written in natural languages that are discussed in this section.
formal versions.
3. Address the points raised earlier and argue about how this specification methodology is better than the natural language one. In order to do that, a test case will be presented in which we use the formal specification of JSON as a schema, in order to verify additional properties.
This paper is an extended version of [4], a short conference paper that introduced an algebraic schema for JSON and presented the basics of treating JSON objects in an algebraic way. This paper provides the details for almost all operators needed to handle a JSON object as an algebraic item as well as how they operate together and, in an attempt to respond in detail to comments from the conference review, it also expands more on how a formal specification of an open standard can be beneficial to the standard as it can be better than a specification written in natural language. Both this paper as well as its predecessor continue in a similar path as [5], in which the authors have formalized XML structures and used the result to create a formal version of RSS v2.0 open web standard.

Open Standards
While the definition of a Standard can be quite straightforward (an established norm/requirement for a repeatable technical task), it's the term "open" that makes the definition tricky as that word can have a different meaning depending on the point of view; dictionaries, national IT agencies, World Trade Organization (WTO), etc. all provide different points of view for the term "open". An intersection of all those different definitions deals with their motivation, development and usage. The motivation to create and promote an Open Standard usually has to do with promoting interoperability. When it comes to its development, an Open Standard: 1. Most of the times emerges by a process that can be effortless for anyone to attend.
3. Is not regulated by any particular collective or vendor.

Advantages of Open Standards
Developing and adhering to an open standard does not produce any financial gain, at least not in the short term but in the long run, the interoperability, the sustainable development and the smooth communication channels that come along, all give value to this investment. Additional benefits from following Open Standards are summarized below [8,9]: • Open Standards ensure that a vendor cannot gain full control over a format. Since an open standard's specification is available to the public, another party can always implement a solution that adheres to its requirements if the previous solution stopped working for whatever reason.
• Open Standards provide smooth interoperability between different systems that may even use different technologies. Communication between such systems can be difficult if there is not an "agreement" to use a pre-approved channel that each system can identify and work with. For example, a company that requires that all its desks use office software applications compatible with the Open Document format (an Open Standard) gives each employee the freedom of choosing to use any such office software, ensuring full compatibility with the rest of the company.
• Open Standards can act as a buffer against applications that stop being developed. If such an application was using a proprietary data format then an end user would have a difficult time trying to save the data in a way that some other application could use. However, if that application was using an open standard document format, it would not be difficult for another application to make use of that data. The issue of program data becoming unusable can be even worse for any big organizations like government agencies (e.g. police [10]) or national Electronic Health Records ( [11]).

Problems of Open Standards
On the other hand, Open Standards have some noteworthy drawbacks. Firstly, creating an Open Standard (with all its steps; designing, reviewing, rating, implementing) can be quite costly in resources. For instance, the amount of time required for an Open Standard to go through that process varies, from a few months to many years, depending on the standard-setting organization. Also, interoperability can be achieved even with non open standards. There is an abundance of proprietary standards that come with a Reasonable and non-discriminatory licensing (RAND) that do not hinder in any way interoperability (e.g. GSM, CD, DVD, MPEG, Wi-Fi, etc.) [8], [9].

About Algebraic Specifications of Open Standards
Formal (especially algebraic) specifications can complement (or even substitute) the original natural language specification of the standard as they can deal with the problems discussed in Section 1: 1. Formal specifications are most often than not small in size, mainly due to their modular nature and their basis on mathematics.
While this process can be applied to any kind of standard, Open Standards are better candidates for a formal specification since their original specification, the one written in a natural language, is available for anyone to read and attempt to improve by formalizing (versus trying to reverse engineer a proprietary standard). Considering that this process is not done by the standard-setting organization, any attempts to formalize a standard are only as good as the understanding of the standard by the reader. Depending on how intricate and entangled the original specification is, such a task can be of considerable difficulty. This process may also uncover flaws in the standard; since the formalization process forces someone to ask the right questions, there have been cases in which ambiguities in original natural language specifications have been detected, for example the specification of "time-to-live" element of Really Simple Syndication (RSS) in [5]. Naturally, the time and resources needed for a formal specification are increased (or differently allocated according to Sommerville [3]), unless the formal specification is developed along the standard.
Formal Methods may at first alienate many readers, especially those not sufficiently versed, however getting acquainted is not such a difficult task [12,13] and the benefits may make it worthwhile. Perhaps until Formal Methods become more popular, a middle ground would be to have both natural language and formal specifications for standards.

CafeOBJ
The formal specification of JSON that is presented in Section 3 is written using CafeOBJ, which according to its webpage ( [14]) is an "advanced formal specification language which inherits many advanced features (e.g. flexible mix-fix syntax, powerful and clear typing system with ordered sorts, parametric modules and views for instantiating the parameters, and module expressions, etc) from 'OBJ' (or more exactly 'OBJ3') algebraic specification language".
The OBJ (and OBJ3) family has been used in many applications ( [15]) such as the debugging of algebraic specifications, rapid prototyping, the executable definition of programming languages, the specification of software systems (such as GKS graphics kernel system, Ada configuration manager, Macintosh QuickDraw, etc), specification of hardware, specification of user interface designs and theorem proving. The main axiomatic systems that it uses are order-sorted algebras [16,17] (used to specify abstract data types) and hidden algebras [16,18] (used to specify abstract state machines, enabling object-driven specifications).
Algorithm 1 shows a CafeOBJ module for factorials. A specification in CafeOBJ consists of modules; places where sorts, operators and equations are declared [19]. The keyword mod! initiates a module and is followed by curly brackets that define the start and end of the module's code block. The next line, pr(INT) imports the built-in integer module, allowing sort, operator and equation inheritance. To declare an operator, keyword op is used. Operators are like functions and they need explicit domains and codomains. In Algorithm 1, operator factorial assigns an integer (domain) to another integer (codomain). A variable in CafeOBJ is declared with the keyword var followed by its name and domain. Here, variable N belongs to INT , the sort of Integers that was imported into the module with the pr command. To write the equations describing the factorial we will need a recursive case as well as a base case, as shown in the last two lines of the module, starting with ceq. Once the module declaration is over, command select selects the created module so that we can use its definitions to calculate the factorial of a number. The command red command will try to reduce f act5 to its value using the module's equations and left-to-right rewriting rules. Lastly, inline comments in CafeOBJ use the −− prefix. (5)) . (JSON) is an open standard data format and file interchange format that does not bind itself to any specific language and declares entities as lists of properties in the attribute: value format [20]. JSON started as a non-strict subset of JavaScript [20], as defined in the European Computer Manufacturers Association Script (EC-MAScript) Programming Language Standard, Third Edition [21], but ever since, it can be (and has been) used in many programming languages. JSON was originally designed to be minimal, portable, textual, a subset of JavaScript [21] but not too far away from the broad spectrum of C-family of languages ( [22]).

JavaScript Object Notation
JSON became a standard originally in 2013, under the name ECMA-404 [23]. The second version of the ECMA-404 standard came along in 2017 [24]. Request for Comments (RFC) published JSON as a standard in 2017 [21] and this is the current JSON standard version that is compliant with ECMA-404 [24].
JSON can express structured data in six ways; the four non-structured primitive types (booleans, strings, numbers and the unique "null" type) and two structured types (arrays and objects) [21]. Arrays and objects can be nested into larger, complex JSON structures, [22].

Uses
JSON has become a very popular data format, with a broad range of applications. It is being used as a configuration language, storing application settings in its low-overhead format, although surprisingly, it lacks support for comments, discouraging people from meddling directly with www.jenrs.com Journal of Engineering Research and Sciences, 1(5): 243-250, 2022 such files. JSON is intended as a lightweight alternative to the widespread XML format that has bigger space requirements as XML uses more overhead to store its data. JSON can be used with most of the languages of the C-family (such as C, C++, C#, Java, JavaScript, Perl) but is also supported by many more, such as Python, Filemaker, .Net, Matlab, Lisp, Haskell, Fortran, OCaml, Prolog, Rust, Scala, etc. [22].

Similar works
Formal versions of JSON schemas have been created in the past, e.g. [25] in which JSON is used as a JSON schema. There also is [26], in which the authors display an Extended Backus-Baur Form (EBNF) grammar of JSON. While JSON and XML are different, formalizing XML is quite similar in nature with the ideas presented in this paper; the authors in [27] have created a formal framework for XML schema languages. What distinguishes this paper from the rest is that a valid schema for JSON is created using an algebraic specification language.

JSON entities as algebraic objects
The proposed schema for JSON data (Section 3.3) treats JSON objects as lists (Section 3.1) and is written in CafeOBJ (Section 2.3). In order to create an algebraic specification of a JSON file in CafeOBJ, the JSON file has to be converted into a format that CafeOBJ can read and process. A simple command line converter has been created that parses a JSON file and outputs it into a format that we will call JSON-OBJ. The rules for such a translation can be seen in Section 3.2. The resulting JSON-OBJ file can now be loaded into CafeOBJ and can be reasoned with, as seen in Section 3.4.
While the reason for choosing CafeOBJ as the language for this formal schema is the authors' familiarity with it, we would argue that a formal specification of JSON would work equally well with any Formal Method tool; especially those that put more weight on data structures like Z, VDM, Estelle or any language of the OBJ family (e.g. Eqlog, FOOPS, Kumo, Maude, OBJ2, OBJ3). Selecting a proper Formal Method tool can be quite a daunting task as lots of different tools exist, all specializing in different features, yet often having some overlap with other tools.

Lists
JSON describes, in its essence, a list of structured data. A list is a collection of elements where an element can be a list itself. A list module provides operators that can i) append an element into a list, ii) search if an element is contained in one, iii) remove an element from a list and iv) concatenate two lists. These operations are specified in a recursive manner.
For example, Algorithm 2 shows the equations necessary for removing an element from a list. The operator "\" takes an element and list and removes that element from the list, if the list contains that that. The first two equations provide the base cases and the next two the recursive cases. The recursive cases state that given an element Elem 1 that's merged with a list Lst; if we want to remove an element Elem 2 from that list, then if Elem 1 is the same as Elem 2 then the operation returns Lst, otherwise the removal operation is now applied to Lst.
Algorithm 3 shows the declaration and equations for querying a list for the existence of an element. The operator "//in" returns true if a given element is inside a given list. The base cases describe how we check an element against another element or against an empty list. The recursive case states that an element Elem 1 exists in a list that consists of an element Elem 2 and a sublist Lst, only if Elem 2 is Elem 1 or Elem 1 exists in Lst.

JSON data in CafeOBJ format
The most basic JSON element comes in the form of {"key": "value"} and that would get converted into <"key" [ txt( "value" ) ] >. Symbols "<" and ">" contain the element's name inside quotes and its following value inside brackets. Values can be in a number of different types (string, number, boolean, object, array, null), so the value of an element is held within the right operator for its type; string values are contained inside the txt operator, log for boolean values, int for integers, etc. A more complex JSON element is the object, which is a comma separated collection of key:value elements held inside curly brackets. Its equivalent form in CafeOBJ notation is a series of parenthesized elements, concatenated with the @ operator.
The last JSON element type is the array, an ordered, comma separated collection of key:value elements that is contained within brackets. Since the sequence of the array elements needs to be preserved, the < "key" [ txt( "value" ) ] > format is extended so that the n th element of such an array would be declared as ({ n } <"key" [ txt( "value" ) ] >).
A value in JSON can be another JSON element, as this format supports nested elements. Some sample JSON data can be seen in Algorithm 4 and its conversion to the JSON-OBJ format can be seen in Algorithm 5. Algorithm 5 also shows how each JSON data type gets converted into its appropriate JSON-OBJ format (string -Name, boolean -isAlive, number - age, objectaddress, array -phoneNumbers and nullspouse).

Algebraic Schema for JSON
The operators in the algebraic schema for JSON can be employed to parse and validate a JSON-OBJ (as described in Section 3.2) file against its requirements. The operators used in the schema can be seen in Table 1 and are used to extract the data stored (and the way it is organized) in any JSON file. The output can then be used in order to check the file against its specification.
In this paper we will describe operators jNodeExists and getParent.

getKey
Given an element, it returns its key name getValue Given an element, it returns its value returnJnode Given a list of elements and an element's key, it returns the whole element. If the element is within an array, returns the parent as well jNodeExists Given an element's key name and a list of elements, it searches for the key in the element list and returns a boolean value hasNextElement Given a list of elements and an element's key, it returns a boolean value based on whether the key has a next element getParent Given an element's key name and a list of elements, searches for the key name inside the list and returns its parent, if there is one getArrayNo Given an array and an element's key, it returns its array index value

jNodeExists
Algorithm 6 displays the declaration for the jNodeExists operator along with its equations; The operator takes a key name (KeyName) and a list of elements (ElementList) and returns a true/false value. In the equations that define the operator, Key 1 and Key2 are variables that describe a key name. Cont is a variable that describes content of any sort. Elem 1 and Elem 2 are variables that describe elements and N is a natural number variable that describes the N th element of an array. The base-case of the operator's recursive definition is given in the first two equations: if the list of elements is just a single element then either its key name matches the one given to the operator and the operator returns true, or there is no match and the operator returns false. Two base-case equations are needed considering that keyname can also try to match a single element of an array. • Given a key name Key 1 and an element list that is made out of two elements, the operator will call itself on the first element and if there is no match, it will do the same to the second element.
• Given a key name Key 1 and an element with key name Key 2 that is concatenated with a list of elements, it will compare Key 1 against Key 2 and if they are different it will continue recursively with the list of elements until the expression collapses to one of base case equations. Again, two recursive equations are required due to the two different ways an element can present itself; as part of an array or in any other way.

getParent
Algorithm 7 displays the declaration for the getParent operator along with its equations; The operator takes a key name K and a list of elements (ElementList) and then seeks element K inside ElementList, returning the key name of K's "parent", or a null value if that element is not found or has not parent.
Since this operator relies on jNodeExists (Section 3.3.1), this shows the way multiple operators can work together.
In the equations that define the operator, Key 1 , Key 2 and Key 3 are variables that describe key names. Cont is a variable that describes content of any sort. Elem 1 and Elem 2 are variables that describe elements.
The first two equations describe the base case: If asked to retrieve the parent of an element named Key 1 from a single element Elem1, the operator returns null (the special nil operator). If asked to retrieve the parent of an element named Key 1 from an element called Key 2 that contains a single nested element (Elem), then it will invoke jNodeExists, to search for element Key 1 inside Elem, returning K 2 if found, or nil otherwise.
The next three equations describe what may happen if we are searching for the parent of element with key name Key 1 in the following scenarios: • The element list is a concatenation of a simple element and any other element: getParent operator ignores the simple element as it cannot have a parent and gets applied to the other element.
• The element list is a concatenation of a simple element and one with nested elements: getParent operator ignores again the simple element and gets applied to the element with the nested elements.
• The element list is a concatenation of two elements that both have sub-elements: if the first element contains Key 1 then getParent returns its key Name otherwise it is applied to the second element.

Test case
This section examines a hypothetical scenario in which a company's management software keeps its contacts in JSON format and those contacts need to adhere to the following traits: 1. Entries for Name, address and isAlive are necessary.
2. Address must have entries for streetAddress, city and postalCode.
4. Age of contact should be between 18 and 60 years old.
To validate the JSON stored contacts against those requirements, the contacts data file gets converted into the JSON-OBJ format of Algorithm 5 and then it gets tested against the requirements of the company. For this test case, the contacts in Algorithm 4 will be used.
In order to validate the contacts against those requirements, we will need to create an operator that formalizes those requirements and then tries to validate a given JSON-OBJ file against them. This operator, called properschema returns true only if all requirements are held. Algorithm 8 contains the operator's signature and equation; given an ElementList E, it checks it against all 4 of the requirements. Operator properschema: 1. Verifies that the three entries of the first requirement exist. To achieve that, operator jNodeExists is called for each entry key name and responds with a boolean. is used; it uses the related operators' equations to try to rewrite the term, eventually returning a boolean value. Output Sequence 1 shows the result of the execution. CafeOBJ returns true which means that the JSON data satisfy the four wanted requisites. In case the output was false, we can identify through CafeOBJ's output which of the requirements failed to evaluate to true and then act accordingly.

Conclusions and future work
This paper presented an algebraic specification / schema for the JSON Open Standard written in the CafeOBJ language. The proposed formal specification provides a framework for treating JSON objects in an algebraic way that allows formal requirement specification and JSON document validation against a schema.
Formal specifications are in general small in size and our test case is no exception; at only 26 lines of code, it is quite small in size, even with comments present. The requirements in the test case's formal specification are laid out in a clear and non-ambiguous way, provided one can familiarize themselves with the operators and the syntax of the specification. That aside, we believe that such a specification can be quite easy for someone to understand. The way the requirements are presented makes it easier to separate the requirements from one another allowing to track the changes one of them may bring to the whole specification thus reducing any issues of requirement intermixture or confusion. Finally, since this specification is in algebraic format, it is also executable allowing us to verify sample data against the specification.
Despite those advantages, there have not been many attempts in formalizing standards. One reason for this discrepancy could be the plethora of different formal methods that can make early steps such as choosing a tool quite difficult, and also the questionable applicability of a specific formal method to the standard.
The authors hope to continue work on using formal methods to create specifications of more open standards, especially web data standards (e.g. YAML or TOML) in order to achieve benefits over natural language specifications.