N. Rishe. ``A File Structure for Semantic Databases.'' Information Systems, 16, 4 (1991), pp. 375-385. Copyright © 1991

A File Structure for Semantic Databases

Naphtali Rishe

School of Computer Science
Florida International University -
The State University of Florida at Miami
University Park, Miami, FL 33199

This paper presents a highly efficient file structure for the storage of semantic databases. A low-level access language is presented, such that an arbitrary query can be performed as one or several elementary queries of the language. Most elementary queries, including such nontrivial queries as range queries and others, can be performed in just one single access to the disk.

Keywords: Semantic databases, implementation, Semantic Binary Model, file structure, query languages, transactions, indices, B-tree, efficiency, database access primitives, facts, inverted storage.

1. INTRODUCTION

Since [Abrial-74], many semantic data models have been studied in the Computer Science literature. Although somewhat different in their terminology and their selection of tools used to describe the semantics of the real world, they have several common principles:

The advantages of the semantic models versus the Relational and older models with respect to database design, database maintenance, data integrity, conciseness of languages, and ease of DML programming have been discussed in many works, e.g. in [Rishe-88-DDF]. This paper advocates the potential of an efficient implementation for the semantic models.

Several semantic data models have been implemented as interfaces to database managements systems in other data models, e.g., the relational or the network model [Tsur&Zaniolo-84]. (There are also less typical, direct implementations, e.g. [Lien&al.-81], [Chan&al.-82], [Benneworth&al.-81].) The efficiency of an interface implementation is limited to that of the conventional DBMS, and is normally much worse due to the interface overhead. The direct implementations are commonly believed to have to be less time-efficient than the conventional systems, as a trade-off for the extra services that semantic databases provide. However, this author contends that the semantic models have the potential for a much more efficient implementation than the conventional data models. This is due to two reasons:

In this paper, the author uses the Semantic Binary Model (SBM) [Rishe-87-RM], [Rishe-88-DDF], [Rishe-89-SD]), a descendant of the model proposed in [Abrial-74]. This model does not have as rich an arsenal of tools for semantic description as can be found in some other semantic models, e.g. the IFO model [Abiteboul&Hull-84], SDM [Hammer&McLeod-81] (implementation [Jagannathan&al.-88]), the Functional Model [Shipman-81] (implementation [Chan&al.-82]), SEMBASE [King-84], NIAM ([Nijssen-81], [Verheijen&VanBekkum-82], [Leung&Nijssen-87]), GEM [Tsur&Zaniolo-84], TAXIS [Nixon&al.-87], or the semi-semantic Entity-Relationship Model [Chen-76]. Nevertheless, the SBM has a small set of sufficient simple tools by which all the semantic descriptors of the other models can be constructed. This makes SBM easier to use for the novice, easier to implement, and usable for delineation of the common properties of the semantic models. The results of this paper are practically independent of the choice of a particular semantic model, and therefore they apply to almost all of the other semantic models.

The semantic binary model represents the information of an application's world as a collection of elementary facts of two types: unary facts categorizing objects of the real world and binary facts establishing relationships of various kinds between pairs of objects. The graphical database schema and the integrity constraints determine what sets of facts are meaningful, i.e. can comprise an instantaneous database (the database as may be seen at some instance of time.)

Example 1-1.

Consider a database of which the following is a subschema:

      []  COMPANY - category

      []  PRODUCT - category

      []  company-name - attribute of COMPANY, range: String
          (1:1)

      []  description - attribute of PRODUCT, range: String
          (1:1)

      []  manufactures - relation from COMPANY to PRODUCT
          (m:m)

Figure 1-1. A subschema of a database
The following set  of  facts  can  be  a  part  of  a  logical
instantaneous database:

1.   object1  COMPANY

2.   object1  COMPANY-NAME  `IBM'

3.   object1  MANUFACTURED  object2

4.   object1  MANUFACTURED  object3

5.   object2  PRODUCT

6.   object2  DESCRIPTION  `IBM/SYSTEM-2'

7.   object3  PRODUCT

8.   object3  DESCRIPTION  `MONOCHROMATIC-MONITOR'
The formal semantics of the semantic binary model is defined in [Rishe-87-DS] using the methodology proposed in [Rishe-86-OD]. The syntax and informal semantics of the model and its languages (data definition languages, 4-th generation data manipulation languages, non-procedural languages for queries, updates, specification of constraints, userviews, etc.) are given in [Rishe-88-DDF]. A non- procedural semantic database language of maximal theoretically- possible expressive power is given in [Rishe-86-PS]. (In this language, one can specify every computable query, transaction, constraint, etc.)

The following section proposes an efficient storage structure for the Semantic Binary Model.

2. STORAGE STRUCTURE

2.1. Abstracted Level

Every abstract object in the database is represented by a unique integer identifier. The categories and relations of the schema are also treated as abstract objects and hence have unique identifiers associated with them. Information in the database can then be represented using two kinds of facts, denoted xC and xRy, where x is the identifier associated with an abstract object, C and R are the identifiers associated with a category or a relation respectively, and y is either an identifier corresponding to an abstract object or a concrete object (a number or a text string); xC indicates that the object x belongs to the category C; xRy indicates that the object x is associated with the object y by the relation R. Logically, the instantaneous database is a set of such facts.

2.2. Goals

2.2.1. Efficiency of retrieval requests

At the intermediate level of processing queries and program retrieval requests, the queries are decomposed into atomic retrieval operations of the types listed below. The primary goal of the physical file structure is to allow a very efficient performance for each of the atomic requests. Namely, each atomic retrieval request normally requires only one disk access, provided the output information is small enough to fit into one block. When the output is large, the number of blocks retrieved is close to the minimal number of blocks needed to store the output information.

1. aC Verify the fact aC. (For a given abstract object a and category C, verify whether the object a is in the category C.)

2. aRy Verify the fact aRy.

3. a? For a given abstract object a, find all the categories to which a belongs.

4. ?C For a given category, find its objects.

5. aR? For a given abstract object a and relation R, retrieve all y such that aRy. (The objects y may be abstract or concrete.)

6. ?Ra For a given abstract object a and relation R, retrieve all abstract objects x such that xRa.

7. a?+a??+??a Retrieve all the immediate information about an abstract object. (That is, for a given abstract object a, retrieve all of its direct and inverse relationships, that is, the relations R and objects y such that aRy or yRa, and the categories to which a belongs.)

(Although this request can be decomposed into a series of requests of the previous types, we wish to be able to treat it separately in order to ensure that the whole request will normally be performed in a single disk access. This will also allow a single-access performance of requests which require several, but not all, of the facts about an object, e.g., a query to find the first name, the last name, and the age of a given person.)

8. ?Rv For a given relation (attribute) R and a given concrete object (value) v, find all abstract objects x such that xRv.

9. ?R[v1,v2] For a given relation (attribute) R and a given range of concrete objects [v1 , v2 ], find all objects x and v such that xRv and v1<v<v2. (The comparison "<" is appropriate to the type of v.)

The elementary queries defined above form a lower-level language of retrieval from semantic databases. Any query in any language can be solved by performing several elementary queries and processing their results in the memory.

In the following, an estimate is given for the number of disk accesses per atomic retrieval request. Two cases are considered.

A. One disk access per request of small output.

In the predominant case, the amount of information to be output for a given atomic request is much less than one block. According to the Lemma, all the information to be output comprises one contiguous segment. The segment has as many facts as there are items to be output. Therefore, the segment is much less than one block. (In the physical storage the facts are prefix-compressed, so the physical space for each fact is normally just a few bytes.) Hence, normally the segment fits into one block. We can find the address of this block in the cache-resident B-tree index with the key factstart. Then, in a single access the block is brought to the memory. There is a small probability that the segment appears on the boundary of two blocks.In the latter case we may have to bring two blocks into the memory. On the other hand, the desired block(s) may have already been in cache and, thus, sometimes zero accesses are sufficient.

Thus, the retrieval efficiency for the atomic requests is the optimum, or very close to the optimum. (One cannot retrieve a memory- unavailable datum in less than one disk access.)

B. For large output, the efficiency is also close to the optimum.

When the output is larger than a block, so is the segment. If the output can be squeezed into n blocks, then n would be the theoretical optimum (not obtainable in any practical system) for the number of disk accesses per request. All the facts of the segment can be squeezed into slightly more than n blocks (depending on how much of the prefix can be compressed), say 1.1n blocks. Due the space maintenance policy of the B-tree, each physical data block is 75 percent full on the average. The segment may begin in the middle of one block and end in another; thus, on the average, one additional block has to be fetched (the actual overhead of this type ranges between 0 and 2 block fetches). Thus the total expected number of blocks to be fetched is 1.1n/0.75+1 = 1.47n+1. The number of disk accesses may be even less than that if some of the blocks are in cache.

3. COMPARISON TO PERFORMANCE OF IMPLEMENTATIONS OF THE Relational Model

The system proposed herein is not less efficient, and is normally more efficient, in both time and storage space than the relational model's implementations with multiple dense indices.

Let us consider a simple relational database composed of one relation T with attributes A1, A2,..., An. Let us assume that for each j there are queries of the type

get Ai where Aj = c (Q1)

and that each of those queries is required to be performed in a reasonable time.

For the purpose of physical implementation, the Relational Model's databases can be technically represented (without affecting the user) as certain semantic binary databases. Specifically, the above relational schema can be regarded in the Semantic Binary Model as a category T and relations Ai between the objects of T and values.

To assure reasonable time performance in the Relational Model for each of the above queries, we need a dense index on each of the attributes Ai. There are n index files (or n indices combined in one file in some implementations). The total size of the indices thus exceeds the size of the table T itself. Therefore the space overhead in the Relational Model is greater than 100 percent and, thus, is greater than the space overhead in the proposed semantic implementation. Also, in the semantic implementation there is only one physical file, while there are many physical files in the relational implementations; in some implementations there are as many files as:

number_of_tables x ( 1 + number_of_attributes_per_table )

The management of multiple files is not only a hassle but also contributes to additional space overhead due to allocation of growth areas for each file.

With respect to the time required to solve the simple queries of type Q1, it is the same in the best relational implementations and in the proposed semantic implementation. Namely, the time is:

( 1 + number_of_values_in_the_output) x time_to_retrieve_one_block

(In the relational implementation, there will be one visit to the dense index on Aj, and for every Aj =c found there, there will be one random access to the main table. In the semantic implementation, first the subquery ?Ai c will be solved, and then for every match x found the subquery xAi ? will be evaluated.) If in Query Q1 we desired to print many attributes Aj instead of just one, the same time results would be obtained in both implementations. Notice that in the semantic implementation proposed herein all the immediate information of an object, including all its attributes, is clustered together.

Now let us consider updates. Insertion of a row into the relational table takes replacement of one block in the main table and n blocks in the dense indices. In the semantic implementation there is insertion of the primary facts about the new object ob: obA1c1,..., obAncn (all the primary facts will appear in contiguous storage in one block) and n inverse facts in possibly n different blocks. Thus, here, as well as in the other types of simple updates, the performance of the semantic implementation is not worse than that of the relational implementations supporting efficiency of queries.

The advantages in the semantic implementation's performance become even more significant for more complex queries and updates. Though the detailed analysis of these is beyond the space limit of this paper, I would like to mention that, for example, queries requiring a natural join in the relational implementations would be more efficient in the semantic implementation because there are direct explicit relationships between the categories instead of relationships represented implicitly by foreign keys in the Relational Model. The gap in performance between the faster semantic implementation and the relational implementations is even greater when the relational keys are composed of more than one attribute and when the relationships between the tables are many-to-many, which requires an extra table to represent the many-to-many relationship in the relational implementations. The gap increases with the number of joins in the query. In general, the advantage in the efficiency of the proposed semantic implementation versus the relational implementations normally becomes greater as the complexity of the queries increases.

Of course, there are also major efficiency advantages in the semantic implementation in support of semantic complexities of the real world, which are very awkwardly and inefficiently implemented in the relational implementations. These complexities include intersecting categories, subcategories, categories with no keys, varying-length attributes, missing ("null") values, multiple values, etc.

4. CONCLUSION

We have implemented this data structure in a prototype DBMS at the University of California, Santa Barbara ([Vijaykumar-87], [Jain-87]). Further improved software is under development. Our implementation allows single-processor multi-user parallel access to the database. Optimistic concurrency control is used.

Although the best results are obtained from our DBMS for the Semantic Binary Model, it can also be used efficiently with all other major semantic and conventional database models. This is due to the fact that the Relational, Network, and Hierarchical data models can be implemented via the Semantic Binary Model (as shown in [Rishe-88-DDF]).

Currently, at Florida International University, we are working on a project, financed by the state government, to extend our semantic DBMS implementation into a massively-parallel very-high-throughput database machine [Rishe&al.-89-AM], to be composed of many (thousand[s]) processors, each equipped with a permanent storage device and a large cache memory. Our analysis has shown that the proposed file structure greatly increases the parallelism in the operations of the DBMS, which can be utilized by large-scale parallel machines.

Acknowledgment

The author gratefully acknowledges the advice of Narayanan Vijaykumar, Li Qiang, Nagarajan Prabhakaran, Doron Tal, David Barton, and Scott Graham.

References

[Abiteboul&Hull-84] S. Abiteboul and R. Hull. ``IFO: A Formal Semantic Database Model'', Proceedings of ACM SIGACT-SIGMOD Symposium on Principles of Database Systems, 1984.

[Abrial-74] J.R. Abrial, ``Data Semantics'', in J.W. Klimbie and K.L. Koffeman (eds.), Data Base Management, North Holland, 1974.

[Benneworth&al.-81] R.L. Benneworth, C.D. Bishop, C.J.M. Turnbull, W.D. Holman, F.M. Monette. ``The Implementation of GERM, an Entity-relationship Data Base Management System''. Proceedings of the Seventh International Conference on Very Large Data Bases. (Eds. C. Zaniolo & C. Delobel.) IEEE Computer Society Press, 1981. (pp 465-477)

[Bracchi&al.-76] Bracchi,G., Paolini, P., Pelagatti, G. ``Binary Logical Associations in Data Modelings''. In G.M. Nijssen (ed.), Modeling in Data Base Management Systems. IFIP Working Conference on Modeling in DBMS's, 1976.

[Chan&al.-82] Chan,A., Danberg,Sy, Fox,S., Lin,W-T.K., Nori,A., and Ries,D.R. ``Storage and Access Structures to Support a Semantic Data Model'' Proceedings of the Eighth International Conference on Very Large Data Bases. IEEE Computer Society Press, 1982.

[Chen-76] P. Chen. ``The Entity-relationship Model: Toward a unified view of data.'' ACM Trans. Databas Syst. 1, 1, 9-36.

[Hammer&McLeod-81] M. Hammer and D. McLeod. ``Database Description with SDM: A Semantic Database Model'', ACM Transactions on Database Systems, Vol. 6, No. 3, pp. 351-386, 1981.

[Jagannathan&al.-88] D. Jagannathan, R.L. Guck, B.L. Fritchman, J.P. Thompson, D.M. Tolbert. ``SIM: A Database System Based on Semantic Model.'' Proceedings of SIGMOD International Conference on Management of Data. Chicago, June 1-3, 1988. ACM-Press, 1988.

[Jain-87] A. Jain. Design of a Binary Model Based DBMS and Conversion of Binary Model Based Schema to an Equivalent Schema in Other Major Database Models. M.S. Thesis, University of California, Santa Barbara, 1987.

[King-84] R.King. ``SEMBASE: A Semantic DBMS.'' Proceedings of the First Workshop on Expert Database Systems. Univ. of South Carolina, 1984. (pp. 151-171)

[Leung&Nijssen-87] C.M.R. Leung and G.M. Nijssen. From a NIAM Conceptual Schema into the Optimal SQL Relational Database Schema, Aust. Comput. J., Vol. 19, No. 2.

[Lien&al.-81] Y.E. Lien, J.E. Shopiro, S. Tsur ``DSIS - A Database System with Interrelational Semantics''. Proceedings of the Seventh International Conference on Very Large Data Bases. (Eds. C. Zaniolo & C. Delobel.) IEEE Computer Society Press, 1981. (pp 465-477)

[Nijssen-81] G.M. Nijssen ``An architecture for knowledge base systems'', Proc. SPOT-2 conf., Stockholm, 1981.

[Nixon&al.-87] B. Nixon, L. Chung, I. Lauzen, A. Borgida, and M. Stanley. Implementation of a compiler for a semantic data model: Experience with Taxis.'' In Proceedings of ACM SIGMOD Conf. (San Francisco), ACM, 1987.

[Rishe-86-OD] N. Rishe. ``On Denotational Semantics of Data Bases.'' Lecture Notes in Computer Science, vol. 239 (Mathematical Foundations of Programming Semantics, ed. A. Melton), pp 249- 274. Springer-Verlag, 1986.

[Rishe-86-PS] N. Rishe. ``Postconditional Semantics of Data Base Queries.'' Lecture Notes in Computer Science, vol. 239 (Mathematical Foundations of Programming Semantics, ed. A. Melton), pp 275-295. Springer-Verlag, 1986.

[Rishe-87-DS] N. Rishe. ``Database Semantics''. Tech. Rep. TRCS87-2, University of California, Santa Barbara, 1987.

[Rishe-87-RM] N. Rishe. ``On Representation of Medical Knowledge by a Binary Data Model.'' Journal of Mathematical and Computer Modelling, vol. 8, 1987. (pp. 623-626)

[Rishe-88-DDF] N. Rishe. Database Design Fundamentals: A Structured Introduction to Databases and a Structured Database Design Methodology. Prentice-Hall, Englewood Cliffs, NJ, 1988. 436 pages.

[Rishe-89-SD] N. Rishe. ``Semantic Database Management: from microcomputers to massively parallel database machines.'' Keynote Paper, Proceedings of The Sixth Symposium on Microcomputer and Microprocessor Applications, Budapest, October 17-19, 1989, pp 1-12.

[Rishe&al.-89-AM] N. Rishe, D. Tal, and Q. Li. ``Architecture for a Massively Parallel Database Machine'' Microprocessing and Microprogramming (the Euromicro journal), 25 (1989) 33-38.

[Shipman-81] D.W. Shipman. ``The Functional Data Model and the Data Language DAPLEX'', ACM Transactions on Database Systems, v. 6, no. 1, 140-173, 1981.

[Tsur&Zaniolo-84] S. Tsur, C. Zaniolo. ``An implementation of GEM - supporting a semantic data model on a relational backend.'' In Proc. ACM SIGMOD Intl. Conf. on Management of Data, May 1984.

[Verheijen&VanBekkum-82] G.M.A. Verheijen and J. Van Bekkum. ``NIAM - An Information Analysis Method'', in Information Systems Design Methodologies: A Comparative Review, T.W. Olle, et al. (eds.), IFIP 1982, North-Holland.

[Vijaykumar-87] N. Vijaykumar. Toward the Implementation of a DBMS based on the Semantic Binary Model. M.S. Thesis, University of California, Santa Barbara, 1987.