Intro Picture

Introduction

Many sources provide data in the form of JSON documents that are exchanged via Kafka, messaging middleware, or files. As the source JSON documents may be quite complex, it is a usual practice to store them in the staging area of the data warehouse, before they can be transformed and analyzed within the data processing pipeline. Most data warehouses support storage and processing of JSON documents alongside the relational data.

Using JSON to extend relational schemas is also a great approach for improving agility and flexibility of data warehouse models. The idea is simple - instead of creating complex relational model that covers all required attributes, we design a “minimal” relational model and put all the volatile and sparse attributes into a JSON column. Data from JSON column may be easily analyzed by SQL and if the structure is extended or changed, the relational model remains the same.

In this post I focus on storing, loading, and querying complex JSON documents in the Autonomous Data Warehouse (ADW), Oracle’s managed cloud database tuned and optimized for data warehouse workloads.

Specifically, I would like to find out if there is any difference in storage efficiency and load and query performance when using different storage options for complex JSON documents.

Test Configuration

The Oracle Cloud configuration I used for this post is depicted below.

Cloud Setup

Oracle Compute instance is used as a client to load and query JSON data. The instance runs OL7 on E4-based AMD shape with 1 OCPU, 16 GB of RAM, and Block Storage based boot volume. The instance is in the public subnet and it may be accessed directly via ssh client from Internet. JSON documents are stored on the boot volume of the instance.
Autonomous Data Warehouse instance stores and queries JSON data. The instance is configured with 1 OCPU and 1 TB of storage (i.e., the minimal sizing). It runs 19c version of Oracle Database. Autoscaling is disabled. ADW instance is provisioned with the public access, so that it may be accessed directly via SQL Net client or Web Browser.
Both Compute and ADW instances were used only by the JSON testing; no other workloads were running on them.

Test Data

To compare JSON performance, I used artificially created JSON documents from a travel booking application. These are complex documents, with 5 levels of nesting and many objects and arrays. One document includes all information about a booking - contact information, passenger information, payment, destination, travel segments etc. A single JSON document is stored in a single file.

Document type	Number of JSON Documents	Number of Files	Average document size	Largest document size	Total size
Booking record	1 M	1 M	23 kB	62 kB	23 GB

JSON Storage Options

Oracle Autonomous Database 19c supports the following storage formats for JSON columns:

VARCHAR2(4000) - recommended for small JSON documents not exceeding 4000 characters. These values are stored inline, with the rest of the row, and queries will benefit from Exadata pushdown processing.
VARCHAR2(32767) - recommended for JSON documents not exceeding 32767 characters. This option will also benefit from Exadata pushdown processing, as values smaller or equal 4000 characters will be stored inline.
CLOB - not recommended for JSON documents because with AL32UTF8 character set (default), the characters in CLOB require 2 bytes, even if they can be represented by single byte only.
BLOB - recommended for JSON documents that may exceed 32767 characters. BLOB may be stored either uncompressed or compressed with varying degree of compression (low, medium, high).
BLOB with optimized binary format OSON - this is new binary JSON format that was introduced in Oracle Database 21c and backported to Autonomous Database. It should provide the fastest query and update performance.

All these options require that table contains a check constraint IS JSON checking that content is a valid JSON document. Note the default constraint applies a “lax” (i.e., less strict) validation of JSON, however it is possible to define stricter validation in the check constraint.

Additionally, Oracle Autonomous Database supports Simple Oracle Document Access (SODA), which allows applications to work with JSON documents using NoSQL-style API. With SODA, JSON documents are stored in Collections, with one Collection corresponding to one table. Collections use BLOB datatype with OSON format for storing of JSON documents.

Given the above constraints on storing JSON data and the test documents I am working with, I decided to test the following storage options:

1. JSON in BLOB column

create table json1 (
  doc_id number not null,
  doc_body blob,
  constraint json1_pk primary key (doc_id),
  constraint json1_check check (doc_body is json)
)

2. JSON in BLOB column with OSON format

create table json2 (
  doc_id number not null,
  doc_body blob,
  constraint json2_pk primary key (doc_id),
  constraint json2_check check (doc_body is json format oson)
)

3. JSON in BLOB column with OSON format and medium compression

create table json3 (
  doc_id number not null,
  doc_body blob,
  constraint json3_pk primary key (doc_id),
  constraint json3_check check (doc_body is json format oson)
)
lob(doc_body) store as (compress medium)

JSON Loading Approach

To load JSON documents into Autonomous Database, I used simple Python program utilizing cx_Oracle library to communicate with Oracle Database. The program loads documents from files into JSON tables one by one, committing work after every 100 records. It also allows to run multiple instances of the program to test performance with multiple parallel threads. Note that cx_Oracle supports both SQL and SODA interfaces.

Here is the code snippet to load JSON documents:

doc_id = 0
for entry in os.scandir(directory):
    if entry.is_file() and entry.name.endswith('.json'):
        with open(entry.path, 'r') as file:
            doc_id = doc_id+1
            doc_body = file.read()
            cursor.execute( 'insert into {0} values (:1, :2)'.format(target), (doc_id, doc_body) )
connection.commit()

Sample Analytical Query

To test performance of accessing JSON documents in the Autonomous Database, I used a simple query that counts number of documents by two first level attributes. It queries all 1 million documents in a JSON table.

select /*+NO_RESULT_CACHE*/
  to_char(created_date,'YYYY/MM/DD') as created_date, doc_status, count(*) as doc_count
from (
  select jt.doc_status, to_date(jt.created_date,'YYYY-MM-DD"T"HH24:MI:SS') as created_date
  from {0} j,
    json_table(j.doc_body,'$' columns(
      created_date path '$.createdDate',
      doc_status path '$.docStatus'
    ) ) jt
)
group by to_char(created_date,'YYYY/MM/DD'), doc_status
order by to_char(created_date,'YYYY/MM/DD'), doc_status

Note I disabled Result Cache to have consistent results across multiple executions of the same query.

Test 1 - Storage Efficiency

The first test compares how much storage is required by different storage options.

Scenario	Number of JSON documents	Size of source data	Size of TABLE segments	Size of LOB segments	Compression ratio
1. JSON in BLOB column	1 M	23 GB	59 MB	29 GB	0.79 x
2. JSON in BLOB column with OSON format	1 M	23 GB	59 MB	20 GB	1.15 x
3. JSON in BLOB column with OSON format and medium compression	1 M	23 GB	78 MB	12 GB	1.87 x

And the graphical representation of JSON storage requirements is here, with the black line showing size of source documents.

Storage Requirements

As you can see, the 2nd scenario with the new binary JSON format (OSON) is more than 30% efficient than storing JSON as text in BLOB columns. And when applying medium compression to BLOB columns in the 3rd scenario, we can furthermore decrease the storage to almost 50% of the size of source data.

Test 2 - Load Performance

The second test compares load performance of different scenarios. For every combination of scenarios and levels of parallelism I loaded 1 million of JSON documents (23 GB in total) 5 times. The presented numbers are averages across the 5 executions.

Scenario	Threads	Elapsed time	Records per second	Bytes per second	ADW CPU Utilization
1. JSON in BLOB Column	1	70 min	238	6 MB	12%
	2	38 min	436	10 MB	24%
	4	21 min	776	18 MB	38%
	8	13 min	1288	30 MB	64%
	16	10 min	1686	39 MB	94%
2. JSON in BLOB column with OSON format	1	64 min	259	6 MB	14%
	2	33 min	508	12 MB	37%
	4	18 min	961	22 MB	52%
	8	11 min	1538	36 MB	85%
	16	13 min	1329	31 MB	99%
3. JSON in BLOB column with OSON format and medium compression	1	43 min	392	9 MB	25%
	2	31 min	537	12 MB	36%
	4	16 min	1026	24 MB	68%
	8	13 min	1319	31 MB	95%
	16	17 min	991	23 MB	98%

And here you can see the load performance as a chart, with Records per second on the Y-axis and Threads (Level of Parallelism) on Y-axis.

Load Peformance

And on this chart you can see average ADW CPU Utilization during the load. Compute CPU Utilization is not shown as it never reached more than 40% and the load was clearly not constraint by the Compute.

Load ADW Utilization

The 3rd scenario with JSON data in OSON format and medium compression provides the best load performance which scales well up to 8 parallel threads on 1 OCPU ADW instance. For more than 8 threads the performance deteriorates as the ADW instance becomes overloaded.

The 2nd scenario with JSON data in OSON format (uncompressed) provides slightly worse performance than 3rd scenario for 1 thread. However, for 2 and 4 threads it scales similarly as 3rd scenario, and it provides superior performance for 8 threads. For more than 8 threads, the performance also degrades.

The 1st scenario with JSON data in text format delivers worse load performance for 1, 2, and 4 threads than 2nd and 3rd scenarios with OSON format. Unlike them, it scales linearly up to 16 threads as it utilizes the ADW instance less than the previous scenarios.

Test 3 - Query Performance

The third test compares performance of simple analytical query against JSON data. It measures how long a query takes for different scenarios and levels of parallelism. For every thread the query was executed 8 times. Note the query has to read and parse all 1 million JSON documents to get the required results.

Scenario	Threads	Elapsed time	Query duration	Number of queries	Queries per 10 minutes	ADW CPU Utilization
1. JSON in BLOB Column	1	76 min	10 min	8	1.05	26%
	2	74 min	9 min	16	2.17	49%
	4	79 min	10 min	32	4.05	89%
	8	151 min	19 min	64	4.24	99%
2. JSON in BLOB column with OSON format	1	63 min	8 min	8	1.26	24%
	2	61 min	8 min	16	2.62	42%
	4	73 min	9 min	32	4.40	72%
	8	108 min	13 min	64	5.93	98%
3. JSON in BLOB column with OSON format and medium compression	1	66 min	8 min	8	1.22	27%
	2	58 min	7 min	16	2.75	51%
	4	62 min	8 min	32	5.19	90%
	8	132 min	16 min	64	4.85	98%

And here you can see the query performance as a chart, with Queries per 10 minutes on the Y-axis and Threads (Level of Parallelism) on Y-axis.

Query Peformance

Average ADW CPU Utilization during the query testing is on the next chart. Single query requires approximately 25% of the ADW OCPU. Compute CPU Utilization is not shown as it was negligible (about 1%) - all the query processing was done by ADW.

Query ADW Utilization

The 3rd scenario with JSON data in OSON format and medium compression provides the best query performance which scales well up to 4 parallel threads on 1 OCPU ADW instance. For more than 4 threads the performance deteriorates as the ADW instance is fully saturated.

The 2nd scenario with JSON data in OSON format (uncompressed) provides slightly worse performance than 3rd scenario for 2 and 4 threads. Furthermore, it scales up to 8 threads, unlike the 3rd scenario with medium compression.

The 1st scenario with JSON documents using text format delivers significantly lower query performace than the other scenarios. This is probably caused by the need to parse and process JSON documents for every query, while the binary OSON format provides optimized access path for fields within JSON.

Key Takeaways

A. Consider the new binary JSON format (OSON) instead of text format. The OSON format provides better query and load performance. It is also more storage efficient. Therefore I do not see any reason why not to start using OSON format for all large JSON documents.
B. Consider compression for JSON documents in BLOB columns. JSON documents in compressed BLOB column provide better or comparable load and query performance to uncompressed BLOB column while significantly reducing storage requirements. The only catch seems to be higher sensitivity to highly utilized systems, when the performance degrades faster then with the uncompressed storage.
C. Single ADW OCPU can support up to 4-8 sessions loading JSON documents. Depending on the storage format, single OCPU of ADW can support maximum of 4 to 8 concurrent sessions inserting large and complex JSON documents. More concurrent sessions will lead to degraded performance as the ADW instance becomes overutilized.

Note these messages are valid for the scenario presented in this post; i.e., for large, complex JSON documents which must be stored in BLOB columns. I strongly recommend testing the storage scenarios and measure the load and query throughput on your data as the results might differ.

Not Tested Scenarios

I did not test the following scenarios:

Load and query performance of smaller JSON documents that can fit into VARCHAR2 columns.
Comparing performance of queries against JSON field with queries against flattened relational tables.
Performance of more complex queries requiring unnesting and flattening of JSON documents.
Performance of OLTP queries with where conditions on JSON fields.
Storage requirements and load performance with indexed JSON columns.
Load and query performance with ADW instance using 2 and more OCPUs or with Autoscaling enabled.
Load performance of JSON Collection.
Load performance of mix of Insert and Update operations.

I hope to return to these scenarios in some of the following posts.