In this article we continue our research on how to implement field level text search by using a datatype specific to Postgres - a HStore column. Next we'll use GIN / GiST indexes which ware described in the opening article of this series and we'll try to implement the described concept using Spring Data JPA, Hibernate and Postgres.
Table of contents
This blog post is part of a series of articles about internationalizing data with support for text searches:
- Indexes available in Postgres and anti-patterns
- Performant internationalization in PostgreSQL + Hibernate / JPA:
- Companion translation table
- HStore column with translations - your currently reading this
Language keys in HStore
HStore is a special datatype in Postgres which allows to store key - value map in a single row. Thanks to this for each field (like for example a products "Name" field) we can create a companion column with a map that holds the following pairs: language key - text in that given language. Example table using this concept:
In the example above we assume that the Name column holds the products name in the organization default language and the NameTranslations column holds the products name translations in all other languages.
To be able to use HStore in postgres we must first enable this extension:
CREATE EXTENSION hstore;
HStore integration with JPA / Hibernate
Java Persistence API tries to be database neutral so we will not find support for HStore there out of the box - as well as for any other Postgres specific datatypes. We can, however easily, extend Hibernates functionality to work with this datatype on multiple levels: JPQL, Criteria API and event DDL generation. The first thing we have to do is to add a UserType for HStore.
Next we must add a function to Hibernate which will allow us to get a value by a given key from a HStore column. We do this by implementing a SQLFunction:
We combine everything together by adding a Postgres dialect.
and we set that dialect in our Spring Boot configuration:
JPA object model
With our Hibernate expanded with HStore support we can now start using it in our persisted object model.
We created a generic embeddable which will be used in many fields in a single entity and therefore in our Spring Boot configuration we should set a database column naming strategy which will handle object nesting.
spring: jpa: hibernate: naming: implicit-strategy: org.hibernate.boot.model.naming.ImplicitNamingStrategyComponentPathImpl physical-strategy: org.hibernate.boot.model.naming.PhysicalNamingStrategyStandardImpl
Without this, the default configuration would try to create many columns with the name "value" in a single table. But because we changed that strategy we will get names like:
details_name_value text; details_name_translation_by_language hstore; details_description_value text; details_description_translation_by_language hstore;
Spring service with text search
Let's test our model in practice and implement a Spring service with a REST API which will allow us to:
- add and fetch translations for a given product,
- find products by name in a given language.
We'll use Spring Data JPA to handle our database queries.
Indexes in Postgres and performance tests
Hibernate can generate SQL tables for us based on our object model but it will not provide required optimizations. We our self’s must provide the GIN indexes which we mentioned at the beginning of this article series.
As we can see from the above example we must create a separate index per language. This approach has upsides and downsides. On one hand we must create many indexes with part of the data in each one. On the other hand, we have greater control which languages get indexed and we can focus only on the most popular ones. Furthermore, text searches will always work in the scope of one language and in consequence will be using only one index which is specialized for the given scope.
As we did in our previous concept, we will populate our database using a Gatling script.
Let's make sure that our indexes are being used:
explain select product0_.id as id1_0_, product0_.details_code as details_2_0_, product0_.details_description_translationByLanguage as details_3_0_, product0_.details_description_value as details_4_0_, product0_.details_name_translationByLanguage as details_5_0_, product0_.details_name_value as details_6_0_, product0_.details_price as details_7_0_ from Product product0_ where lower(product0_.details_name_value) like ('%'||lower('aaa')||'%') or lower(product0_.details_name_translationByLanguage -> 'pl') like ('%'||lower('aaa')||'%')
Bitmap Heap Scan on product product0_ (cost=175.84..5669.84 rows=15680 width=173) Recheck Cond: ((lower((details_name_value)::text) ~~ '%aaa%'::text) OR (lower((details_name_translationbylanguage -> 'pl'::text)) ~~ '%aaa%'::text)) -> BitmapOr (cost=175.84..175.84 rows=16000 width=0) -> Bitmap Index Scan on product_details_name_idx (cost=0.00..84.00 rows=8000 width=0) Index Cond: (lower((details_name_value)::text) ~~ '%aaa%'::text) -> Bitmap Index Scan on product_translated_name_idx (cost=0.00..84.00 rows=8000 width=0) Index Cond: (lower((details_name_translationbylanguage -> 'pl'::text)) ~~ '%aaa%'::text)
The index was used as anticipated. Now lets make a HTTP call to our API.
This requests execution time ranges from 5 to 30 ms - so far so good. If we remove the index, the execution time will increase to about ~150 ms. Let's now do a full blown performance test. We'll write two bigger Gatling scripts:
- Just search - a simulation where only searches are done in parallel
- Full simulation - concurrent writes and searches (1:1 proportions)
We run the scripts above once on a database with indexes and once on a database without them. Once on a database with 100'000 products and once with 500'000 products. Below, we can compare the reported results:
Results for 100'000 products
Just search simulation - WITHOUT indexes
Just search simulation - WITH indexes
Results for 500'000 products
Just search simulation - WITHOUT indexes
Just search simulation - WITH indexes
- this concept based on HStore turned our easier to integrate with JPA and Spring Data then the companion translation table concept
- when we only have 100'000 products in the database then the performance impact of indexes is marginal both for searches and writes in a loaded system
- but then the products count grows to 500'000 then:
- average text search time in a loaded system is 8x shorter when we have indexes
- even when we have many writes (one write for each search) both search and writes on average are about 20% faster on a loaded system with indexes - this is interesting as the write has more work to do when an index needs updated but the resources saved on indexed searched give it plenty of time to do it.
Source code used in this article can be found in our bitbucket repository.
This article finishes our series. If you would like to read our analysis from the beginning then please go to the Data internationalization with text search - indexes in Postgres and anti-patterns article.
This article is a result of our cooperation with Nextbuy - a SaaS company which develops a procurement and online auction platform to connect buyers and suppliers. We provide various consulting and software development services to them and they have kindly allowed us to publish part of the resulting research / design documents. You can checkout their amazing platform at www.nextbuy24.com
Do you need help in your company with some topic we mentioned in our blog articles? If so, then please feel free to contact us. We can help you by providing consulting and audit services or by organizing training workshops for your employees. We can also aid you in software development by outsourcing our developers.