How to Choose Right Schema Definition for Topics in Publisher-Subscriber Model
- Anant Mishra
- Aug 20, 2023
- 3 min read
Updated: Aug 27, 2023
If you ever got confused while defining the schema for your topics in Kafka (or any publisher-subscriber model), then this article is for you.
One of the main architectural decisions in creating a publisher-subscriber model is to finalize schema definitions for topics.
We often face questions like:
Which fields should be there in the schema?
What should be the structure of the schema (e.g. nested or flat)?
Example Use Case
This article is not bound to any specific technology or language, but I am using Java as the coding language, Kafka as the publisher-subscriber model, and JSON as the schema.
We will take the example of an Employee object and will see which fields and structures we should pick in our schema definition.
public class Employee {
private int id;
private String name;
private String designation;
private List<JobRole> roles;
//gettters and setters
}
public class JobRole {
private int id;
private String name;
private String type;
//gettters and setters
}
There are mainly two types of schema definitions: producer-centric and consumer-centric.
Producer-Centric Schema Definition:
As the name suggests, we create this kind of schema by keeping in mind the structure of the object on the producer side. Such schemas are generic and are not specific to any consumer.
When to Choose This Schema
When we follow message-driven communication in our service architecture, and we believe that the object we are creating or updating will be required by other services, then we can publish such objects to a topic of centralized Kafka clusters. Any service can consume that topic.
If both the producers and consumers are in our control (i.e. there's no third-party involved), then we should prefer this model, as it is extensible.
When we do not have any specific requirements from the client, or there are n number of clients, then in place of having n different producers, we should go with a producer-specific schema. Otherwise, we would encounter a maintainability issue.
What Should the Schema Definition Be?
Producer-centric schema structures should be closer to the object definition on the producer side. While defining such schemas, we should expose all fields, which we believe other services will need.
For our employee example, the schema should be:
{
"id": "emp id",
"name": "emp name",
"designation": "emp designation",
"roles": [{
"id": "role id",
"name": "role name",
"type": "role type"
}]
}
Do not leave any field which you may need to expose in the future, because then you will face backward compatibility issues. For example, maybe at the moment, no consumer wants 'role type,' so you haven't exposed that, but, after some time, a consumer needs 'role type' also. In that case, we would not have required fields in already published JSON. This will break backward compatibility, and our generic schema would be of no use.
Advantages
It helps to achieve loose coupling between services.
Reduces the load from the publisher side, as the publisher has to take care of publishing only one schema.
It's maintainable and extensible.
Disadvantages
The schema would be heavy, as we would expose fields that no consumer would ever need.
It would not work if your consumer is a third-party, or a generic framework in your project, which needs the schema to be in a specific format.
Consumer-Centric Schema Definition
This kind of schema definition is totally governed by the consumer.
When to Choose This Schema
When the consumer is a service or framework which is not able to consume different kinds of schema to gather the information required.
When the consumer needs to consume data from n number of producers for the same use case (e.g. user activity information), getting different JSON objects and parsing them would be painful. To avoid this, the consumer can request other services to produce data in a given format for a client-centric topic.
What Should the Schema Definition Be?
Schema definitions and structures should be defined by the consumer. For the same object, different consumers can request different fields and structures.
For our employee example, the schema could be any of the below (or any other combination of fields and structures)
//For topic 1
{
"emp_id": "emp id"
"emp_name": "emp name"
"roles": ["role 1 name", "role 2 name"]
}
// For topic 2
{
"empId": "emp id",
"empName": "emp name",
"roles":[{
"name": "role name 1"
},{
"name": "role name 2"
}]
}
Advantages
We expose only required fields, so schemas are not heavy.
Reduces the complexity on the consumer side and supports the consumers which are not extensible.
If the object on the producer side is too heavy, or too complex to expose and maintain, then choosing consumer specific schema is a better option.
Disadvantages
If, for the same object, a producer is supporting a different consumer-specific schema, then it would create maintainability issues on the producer side.
This is not the ideal way to achieve loose coupling in message-driven architecture.
Thanks for reading!
Commenti