With the passage of time, the volumes of data that we handle every day are growing exponentially and there is an ongoing need for integration tools like Pentaho PDI to process larger and larger volumes of day.
Currently, we talk about Terabytes of information of the Gigabytes of some years ago or the Kilobytes of decades ago.
The challenge of processing these large volumes of data requires attention to every detail and the application of best practices whenever possible. There are several tools on the market that can help you process syntactic or semantic controls in the data sent to you by your online payment provider or load data in your corporate database to be used in your ERP or Data Warehouse and everything in between.
Detailed below are good practices which you should apply when the processing large volumes of data using the Pentaho Data Integration (PDI) tool.
One of the primary things you need to consider is making sure you have enough resources on your processing server. You want to optimize the use of those resources to perform all your data processing tasks in the best possible way.
Server resources to consider
- Number and type of CPUs
- Amount of available memory
- Type of storage (hard drives, databases, etc.)
- Data transmission capacity of your network
1. Adjust the memory usage parameters of your tool
Allocate enough memory to carry out the work, considering the memory used by other applications and the operating system of your server itself.
Tip: Do not exceed the memory usage. For example, do not allocate a Gigabyte of memory to process a text file of a few hundred lines. Another use that you can give to your memory is to increase the amount of records that are kept in the buffer between the different steps of your work and in this way improve the times. For example, you can keep in memory the records that you use as lookup, so your resources are not constantly reading this data.
2. Take full advantage of the possibilities of reading data provided by your source system
For example, if you are reading text files you could store these files on different hard drives and shoot readings in parallel to take better advantage of the available hardware. This is also a valid approach if your data is on a disk battery (RAID) or specialized external devices such as SAN or NAS.
The same applies if your data source is a Database in which case readings can be triggered in parallel taking care not to saturate the database server.
Pentaho PDI provides mechanisms to parallelize access to data. For example, in the case of text files, you can define what is read in parallel and how many reading processes are generated.
If you have a 4 GB file, you can fire 4 processes in parallel that will be distributed reading work. The first process will read from line 1 to the line that corresponds to 1 GB, the second process will do it from 1 GB to 2 GB and so on with the rest of the file.
3. Scale up – execute several copies of the steps of your work that consume more resources
If you find that there are some steps of your work that are creating a “bottleneck” in your process, you can run multiple copies of these steps to lower the total time at the expense of increased resource consumption. But watch out, do not create more copies than the number of CPUs you have available!
4. Scale out – run the job on a cluster of integration servers
If your server resources are not enough to execute your process within the acceptable times for your business, it is possible to improve the processing capacity by increasing the number of servers. To achieve this, it is necessary to configure a Cluster of Pentaho PDI servers.
In this cluster, one server will officiate as a master server and the others will officiate as slave servers. The idea is that the master server is responsible for distributing the work to the other servers and consolidating the results that each one returns, thus distributing the workload and improving the total time.
With this strategy you must always take into account the form of execution that you are using and the separation of data that you carry out. For example, you must bear in mind that if you order the results on the slave servers, it will be necessary to group them on the master server so that the result is kept orderly. There are very specific steps to this. Of course, you can combine both strategies, scale up and scale out.
One last Pentaho PDI tip
Test the transformations well before moving them to production! Build a test environment as close as possible to the production environment and make a good test plan that covers as many cases as possible.
There is always the risk with any optimization that the system performs worse than you anticipated, and you will need to troubleshoot what to do next.
If you run out of time and need some additional expert Pentaho resources of tools please contact us! We’d be happy to learn more about your project and let you know how we can help.