Building a Simple PHP-ETL
Anyone of you fallowing my github might have seen that I have recently created a new library : PHP-ETL
Recently I had a did a few repetitive projects, all of them consisted of transforming a csv file. I ended up identifying these repetetive tasks.
- Transformation : I identified the transformations very early in the process, and coded the RuleEngine a while ago.
- Grouping : On a few of the occasion I ended up having to group multiple lines of the csv file into one line. This I had todo to create configurable products from variants for exemple.
- Multiple destination : Quite often when reading a single csv file I had to write multiple outpus. For exemple a file containing the list of all products and a file for the attribut options.
- Writing : All these operations always ended with the data being persisted either in file(s) or in databases.
Basically here we can see clearly see the TL part of ETL. But we also have a few more complex operations such as the grouping & multi destinations.
Once all this in place I started to see if I could imagine a way to simply describe each step for all the transformations I worked on. I quickly realized that on all of these projets there was alot of common code and it was easy to split into multiple operations handling individual data.
As I said the first step that I identified and had time to implement was the RuleEngine. It allows to describe simple transformations in yaml configuration files. The yaml is relatively simple, we even have a client(non technical) editing the yaml file by himself to make custom exports.
The next step was to have a common ground to build the next tasks on.
I started looking available open source tools and particularly the ETL's out there; I found none that could do both the Grouping & the Multiple Destination thing (in php). At this point I decided to have a go at it myself.
The main complexity here is Grouping. How do we handle that?
Grouping
A ETL chain, the way I imagine it needs to process each item individually, this way you are sure there are no memory issues. This seems to be the case of the majority of the ETL's out there; so I had to keep this logic.
A second logic that needed to be maintained is : that what get's out from an operation is what goes in the next operation.
The problem with grouping is that you can't do it without reading the whole data. This is problematic for big datasets as we would need to keep it all in memory.
We end up with 3 solutions :
- The grouping operation returns the whole data grouped.
- -) The memory problem inside the grouping moves in different operations.
- -) We are not handling items one by one.
- +) We respect the what goes out from the operation goes in the next operations
- The grouping operation returns an Iterator of the grouped data,
- -) We are not handling items one by one.
- -) We are transporting in iterator in each operation, and possibly reading/writing in database ?
- +) The memory issue can be solved when using a database. It remains only in the grouping operation.
- +) We respect the what goes out from the operation goes in the next operations
- The grouping operation returns an Iterator and the chain processor transforms that back into individual items (each item being a group of the original item)
- -) We don't respect the what goes out from an opeation goes in the next operation.
- +) The memory issue can be solved when using a database. It remains only in the grouping operation.
- +) We can chain multiple operations.
I went for the third option. It allows me to have a generic operations to be chained. If we wanted to use the RuleTransformer after a grouping, we can do it as the item received is an associave array of 2 levels.
There are still a few questions in how to implement.
- What does the Grouping does when it receives an item and isn't ready to send the iterator?
- You might be wondering how does the grouping operation know when to send the iterator?
Bascailly there are 2 approches possible, we must remeber that there might be multiple grouping operations in a chain.
- The Grouping operations returns null when processsing. The processor then calls the end method of the grouping to get the Iterator.
- -) In the processor we need to code some specific code to handle a call to stop, while handling the multiple grouping operations.
- -) A null data can't be sent to fallowing operations.
- The data going throught the chain is used as a signals.
- -) Complexity
- +) The processor will just need to send a "StopItem" to indicate that everything needs to be flushed. A simple loop allows it to be sent until everything is well empty. We use different methods (processData & processStop) in operations to handle different types of Item. If Item is unknow it simply is returned without any change.
- +) We can send a null, data as Groups will send a ChainStopItem to stop a chain if it has no data to send instead of null.
I chose the last option, and it seems to be the most flexible option. Having everything flush is a simple line of code.
So we have solved our grouping issue, we can group data and optimize the memory usage if we wish todo so using databases or other storage methods instead of memory.
Multiple destination
At first I hadn't thought about this. I thought I can create 2 chains or more, and have the same beginning for all of them. The issue here is that we duplicate alot to do that.
Once the grouping and all the ETL was in place, I realized doing the "Split" as I call it was relatively easy. Each item of the main chain is sent into a second chain. And as I use the Items as a signal it automatically flushes any of the groupings there might be in the second chain. So there is a single loop in the start of the process.
Conclusion
I was really suprised not to find any tools that matched my needs with all the open source libraries out there. I can also understand that PHP is not the idea language to do this.
I am quite happy of the end result. I am sure it can be improved quite a lot with some more thinking. There are alot of basic operations (such as aggregations) that could be added in the tool. But for now it should be able to handle relatively easily most of the transformations I need, and hopefully tou need as ell.
You can check it out in github : https://github.com/oliverde8/php-etl and add a start to it :)