Large collections are inevitable. If your Magento store has been running for more than a minute, you’ll have orders. You might have hundreds of orders. Thousands of orders. Hundreds of thousands of orders! When you’re constructing an extension for Magento, in all likelihood you’re not rocking a database with hundreds of thousands of items to test against.
Let’s run with orders as an example, and let’s build something that exports those orders.
You might come up with something like this; a quick, and dirty script to export …
Quick
1 2 3 4 |
$orders = Mage::getModel('sales/order')->getCollection(); foreach ($orders as $order) { echo $order->getIncrementId() . PHP_EOL; } |
Let me explain why it’s quick and dirty…
Say you’ve got twenty orders. This is probably going to be “not too bad”. It’s probably going to take a couple of seconds, and you’ll get what you want.
Let’s break this down a little.
$orders = Mage::getModel('sales/order');
tells Magento to go and get a sales/order model – Mage_Sales_Model_Order
. When you call getCollection(), you’re getting essentially, a collection of product objects. A this point, all Magento does is start building an SQL query. Nothing gets executed. Magento hasn’t gone anywhere near the database. You can add attributes, add extra filters, you can even decide, hey, I don’t even want to use this anymore. No expensive database calls. This is all thanks to Magento’s Lazy Loading. Essentially, Magento will only go to the database when you need access to the data – in our case, when you start the foreach
.
So, let’s take our example above. It’s going to give you something like – SELECT
– which goes and gets everything, and loads it into an object.main_table
.* FROM sales_flat_order
AS main_table
Using the same code, let’s say you’ve got a thousand orders. It’s probably going to take a while longer, but you can wait.
Let’ jump up to 100,000 orders. You’re probably going to be tapping that foot for a while. While you’re waiting, think about your server. It’s probably screaming. If it doesn’t timeout, it could run out of memory or resources before you get all 100,000 orders exported.
So, let’s have a think about how we can make a more lightweight way, server friendly (and hopefully, quicker).
Using an Iterator
PHP provides an array_walk
function that will send each item through another function. One at a time. Magento provides a collection iterator – core/resource_iterator
– Mage_Core_Model_Resource_Iterator
which will send each element of a collection through another function. Given Magento’s lazy loading, this has the affect that it will LIMIT the SQL queries each time – essentially getting each row, one at a time.
The Mage_Core_Model_Resource_Iterator
Class has a walk
method which allows you to iterate over your collection.
1 |
public function walk($query, array $callbacks, array $args=array(), $adapter = null) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
$ordersCollection = Mage::getModel('sales/order')->getCollection(); Mage::getSingleton('core/resource_iterator')->walk( $ordersCollection->getSelect(), array(array($this, 'ordersCallback')), // our function to pass each order into array('store_id' => $storeId) ); public function ordersCallback($args) { $_product = Mage::getModel('sales/order')->setData($args['row']); //process } |
Paging a collection
So, we’ve got a solution that allows for less memory blocking, and quicker execution, but it comes at the expense of having to hammer the database with a large number of small queries. This in some respects is a lot better than the previous solution, but by using techniques from the first and second solution, we can produce something better.
What we’re looking for is dealing with collections essentially, in managable chunks. From the PHP side, this is a question of just keeping arrays and objects to a “managable” size. To do this from the database side we need to pull out a set number of results at a time, in big enough chunks to make processing easy, but small enough chunks to not over-load the database.
In MySQL, what we’re looking at is using LIMIT
so that we can “page” through results. LIMIT
takes two values. The first is where to start, and the second is how big a chunk of data you want. So you could use LIMIT 0,10
to get the first 10 results. Then use LIMIT 11,10
to get the next 10 results.
Magento implements this using two functions.
setPageSize
this informs Magento how many results you want to retrieve.
setCurPage
this sets our offset.
Example code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
$ordersCollection = Mage::getModel('sales/order')->getCollection(); $ordersCollection->setPageSize(100); $pages = $ordersCollection->getLastPageNumber(); $currentPage = 1; do { $ordersCollection->setCurPage($currentPage); $ordersCollection->load(); foreach ($ordersCollection as $_order) { // process } $currentPage++; // clear collection (if not done, the same page will be loaded each loop) - will also free memory $ordersCollection->clear(); } while ($currentPage < = $pages); } |
This seems like a very nice solution and is one that I try to use where possible – as you never know how big a collection is likely to be.
One issue we do have though is that of moving goalposts. As each of our queries hit the database at different times and we don’t grab all the data at once, there is a possibility that new orders will bump our paging out, or that new orders will be added to a page that we haven’t captured.
Using PHPStorm
If you’re using PHPStorm, I’d suggest setting up a “live template” with the following code to make it quick and easy for you to deal with large collections.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
$orders$->setPageSize(100); $pages = $orders$->getLastPageNumber(); $currentPage = 1; $batchNumber = 0; do { $orders$->setCurPage($currentPage); $orders$->load(); $orders$->clear(); $currentPage++; } while ($currentPage < = $pages); |
Using template variables $orders$
and $order$
, you can quickly update your PHP variable names as soon as PHPStorm expands your code.
Worth looking at
There is a generic batched iterator by Kalen Jordan which can be found here: https://gist.github.com/kalenjordan/5483065 – this introduces the concept of using an integrator, whilst batch processing a collection! Really neat to keep in a core module for code reuse that you can pass any collection to.
Image Credit: Oak Street Blues - 4-11-15
Comment or tweet @douglasradburn