Importance of AEM Maintenance Procedures for Development and QA Environments

In the rush to get a project started and into production project teams will deprioritize establishing operational procedures when standing up an in initial Adobe Experience Manager environment. This seems to make sense because we assume that the system administrators will figure out these issues when it is convenient for them, and that there will be little to no impact to the project. This assumption can lead to unexpected issues that usually manifest in an acute manner at an inconvenient time for the project. There are really three key maintenance procedures that any project should operationalize early to avoid unexpected project delays:

  • Backup and Restore
  • Tar File Optimization
  • Data Store Garbage Collection

Backup and Restore

In most organizations non-production environments get significantly lower levels of back up protection – sometimes even none. The assumption is that there is nothing in a development or QA environment that can’t be easily recreated. Because CRX stores content locally this assumption may be inaccurate for AEM non-production environments. In addition it is very common for AEM development and QA environments to also be leveraged as content entry environments.  This means that they need to be treated as production environments from a data backup perspective.

The good news is that it is normally a relatively easy problem to solve because these environments have fairly lax SLAs. You can implement a simply process to bring down the whole instance and do a simple cold backup. This will not only mitigate the risk of losing the only copy of your new website’s content, but it will also help to get the system administrators familiar with the system and a head start on the issues common to automating backup so that when the product environments are brought up the back procedure is already solid.

One important note is that you should also test your restore procedures – you want to verify that you can actually recover.

Tar File Optimization

Tar file optimization is one of those things that you never really notice until it is too late. For a summary of what Tar File optimization is see https://helpx.adobe.com/crx/kb/TarPMOptimization.html. Because it is common to assume that you won’t be loading full data sets into development or QA these environments will often have less available disk space. Tar file optimization is especially important in environments with frequent content changes, which will be the case in development and QA environments as testing and content entry ramp up. A common scenario is that you are making frequent enough content updates that the default 3 hour window in which the tar file optimization process runs is not sufficient for the process to actually complete, and over time it gets further and further behind and your tar files continue to grow out of proportion to the amount of content in your repository. Eventually this leads to low disk space issues, and because usually the normal enterprise monitors aren’t enabled for development and QA environments the first indication you get of the problem is that your development or QA author instances get shutdown by the low disk space monitor in order to prevent repository corruption. This is then followed by emergency reallocation of disk space, or an emergency procurement of more disk space. In most cases the additional disk space isn’t really required, and by allowing the optimization process to run to completion you can recover large amount of disk space. I have seen cases where the tar file size was reduced from 150 GB to 20 GB following a complete run of the tar file optimization process.

The solution to the problem is relatively simple. First you need to set up a process to monitor the nightly results of the tar file optimization and verify that the optimization is running to completion every night. If you find that the process runs for more than 5 days without completing then you should consider kicking off the optimization process manually and allowing it to run to completion no matter how long it takes (perhaps over a weekend). Or you may want to consider allowing the process to run for longer than 3 hours – say 5 hours every night to ensure it completes on a regular basis.

Data Store Garbage Collection

Data store garbage collection surfaces many of the same symptoms tar file optimization. See http://helpx.adobe.com/crx/kb/DataStoreGarbageCollection.html for a description of what data store garbage collection is and http://blogs.adobe.com/dekesmith/2012/03/13/using-the-data-store-garbage-collection-in-crx-and-cq/ for a more recent discussion of how to use it. As with tar file optimization many times the first indication you have a problem is when you run out of disk space, although if you have followed best practices your data store is probably on a different file system than your tar files and so it may not trigger the low disk space monitor to shut down the repository, you may just start seeing IO exceptions in the log when the repository tries to save a binary file to the data store.

The solution for this issue is a little more complex than tar file optimization issue. By default the data store garbage collection doesn’t run on a schedule so the first thing that is required is setting up a schedule to manually run the job. This is complicated by the fact that the job needs to run from start to finish without interruption – it can’t start over from where it left off if it is interrupted. The job can be interrupted by a back up job running and shutdown the instance or by a code deployment (since code deployments that include bundles can cause a cascade of bundle restarts which will interrupt the job). The issue is further complicated by the amount of time required to run the job – depending on your repository size and you configurations it can take hours – up to 24 hours in some cases.  This means that you will need to disable back up jobs and code deployments while the process runs.

One strategy to reduce the run time is to separate the deletion of the files from the identification of those files. The data store garbage collection job can be configured to delete the unneeded files, or it can be configured to just touch the files that are in use. You can then write scripts to go through and delete files older than a certain date. This reduces the amount of time that you need shut off backups and deployments.

A second complication is that in addition running the garbage collection job you may also need to schedule a clean up of your package manager. One of the most common culprits in the growth of the data store in dev and QA environments are packages. Developers will create packages on development or QA instances in order get data for the local environments or to move data between environments. These packages are normally one time use – they get created and then never used again. This means that you may need to create a process to periodically clean out your package manager of unused packages in order recover space.

Conclusion

While these may seem like simple issues to solve they are often missed because they seem like the consequences of not dealing with them are a long way off and your project team has lots of other higher priority things to deal with. The reality is that consequences probably aren’t as far off at you think, and pain you will feel when run out of disk space or look 2 weeks of authoring will be far worse than having to divert some development effort early on in the project.

Advertisements