Troubleshooting Environment Specific AEM Issues

One of the most common issues that arise during development/testing on AEM implementations is why does my page look right in author but not when I view it through the web server/Dispatcher . Or why does it work fine when I hit the publish server but not the web server.

Most developers with AEM experience have their list of immediate suspects – did you publish everything, is Dispatcher blocking it somehow, etc. Where people run into trouble is when the cause isn’t on their list and they aren’t sure how to identify the source of the problem. This is where having clear understanding of request process helps you identify the at least what layer is causing the problem which will then help you quickly solve the problem.

Step 1 – What’s Happening on Browser

The first step is to figure out whether or not all the files your page requires are getting downloaded. Normally in these situations you open something like Firebug or Charles and look for the 404s (or other error code – 403 or 500). You can then normally zero in on why those files are throwing the error by moving to Step 2 – What’s Happening on the Web Server/Dispatcher.

Every once in while you will run into a situation where you don’t have any requests throwing errors so at this point you need stay focused on what’s happening in the browser. If you don’t have any requests with errors then you will want to look at a couple of different things:

  1. JavaScript error console – look to see if you find any security warnings of other errors. You may be experiencing cross domain scripting issues for example that perhaps you don’t see when hitting the publish server. Also you may see parsing errors – this can sometimes indicate that one of your files has bad content in it even though it isn’t throwing an error code.
  2. Compare the list files downloaded in environment that works against the list downloaded in the environment that doesn’t work to see if you have same files (also check not just the count but file names as well). Sometimes it’s useful to use a proxy tool like Charles for this rather than a Firebug if you site leverages plugins like Flash or other tools that might make an HTTP request that the browser isn’t aware of.
  3. Worst case scenario is to start opening individual files and comparing the contents between environments to see if you have errors embedded in files or old versions, etc. You may also need to compare HTTP headers between environments (verifying for example that the mime type and content types are the same in different environments).

Normally as you run through these options you hit on the problem file and you will advance to the Step 2

Step 2 – What’s Happening on the Web Server/Dispatcher

The goal at this step is determine whether or not the web server/Dispatcher is the source of the problem. The questions to ask will vary a little based on whether what the problem you have (404 vs. wrong mime type) but generally it’s one of these questions

  1. Is the request getting to the web server at all or is getting misdirected above the web server by another layer?
  2. Is the request getting to Dispatcher or is the web server configuration somehow misdirecting the request?
  3. Is Dispatcher serving the item from cache or sending the request back to publish?
  4. Do you get different behavior when the item is server from cache vs. when Dispatcher sends the request back to a publish server.
  5. Is the URL that Dispatcher is trying to resolve the correct URL (if you have rewrite logic is it getting properly applied,  is it being rewritten when it should not be, did it get mangled somewhere along the way)?

In order to answer these questions there are generally a few places to look:

  1. Web server’s access.log or error logs.
  2. Dispatcher .log – ideally you should turn up the logging on this to get more information but in production that may be a last resort – however sometimes it may be the only option.
  3. Publish server request.log or access.log – if you can’t turn up the logging at the Dispatcher level these can sometimes give you the information you are looking for, although in production where you have more than one publish server it can be difficult.

At this point if you determine that the web server/Dispatcher is the problem it is normally either an issue with filter rules in Dispatcher .any, stale cache, or a problem in the web server that is either changing, or failing to change the URL, or an issue with web server not forwarding the request to Dispatcher for handling.

If however you determine that request is getting to the web server/Dispatcher and being sent back to the publish server then you are onto Step 3.

Step 3 – What’s Happening on the Publish Server

Generally if you get to this point you have a fairly straightforward problem to solve (and generally the issue was it works in author and not in publish). The questions you are looking at are:

  • If it’s a 404 why – does the node not exist in the publish server or is the URL wrong for some reason or do you have some sort security issue.
  • If it’s a 403 error why – the common one here beyond just misconfigured security permissions is the Sling Servlet Referrer filter setting (which can block things like post requests that don’t come from a white listed domain).
  • If it’s 500 – read the error logs – this is really important in all the questions – if you stumped always read the error logs.

Cutting to the Chase

That’s a pretty extensive list, and once you have gone through it a time or two you start to develop some short cuts to identifying the problem layer which is good. I usually try tto bracket the problem before I start digging:

  • Do I get the same results when I hit the web server vs. hitting 4503 directly?
  • Do I get the same results cached vs. uncached (usually by adding question mark)?
  • Do I get the same results when I request the file by itself vs. when it’s downloaded with the page?

The answer to one of those questions will usually point me to the right layer and reduce the steps. That said one thing to monitor is when you start spinning your wheels. While walking through that list methodically may seem like a waste of time it is usually worth doing once you start to spin your wheels and you aren’t making progress.

Importance of AEM Maintenance Procedures for Development and QA Environments

In the rush to get a project started and into production project teams will deprioritize establishing operational procedures when standing up an in initial Adobe Experience Manager environment. This seems to make sense because we assume that the system administrators will figure out these issues when it is convenient for them, and that there will be little to no impact to the project. This assumption can lead to unexpected issues that usually manifest in an acute manner at an inconvenient time for the project. There are really three key maintenance procedures that any project should operationalize early to avoid unexpected project delays:

  • Backup and Restore
  • Tar File Optimization
  • Data Store Garbage Collection

Backup and Restore

In most organizations non-production environments get significantly lower levels of back up protection – sometimes even none. The assumption is that there is nothing in a development or QA environment that can’t be easily recreated. Because CRX stores content locally this assumption may be inaccurate for AEM non-production environments. In addition it is very common for AEM development and QA environments to also be leveraged as content entry environments.  This means that they need to be treated as production environments from a data backup perspective.

The good news is that it is normally a relatively easy problem to solve because these environments have fairly lax SLAs. You can implement a simply process to bring down the whole instance and do a simple cold backup. This will not only mitigate the risk of losing the only copy of your new website’s content, but it will also help to get the system administrators familiar with the system and a head start on the issues common to automating backup so that when the product environments are brought up the back procedure is already solid.

One important note is that you should also test your restore procedures – you want to verify that you can actually recover.

Tar File Optimization

Tar file optimization is one of those things that you never really notice until it is too late. For a summary of what Tar File optimization is see https://helpx.adobe.com/crx/kb/TarPMOptimization.html. Because it is common to assume that you won’t be loading full data sets into development or QA these environments will often have less available disk space. Tar file optimization is especially important in environments with frequent content changes, which will be the case in development and QA environments as testing and content entry ramp up. A common scenario is that you are making frequent enough content updates that the default 3 hour window in which the tar file optimization process runs is not sufficient for the process to actually complete, and over time it gets further and further behind and your tar files continue to grow out of proportion to the amount of content in your repository. Eventually this leads to low disk space issues, and because usually the normal enterprise monitors aren’t enabled for development and QA environments the first indication you get of the problem is that your development or QA author instances get shutdown by the low disk space monitor in order to prevent repository corruption. This is then followed by emergency reallocation of disk space, or an emergency procurement of more disk space. In most cases the additional disk space isn’t really required, and by allowing the optimization process to run to completion you can recover large amount of disk space. I have seen cases where the tar file size was reduced from 150 GB to 20 GB following a complete run of the tar file optimization process.

The solution to the problem is relatively simple. First you need to set up a process to monitor the nightly results of the tar file optimization and verify that the optimization is running to completion every night. If you find that the process runs for more than 5 days without completing then you should consider kicking off the optimization process manually and allowing it to run to completion no matter how long it takes (perhaps over a weekend). Or you may want to consider allowing the process to run for longer than 3 hours – say 5 hours every night to ensure it completes on a regular basis.

Data Store Garbage Collection

Data store garbage collection surfaces many of the same symptoms tar file optimization. See http://helpx.adobe.com/crx/kb/DataStoreGarbageCollection.html for a description of what data store garbage collection is and http://blogs.adobe.com/dekesmith/2012/03/13/using-the-data-store-garbage-collection-in-crx-and-cq/ for a more recent discussion of how to use it. As with tar file optimization many times the first indication you have a problem is when you run out of disk space, although if you have followed best practices your data store is probably on a different file system than your tar files and so it may not trigger the low disk space monitor to shut down the repository, you may just start seeing IO exceptions in the log when the repository tries to save a binary file to the data store.

The solution for this issue is a little more complex than tar file optimization issue. By default the data store garbage collection doesn’t run on a schedule so the first thing that is required is setting up a schedule to manually run the job. This is complicated by the fact that the job needs to run from start to finish without interruption – it can’t start over from where it left off if it is interrupted. The job can be interrupted by a back up job running and shutdown the instance or by a code deployment (since code deployments that include bundles can cause a cascade of bundle restarts which will interrupt the job). The issue is further complicated by the amount of time required to run the job – depending on your repository size and you configurations it can take hours – up to 24 hours in some cases.  This means that you will need to disable back up jobs and code deployments while the process runs.

One strategy to reduce the run time is to separate the deletion of the files from the identification of those files. The data store garbage collection job can be configured to delete the unneeded files, or it can be configured to just touch the files that are in use. You can then write scripts to go through and delete files older than a certain date. This reduces the amount of time that you need shut off backups and deployments.

A second complication is that in addition running the garbage collection job you may also need to schedule a clean up of your package manager. One of the most common culprits in the growth of the data store in dev and QA environments are packages. Developers will create packages on development or QA instances in order get data for the local environments or to move data between environments. These packages are normally one time use – they get created and then never used again. This means that you may need to create a process to periodically clean out your package manager of unused packages in order recover space.

Conclusion

While these may seem like simple issues to solve they are often missed because they seem like the consequences of not dealing with them are a long way off and your project team has lots of other higher priority things to deal with. The reality is that consequences probably aren’t as far off at you think, and pain you will feel when run out of disk space or look 2 weeks of authoring will be far worse than having to divert some development effort early on in the project.