THUD - Troubleshooting Hyperion Under Duress: August 2016

August 30, 2016

Automate Deleting Taskflow Log

We use HFM taskflows. A lot. When a taskflow runs it generates log information so you can check on the status of the execution and examine any errors to effect remediation.

But like the EPMA logs, these log entries stay around forever until you delete them. On the Task Flow Status page you see each task flow status on a line with a check box. To delete a log entry you click the check box to select the entry and click the Delete button. Child's play.

Since these are check boxes you would think, as a standard GUI practice, you could check multiple boxes and click the delete button to delete multiple log records. You would be wrong. You have to select the log records one at a time to delete them. Isn't that convenient?

So if you don't constantly do maintenance you wind up with a bunch of useless information in your log viewer. Searches take longer and if you want to delete entries it takes a lot of time to complete. What's a mother to do?

Analysis

We use the taskflows to monkey with our HFM applications. Taskflows are found in the Consolidation Administration tab so you might think the information for taskflows is in the HFM table. But we can also do EPMA related things like run data synchronizations and redploy applications. So where do we find the tables that manage the taskflows and logs? In the Shared Services database, of course.

It took me some digging to figure that out, but the first clue was finding a table named [ces_wf_instances]. SELECTing from that tables shows the field process_name with the names of my taskflows. Other useful fields include starttime, a datetime field that we will use to find all the old log entries, and workflow_id and taskid which are used as references to other tables. workflow_id is the taskid with the string "wf-" as a prefix. workflow_id is also part of the primary key so it is unique across the tables.

The nice thing is that taskid values are always in ascending order. I don't have any way to prove this but I suspect that taskid is based at least in part the datetime of the activity. This means that when we process the other tables we can define a cutoff date, find the first taskid in the [ces_wf_instances] for that date, then delete anything that has a taskid less than that. The not so nice thing is that in all the other tables the taskid field is named task_id.

What other tables will we process? There are 13 tables with the ces_ prefix but some of them are empty and others are unrelated to logging. The [ces_tasks] table lists tasks within the taskflow, [ces_messages] contains more detailed messages about the task. Both tables contain a field named objectvalue which is an image blob. This is a large hex string that gets processed to show its part of the workflow status. Both of these tables are referenced by task_id which, like in [ces_wf_instances], is sequential, so we can still use the strategy of deleting records less than the target task_id.

There are also the [ces_participant] and [ces_participant_events] tables which have details on the steps in the taskflow. Like the [ces_tasks] and [ces_messages] tables these tables have the objectvalue image blob. The [ces_participant] is referenced by workflow_id which matches the field in the [ces_wf_instances] table. There is also a field named participant_id which contains the workflow_id, the user, and the name of the step in the taskflow. This participant_id is a foreign key in the [ces_participant_events] log. Because there is a foreign key constraint between the tables when we delete records from the [ces_participant] table the records with the matching participant_id in the [ces_participant_events] table also get deleted.

Solution

So now that we know all the players, let's draw up a play to get us to the end zone. We will build a SQL script that defines our cutoff date, find the last taskid associated with that date, then delete records from the other tables where the taskid is less than the one for the target date.

There are different tactics to define the cut off date. If we are running this process manually maybe we just want to define the date with a static string:

DECLARE @PriorDate DateTime;

SET     @PriorDate = '2015-01-01';

But if we want to create a scheduled process we can calculate the desired cut off date. For example, the following code will use the DATEADD() function to set the cut off date as 3 months prior to the current date:

DECLARE @PriorDate DateTime;

SET     @PriorDate = DATEADD(M,-3,GetDate());

Now that we have the date, we need to find the target taskid. The taskid is sorted sequentially so we take the TOP 1 taskid where the starttime is greater than or equal the cut off date. Note that we use greater than or equal to (>=) rather than just equal to (=) in case there was no taskflow execution on the cut off date. This gives us a taskid where anything less will be prior to our cutoff date. Also note that we make the @LastID variable nvarchar(100) which matches the data type used in the [ces_wf_instances] table.

DECLARE @LastID nvarchar(100);

SET     @LastID = (   

  SELECT TOP 1 taskid

    FROM [ces_wf_instances]

    WHERE starttime >= @PriorDate

   )

For the [ces_messages] and [ces_tasks] tables we can do a simple delete where task_id is less than our target taskid. But for the [ces_participant] table we need to get the workflow_id for all of the records prior to our cut off date. One way to do this is to use the IN clause and build a list of workflow_id from the [ces_wf_instances] table where the starttime is less than the cut off date.

DELETE FROM [ces_participant]

 WHERE workflow_id IN (

  select workflow_id

   from [ces_wf_instances]

   where starttime < @PriorDate)

Remember that because of the foreign key constraint deleting records from [ces_participant] will also delete the associated records from the [ces_participant_events] table. Also note that because we need the [ces_wf_instances] table to do this delete we have to make that the last table from which we delete records.

The entire SQL script is:

-- Cut off date

DECLARE @PriorDate DateTime;

SET     @PriorDate = DATEADD(M,-3,GetDate());

--SET     @PriorDate = '2015-01-01'

PRINT 'Cut off date = '+convert(VarChar(32),@PriorDate)

-- Get the first datskid of the cutoff date

DECLARE @LastID nvarchar(100);

SET     @LastID = (

 select top 1 taskid

  from [ces_wf_instances]

  where starttime >= @PriorDate

   )

PRINT 'Last TaskID = '+@LastID 

-- These tables are simple deletes

DELETE from [ces_messages] where task_id < @LastID

DELETE from [ces_tasks]    where task_id < @LastID

-- [ces_participant] table has a foreign key constraint against [ces_participant_event]

-- table.  Deleting records from [ces_participant] deletes records with the same

-- particpant_id from the [ces_participant_event].  So to process these two tables we

-- get a list of workflow_ids where starttime is less than @PriorDate

DELETE FROM [ces_participant]

 WHERE workflow_id IN (

  select workflow_id

   from [ces_wf_instances]

   where starttime < @PriorDate

   )

-- We could also have deleted based on task_id   

DELETE FROM [ces_wf_instances]

 WHERE starttime < @PriorDate

This gives us a fairly simple SQL script we can schedule as a SQL agent job or any other means appropriate for the environment. No manual steps, no muss, no fuss, and we keep the taskflow log and the underlying tables lean and manageable.

August 3, 2016

The Case of the Missing Applications

Patching Oracle EPM products can be challenging. Things have gotten better with newer versions but there are still wrinkles that have to be ironed out on occasion.

https://ww.deluxe.com/blog/using-facebook-insights-small-business/

In this case we were applying the latest Patch Set Updates (PSU) for our Financial Reporting/Reporting and Analysis/BIPlus applications. One of the wrinkles with PSUs, particularly those for BIPlus, is that you have to redeploy the application with the EPM System Configurator/Config Tool. One of the other wrinkles with EPM products is that they are frequently bolted on to the rest of the suite and will go by many names when you try to research a problem.

After successfully running OPatch to update the product I ran configtool.bat. But the list of available products to configure only showed the basic Foundation Services. None of the other products, including the Reporting and Analysis Framework that we just patched were available.

Fortunately this was in our QA environment so there was no chance of causing a production outage. Also fortunately, and oddly, the services were still visible in the Windows Services console. And the OPatch for the PSU completed correctly which confirmed that the required file structure was still in place.

Perhaps the other products needed to be reinstalled? But when I ran the InstallTool.cmd the only products available to re-install were the same Hyperion Foundation products that we saw in ConfigTool.

Research

I opened an SR with Oracle to help me determine the cause. We ran the usual diagnostics and checked files related to inventory such as the HYPERION_HOME/inventory/contentsXML/comps.xml and EPM_INSTANCE/config/foundation/11.1.2.0/reg.properties files. But that didn't show us anything useful.

The Oracle SE hit on the idea of using ProcMon to watch what files were being processed. This pointed us to the HYPERION_HOME/.oracle.products file as the file referenced during configuration. After comparing the .oracle.products file to our production server we confirmed that the file in our QA environment got damaged somewhere along the line.

Solution

The .oracle.products file isn't used for normal operations. It is just referenced and updated during installation and configuration. So the file could be damaged for months or years and you would never notice until you run the configurator or try to install or uninstall products.

One solution was to backup the existing .oracle.products file, remove all of the <InstallabelItem> and <products> nodes from the XML file, then run InstallTool and reinstall the required products. Because this would overwrite all of the existing .ear and .jar files all previous patches would also need to be re-applied.

While this is a reasonable solution it involves several manual steps which can be prone to error. And it can be time consuming to reapply all of the patches.

The other solution was simple and brilliantly elegant. Just take a clean server and run InstallTool to install the products that are on the damaged server, then copy the .oracle.products file from the temporary server to the damaged server.

I used VirtualBox to quickly spin up a virtual machine to handle this task. Note that all I needed to do is install the products to generate the .oracle.products file. There was no need to worry about connecting to an existing WebLogic instance, or accessing databases, or any of the other configuration tasks. Plus I already had a server image I could deploy in VirtualBox so I completed the whole process in an hour.

Once the .oracle.properties file was copied I was able to successfully complete the redeploy and confirm that the PSU fixed the issue we were facing.

Takeaways:
1. ProcMon is a useful tool for finding missing or damaged files and registry keys
2. VMs are useful for troubleshooting and problem resolution. Everybody gives away hypervisors so be sure to keep something handy for when you need it.
3. The .oracle.products file is used by InstallTool and ConfigTool to determine what is installed and configurable on the server.