THUD - Troubleshooting Hyperion Under Duress

October 24, 2016

Truncating the Job Console logs

In the last post we talked about how to automate deleting the task flow logs. Another place where log entries accumulate is the Library Job Console.

The Job Console log is more manageable than the Task Flow log. In the Job Console tab you can display up to 200 entries per page and you can click the first record, scroll to the bottom of the list, shift+click to select all the entries, then right-click and delete.

This is better but can still be a bunch of work if you haven't cleared the logs in a while. And wouldn't you rather have an automated solution that can be scheduled so you can work on other things? Of course you would.

Analysis

The discovery was pretty easy. In the EPMA database the table name [JM_Job] is a dead give away. The contents show the same information we see in the Job Console tab.

There are three tables with the JM* prefix which hold the job log data: [JM_Job], [JM_Batch], and [JM_Attachment]. They are connected by the i_job_id key which has sequential values and is the primary key in the [JM_Job] table. The [JM_Attachment] primary key combines the i_job_id and the i_attachment_id, where i_attachment_id is also a sequential value. The primary key for the [JM_Batch] database combines the foreign keys i_job_id and i_attachment_id with the field i_batch_id.

The [JM_Job] database has the start and completion times for the various jobs. So we can reference that to find our cut off date, get the i_job_id for the previous job, and delete from the three tables anything that is less than the cut off i_job_id.

Solution

The wrinkle is that because of foreign key constraints of i_job_id the [JM_Job] database has to be the last one truncated. And because the i_attachment_id is a foreign key in [JM_Batch] that has to be the first table truncated.

Like with the task flow truncation we can either calculate or specify the cut off date.

DECLARE @CutOff datetime;

SET     @CutOff = DATEADD(M,-4,GetDate());

--DECLARE @CutOff char(10);

--SET     @CutOff = '2016-01-01'

There are several ways to tackle getting the i_job_id for our cut off date. One approach is to assign the i_job_id to a variable and use that in the DELETE statement:

DECLARE @ID Int;

SET @ID = (SELECT Top 1 i_job_id FROM JM_Job WHERE d_started < @CutOff ORDER by i_job_id DESC)

DELETE FROM JM_Batch WHERE i_job_id <= @ID

But these are fairly small tables so it is easy to use a subselect in the DELETE:

DELETE from JM_Batch  

 where i_job_id in (

  SELECT i_job_id

  FROM JM_Job

  WHERE d_started < @CutOff)

The last piece is to arrange the DELETEs in the correct order. The complete TSQL script is:

use EPMA

go

-- Anything prior to cutoff date will be deleted

--   The DateAdd function can change depending on need

--

DECLARE @CutOff datetime;

SET     @CutOff = DATEADD(M,-4,GetDate());

--DECLARE @CutOff char(10);

--SET     @CutOff = '2015-01-01'

PRINT 'Cut off date = '+convert(VarChar(32),@CutOff)

-- Because of the key constraints the deletes need to happen in this order

--   JM_Batch has to be first because it has keys from JM_Attachment and JM_Job

--   JM_Attachment has key for JM_Job so deletes have to happen before

--     JM_Job but after JM_Batch

--   JM_Job has to be last because that has the date field we query against

--

-- For the first 2 deletes we get a list of i_job_id from JM_Job that are older

--   than the cutoff date, and compare the i_job_id in the table to that list

--

DELETE from JM_Batch  

 where i_job_id in (

  SELECT i_job_id

  FROM JM_Job

  WHERE d_started < @CutOff)

DELETE from JM_Attachment

 where i_job_id in (

  SELECT i_job_id

  FROM JM_Job

  WHERE d_started < @CutOff)

DELETE from JM_Job

 where d_started < @CutOff

This gives us a tidy bit of code that we can run on a regular basis to truncate the Library Job Console log to purge stale information.

This blog post is a bit short, so here's a picture of a kitty.

August 30, 2016

Automate Deleting Taskflow Log

We use HFM taskflows. A lot. When a taskflow runs it generates log information so you can check on the status of the execution and examine any errors to effect remediation.

But like the EPMA logs, these log entries stay around forever until you delete them. On the Task Flow Status page you see each task flow status on a line with a check box. To delete a log entry you click the check box to select the entry and click the Delete button. Child's play.

Since these are check boxes you would think, as a standard GUI practice, you could check multiple boxes and click the delete button to delete multiple log records. You would be wrong. You have to select the log records one at a time to delete them. Isn't that convenient?

So if you don't constantly do maintenance you wind up with a bunch of useless information in your log viewer. Searches take longer and if you want to delete entries it takes a lot of time to complete. What's a mother to do?

Analysis

We use the taskflows to monkey with our HFM applications. Taskflows are found in the Consolidation Administration tab so you might think the information for taskflows is in the HFM table. But we can also do EPMA related things like run data synchronizations and redploy applications. So where do we find the tables that manage the taskflows and logs? In the Shared Services database, of course.

It took me some digging to figure that out, but the first clue was finding a table named [ces_wf_instances]. SELECTing from that tables shows the field process_name with the names of my taskflows. Other useful fields include starttime, a datetime field that we will use to find all the old log entries, and workflow_id and taskid which are used as references to other tables. workflow_id is the taskid with the string "wf-" as a prefix. workflow_id is also part of the primary key so it is unique across the tables.

The nice thing is that taskid values are always in ascending order. I don't have any way to prove this but I suspect that taskid is based at least in part the datetime of the activity. This means that when we process the other tables we can define a cutoff date, find the first taskid in the [ces_wf_instances] for that date, then delete anything that has a taskid less than that. The not so nice thing is that in all the other tables the taskid field is named task_id.

What other tables will we process? There are 13 tables with the ces_ prefix but some of them are empty and others are unrelated to logging. The [ces_tasks] table lists tasks within the taskflow, [ces_messages] contains more detailed messages about the task. Both tables contain a field named objectvalue which is an image blob. This is a large hex string that gets processed to show its part of the workflow status. Both of these tables are referenced by task_id which, like in [ces_wf_instances], is sequential, so we can still use the strategy of deleting records less than the target task_id.

There are also the [ces_participant] and [ces_participant_events] tables which have details on the steps in the taskflow. Like the [ces_tasks] and [ces_messages] tables these tables have the objectvalue image blob. The [ces_participant] is referenced by workflow_id which matches the field in the [ces_wf_instances] table. There is also a field named participant_id which contains the workflow_id, the user, and the name of the step in the taskflow. This participant_id is a foreign key in the [ces_participant_events] log. Because there is a foreign key constraint between the tables when we delete records from the [ces_participant] table the records with the matching participant_id in the [ces_participant_events] table also get deleted.

Solution

So now that we know all the players, let's draw up a play to get us to the end zone. We will build a SQL script that defines our cutoff date, find the last taskid associated with that date, then delete records from the other tables where the taskid is less than the one for the target date.

There are different tactics to define the cut off date. If we are running this process manually maybe we just want to define the date with a static string:

DECLARE @PriorDate DateTime;

SET     @PriorDate = '2015-01-01';

But if we want to create a scheduled process we can calculate the desired cut off date. For example, the following code will use the DATEADD() function to set the cut off date as 3 months prior to the current date:

DECLARE @PriorDate DateTime;

SET     @PriorDate = DATEADD(M,-3,GetDate());

Now that we have the date, we need to find the target taskid. The taskid is sorted sequentially so we take the TOP 1 taskid where the starttime is greater than or equal the cut off date. Note that we use greater than or equal to (>=) rather than just equal to (=) in case there was no taskflow execution on the cut off date. This gives us a taskid where anything less will be prior to our cutoff date. Also note that we make the @LastID variable nvarchar(100) which matches the data type used in the [ces_wf_instances] table.

DECLARE @LastID nvarchar(100);

SET     @LastID = (   

  SELECT TOP 1 taskid

    FROM [ces_wf_instances]

    WHERE starttime >= @PriorDate

   )

For the [ces_messages] and [ces_tasks] tables we can do a simple delete where task_id is less than our target taskid. But for the [ces_participant] table we need to get the workflow_id for all of the records prior to our cut off date. One way to do this is to use the IN clause and build a list of workflow_id from the [ces_wf_instances] table where the starttime is less than the cut off date.

DELETE FROM [ces_participant]

 WHERE workflow_id IN (

  select workflow_id

   from [ces_wf_instances]

   where starttime < @PriorDate)

Remember that because of the foreign key constraint deleting records from [ces_participant] will also delete the associated records from the [ces_participant_events] table. Also note that because we need the [ces_wf_instances] table to do this delete we have to make that the last table from which we delete records.

The entire SQL script is:

-- Cut off date

DECLARE @PriorDate DateTime;

SET     @PriorDate = DATEADD(M,-3,GetDate());

--SET     @PriorDate = '2015-01-01'

PRINT 'Cut off date = '+convert(VarChar(32),@PriorDate)

-- Get the first datskid of the cutoff date

DECLARE @LastID nvarchar(100);

SET     @LastID = (

 select top 1 taskid

  from [ces_wf_instances]

  where starttime >= @PriorDate

   )

PRINT 'Last TaskID = '+@LastID 

-- These tables are simple deletes

DELETE from [ces_messages] where task_id < @LastID

DELETE from [ces_tasks]    where task_id < @LastID

-- [ces_participant] table has a foreign key constraint against [ces_participant_event]

-- table.  Deleting records from [ces_participant] deletes records with the same

-- particpant_id from the [ces_participant_event].  So to process these two tables we

-- get a list of workflow_ids where starttime is less than @PriorDate

DELETE FROM [ces_participant]

 WHERE workflow_id IN (

  select workflow_id

   from [ces_wf_instances]

   where starttime < @PriorDate

   )

-- We could also have deleted based on task_id   

DELETE FROM [ces_wf_instances]

 WHERE starttime < @PriorDate

This gives us a fairly simple SQL script we can schedule as a SQL agent job or any other means appropriate for the environment. No manual steps, no muss, no fuss, and we keep the taskflow log and the underlying tables lean and manageable.

August 3, 2016

The Case of the Missing Applications

Patching Oracle EPM products can be challenging. Things have gotten better with newer versions but there are still wrinkles that have to be ironed out on occasion.

https://ww.deluxe.com/blog/using-facebook-insights-small-business/

In this case we were applying the latest Patch Set Updates (PSU) for our Financial Reporting/Reporting and Analysis/BIPlus applications. One of the wrinkles with PSUs, particularly those for BIPlus, is that you have to redeploy the application with the EPM System Configurator/Config Tool. One of the other wrinkles with EPM products is that they are frequently bolted on to the rest of the suite and will go by many names when you try to research a problem.

After successfully running OPatch to update the product I ran configtool.bat. But the list of available products to configure only showed the basic Foundation Services. None of the other products, including the Reporting and Analysis Framework that we just patched were available.

Fortunately this was in our QA environment so there was no chance of causing a production outage. Also fortunately, and oddly, the services were still visible in the Windows Services console. And the OPatch for the PSU completed correctly which confirmed that the required file structure was still in place.

Perhaps the other products needed to be reinstalled? But when I ran the InstallTool.cmd the only products available to re-install were the same Hyperion Foundation products that we saw in ConfigTool.

Research

I opened an SR with Oracle to help me determine the cause. We ran the usual diagnostics and checked files related to inventory such as the HYPERION_HOME/inventory/contentsXML/comps.xml and EPM_INSTANCE/config/foundation/11.1.2.0/reg.properties files. But that didn't show us anything useful.

The Oracle SE hit on the idea of using ProcMon to watch what files were being processed. This pointed us to the HYPERION_HOME/.oracle.products file as the file referenced during configuration. After comparing the .oracle.products file to our production server we confirmed that the file in our QA environment got damaged somewhere along the line.

Solution

The .oracle.products file isn't used for normal operations. It is just referenced and updated during installation and configuration. So the file could be damaged for months or years and you would never notice until you run the configurator or try to install or uninstall products.

One solution was to backup the existing .oracle.products file, remove all of the <InstallabelItem> and <products> nodes from the XML file, then run InstallTool and reinstall the required products. Because this would overwrite all of the existing .ear and .jar files all previous patches would also need to be re-applied.

While this is a reasonable solution it involves several manual steps which can be prone to error. And it can be time consuming to reapply all of the patches.

The other solution was simple and brilliantly elegant. Just take a clean server and run InstallTool to install the products that are on the damaged server, then copy the .oracle.products file from the temporary server to the damaged server.

I used VirtualBox to quickly spin up a virtual machine to handle this task. Note that all I needed to do is install the products to generate the .oracle.products file. There was no need to worry about connecting to an existing WebLogic instance, or accessing databases, or any of the other configuration tasks. Plus I already had a server image I could deploy in VirtualBox so I completed the whole process in an hour.

Once the .oracle.properties file was copied I was able to successfully complete the redeploy and confirm that the PSU fixed the issue we were facing.

Takeaways:
1. ProcMon is a useful tool for finding missing or damaged files and registry keys
2. VMs are useful for troubleshooting and problem resolution. Everybody gives away hypervisors so be sure to keep something handy for when you need it.
3. The .oracle.products file is used by InstallTool and ConfigTool to determine what is installed and configurable on the server.

May 23, 2016

Auditing Financial Reports

You know how it goes. Somebody starts a project and then forgets about it. Or there are multiple versions of a report that were made for various purposes. Or a report was critical at one point but eventually interest waned and is now unused. Whatever the cause you wind up with a bunch of reports, books, and batches that are never used. But you can't tell which ones aren't used if you don't have an audit trail.

There other uses for an audit trail beyond finding which reports are used. Maybe you need to know who runs the reports, or how frequently reports are generated, or if reports are run at a certain times so you can plan maintenance. Or maybe you're just curious.

Oracle docs 1353965.1 and 1947774.1 show the FRLogging.log file holds the audit data of Financial Reporting objects. The FRLogging.log contains a bunch of information including start and stop times for reports and books, steps taken to produce output, queries used to pull from the data source, and other details.

The Problem

As noted in the 1947774.1 document you configure logging in the MIDDLEWARE_HOME/user_projects/domains/EPM_System/config/fmwconfig/servers/FinancialReporting0/logging.xml file. By default BI+ limits the log file size to 1Mb. When the log reaches that limit the FRLogging.log file is renamed by appending a sequential number to the file name and a new log file is created. By default only 50Mb of log files are retained.

This is good news because BI+ won't create an infinite number of logs which could fill your drive and crash your system. But this is bad news because you can blow through 50Mb of logging in an afternoon of running multiple report batches during month end close, thus losing historical data.

Also the logs contain a lot of information that isn't interesting from an audit stand point. We just want lines that show the name of the report as well as the time it ran and the user who ran it.

We have to parse the FRLogging logs to get just the summary information. But we can't process the FRLogging.log file directly because it is held open by the Financial Reporting services and new records are constantly being appended.

The Solution

Windows has a .NET class called FileSystemWatcher. This will monitor a file or folder and take action when file system events occur such as creating, deleting, or renaming a file. We can use this feature to monitor the FRLogging.log file. When it gets renamed we find the newest numbered file in the folder and process that. This allows us to process a file that is not in use and prevents us from processing a log file more than once.

There are several lines in the FRLogging.log that are useful for audits. The START: record gets added when the job starts and the JOB: record gives a completion summary of the job including the total time it took the job to complete. In between these are one or more REPORT: records. If you run just the report you get one REPORT: record but for a book or batch you will get multiple records, one for each report generated.

Each of those lines contain the string "start:" (note the colon in the string). For the JOB: and REPORT: records this string identifies the start time of the event while for the START: records the string is the record identifier. No other records contain this string so this is a sure way to parse the log for the audit data.

To put this all together we need a script that will run FileSystemWatcher, wait for the FRLogging.log file to get renamed, find the latest numbered file, do a case-insensitive search for lines with the string "start:", and drop those lines into a file we can retain for as long as we need. This PowerShell script accomplishes those tasks:

# PoSh script to keep a record of all report usage

#   Use the FileSystemWatcher to monitor the FRLogging.log.  When the file gets renamed

#   we pull the job information from the most recent file and save it to an audit log

#

$Folder = `

"D:\Oracle\Middleware\user_projects\domains\EPMSystem\servers\FinancialReporting0\logs"

$File   = "FRLogging.log"   # File to monitor with FileSystemWatcher

$Filter = "FRLogging*.log"  # Filter to find the most recently renamed file

$Output = "D:\FRAudit.log"  # Running audit trail

# Configure the FileSystemWatcher.  Create the watcher object, then define the

# folder and file to watch

#

$watcher        = New-Object System.IO.FileSystemWatcher

$watcher.Path   = $Folder

$watcher.Filter = $File

# Define the action to take when the file we are watching gets renamed

#

$renamed = Register-ObjectEvent $watcher "Renamed" -Action {

   # Get the FullName of the most recent numbered FRLogging*.log file

   $LogFile = (Get-ChildItem $Folder $Filter | Sort LastWriteTime -desc)[1].FullName

   # Do a case insensitive search for all lines that contain the string "start:"

   $NewRecs = get-content $LogFile | where {$_ -match "start:"}

   # And append those lines to our running audit trail

   $NewRecs | out-file $Output -append

}

If you use this script you will need to change the value for the $Folder variable to match your environment. The value of the $Output variable is arbitrary but must be the full path to a file for your running audit trail. The tricky part of the script is the line where we find the most recent log file. Get-ChildItem returns an array of files matching the filter in the folder. We sort that array by LastWriteTime in reverse order and take the second element of the array (index [1]) to get the latest numbered file since the first element (array index [0]) is the FRLogging.log file.

But how do we make sure the script stays running to continuously monitor the FRLogging.log file?

An easy solution is to use Windows Task Scheduler to create a task that runs at system startup. You can also start the task manually when you first create it or if you adjust it later. Since there is nothing to terminate the script the task keeps running in the background. We basically get a service that starts automatically at every reboot.

Our new audit log is just a raw dump of the the useful lines from the FRLogging.log file. You will still need to parse that data to isolate the information of interest to you. (I have another PowerShell script to do that and will send it to you if you are interested.) But the key point is that we can leverage the transient information Oracle gives us and generate a persistent audit trail. And we can do it for free using built-in tools and with minimal overhead.

April 21, 2016

Permanent temporary files

We update the metadata for our HFM applications at least once per month. Before updating the shared dimension library and deploying the production application we test all of the changes in our development environment.

During a recent test the deployment in development failed. Usually a failure is during the deployment phase due to some misconfiguration in the new dimension members. In this case the deployment failed during the initial export with the error message:

Detail : An error was encountered while executing the Application Export: An error was encountered during export: The file exists.

Oracle Doc ID 1639083.1 shows the same error for Planning applications. Because Planning and HFM both use the EPMA dimension and application libraries you can often extrapolate from the Planning solutions to solve HFM problems.

The article says the <SystemDrive>:\Windows\temp folder is full and the solution is to delete files from that directory. But my C:\ drive wasn't full and there weren't many files in the \windows\temp directory.

Troubleshooting

According to the Oracle doc we were running out of file names. But where?

The deployment is performed by EPMA and it fails before HFM starts its processing. (Years ago an Oracle support engineer told me that during a deployment all the processing up to 20% is EPMA and after that it is HFM. At any rate we don't get past the export so I know definitively that the problem was with EPMA.)

SysInternals has a bunch of powerful utilities for Windows systems. I've been using these utilities since the WindowsNT days. The author of the utilities, Mark Russinovich, was eventually hired as a technical fellow at Microsoft and is now the CTO of Azure. He's one of the smartest guys you can follow and he's an engaging and articulate speaker if you see him give a presentation.

In this case I had to monitor the processes on the EPMA server to see what caused the failure. From live.sysinternals.com I downloaded ProcMon and ran it on the EPMA server. ProcMon will monitor everything on the server so you need to apply filters to isolate the information you want. A handy technique is to let it capture for a few seconds, then stop the monitor, right-click any processes you don't care about such as explorer.exe, svchost.exe, your anti-virus software, etc and select Exclude. This adds the exclusions to the current filter. Repeat until you only have the handful of processes of interest, in this case anything related to EPMA. Then clear the log, start the capture, and redeploy.

Once the deployment failed I stopped the capture and saved the results. ProcMon captures access to the file system and registry as well as network and process thread events, so I removed everything except file system requests. I didn't have to scroll down very far to see thousands of events from EPMA_Server.exe with a result of NAME COLLISION. These are all in the AppData\Local\Temp directory of the user account under which the services run.

The files are named tmpXXXX.tmp where XXXX is appears to be a random set of 4 hex digits. Some of these files are created during application deployment. We do a nightly LCM export of EPMA for backup purposes and which also creates .tmp files. Some of the .tmp files are XML properties of the applications and some are binary files. Since this is our dev environment and we have many test applications, do regular deployments of HFM applications, and export EPMA nightly we ran out of the 65,535 possible names for the .tmp files.

Solution

As the Oracle doc implies, we just delete the .tmp files. They are, after all, just temporary files in a temp directory. Why Oracle doesn't delete them automatically or overwrite existing files instead of throwing an error is a topic for philosophers and software engineers to resolve.

Another location where files accumulate is \Oracle\Middleware\user_projects\epmsystem\tmp. That folder holds temp files for shared dimension library updates, diagnostic framework, and web logic, among others. Beneath this \tmp folder is a folder named EPMA that holds temp XML files with all properties of all dimensions in redeployed applications as well as temp ADS files created during LCM exports from EPMA. Also beneath the \tmp folder is a folder oracle\tmp which has subfolders holding log files generated during HFM data syncs.

For us that issue came to light since we not only do nightly LCM exports of EPMA, we also have reporting applications that are updated nightly. We use HFM taskflows to update the metadata, redeploy the applications, and run datasyncs to load data. Because the .xml files generated during the redeploy can be large and we went through many rounds of testing, the data drive eventually filled to capacity which killed all activity on our EPMA server.

Our solution is regular maintenance to delete temp files in both the AppData\Local\Temp and the epmsystem\tmp folders on servers running the Hyperion EPMA Server service. This task is easy enough to complete manually that I have not created an automated process for it.

But it is curious that Oracle doesn't automatically delete, or provide facilities for automatically deleting, these temporary files. If you know of anything built into the products to address these temporary files please share that information in the comments below.