Notes from Feb 2010

The easiest way to run the monitoring nowadays is probably just to run the transform used for production at Tier0 - Reco_trf. You have to run this anyway before you are allowed to submit code to the Tier0 cache. The advantage of this is that it almost always works out of the box in any recent release and you are running it exactly as it will be run on Tier0. The disadvantage is that by default you are running everything so it takes longer to run. You can turn things off to speed it up but often the same jobOptions won't work when you move from one release to the next so I don't bother trying any more.

Example use of Reco_trf:

Reco_trf.py inputBSFile=rawDataFilename outputESDFile=myESD.pool.root \ outputAODFile=myAOD.pool.root HIST=myHIST.root \ autoConfiguration=everything maxEvents=200

This runs the complete Tier0 processing chain. The important parts for us are RAWtoESD, ESDtoAOD and Histogram Merging. The latest version of TrigT1CaloMonitoring runs some monitoring in the RAWtoESD step and some in the ESDtoAOD step which is then merged together. Older versions ran everything in the RAWtoESD step but we have been asked to move as much as possible to the ESDtoAOD step. I've never tried running on the Grid as it's not necessary for testing. I usually use our batch farm here at Birmingham. I recommend using the latest release possible, I'm using AtlasProduction-15.6.4.1 which is the current Tier0 release, I believe. The latest tags of TrigT1CaloMonitoring and TrigT1Monitoring will work with this.

Update Feb 2013

The monitoring has been moved back to the RAWtoESD step to avoid reading a large database folder in both steps. But note that the jobOptions are still called in both steps so still need to cater for both.

Latest suggested test job:

Reco_trf.py inputBSFile=rawDataFilename --ignoreerrors=True conditionsTag='COMCOND-ES1PA-006-05' \
            autoConfiguration='everything' maxEvents=200 outputESDFile=myESD.pool.root \
        --omitvalidation=ALL --test outputHISTFile=myHIST.root

You can find out the current version and job being run on Tier0 by looking on the DQ web pages for Tier0 monitoring. If you click on the tag next to the run number it will give you various information including the Atlas release used. To get the actual job parameters use GetCommand.py:

GetCommand.py AMI=x250

where x250 is the first part of the tag on the DQ page. You may need to do:

voms-proxy-init -voms atlas

first to access AMI.

Before requesting a tag for Tier0 you should test with the latest cache or nightly and run these three jobs:

Reco_trf.py AMI=q120
Reco_trf.py AMI=q121
Reco_trf.py AMI=q122

If you are running these jobs in an environment that can't access AMI then use GetCommand.py to get the job parameters you need. Check the outputs carefully particularly for the RAWtoESD step.

Testing Online-specific Code

The tools which contain online-specific code have a property OnlineTest which if set to true makes the tool run as if it was online even when offline. (Exception: PPrStabilityMon.)

Monitoring CPU Time

For Tier0 monitoring it is important to keep CPU and memory usage as low as possible. To help with this an alternative jobOptions is provided which runs every L1Calo monitoring tool in a separate manager so that the CPU usage of each tool is given at the end of the Reco_trf.py job log. See TrigT1CaloMonitoring_forRecExCommission_cpu.py (and TrigT1Monitoring_forRecExCommission_cpu.py for TrigT1Monitoring).

The following table shows the cpu usage of each tool as a percentage of the total L1Calo cpu. The express stream runs all tools so gives times for all of them. The overall column estimates the contribution of each tool for all streams (ES1 and BLK) taking into account numbers of events and which streams the tools run in. Run 215643 and release 17.7.0.2 together with TrigT1CaloByteStream-00-08-17, TrigT1CaloMonitoring-00-14-06, TrigT1CaloMonitoringTools-00-02-01, TrigT1Monitoring-00-05-00 and TrigT1CaloCalibTools-00-05-14 were used for this.

Manager	Tool(s)	% cpu express	% cpu overall
L1CaloMonManager0A1	Bytestream Unpacking PPM (1)	6.7	12.0
L1CaloMonManager0A2	Bytestream Unpacking CPM (1)	1.2	2.2
L1CaloMonManager0A3	Bytestream Unpacking JEM (1)	1.2	2.2
L1CaloMonManager0A4	Bytestream Unpacking ROD (1)	0.2	0.4
L1CaloMonManager0B	L1CaloMonitoringCaloTool (2)	31.0	18.5
L1CaloMonManager1A	PPrStabilityMon /FineTime	2.6	0.1
L1CaloMonManager1B	PPrStabilityMon /Pedestal	5.2	0.3
L1CaloMonManager1C	PPrStabilityMon /EtCorrelation	0.8	0.0
L1CaloMonManager2	PPrMon	1.6	2.8
L1CaloMonManager3	PPMSimBSMon	3.5	6.3
L1CaloMonManager4	PPrSpareMon	0.3	0.5
L1CaloMonManager5	JEMMon	0.5	0.9
L1CaloMonManager6	CMMMon	0.2	0.3
L1CaloMonManager7	JEPSimBSMon	14.4	25.8
L1CaloMonManager8	TrigT1CaloCpmMonTool	0.6	1.1
L1CaloMonManager9	CPMSimBSMon	4.0	7.2
L1CaloMonManagerA	TrigT1CaloRodMonTool	0.2	0.3
L1CaloMonManagerB	TrigT1CaloGlobalMonTool	0.1	0.3
L1CaloMonManagerC	EmEfficienciesMonTool	5.4	5.3
L1CaloMonManagerD	JetEfficienciesMonTool	3.6	3.0
L1MonManager0A (3)	CalorimeterL1CaloMon	15.1	9.0
L1MonManager0B (3)	L1CaloHVScalesMon (4)	1.0	0.6
L1MonManager0C (3)	L1CaloPMTScoresMon (4)	0.1	0.1
L1MonManager1 (3)	L1CaloCTPMon	0.4	0.7
L1MonManager2 (3)	L1CaloLevel2Mon	0.1	0.2

(1) Needs to run before any other algorithms that may be reading our data, eg RoIBResultToAOD.
(2) This tool forms CaloCell Et sums and quality per TriggerTower for the use of other tools.
(3) TrigT1Monitoring.
(4) Runs first event of each job only.

To get the cpu times from the job log do:

grep 'L1' job.log | grep 'MonManager' | grep 'execute' | sort > cpu.log

The numbers in the table were generated with this program:

#include <iostream>
#include <iomanip>
 
int main()
{
  int ntools = 25;
  // relative cpu times for each tool in express stream (from job log)
  float timesE[] = {1.77,0.324,0.318,0.065,8.16,0.673,1.36,0.222,0.420,
                    0.923,0.076,0.128,0.041,3.8,0.161,1.06,0.048,0.039,
            1.41,0.943,3.98,0.253,0.026,0.105,0.03};
  // flag which tools run in each stream (as in jobOptions)
  int express[] = {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1};
  int jetet[]   = {1,1,1,1,1,0,0,0,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1};
  int egamma[]  = {1,1,1,1,0,0,0,0,1,1,1,1,1,1,1,1,1,1,0,1,0,0,0,1,1};
  int muons[]   = {1,1,1,1,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,1,1};
  int other[]   = {1,1,1,1,0,0,0,0,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,1,1};
  // relative number of events per stream
  // (from plots L1Calo/Overview/l1calo_1d_NumberOfEvents)
  float events[] = {0.192, 2.11, 1.53, 1.52, 1.53};
  float timesO[ntools];
  float totalE = 0;
  float totalO = 0;
  for (int i = 0; i < ntools; ++i) {
    totalE += timesE[i];
    timesO[i] = express[i]*timesE[i]*events[0] + jetet[i]*timesE[i]*events[1] +
                 egamma[i]*timesE[i]*events[2] + muons[i]*timesE[i]*events[3] +
          other[i]*timesE[i]*events[4];
    totalO += timesO[i];
  }
  float percE, percO;
  std::cout << "Express  Overall" << std::endl;
  for (int i = 0; i < ntools; ++i) {
    percE = 100*timesE[i]/totalE;
    percO = 100*timesO[i]/totalO;
    std::cout << std::setiosflags(std::ios::fixed | std::ios::showpoint)
              << std::setprecision(1)
              << std::setw(6) << percE
              << std::setw(9) << percO << std::endl;
  }
}

Times are for one input file (683 events).