ATLAS Offline Software
yodamerge_tmp.py
Go to the documentation of this file.
1 # Copyright (C) 2002-2022 CERN for the benefit of the ATLAS collaboration
2 
3 """\
4 %(prog)s [-o outfile] <yodafile1>[:<scale1>] <yodafile2>[:<scale1>] ...
5  e.g. %(prog)s run1.yoda run2.yoda run3.yoda (unweighted merging of three runs)
6  or %(prog)s run1.yoda:2.0 run2.yoda:3.142 (weighted merging of two runs)
7 
8 Merge analysis objects from multiple YODA files, combining the statistics of
9 objects whose names are found in multiple files. May be used either to merge
10 disjoint collections of data objects, or to combine multiple statistically
11 independent runs of the same data objects into one high-statistics run. Optional
12 scaling parameters may be given to rescale the weights of the objects on a
13 per-file basis before merging.
14 
15 By default the output is written to stdout since we can't guess what would be
16 a good automatic filename choice! Use the -o option to provide an output filename.
17 
18 
19 IMPORTANT!
20  This script is not meant to handle all run merging situations or data objects:
21  there are limitations to what can be inferred from data objects alone. If you
22  need to do something more complex than the common cases handled by this script,
23  please write your own script / program to load and process the data objects.
24 
25 
26 SCATTERS (E.G. HISTOGRAM RATIOS) CAN'T BE MERGED
27 
28  Note that 'scatter' data objects, as opposed to histograms, cannot be merged
29  by this tool since they do not preserve sufficient statistical
30  information. The canonical example of this is a ratio plot: there are
31  infinitely many combinations of numerator and denominator which could give the
32  same ratio, and the result does not indicate anything about which of those
33  infinite inputs is right (or the effects of correlations in the division).
34 
35  If you need to merge Scatter2D objects, you can write your own Python script
36  or C++ program using the YODA interface, and apply whatever case-specific
37  treatment is appropriate. By default the first such copy encountered will be
38  returned as the 'merged' output, with no actual merging having been done.
39 
40 NORMALIZED, UNNORMALIZED, OR A MIX?
41 
42  An important detail in histogram merging is whether a statistical treatment
43  for normalized or unnormalized histograms should be used: in the former case
44  the normalization scaling must be undone *before* the histograms are added
45  together, and then re-applied afterwards. This script examines the ScaledBy
46  attribute each histograms to determine if it has been normalized. We make the
47  assumption that if ScaledBy exists (i.e. h.scaleW has been called) then the
48  histogram is normalized and we normalize the resulting merged histogram to the
49  weighted average of input norms; if there is no ScaledBy, we assume that the
50  histogram is not normalised.
51 
52  This is not an infallible approach, but we believe is more robust than heuristics
53  to determine whether norms are sufficiently close to be considered equal.
54  In complicated situations you will again be better off writing your own
55  script or program to do the merging: the merging machinery of this script is
56  available directly in the yoda Python module.
57 
58 MORE NOTES
59 
60  If all the input histograms with a particular path are found to have the same
61  normalization, and they have ScaledBy attributes indicating that a histogram
62  weight scaling has been applied in producing the input histograms, each
63  histogram in that group will be first unscaled by their appropriate factor, then
64  merged, and then re-normalized to the target value. Otherwise the weights from
65  each histogram copy will be directly added together with no attempt to guess an
66  appropriate normalization. The normalization guesses (and they are guesses --
67  see below) are made *before* application of the per-file scaling arguments.
68 
69  IMPORTANT: note from the above that this script can't work out what to do
70  re. scaling and normalization of output histograms from the input data files
71  alone. It may be possible (although unlikely) that input histograms have the
72  same normalization but are meant to be added directly. It may also be the case
73  (and much more likely) that histograms which should be normalized to a common
74  value will not trigger the appropriate treatment due to e.g. statistical
75  fluctuations in each run's calculation of a cross-section used in the
76  normalization. And anything more complex than a global scaling (e.g. calculation
77  of a ratio or asymmetry) cannot be handled at all with a post-hoc scaling
78  treatment. The --assume-normalized command line option will force all histograms
79  to be treated as if they are normalized in the input, which can be useful if
80  you know that all the output histograms are indeed of this nature. If they are
81  not, it will go wrong: you have been warned!
82 
83  Please use this script as a template if you need to do something more specific.
84 
85  NOTE: there are many possible desired behaviours when merging runs, depending on
86  the factors above as well as whether the files being merged are of homogeneous
87  type, heterogeneous type, or a combination of both. It is tempting, therefore,
88  to add a large number of optional command-line parameters to this script, to
89  handle these cases. Experience from Rivet 1.x suggests that this is a bad idea:
90  if a problem is of programmatic complexity then a command-line interface which
91  attempts to solve it in general is doomed to both failure and unusability. Hence
92  we will NOT add extra arguments for applying different merging weights or
93  strategies based on analysis object path regexes, auto-identifying 'types' of
94  run, etc., etc.: if you need to merge data files in such complex ways, please
95  use this script as a template around which to write logic that satisfies your
96  particular requirements.
97 """
98 
99 import yoda
100 import argparse
101 import sys
102 import math
103 
104 parser = argparse.ArgumentParser(usage=__doc__)
105 parser.add_argument("INFILES", nargs="+", help="datafile1 datafile2 [...]")
106 parser.add_argument("-o", "--output", default="-", dest="OUTPUT_FILE", metavar="PATH",
107  help="write output to specified path")
108 parser.add_argument("--s1d-mode", "--s1dmode", default="assume_mean", dest="S1D_MODE", metavar="MODE",
109  help="choose strategy for combining Scatter1D objects: one of 'first', 'combine', 'assume_mean', 'add'")
110 parser.add_argument("--s2d-mode", "--s2dmode", default="assume_mean", dest="S2D_MODE", metavar="MODE",
111  help="choose strategy for combining Scatter2D objects: one of 'first', 'combine', 'assume_mean', 'add'")
112 parser.add_argument("--s3d-mode", "--s3dmode", default="assume_mean", dest="S3D_MODE", metavar="MODE",
113  help="choose strategy for combining Scatter3D objects: one of 'first', 'combine', 'assume_mean', 'add'")
114 parser.add_argument("--type-mismatch-mode", default="scatter", dest="TYPE_MISMATCH_MODE", metavar="MODE",
115  help="choose strategy for combining objects whose types mismatch: one of 'first', 'scatter'")
116 parser.add_argument("--add", "--stack", action="store_true", default=False, dest="STACK",
117  help="force simple stacking (also forces all scatter modes to 'add')")
118 parser.add_argument("--no-veto-empty", action="store_false", default=True, dest="VETO_EMPTY",
119  help="disable the removal of empty (sumW=0) data objects _before_ applying merge heuristics. You probably want the default!")
120 parser.add_argument("--assume-normalized", action="store_true", default=False, dest="ASSUME_NORMALIZED",
121  help="DEPRECATED, AND DOES NOTHING. This option _used_ to bypass the detection heuristic for unnormalized histograms")
122 parser.add_argument("-m", "--match", action="append", dest="PATHPATTERNS", default=[],
123  help="only write out histograms whose $path/$name string matches any of these regexes")
124 parser.add_argument("-M", "--unmatch", action="append", dest="PATHUNPATTERNS", default=[],
125  help="exclude histograms whose $path/$name string matches any of these regexes")
126 args = parser.parse_args()
127 
128 
129 # Include scatters in "add" mode
130 if args.STACK:
131  args.S1D_MODE = "add"
132  args.S2D_MODE = "add"
133  args.S3D_MODE = "add"
134 
135 # Put the incoming objects into a dict from each path to a list of histos and scalings
136 analysisobjects_in = {}
137 for fa in args.INFILES:
138  filename, scale = fa, 1.0
139  if ":" in fa:
140  try:
141  filename, scale = fa.rsplit(":", 1)
142  scale = float(scale)
143  except Exception:
144  sys.stderr.write("Error processing arg '%s' with file:scale format\n" % fa)
145  aos = yoda.read(filename, patterns=args.PATHPATTERNS, unpatterns=args.PATHUNPATTERNS)
146  for aopath, ao in aos.items():
147  ao.setAnnotation("yodamerge_scale", scale)
148  analysisobjects_in.setdefault(aopath, []).append(ao)
149 
150 analysisobjects_out = {}
151 for p, aos in analysisobjects_in.items():
152 
153  # Identify the canonical aotype being handled from the type of the first entry in aos
154  aotype = type(aos[0])
155 
156  # Check that types match, and just output the first one if they don't
157  if not all(type(ao) is aotype for ao in aos):
158  msg = "WARNING: cannot merge mismatched analysis object types for path %s: " % p
159  scatter_fail = False
160  if args.TYPE_MISMATCH_MODE == "scatter":
161  saos = []
162  for ao in aos:
163  sao = ao.mkScatter()
164  sao.setAnnotation("yodamerge_scale", ao.annotation("yodamerge_scale"))
165  saos.append(sao)
166  saotype = type(saos[0])
167  msg += "converting to %ss" % saotype.__name__
168  if all(type(sao) is saotype for sao in saos):
169  sys.stderr.write(msg + "\n")
170  aos = saos
171  aotype = saotype
172  else:
173  msg += "... failed, "
174  scatter_fail = True
175  if args.TYPE_MISMATCH_MODE == "first" or scatter_fail:
176  sys.stderr.write(msg + "returning first object\n")
177  analysisobjects_out[p] = aos[0]
178  continue
179 
180  # Remove empty fillable data objects, to avoid gotchas where e.g. histos are normalised and hence
181  # ScaledBy should be set... but isn't because the emptiness blocked rescaling to finite area
182  if args.VETO_EMPTY:
183  # TODO: Add a Fillable interface/ABC and use that for the type matching
184  if aotype in (yoda.Counter, yoda.Histo1D, yoda.Histo2D, yoda.Profile1D, yoda.Profile2D):
185  aos_nonzero = [ao for ao in aos if ao.sumW() != 0] # < possible that this doesn't mean no fills :-/
186  # Just output the first histo if they are all empty
187  if not aos_nonzero:
188  analysisobjects_out[p] = aos[0]
189  continue
190  # Reset aos to only contain non-empty ones
191  aos = aos_nonzero
192 
193  # Counter, Histo and Profile (i.e. Fillable) merging
194  # TODO: Add a Fillable interface/ABC and use that for the type matching
195  if aotype in (yoda.Counter, yoda.Histo1D, yoda.Histo2D, yoda.Profile1D, yoda.Profile2D):
196 
197  # Identify a target rescaling factor from the 1/scalefactor-weighted norms of each run
198  rescale = None
199  if len(aos) > 1 and args.STACK:
200  pass # we're in dumb stacking mode
201  elif all("ScaledBy" in ao.annotations() for ao in aos):
202  try:
203  rescale = 1.0 / sum(float(ao.annotation("yodamerge_scale")) / float(ao.annotation("ScaledBy")) for ao in aos)
204  except ZeroDivisionError:
205  sys.stderr.write("WARNING: Abandoning normalized merge of path %s because ScaledBy attributes are zero.\n" % p)
206  elif all("ScaledBy" not in ao.annotations() for ao in aos):
207  pass
208  else:
209  sys.stderr.write("WARNING: Abandoning normalized merge of path %s because some but not all inputs have ScaledBy attributes\n" % p)
210 
211  # Now that the normalization-identifying heuristic is done, apply user scalings and undo the normalization scaling if appropriate
212  for ao in aos:
213  if rescale:
214  ao.scaleW(1.0 / float(ao.annotation("ScaledBy")))
215  ao.scaleW(float(ao.annotation("yodamerge_scale")))
216 
217  # Make a copy of the (scaled & unnormalized) first object as the basis for the output
218  # and merge for histograms (including weights, normalization, and user scaling)
219  ao_out = aos[0].clone()
220  ao_out.rmAnnotation("yodamerge_scale")
221  for ao in aos[1:]:
222  ao_out += ao
223  if rescale:
224  ao_out.scaleW(rescale)
225 
226  # Merge for Scatters, assuming equal run sizes, and applying user scaling
227  else:
228 
229  # Make a copy of the first object as the basis for merging (suitable for all Scatter types)
230  ao_out = aos[0].clone()
231  ao_out.rmAnnotation("yodamerge_scale")
232  # If there's only one object, there's no need to do any combining
233  if len(aos) == 1:
234  pass
235 
236  elif aotype in (yoda.Scatter1D, yoda.Scatter2D, yoda.Scatter3D):
237 
238  # Retrieve dimensionality of the Scatter*D object
239  dim = ao_out.dim()
240  SND_MODE = getattr(args, "S%dD_MODE" % dim)
241  axis = ['', 'x', 'y', 'z']
242 
243  # Use asymptotic mean+stderr convergence statistics
244  if SND_MODE in ("assume_mean", "add"):
245 
246  msg = "WARNING: Scatter%dD %s merge assumes asymptotic statistics and equal run sizes" % (dim, p)
247  if any(float(ao.annotation("yodamerge_scale")) != 1.0 for ao in aos):
248  msg += " (+ user scaling)"
249  sys.stderr.write(msg + "\n")
250  npoints = len(ao_out.points())
251  for i in range(npoints):
252  val_i = scalesum = 0.0
253  ep_i = {} # will hold the values of the multiple error sources
254  em_i = {} # will hold the values of the multiple error sources
255  for ao in aos:
256  scale = float(ao.annotation("yodamerge_scale"))
257  variations = ao.variations()
258  scalesum += scale
259  val_i += scale * ao.point(i).val(dim)
260  for var in variations:
261  if var not in ep_i.keys():
262  ep_i[var] = 0.
263  em_i[var] = 0.
264  ep_i[var] += (scale * ao.point(i).errs(dim, var)[0]) ** 2
265  em_i[var] += (scale * ao.point(i).errs(dim, var)[1]) ** 2
266  for var in ep_i.keys():
267  ep_i[var] = math.sqrt(ep_i[var])
268  em_i[var] = math.sqrt(em_i[var])
269  if SND_MODE == "assume_mean":
270  val_i /= scalesum
271  for var in ep_i.keys():
272  ep_i[var] /= scalesum
273  em_i[var] /= scalesum
274  ao_out.point(i).setVal(dim, val_i)
275  for var in ep_i.keys():
276  # setattr(ao_out.point(i), 'set%sErrs' % axis[dim].upper(), ((ep_i[var], em_i[var]), var))
277  ao_out.point(i).setErrs(dim, (ep_i[var], em_i[var]), var)
278 
279  # Add more points to the output scatter
280  elif SND_MODE == "combine":
281  for ao in aos[1:]:
282  ao_out.combineWith(ao)
283 
284  # Just return the first AO unmodified & unmerged
285  elif SND_MODE == "first":
286  pass
287 
288  else:
289  raise Exception("Unknown Scatter%dD merging mode:" % dim + args.SND_MODE)
290 
291  # Other data types (just warn, and write out the first object)
292  else:
293  sys.stderr.write("WARNING: Analysis object %s of type %s cannot be merged\n" % (p, str(aotype)))
294 
295  # Put the output AO into the output dict
296  analysisobjects_out[p] = ao_out
297 
298 # Write output
299 yoda.writeYODA(analysisobjects_out, args.OUTPUT_FILE)
Cut::all
@ all
Definition: SUSYToolsAlg.cxx:67
dumpHVPathFromNtuple.append
bool append
Definition: dumpHVPathFromNtuple.py:91
python.Utilities.clone
clone
Definition: Utilities.py:134
convertTimingResiduals.sum
sum
Definition: convertTimingResiduals.py:55
plotBeamSpotVxVal.range
range
Definition: plotBeamSpotVxVal.py:195
Pythia8_RapidityOrderMPI.val
val
Definition: Pythia8_RapidityOrderMPI.py:14
python.CaloScaleNoiseConfig.type
type
Definition: CaloScaleNoiseConfig.py:78
str
Definition: BTagTrackIpAccessor.cxx:11
readCCLHist.float
float
Definition: readCCLHist.py:83