Search for Heavy Resonances in the H-tagged Dijet Mass Spectrum in pp Collisions at 8 TeV

Purpose of This Page

This is Angelo's log book of work in the Exotica Analysis "Search for Heavy Resonances in the H-tagged Dijet Mass Spectrum in pp Collisions at 8 TeV". This log started to be written on Oct 28th, 2013. It supposed to be a summary of what has been written along CMS Analysis Notes 13-152 and 213/347, as well as a report of Angelo's related activities.

Introduction

As buscas por sinal de nova Física no LHC visam observar partículas de alta massa: geralmente maiores que 1 TeV. Por isso, o produto de decaimento dessas ressonâncias são partículas que sofrem um “boost” de Lorentz, caracterizando-se por apresentar trajetórias muito próximas. Muitas análises de busca por partículas massivas tem sido desenvolvidas no CMS como, por exemplo, a análise do canal de decaimento pp → X → (H → bb)(H → bb), em que X pode ser um Rádion ou um Gráviton, por exemplo, e decai em dois bóson de Higgs.

Ao invés de produzir dois jatos, a hadronização do par bb produz um único jato, chamado de jato “gordo”, como resultado do “boost” de Lorentz. Essa topologia requere o desenvolvimento de novas técnicas de análise de dados que possibilitem identificar o número de quarks b, imersos naqueles jatos gordos”, bem como obter a massa invariante do jato. A aplicação dessas técnicas permitiria reconstruir o cenário de bósons H decaindo em quatro jatos oriundos de quarks b. Além disso, também são empregadas técnicas de subestrutura de jatos.

Quarks b estão presentes em muitos canais interessantes para a Física. Como, no Modelo Padrão, a probabilidade de ocorrer o decaimento H → bb é de cerca de 57% para um bóson de Higgs com massa de 125 GeV/c², a descoberta recente desse bóson dependeu basicamente de uma eficiente identificação daqueles quarks. Devido a um tempo de vida relativamente longo dos hádrons que contém quarks b, jatos resultantes da hadronização desses quarks apresentam uma topologia difenrente dos jatos produzidos a partir de quarks leves. Aqueles hádrons podem viajar alguns milímetros, em relação à região de colisão próton-próton, antes de decair. Com a assinatura fornecida pelos produtos de decaimento, é possível reconstruir vértices secundários cujas características permitem identificar eficientemente os quarks b.

Como membro do Grupo ExoDiBoson do CMS, Angelo Santos tem participado dessa busca por partículas massivas, previstas em teorias que vão além do Modelo Padrão. realizando estudos com o emprego de técnicas de identificação de quarks b e de subestrutura de jatos.

Proposals of Work

Maxime's e-mail on 10/10/2013:
- Tijs have added the sub-jet b-tagging and fund a solution to estimate it. Your work would be to propagate this estimation to the limits, do some closure tests and add the ttbar ciontamination that Thiago have estimated.
- We think that if you dedicate your time till Xmas it would be essentially what we need. In addition you would learn things that would be definitively useful for you for tautaubb analysis.
Maurizio's summary of meeting on 10/14/2013:
- The reduced ntuples + Tijs code can be used (no need to reprocess data or patuples) Angelo will work on it. We count on Thiago's help to follow him. Maurizio and Maxime will also help Angelo (when problems occur). Angelo is reading the report by Tijs, then he will start the analysis. We discussed these tasks for Angelo (+Thiago)
  - redo what tijs did
  - reoptimize n-subjettiness cut
  - implement two different background estimates (Tijs' and Caterina's)
- We also have
  - implement scale factor from jet flavor -> Badder+Angelo
  - implement limit code -> Andreas+Angelo
- The documentation will be rewritten (AN+PAS) to reflect these facts (Maurizio)
  - we will define a loose Higgs tag: jet mass cut + tau21 + 1 loose btag
  - we will define a tight Higgs tag: jet mass cut + tau21 + 2 loose btag
- These definitions will be shared by VH, HH, and potentially other analyses looking at boosted H->bb. We need to discuss the ZH destiny too. There will be a separate email exchange between Petar, Maurizio and Thiago on that. We basically have two options:
  - Thiago keep working on that, JHU joins once VV is in CWR
  - Thiago focuses on HH hadronic events with 2 btag (eg H(bb)H(bb) with 2 loose Higgses, or H(bb)H(gg/cc) or H(bb)H(WW*/ZZ*) and JHU takes care of ZH once the VV paper is in CWR.

Analysis Note Summary

Abstract

This is a search for massive resonances decaying into a pair of Higgs bosons each reconstructed in hadronic final states. This search is optimized for large resonance masses, in which the Higgs decay products merge in one massive jet. QCD background is suppressed using jet substructure techniques. Data sample corresponds to an integrated luminosity of 19.6/fb of proton-proton collisions collected in the CMS experiment at the LHC in 2012 at a center-of-mass energy of 8 TeV.

Introduction

This analysis search for new particles based on physics scenarios Beyond the Standard Model. These particles could be either a spin 0 Radion, or an excited state of the Graviton (spin 2). The X-particle (Radion or Graviton) decays to two Higgs bosons which both decay to a b-quark and a b-antiquark: X → HH → 4b channel. For these heavy X-particles (decaying nearly at rest in the labframe), the Higgs bosons will appear back-to-back, and very boosted due to their very high momentum. The final states will have only two merged, fat jets (dijet state) instead of 4 separated jets because the b-bbar pairs from each jet will appear merged in a single jet.

Branching fraction of X → HH will be around 25%. The H → b-bbar is the preferred decay since b is the most massive quark bellow one half of the Higgs mass. A collision happens only between internal quarks or gluons, which carry only a fraction of the total energy of the proton. Since this analysis uses data of sqrt(s) = 8 TeV, a reasonable effective energy is 3 TeV. Then the energy spectrum of this analysis ranges up to 3 TeV.

Event Selection

First studies have been performed only using QCD background, whose events were generated by MadGraph5 interfacing with Pythia6 for showering and hadronization. Events account only QCD interactions, without Electroweak bosons or top quarks. Since these events do not lead a very precise background estimation, background is estimated by a data-driven technique.

Event selections are enumerated as follow.

Jets are reconstructed using the Cambridge/Aachen algorithm, which have a cone size ΔR = 0.8.
The implemented triggers are an "OR" of triggers (FAT750 OR HT650 OR HT750) and are found to be 99% efficient at a dijet mass > 890 GeV.
Each event must have at least two jets with pt > 30 GeV and |η| < 2.4 for an accurate reconstruction, and pruned mass close to the Higgs mass.
An optimization of the Higgs mass window found the best mass range to be between 110 and 135 GeV.
Δη = |η(jet1) - η(jet2)| < 1.3 is imposed on all events because number of QCD background events grows with increasing of Δη, while it is not the case for the modeled heavy resonances.

N-subjetiness

A possible discriminator between background and signal events is the N-subjetiness τ21 = τ2/τ1. That is, the smaller is τ21, the closer the jet is to a dipole (rather than monopole) structure, as is explained here.

B-tagging

B-tagging is a method used to identify jets originating from b-(anti)quarks, and is based on the lifetime of the decay products of the b-quark. Hadrons containing b-quarks frequently have a lifetime long enough to travel a measurable distance in the detector, causing a secondary vertex of charged tracks within the jet. This vertex is reconstructed using the adaptive vertex fitter in a cone of ΔR = 0.3 around the the primary vertex. The secondary vertex is rejected if it is either too much like the primary vertex or too far from it. The Combined Secondary Vertex (CSV) combines secondary vertices and lifetime information to construct a probability discriminator (between 0 and 1) to distinguish b-quark jets from other jets, resulting in two "working points":

loose: CSV discriminant > 0.244;
medium: CSV discriminant > 0.679k.

Two baseline b-tagging approaches have been investigated:

Fat Jet CSV: ignore the substructure and try to run the CSV algorithm on the jet as a whole;
Sub-jet CSV: try to tag the two subjets.

Background Estimation

The background is estimated using a data-driven called ABCD method. A sideband is defined with a different jet mass window cut on the second jet (first jet remains in a 110 - 135 GeV window). Then, the background for n b-tags is estimated using the spectrum of n - 1 b-tags. Assuming the mass and b-tag windows as in the table bellow, signal D can be estimated as D = (A/C).B.

Mass window	cut _n -1_	cut _n_
70 - 110 GeV	A	B
110 - 135 GeV	C	D

Two assumptions were considered to use this method:

pruned jet mass and b-tags are assumed to be uncorrelated: it is false because it is possible to distinguish an anti-correlation between them;
it is assumed to have no contamination from background (like t-tbar and VV) with real b-quarks: however, t-tbar is non-negligible in regions A, C and D, while VV is non-negligible in region A.

Subjet b-Tagging in Boosted Higgs

Double subjet b-tagging with CSV shows better discrimination than fat-jet b-tagging. Only at very high pt, fat-jet b-tagging and subjet b-tagging are equally good. Therefore, different b-tagging cuts are chosen to be implemented on the four subjets, rather than on fat-jets:

from 1 to 4 subjets passing a loose b-tagging;
from 1 to 4 subjets passing a medium b-tagging.

This table summarizes all categories.

Categories
≥ 1 loose
≥ 2 loose
≥ 3 loose
exact 3 loose
4 loose
≥ 1 medium
≥ 2 medium
≥ 3 medium
exact 3 medium
4 medium

Studies with Punzi Significance showed that b-tags strongly reduces the background, while leaving most of signal events, when increasing from 1 to 4 b-tags. For loose b-tags, the significance increases the more b-tags are aplied. For medium b-tags, 4 tags reduces the signal so much that it becomes less efficient.

N-subjetiness results

To test the effectiveness of N-subjetiness, a cut τ21 < 0.5 on both jets is applied to all different b-tagging categories. Independent of b-tagging cut, the N-subjetiness reduces signal by a factor of 1.3, and QCD roughly by a factor of 3, being considered uncorrelated.

Cross Section Limit Comparison

The used categories for limit-setting (via CLs techniques 1 and 2) are:

Precisely 3 loose subjet b-tags, precisely 4 loose subjet b-tags, and 3 or 4 subjet b-tags;
All above categories with and without N-subjetiness cut (both jets with τ21 < 0.5).

It was found that the analysis is more sensitive without N-subjetiness cut at higher resonance masses. In principle, subjet b-tagging increases sensitivity by a factor of 10.

Background Estimation Improvements

In order to put exact limits on the production cross section of a heavy resonance, several feasibility studies have been done to get more accurate background estimations.

N-Subjetiness and b-Tagging as Sideband

Using the fact that N-subjetiness and b-tagging are uncorrelated, a sideband is constructed with those variables. Then for signal region:

Both jets with 100 < mass < 135 GeV;
Both jets with τ₂₁ < 0.5;
Two different categories: 3 and 4 subjet b-tags.

And for background estimation:

Jet1 randomly chosen, τ₂₁ < 0.5, has 1 or 2 subjet b-tags (for in totall, 3 b-tags or 4 b-tags respectively);
Jet2 is chosen as in the Table bellow.

	Signal	B-tag sideband
Signal	2 subjet b-tags	0 subjet b-tags
Signal	τ₂₁ < 0.5	τ₂₁ < 0.5
τ₂₁ sideband 1	0.5 < τ₂₁ < 0.75	0.5 < τ₂₁ < 0.75
τ₂₁ sideband 2	τ₂₁ > 0.75	τ₂₁ > 0.75

page 13 (3.4.1 N-Subjetiness and b-Tagging as Sideband): What is the meaning of "Jet 1 randomly chosen"?

At least 3 background events need to remain in the sideband in order to get the ABCD method working. Since there is a loose of too much (< 20%) signal and since the sensitivity at higher resonance masses is better without N-subjetiness, it is not considered a useful method.

Different Jets in Sideband

A way to improve the initial background estimation (with ABCD method) is by varying the parameters on different jets, rather than only using one jet in the sideband. This estimation will work good enough if the correlation between those parameters is sufficiently small.

In this sense, signal region is defined as:

Both jets with 110 < mass < 135 GeV;
Two different categories: 3 and 4 subjet b-tags, both loose and medium.

Background regions are defined as:

Jet 1
- randomly chosen, and has 1 or 2 subjet b-tags (for 3 b-tags or 4 b-tags in total)
- with three mass windows
  - 70 < mass < 110 GeV (mass-sideband 1)
  - 135 < mass < 150 GeV (mass-sideband 2)
  - 110 < mass < 135 GeV (signal region)
Jet 2
- with 110 < mass < 135 GeV
- and three b-tagging categories
  - 0 subjet b-tags
  - 1 subjet b-tag
  - 2 subjet b-tags

Look at summary in the Table bellow.

	b-tags	Closure	Signal
Signal	110 < mass_jet1 < 135 GeV	110 < mass_jet1 < 135 GeV	110 < mass_jet1 < 135 GeV
Signal	0 subjet b-tags on jet 2	1 subjet b-tag on jet 2	2 subjet b-tags on jet 2
Low mass sideband	70 < mass_jet1 < 110 GeV	70 < mass_jet1 < 110 GeV	70 < mass_jet1 < 110 GeV
Low mass sideband	0 subjet b-tags on jet 2	1 subjet b-tag on jet 2	2 subjet b-tags on jet 2
High mass sideband	135 < mass_jet1 < 150 GeV	135 < mass_jet1 < 150 GeV	135 < mass_jet1 < 150 GeV
High mass sideband	0 subjet b-tags on jet 2	1 subjet b-tag on jet 2	2 subjet b-tags on jet 2

First checks were done using 0+1 subjet b-tags (0 subjet b-tags on the second jet, 1 subjet b-tag on first jet) to estimate 1+1 subjet b-tags. Estimations of 1+1 subjet b-tags give very reasonable agreement with the actual values. Estimation of 2+1 subjet b-tags high band is more difficult to compare because of lack of statistics.

Summary of Actual Results (Oct 4th, 2013)

It has been found that subjet b-tagging, in a boosted dijet topology, works very effectively as a background discriminator. Almost all QCD background has been removed, leaving most of signal intact. Categories with the highest efficiencies are 3 or 4 loose b-tags, and 3 or 4 medium b-tags. N-subjetiness applied to both jets (τ₂₁ < 0.5) reduces signal events by a factor of 1.3 and QCD background events by a factor of 3, independently of applied b-tagging cuts.

First background estimation, where an ABCD method was attempted using a jet mass and a b-tagging sideband, with the primary jet still "signal-tagged", failed because of the correlation between those variables. The second background estimation, using uncorrelated N-subjetiness and b-tagging, failed due to lack of statistics in the sideband region. A third method, where jet mass and b-tagging were varied on different jets, seems much more successful so far.

This analysis has not been able to quantify this in an actual numbers and uncertainties yet, but this will be done later, and the same method will be used in further analysis and limit setting on production cross section.

Input ROOT Files

The Ntuples (from Tijs) are located in /store/cmst3/user/mgouzevi/HH4B/TIJS_TREES. They are:

Data: /store/cmst3/user/mgouzevi/HH4B/TIJS_TREES/dijetWtag_Moriond_Data.root
QCD (Mass = 0.5 TeV): /store/cmst3/user/mgouzevi/HH4B/TIJS_TREES/dijetWtag_Moriond_QCD500.root
QCD (Mass = 1 TeV): /store/cmst3/user/mgouzevi/HH4B/TIJS_TREES/dijetWtag_Moriond_QCD1000.root
ttbar: /afs/cern.ch/user/t/tomei/work/workWithAngelo/CMSSW_5_3_9/src/Ntuples/TNMc1/higgs_tagged_dijet_analysis_with_btag_TTJets.root
Wbb: /afs/cern.ch/user/t/tomei/work/workWithAngelo/CMSSW_5_3_9/src/Ntuples/TNMc1/higgs_tagged_dijet_analysis_with_btag_Wbb.root
Signal (Mass = 1 TeV): /store/cmst3/user/mgouzevi/HH4B/TIJS_TREES/dijetWtag_Moriond_HHPy61000.root
Signal (Mass = 1.5 TeV): /store/cmst3/user/mgouzevi/HH4B/TIJS_TREES/dijetWtag_Moriond_HHPy61500.root
Signal (Mass = 2 TeV): /store/cmst3/user/mgouzevi/HH4B/TIJS_TREES/dijetWtag_Moriond_HHPy62000.root
Signal (Mass = 2.5 TeV): /store/cmst3/user/mgouzevi/HH4B/TIJS_TREES/dijetWtag_Moriond_HHPy62500.root
Signal (Mass = 3 TeV): /store/cmst3/user/mgouzevi/HH4B/TIJS_TREES/dijetWtag_Moriond_HHPy63000.root
All signal: /store/cmst3/user/mgouzevi/HH4B/TIJS_TREES/dijetWtag_Moriond_Allsignal.root

To copy any of the root files above to ACCESS, just type:

srmcp -2 srm://srm-eoscms.cern.ch:8443/srm/v2/server?SFN=/eos/cms/store/cmst3/user/mgouzevi/HH4B/TIJS_TREES/<root_file_name>   file:///<root_file_name>

But it is not recommend since some of the files are bigger than 1 GB. Instead, use the Physical File Name (PFN), like this:

TFile *file = TFile::Open("root://eoscms//eos/cms/store/cmst3/user/mgouzevi/HH4B/TIJS_TREES/dijetWtag_Moriond_Data.root")

How to allow someone to have access to files in you lxplus workspace:

fs setacl -dir /afs/cern.ch/user/t/tomei/work/workWithAngelo/CMSSW_5_3_9/src/Ntuples/TNMc1/ -acl <user_name> read

Parameters Correlation (Nov 21st, 2013)

Dijet masses are fitted through the formula

were P₀ takes care of normalization, while P₁ and P₂ take care of the distribution shape. There may be a correlation among these parameters. Here is the way to check such a correlation through a correlation matrix:

Show

Hide

TH1F *h = new TH1F("h","h",100,-5,5);
h->FillRandom("gaus", 50000);
TFitResultPtr result = h->Fit("gaus", "S");
         NO.   NAME      VALUE            ERROR          SIZE      DERIVATIVE 
         1  Constant     2.00141e+03   1.09275e+01   4.14460e-02  -3.06373e-05
         2  Mean        -1.42906e-03   4.45701e-03   2.06212e-05  -4.26207e-03
         3  Sigma        9.94981e-01   3.11521e-03   3.94839e-06  -3.03397e-01

TFitResult* r = result.Get();
r->GetErrors()[0]
         1.09274960858893486e+01

r->GetErrors()[1]
         4.45701208205878576e-03

r->GetErrors()[2]
         3.11520927801539615e-03

r->PrintCovMatrix(cout)
         Covariance Matrix:

                              Constant        Mean       Sigma
         Constant             119.41 -0.00017258   -0.019493
         Mean            -0.00017258  1.9865e-05  8.4435e-08
         Sigma             -0.019493  8.4435e-08  9.7045e-06

         Correlation Matrix:

                              Constant        Mean       Sigma
         Constant                  1  -0.0035434    -0.57263
         Mean             -0.0035434           1   0.0060812
         Sigma              -0.57263   0.0060812           1

Relevant Documentation and Twiki Pages

HH → 4b: main twiki of this analysis;
Setup of the code: made by Tijs;
CMS AN 13-152: AN of this analysis;
CMS AN 213/347: summary of Tijs' work (when was Andreas' summer student);
Tijs' talk: talk of Tijs about the side band estimation;
CMS PAS EXO-12-024 on W/Z-tag: Search for heavy resonances in the W/Z-tagged dijet mass spectrum in pp collisions at 8 TeV;
CMS PAS BTV-13-001 on b-tagging: Performance of b tagging at sqrt(s)=8 TeV in multijet, ttbar and boosted topology events;