Introduction to Searches
Performing a search for Physics beyond the standard model (a.k.a New Physics) at the LHC is no easy task. Here we try to give a brief outline of how this works.
Choose a prospective New Physics signal
For instance, if you're interested in Extra Dimensions (ED), you would choose one of the different flavours of ED developed by the theorists - Warped ED, Universal ED, etc. Then, you have to choose a particular channel / signature - for instance, the Warped ED graviton decaying into a pair of Z bosons, leading to a jets + dilepton signature. For the rest of this example, let's focus on exactly this channel: RS to ZZ to 2lep+jets.
Even if you plan on doing a model-independent analysis later, it is useful to have a model as benchmark. This gives you at least an approximate idea of what kind of New Physics signal you are going to search for. Also, in this step you will be able to at least pinpoint the basic standard model backgrounds that you will face. In the example, it is clear that standard model Z+jets production is going to be a very relevant background. Other processes which would contribute to the background would be fully leptonic ttbar + jets and multi vector bosons standard model production.
Simulation of signal samples
Since you want to know how your signal would look like in CMS, you need to simulate it. For the more common BSM models the basic generators like Pythia and Herwig can do the generation for you. For more specialized models, you may need to rely on other software, like JHUGenerator, CompHEP and others. For preliminary studies, generator-level events are good enough; for complete studies, simulation of the CMS detector with Geant and full reconstruction of the events is required.
This is also the moment to familiarize yourself with the characteristics of your signal, plotting some basic variables like, pt, eta, phi, mass of different objects in your events. In our example, a quick investigation would reveal that our signal is characterized by a same-flavour, opposite-sign dilepton system, with 1) very high pt 2) very low delta R between the two leptons 3) invariant mass around the mass of the Z boson. Additionally, our signal usually presents a single, high-pt jet with invariant mass around the mass of the Z boson as well.
Design your triggers
The CMS detector is equipped with a trigger system, which analyzes the data taken on real time and selects events with high probability of coming from interesting physics processes to be saved on permanent media.
Events not selected by the trigger are irremediably lost! You should make sure that a fair fraction of your simulated signal samples are accepted by some trigger. In negative case, you should design a new trigger for your search, or you will have no data to analyze!
Data collection
The data collected by CMS is grouped in large sets named Primary Datasets (PDs). Each PD contains events which passed a set of closely related triggers. For our example, in 2012 we would take events from the DoubleMu and DoublePhotonHighPt (which, despite its name, contains events which passed the DoubleEle33 triggers). You would also select events which passed specifically the trigger of your choice. You should also veto events where not all parts of the detector were operating properly (with the golden JSON file) or where some characteristic of the events makes it look a lot like detector noise (many standard event filters do this).
Simulation of standard model backgrounds
This step is usually done centrally by specialized teams in CMS, since the backgrounds are the same for many people. You should pick simulated samples which contain the processes that are more likely to be a background to your search. You should also keep the characteristics of your signal in mind. In our case, since the signal presents high-pt Z bosons, it is better to choose a specialized Z boson simulated sample with only high-pt Zs; in this way, there will be more simulated events to model the background.
Choice of discriminating variables
A comparison of our simulated background to the simulated signal should make clear which variables better discriminate between the two. In our example, the jet invariant mass turns out to be a good discriminator - since the accompanying jet in Z+jets events is not coming from a Z boson, its mass is usually very low, when compared to the mass of the jet in the signal samples. The ranges of the variables which are optimal for signal selection are what is called the
signal region. For our example, the signal region is: leptonic Z with mass in the (70,110) GeV range, hadronic Z (jet) with mass in the (70,110) GeV range.
Comparison of SM background and data in control regions
It is always important to validate your simulated background in some way. Usually, this is done by comparing the simulation with data in a
control region - one where your signal is not expected to appear. Usually those control regions are taken to be ranges of discriminating variables close to the ones where most of the signal is expected to appear. Agreement between the simulation and the data in those control regions is fundamental; without that, you cannot really say that any decision you take by looking at the simulation is right. In our case, the control region is: leptonic Z with mass in the (70,110) GeV range, hadronic Z (jet) with mass in the (40,70) GeV range.
Aditionally, if you have multiple control regions, you can try to find relations in between to constrain the behaviour of your background. For instance, maybe you can demonstrate that the background in your signal region is always expected to be proportional to the same background in a control region! If so, you just have to measure that constant of proportionality, multiply it by the real data in the control region (which is expected to be almost purely background, otherwise it would not be a control region!), and you have an estimate of the same background in the signal region. This is what is called a data-driven method.
Estimation of SM background in signal region
In one way or another, you have to estimate the SM background in the signal region. Purely data-driven methods are the most reliable, but they may not always be possible. The next best thing is a properly validated SM simulation. The less you rely on simulation, the better; ratios of simulated distributions, for instance, are more reliable than the distributions themselves.
In our case, we don't have a purely-data driven method. However, we are going to take the ratio of the dilepton-jet mass distributions in the signal region and in the control region, for the SM background. We expect the simulation to not get this ratio that wrong. We multiply this ratio by the same distribution for the data in the background region, and we use that product as the estimate of the SM background in the signal region.
Optimisation of signal region
Usually, your first guess of the definition of the signal region and control regions is going to be suboptimal. Any analysis should go through an optimisation step. where the variable ranges whichd define the regions are scanned in order to achieve an optimal signal / background separation. This optimisation is usually guided by the computation and maximisation of a figure of merit; many choices are available for that, from the simple signal-background ratio (S/B) to more sophisticated ones - Punzi significance is defined as Eff / sqrt(1.0 + B), where Eff is the signal selection efficiency. An alternative way is to optimise for the best expected limit.
Notice that this step should always be done
before looking at the data in the signal region. This is the main principle of a so-called
blind analysis; the analysis is done in this way as to prevent researcher bias, like tuning cuts in order to get better data/simulation agreement.
Comparison of SM background and data in signal region
After the analysis is fully optimised, and all the experimental techniques have been deemed correct by both thorough scrutinization and comparison of the data to simulaiton in the control regions, one can look at the data in the signal region. Since we are doing a search for a new resonance in the example, we are going to look at the M(llj) distribution and see if the observed data agrees with the SM expectation, or if there is a localised excess (bump) somewhere. In the former case, we have found no evidence of New Physics, and should proceed to set limits on the cross-section of a physical process which would have produced such a bump.
The latter case is more interesting - after all, we may be on the verge of discovering a new particle! The first thing one should do is characterise the significance of the excess - in other words, how improbable is it to see such a bump, given that we know our background with a given uncertainty. If the excess is not significant enough, it is probably the result of a statistical fluctuation, and again limits should be set - the excess just means that your limits will not be as good. Notice that "significant enough" is rather subjective, so we have some objective criteria: we call a three-sigma global significance "evidence" for a new particle, while a five-sigma global significance would call for the announcement of a "discovery".
In our exemple, we search for a bump in M(llj) at a given mass, and set a limit for that mass. This allows us to make a curve of limit vs mass. There are automated tools in CMS to make this kind of curve.
-- Main.trtomei - 2014-06-24