The original (English) of my interview in the Lettre IN2P3 Informatique

Balázs, you are one of the two computer scientists who carry out their research at the IN2P3. Can you tell us what was your path that lead to join our community?

I got my engineering degree in electrical engineering, specializing in computer science, in Hungary in 1994. I was, from the very beginning of my studies, attracted to machine learning, which is, to cite Andrew Ng, the science of getting computers to act without being explicitly programmed. That was the time of the first successes of neural networks on practical problems, for example, handwritten character recognition. The connections to human intelligence and the brain intrigued me. There was very little theory behind these methods, and given that I was surrounded by smart mathematicians working on statistics and signal processing, I started to do research mainly in learning theory. During my PhD and postdoc in Canada my engineering vein slowly took over and I started to work more on the algorithmic-methodologic side of machine learning, but my math background helped me a lot in formalizing problems and solutions all throughout my carrier.

I was hired by the Université de Montréal as an assistant professor in 2001, where I was a “classical” machine learning researcher, publishing papers at our two major yearly conferences (ICML and NIPS), and working on either intrinsically motivated problems (improving and analyzing methods) or on problems motivated by practical applications in software engineering, music processing, and image processing. During this period I was regularly (although not very intensively) looking for opportunities to come back to Europe, mainly for family reasons. In 2006 a data mining CR1 position opened at LAL, and I applied and got hired. I was diving into deep water since I had had no formal training in physics since high school, but I liked a challenge. I have to admit that it was a risky carrier move, but I knew that the flexibility and freedom of a CNRS position made it possible to go back to a classical machine learning lab in case the experiment did not work out.

I think that at that time only Guy Wormser, the director of LAL, had a clear vision that machine learning and data mining research had a place in high-energy physics. It is amazing to see that this vision, a data scientist embedded into a science lab where data is produced and analyzed, is becoming the mainstream.

Can you briefly describe your and your team's research interests? Since arriving at the IN2P3, what has been your interaction with the physicists of our discipline? How would you describe the impact of your own research on computational methods in experimental particle physics? Conversely, how have you profited of this interaction?

When I arrived to LAL, I joined the Pierre Auger Experiment and the Auger group at LAL. There was nothing conscious about this choice, but retrospectively I was probably intimidated to approach one of the larger particle physics experiments. It turned out a wise decision: Auger was a small group with smart and open-minded physicists, particularly Marcel Urban, who was patient enough to introduce me into the basics of particle physics and experimental physics in general, mainly during lunches in the cantine. I enjoyed very much the intellectual challenge of learning a new discipline, and so the learning phase was quite short. I can proudly say that in Auger I can pretend to be a physicist without a problem. It also turns out that an experience in the final phase mainly need data scientists who know how to get knowledge out of data, so my expertise in the field came quite handy in a lot of cases when the data analysis was not straightforward.

The mission of my AppStat group become clear quite fast: to bring state-of-the-art analysis techniques to physics, and to motivate basic research in machine learning and statistics by real problems in physics. I rapidly built a team, thanks to two ANR projects, MetaModel and Siminole. During these eight years I graduated three computer science PhD students and (partly) two physics students, and supervised four postdocs, an engineer, and a visiting researcher. I have been collaborating widely with machine learning and statistics groups in Saclay, mainly with Michèle Sebag and Cécile Germain from LRI and Olivier Cappé and Gersende Fort from Telecom. About half of my publications are in machine learning with no physics motivation; this was important to keep in touch with my community. Interestingly, physics also motivated me to widen my horizon on the methodological side, since some of the problems could not have been solved by techniques that I had known before. Another effect of working with physicists is that I learnt from them humility and rigor in doing computer experiments. In machine learning, creativity is everything: we invent methods that have to significantly improve existing techniques to be published. Improvement is usually evaluated on benchmark data sets, but often these experiments are far from being rigorous, something that I can now clearly see due to my experience in physics.

How do you see the future of cross-fertilization between computational statistics and high energy physics?

In the last 2-3 years I gradually started to work with particle physics groups. With the LHCb experiment we are working on budgeted learning to design triggers. I have two projects in a theme that I call “learning to discover”. With the ATLAS group we are working on multivariate methods to optimize the discovery significance. We have recently launched the HiggsML open data challenge which attracted 700 teams of data scientists in a month. It is interesting to see how hungry the machine learning community is for data coming from scientific projects.The most futuristic project is with the Calice (ILC) team at LAL and LLR: we are working on adapting deep representation learning techniques to imaging calorimeter data, basically teaching particle physics to computers by showing them events. These methods revolutionized speech recognition and computer vision in the last five years, and have the potential of achieving the 50-year dream of artificial intelligence.

You proposed then you were asked to set up a Center for Data Science under the Foundation for Scientific Cooperation Paris-Saclay Campus (FCS), arguably the biggest scientific assemblage ever in France. What is a "Center for Data Science"? How do you plan to meet this challenge?

As I said earlier, what was visionary eight years ago became mainstream today. Centers for Data Science (or similarly named initiatives) are popping up everywhere in the world (NYU, Berkeley, UWashington, Amsterdam, Edinburgh, just to mention the main ones). The idea is very much a generalization of AppStat. Today, the data science community is scattered into different disciplines. We are essentially doing the same research in statistics (mathematics), machine learning, data mining, data visualization (computer science), and signal processing (electrical engineering). The proof: we meet regularly at scientific conferences. The first goal of the CDS is to form a community of data scientists at the new Université Paris-Saclay. The second major challenge is that data are today ubiquitous, and disciplines which were rather data-poor in the past are inundated by data today. Our goal is to create an “agora” where scientist who have data (and analysis problems) can meet scientists who know (and do research on) data analysis methodology. A third and important goal is to organize the building and maintenance of software tools that can be used for data analysis across disciplines. The physics community has a long experience in this domain, paradoxically much more experience than the computer science community, and we are looking forward to learn and generalize this experience in building tools for the wider scientific community.

The CDS, for now, is a two-year project. Our two main tools are financing 10-15 interdisciplinary projects and 3-6 theses, and organizing thematic days and informal brainstorming sessions. We are in the process of evaluating proposals to our first call. We will launch a second call for PhD projects next fall, and, if we still have money left, we will have a second project call in the beginning of 2015. We are planning to organize 8-10 thematic days in the next two years both around methodological and scientific themes. Besides particle physics, we see strong data science communities forming around neuroscience, environmental and Earth sciences, economy, biology and chemistry, and astrophysics and cosmology.

Besides these short-term goals, we will also rapidly start to design a long-term strategy. We have strong bottom-up motivation, and, at the same time, pressure from the FCS to carry out this work. It’s an exciting time: in a sense we have to invent the future of data science for data in science. The task is challenging: we have to find a way to initiate and then to organize temporary interdisciplinary projects and to incite people to invest into tool-building. Saclay is an ideal place for this in France: we have the critical mass (more than 250 researchers are associated with the CDS), the top-down support of the FCS and other actors (labs, schools, universities, and Labexes), and a “softened” institutional landscape which is being changed by the formation of UPSay anyway.

A major challenge that seems to go largely unnoticed by the decision makers in our host institutions is the unprecedented brain drain of data scientists into private IT research. Most of the major IT companies (Google, Microsoft, Facebook, Amazon, Baidu, Criteo, etc.) are rapidly setting up full-blown research labs, offering salaries and engineering infrastructure with which the public sector cannot compete, and, at the same time, bringing to us serious and exciting scientific problems in social sciences, engineering, and artificial intelligence research. This is at the time when students are also starting to discover data science and are showing up in our classes in masses. Without a concentrated strategy of the national institutes and higher education, we are facing a serious shortage in public data science research and education in the near future. The CDS cannot solve all these problems, but we can be an important part of the solution.