Understanding the learning process: machine learning and computational chemistry for hydrogenation

News - 02 September 2024 - Communication ChemE

Machine learning is being mentioned all around, but can it be applied to modelling homogeneous catalysis? Researchers from TU Delft together with Janssen Pharmaceuticals published an extensive study accompanied by one of the biggest datasets on rhodium-catalyzed hydrogenation in Chemical Science trying to answer this question.

  

   


For more than half a century, Rhodium-based catalysts have been used to produce chiral molecules via the asymmetric hydrogenation of prochiral olefins. The importance of this transformation was acknowledged by a Nobel prize given to Noyori and Knowles for their contributions in this field. Nowadays, asymmetric hydrogenation catalysts are widely used in the pharmaceutical industry, numerous chiral ligands are available to tackle a wide range of prochiral substrates and the reaction mechanism has been extensively studied. Consequently, one would expect that finding the best catalyst for the asymmetric hydrogenation of a new substrate is a trivial task. Unfortunately, this is not the case and a tedious and costly experimental screening is still needed. Adarsh Kalikadien and Evgeny Pidko from TU Delft together with experts in high-throughput-experimentation, data science and computational chemistry from Janssen Pharmaceutica in Belgium decided to investigate whether a well-trained machine could do the job. To their surprise, the machine was actually not able to learn as much as they expected.

The idea was to set up a simple model reaction with a well-known rhodium catalyst. Based on the experimental data generated by the high-throughput experimentation team of Janssen, a computational dataset was built to which multiple machine learning models were applied. “We digitalized the 192 catalyst structures and represented them with features of various levels of complexity for the machine learning models,” says Kalikadien, a PhD student in Pidko’s group. "The interesting thing was that all the simpler models, including the random model, showed similar performances as the expensive variant, which intrigued us. It turned out to be an early indication that the machine was not really learning anything useful.” 

"One of our conclusions was, when tested more extensively, that for an out-of-domain modeling approach, it doesn't matter what representation you put in”. Nevertheless, although the team was not able to build an accurate model, their study was worth publishing. The publication process went relatively smoothly. 

“Although the first journal we contacted rejected our submission as too specialized, the high-impact journal Chemical Science saw the value of this work. Not many researchers are interested in just seeing the R2 value of a model and then having no possibility to use it, they are probably interested in an in-depth analysis like ours. So we were able to submit our data, code and even interactive figures there for everyone to use.” At the moment there is a big incentive for publishing negative data in order to help the community to assess the true added value of machine learning, since models trained on mainly positive results tend to become very biased. "We made everything open source," says Kalikadien. "Not only is all the data accessible, but we also offer the code including packages and instructions, so that anyone who is interested can do the same type of research." In this way, they have published one of the largest datasets of a certain type of hydrogenation reaction.

What's next? "Our representation of the catalyst wasn't as meaningful for the machine learning models as we had hoped, so we are now looking for a representation that may be less simplified but still as simple as possible," says Kalikadien. "Creating a digital representation of your catalyst should not cost way more money than running the actual experiment, so we are trying to incorporate more information from the reaction mechanism into the model without making it too extensive. A more dynamic and hopefully more informative version of the representation."

Adarsh Kalikadien, Cecile Valsecchi, Robbert van Putten, Tor Maes, Mikko Muuronen, Natalia Dyubankova, Laurent Lefort and Evgeny A. Pidko