TU Delft scientists put ChatGPT to the test
Researchers at Delft University of Technology and RWTH Aachen University have put ChatGPT’s knowledge on science and engineering to the test. By letting 198 Delft scientists evaluate GPT-3.5’s answers to questions covering natural science and engineering disciplines at the university, they found out how well the large language model can answer university level questions. The study shows that on average, ChatGPT’s answers to exam-like questions are mostly correct across faculties at the level of both Bachelor’s and Master’s degrees. Even at PhD level, most of the chatbot’s answers were either partly or mostly scientifically correct.
ChatGPT, a chatbot based on a large language model by tech company OpenAI, has gained enormous popularity since its release in November 2022, because of its ability to generate convincing human-like text. The increasing use of the programme has stirred up discussions on whether and how we can regulate AI tools like ChatGPT in science, teaching and exams. This is why a team of Delft scientists decided to put ChatGPT to the proverbial test with questions on Bachelor’s, Master’s, and PhD level.
Mostly correct
PhD Candidate Lukas Schulze Balhorn and fellow researchers sent out a survey to hundreds of Delft scientists across the natural science and engineering faculties of the university. “We asked them to formulate three questions about their own discipline at Bachelor, Master and PhD level, and evaluate ChatGPT’s answers”, says Schulze Balhorn. “Our results show that the answers from ChatGPT are on average perceived as ‘mostly correct’ across faculties. The programme performed best at the simpler Bachelor level questions, where it answered more than half of the questions mostly up to completely correct.”
The team didn’t expect ChatGPT to answer the Master and PhD level questions that well, and so consistent across a broad range of domains – from chemistry, to aerospace engineering, to computer science. “I think we all expected it to produce more nonsense”, Jana Weber, Assistant professor in AI for Bioscience says. “The fact that it’s so consistent must mean that the model has added a lot of scientific journal papers and textbooks to its training data. In that sense, ChatGPT could be more helpful to students than we expected, and at the same time more noteworthy for the potential of cheating at for instance take-home exams.”
Extremely impressive
The scientific correctness of the answers to the PhD level questions was particularly surprising, says Artur Schweidtmann, Assistant Professor in AI & Machine Learning for Chemical Engineering: “At PhD level, we’re talking about open research questions in specific scientific domains. This is stuff that I would have great difficulties answering. That is extremely impressive.”
Awareness of impact
On skills beyond the scientific content, such as critical attitude and awareness of how the answer may impact society, the chatbot didn’t score as well. “One example where ChatGPT did show awareness was its response to a question on forensic science and chemistry, how to synthesise MDMA, commonly used in drugs such as XTC. In this case, ChatGPT refused to answer, saying it is not appropriate to provide information about synthesis of illegal drugs. But in most cases, there was no sign of such awareness”, Schweidtmann says. “The language model underlying ChatGPT really isn’t aware of anything, the programme just has built-in safeguards where it doesn’t give you the answer it’s actually writing. But in cases where the filter doesn’t kick in, the ethical awareness is definitely not fantastic”, Stefan Buijsman, Assistant Professor on Ethics & Values in Technology, adds.
It's not as if ChatGPT can suddenly do the work of the scientists or the engineers.
Another limitation is that the reasoning behind the answers is missing. Buijsman: “You still need the underlying thought process of what to do with this outcome that you're getting out of ChatGPT. It's not as if ChatGPT can suddenly do the work of the scientists or the engineers that we're aiming to educate. It's about knowing which questions to ask, and knowing which answers to trust. The scientific correctness is impressive, but at the same time might be missing important details. ChatGPT performed the worst in critical attitude and reasoning, yet these are critical skills for our students to have.”
To Schweidtmann, the pace at which AI models like ChatGPT develop and are becoming better and better is striking. “That’s why I conclude that we need to learn how to use it, and also teach our students how to do that, and make them aware about the advantages and disadvantages of these models.”
This news release was written entirely by humans
A few examples from the study:
Bachelor level question – 3mE
How big is the divergence of the velocity field in an incompressible fluid flow?
Answer by GPT-3.5:
In an incompressible fluid flow, the divergence of the velocity field is always equal to zero. This is because an incompressible fluid is defined as a fluid in which the density remains constant, and as a result, the volume of the fluid elements cannot change. Therefore, there can be no net flow of fluid into or out of a given volume, and the divergence of the velocity field must be zero.
Scientific correctness:
completely correct (5/5)
Critical attitude:
N/A
Master level question – AS
Which polymer model would you use to describe the physical properties of DNA?
Answer by GPT-3.5:
The worm-like chain (WLC) model is commonly used to describe the physical properties of DNA, such as its elasticity and persistence length. The WLC model treats DNA as an elastic, semi-flexible polymer and describes the behavior of the polymer using a combination of elasticity theory and statistical mechanics. It has been shown to accurately describe a wide range of experimental measurements of DNA properties, including its bending and stretching behavior.
Scientific correctness:
completely correct (5/5)
Critical attitude:
N/A, no (societal) impact
PhD level question – EEMCS
Is there a general algorithm to predict protein folding?
Answer by GPT-3.5:
There is no general algorithm that can accurately predict protein folding for all proteins. However, various computational methods have been developed to predict the structure of proteins, including homology modeling, threading, and ab initio prediction methods. These methods use various types of information, such as the amino acid sequence, evolutionary relationships, and physical principles, to predict the three-dimensional structure of a protein. However, these methods are not perfect and the accuracy of predictions can vary depending on the specific protein being studied. Additionally, experimental methods such as X-ray crystallography and NMR spectroscopy are often used to determine the actual structure of a protein.
Scientific correctness:
completely correct (5/5)
Critical attitude:
it gave a good answer, however without criticism (1/5)
What does ChatGPT know about natural science and engineering?
https://arxiv.org/abs/2309.10048