Overview
Focus is on describing the article review process, which is a combination of two main tasks. The first task is to identify the criteria by which the model will evaluate articles, as well as the criteria for evaluating the performance of the model itself. The second task is to test the results produced by the model.
Several models have now been tested (GPT-4o, Mixtral and OpenBioLlama) and preliminary criteria for reviewing articles have been identified.
Review process and issues
The review process is divided into seven main stages each of which the model can handle with varying percentages of success. Below the list of these stages,criteria and rationale for whether or not a step can or cannot be analyzed.
- Introduction and problem statement
- Introduction and problem statement
- Relevance and novelty
Regarding this point, it can be concluded that it is generally feasible, but some minimal limitations are present. This is due to the fact that the models have a wide range of material from Wikipedia, articles and other sources, which allows them to effectively define the problem statement.
As for relevance, it depends on the particular model. For example, while the GPT-4o model was trained on data up to October 2023, the Mixtral model has information up to about mid-2022. Thus, the accuracy and relevance of a model is highly dependent on its novelty and updates.
- Research methods
- Research methods
- Research design
- Description of methods
- Ethical considerations
In general, this task is feasible because, as noted earlier, the models have a wide range of materials.
However, with regard to ethical considerations, it is difficult to predict in advance, as they are not present in all articles (and it is not known whether they were present when the models were trained).
- Results and interpretation
- Results
- Analyses and interpretation
- Statistical analysis
- Interpretation of data
Models can also cope with this task, with the exception of statistics and possibly data interpretation.
All models may find it difficult to solve such problems and produce incorrect answers. In addition, the models do not yet understand graphs and charts well enough, which also makes it difficult to interpret the data. For example, the Biollama and Mixtral models cannot understand graphs at all, and they also do not do well with calculations more complex than, for example, primitive addition.
Regarding data interpretation, the OpenBioLlama model faces a problem due to its small token window (8 thousand), which limits the amount of text that can be fed into the input. For example, feeding even a small article with a prompt, article and response into the input can exceed 9 thousand tokens of text.
This raises the following difficulties: it is impossible to fit an entire article into the model, and when breaking the article into parts and feeding them into the input separately, the overall meaning of the data interpretation is lost, which can lead to an incorrect answer.
There are several solutions to this problem:
- You can connect the chat history and pass it to the model each time a piece of the article is loaded again. However, in this case, the amount of history would increase with each new chunk, and at the end, there would be no differenсe than if we just gave the whole article as input.
- It is possible to make a brief summary of the chat history and feed it into the input of the model. However, in this case, there is a chance to lose important information, and there is no guarantee that when the general thought of the history is highlighted, only the right data will be included, and all the right data at that. In addition, this option also does not solve model’s tokens overflow problem l, and the maximum that is possible is to feed only the smallest articles to the input with the risk that some of the important information will be lost.
- Discussion
- Comparison with previous studies
There are two options here:
- Feed previous studies as input, but this too takes away a certain amount of tokens that can be loaded into the model
- Rely on the completeness of the training of the model itself.
- Conclusion and overall evaluation
- Conclusions
- Overall evaluation
- Recommendations for improvement
Most models should be able to cope with this task given the right number of tokens, but if we feed in chunks, then as already mentioned, some important information may be lost.
- Presentation and formatting of the article
- Presentation of the article
- Structure and logic of presentation
- Language and style
- Illustrative material
- Literature
Regarding the illustrative material, as noted earlier, virtually no modern model can handle it effectively, and the Mixtral and OpenBioLlama models are no exception. For the other items, it is assumed that there should be no problems. However, if the article is divided into parts to feed into the model's input, the model will only produce results for the individual parts, without considering the relationship to other parts of the article.
Regarding the literature, it is preferable for the model to be able to fully load the article so that it can assess whether the literature is relevant to the content of the article. In addition, it is worth noting that the newer the model is, the more relevant answers it can provide.
- References and citation
- Completeness and relevance of the list of references (need more testing)
- Citation
To evaluate citations, the article must be loaded in full so that the model can understand the context in which the citation is used and so that it can interpret it correctly.
Conclusion
Based on the above, the following conclusions can be drawn: there are two main challenges faced by models in text processing.
The first one is the limitation on the number of tokens that the model can process at the same time.. The larger the window of tokens, the better the model can understand the context and coherence of the text.
When addressing this point by breaking the text into smaller parts, another problem arises - loss of narrative cohesion. To deal with this issue, the Mixtral model can be used, which has a 32k token window, allowing it to handle longer text than OpenBioLlama.
When solving the first point by breaking the text into smaller parts, another problem arises - loss of narrative coherence. To solve this problem, the Mixtral model can be used, which has a token window of 32k, allowing it to handle longer texts.