译科技|在研究了GPT-4和ChatGPT性能后,Wiley高级数据科学家得出这样的结论……
GPT-4 vs. ChatGPT: An Exploration of Training, Performance, Capabilities, and Limitations
GPT-4对决ChatGPT:基于训练、性能、功能和局限性的探索
GPT-4 is an improvement, but temper your expectations.
GPT-4 是(ChatGPT)的演进,但要降低你的期待。
Image created by the author(图片由作者自创)
OpenAI stunned the world when it dropped ChatGPT in late 2022. The new generative language model is expected to totally transform entire industries, including media, education, law, and tech. In short, ChatGPT threatens to disrupt just about everything. And even before we had time to truly envision a post-ChatGPT world, OpenAI dropped GPT-4.
OpenAI在2022年末投放ChatGPT时震惊了世界。新的生成语言模型有望彻底改变整个行业,包括媒体、教育、法律和科技,简而言之,ChatGPT有可能会颠覆一切。甚至在我们有时间真正设想一个后ChatGPT世界之前,OpenAI就放弃了GPT-4。
In recent months, the speed with which groundbreaking large language models have been released is astonishing. If you still don’t understand how ChatGPT differs from GPT-3, let alone GPT-4, I don’t blame you.
近几个月来,突破性的大型语言模型的发布速度令人吃惊。如果你还不明白ChatGPT与GPT-3有什么不同,更不用说GPT-4了。
In this article, we will cover the key similarities and differences between ChatGPT and GPT-4, including their training methods, performance and capabilities, and limitations.
本文将介绍ChatGPT和GPT-4的主要异同,包括它们的训练方法、性能、功能以及局限性。
ChatGPT vs. GPT-4: Similarities & differences in training methods
ChatGPT与GPT-4:训练方法的相似性和差异性
ChatGPT and GPT-4 both stand on the shoulders of giants, building on previous versions of GPT models while adding improvements to model architecture, employing more sophisticated training methods, and increasing the number of training parameters.
ChatGPT和GPT-4都站在巨人的肩膀上,在以前版本的GPT模型基础上,增加了对模型结构的改进,采用了更复杂的训练方法,并增加了训练参数的数量。
Both models are based on the transformer architecture, which uses an encoder to process input sequences and a decoder to generate output sequences. The encoder and decoder are connected by an attention mechanism, which allows the decoder to pay more attention to the most meaningful input sequences.
这两个模型都是基于变压器架构,它使用编码器处理输入序列,使用解码器生成输出序列。编码器和解码器由一个注意力机制连接,使解码器能够更多地关注最有意义的输入序列。
OpenAI’s GPT-4 Technical Report offers little information on GPT-4’s model architecture and training process, citing the “competitive landscape and the safety implications of large-scale models.” What we do know is that ChatGPT and GPT-4 are probably trained in a similar manner, which is a departure from training methods used for GPT-2 and GPT-3. We know much more about the training methods for ChatGPT than GPT-4, so we’ll start there.
OpenAI 的GPT-4 技术报告几乎没有提供有关 GPT-4 模型架构和训练过程的信息,引用了“竞争格局和大型模型的安全影响”。我们所知道的是,ChatGPT 和 GPT-4 可能以类似的方式进行训练,这与用于GPT-2和GPT-3的训练方法不同。由于我们对ChatGPT的训练方法比 GPT-4了解更多,所以将从这里开始讲起。
ChatGPT
To start with, ChatGPT is trained on dialogue datasets, including demonstration data, in which human annotators provide demonstrations of the expected output of a chatbot assistant in response to specific prompts. This data is used to fine-tune GPT3.5 with supervised learning, producing a policy model, which is used to generate multiple responses when fed prompts. Human annotators then rank which of the responses for a given prompt produced the best results, which is used to train a reward model. The reward model is then used to iteratively fine-tune the policy model using reinforcement learning.
首先,ChatGPT 在对话数据集上接受训练,包括演示数据,其中人工注释者提供聊天机器人助手响应特定提示的预期输出的演示。这些数据被用来通过监督学习对GPT3.5进行微调,生成策略模型,当输入提示时,该模型被用来产生多种反应。然后,人工注释者对给定提示的响应进行排名,产生最佳结果,用于训练奖励模型。继而使用奖励模型通过强化学习迭代地微调策略模型。
Image created by the author(图片由作者自创)
To sum it up in one sentence, ChatGPT is trained using Reinforcement Learning from Human Feedback (RLHF), a way of incorporating human feedback to improve a language model during training. This allows the model’s output to align to the task requested by the user, rather than just predict the next word in a sentence based on a corpus of generic training data, like GPT-3.
一句话概括,ChatGPT 是使用人类反馈强化学习(RLHF) 进行训练的,这是一种在训练过程中结合人类反馈来改进语言模型的方法。这使得模型的输出与用户要求的任务相一致,而不是像GPT-3那样,仅仅根据通用训练数据的语料库来预测句子中的下一个词。
GPT-4
OpenAI has yet to divulge details on how it trained GPT-4. Their Technical Report doesn’t include “details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.” What we do know is that GPT-4 is a transformer-style generative multimodal model trained on both publicly available data and licensed third-party data and subsequently fine-tuned using RLHF. Interestingly, OpenAI did share details regarding their upgraded RLHF techniques to make the model responses more accurate and less likely to veer outside safety guardrails.
OpenAI 尚未透露其如何训练 GPT-4 的细节。他们的技术报告不包括“关于架构(包括模型大小)、硬件、训练计算、数据集构造、训练方法或类似内容的详细信息。” 我们所知道的是,GPT-4 是一种 transformer 式的生成多模态模型,在公开可用数据和许可的第三方数据上进行训练,随后使用 RLHF 进行微调。有趣的是,OpenAI 确实分享了有关其升级的 RLHF 技术的详细信息,以使模型响应更加准确,并且不太可能偏离安全护栏。
After training a policy model (as with ChatGPT), RLHF is used in adversarial training, a process that trains a model on malicious examples intended to deceive the model in order to defend the model against such examples in the future. In the case of GPT-4, human domain experts across several fields rate the responses of the policy model to adversarial prompts. These responses are then used to train additional reward models that iteratively fine-tune the policy model, resulting in a model that’s less likely to give out dangerous, evasive, or inaccurate responses.
在训练完一个政策模型后(如ChatGPT),RLHF被用于对抗性训练,这是一个对旨在欺骗模型的恶意例子进行训练的过程,以便在未来抵御这种例子。在GPT-4的案例中,多个领域的人类领域专家对政策模型对对抗性提示的反应进行了评级。然后,这些反应被用来训练额外的奖励模型,对政策模型进行反复微调,从而形成一个不太可能给出危险、逃避或不准确反应的模型。
Image created by the author(图片由作者自创)
ChatGPT vs. GPT-4: Similarities & differences in performance and capabilities
ChatGPT与GPT-4:性能和功能的异同
Capabilities
功能
In terms of capabilities, ChatGPT and GPT-4 are more similar than they are different. Like its predecessor, GPT-4 also interacts in a conversational style that aims to align with the user. As you can see below, the responses between the two models for a broad question are very similar.
就功能而言,ChatGPT和GPT-4的相似之处多于它们的不同之处。与其前身一样,GPT-4也是以对话式的方式进行互动,旨在与用户保持一致。正如你在下面看到的,两个模型之间对一个广泛问题的回答非常相似。
Image created by the author(图片由作者自创)
OpenAI agrees that the distinction between the models can be subtle and claims that “difference comes out when the complexity of the task reaches a sufficient threshold.” Given the six months of adversarial training the GPT-4 base model underwent in its post-training phase, this is probably an accurate characterization.
OpenAI认为,模型之间的区别可能是微妙的,并声称 "当任务的复杂性达到足够的阈值时,就会出现差异"。鉴于GPT-4基础模型在训练后阶段经历了六个月的对抗性训练,这可能是一个准确的表征。
Unlike ChatGPT, which accepts only text, GPT-4 accepts prompts composed of both images and text, returning textual responses. As of the publishing of this article, unfortunately, the capacity for using image inputs is not yet available to the public.
与只接受文本的ChatGPT不同,GPT-4接受由图像和文本组成的提示,并返回文本响应。遗憾的是,截至本文发表时,GPT-4使用图像输入的能力还没有向公众开放。
Performance
性能
As referenced earlier, OpenAI reports significant improvement in safety performance for GPT-4, compared to GPT-3.5 (from which ChatGPT was fine-tuned). However, whether the reduction in responses to requests for disallowed content, reduction in toxic content generation, and improved responses to sensitive topics are due to the GPT-4 model itself or the additional adversarial testing is unclear at this time.
如前所述,OpenAI报告称,与GPT-3.5(ChatGPT是由其微调而来)相比,GPT-4的安全性能有了明显的改善。然而,减少对不允许内容请求的响应、减少有毒内容的生成以及改善对敏感话题的响应,是由于GPT-4模型本身还是由于额外的对抗性测试,目前尚不清楚。
Additionally, GPT-4 outperforms CPT-3.5 on most academic and professional exams taken by humans. Notably, GPT-4 scores in the 90th percentile on the Uniform Bar Exam compared to GPT-3.5, which scores in the 10th percentile. GPT-4 also significantly outperforms its predecessor on traditional language model benchmarks as well as other SOTA models (although sometimes just barely).
此外,GPT-4在人类参加的大多数学术和专业考试中都优于CPT-3.5。值得注意的是,GPT-4在统一律师考试中的得分是90分,而GPT-3.5的得分则是10分。GPT-4在传统的语言模型基准以及其他SOTA模型上也明显优于其前身(尽管有时只是勉强)。
ChatGPT vs. GPT-4: Similarities & differences in limitations
ChatGPT 与 GPT-4:限制性的异同
Both ChatGPT and GPT-4 have significant limitations and risks. The GPT-4 System Card includes insights from a detailed exploration of such risks conducted by OpenAI.
ChatGPT和GPT-4都有很大的局限性和风险。GPT-4系统卡包括OpenAI对此类风险进行的详细探索的见解。
These are just a few of the risks associated with both models:
以下是与两种模型相关的一些风险:
·Hallucination (the tendency to produce nonsensical or factually inaccurate content)
·幻觉(倾向于产生无意义或与事实不符的内容)
·Producing harmful content that violates OpenAI’s policies (e.g. hate speech, incitements to violence)
·制作违反OpenAI政策的有害内容(如仇恨言论、煽动暴力)。
·Amplifying and perpetuating stereotypes of marginalized people
·扩大和延续对边缘化人群的刻板印象
Generating realistic disinformation intended to deceive
生成意在欺骗的现实虚假信息
While ChatGPT and GPT-4 struggle with the same limitations and risks, OpenAI has made special efforts, including extensive adversarial testing, to mitigate them for GPT-4. While this is encouraging, the GPT-4 System Card ultimately demonstrates how vulnerable ChatGPT was (and possibly still is). For a more detailed explanation of harmful unintended consequences, I recommend reading the GPT-4 System Card, which starts on page 38 of the GPT-4 Technical Report.
虽然ChatGPT和GPT-4在同样的限制和风险中挣扎,但OpenAI已经做出了特别的努力,包括广泛的对抗性测试,以减轻GPT-4的风险。虽然这令人鼓舞,但GPT-4系统卡最终显示了ChatGPT是多么的脆弱(而且可能仍然是)。关于有害的非预期后果的更详细解释,我建议阅读GPT-4系统卡,它从GPT-4技术报告的第38页开始。
Conclusion
结论
In this article, we review the most important similarities and differences between ChatGPT and GPT-4, including their training methods, performance and capabilities, and limitations and risks.
本文中,我们回顾了ChatGPT和GPT-4之间最重要的异同点,包括它们的训练方法、性能和能力,以及限制和风险。
While we know much less about the model architecture and training methods behind GPT-4, it appears to be a refined version of ChatGPT that now accepts image and text inputs and claims to be safer, more accurate, and more creative. Unfortunately, we will have to take OpenAI’s word for it, as GPT-4 is only available as part of the ChatGPT Plus subscription.
虽然我们对GPT-4背后的模型架构和训练方法知之甚少,但它似乎是ChatGPT的改进版,现在接受图像和文本输入,并声称更安全、更准确、更有创造性。不幸的是,我们将不得不相信OpenAI的话,因为GPT-4只作为ChatGPT Plus订阅的一部分提供。
The table below illustrates the most important similarities and differences between ChatGPT and GPT-4:
下表说明了 ChatGPT 和 GPT-4 之间最重要的异同点:
Image created by the author(图片由作者自创)
The race for creating the most accurate and dynamic large language models has reached breakneck speed, with the release of ChatGPT and GPT-4 within mere months of each other. Staying informed on the advancements, risks, and limitations of these models is essential as we navigate this exciting but rapidly evolving landscape of large language models.
随着ChatGPT和GPT-4在短短几个月内相继问世,一场旨在创建最准确和动态的大型语言模型的竞赛已经进入白热化。当我们驾驭大型语言模型这个令人兴奋但快速发展的领域时,了解这些模型的进展、风险和限制至关重要。
责任编辑:张薇