Anthropic 開源工具揭 AI 風險：Gemini、Grok 被測出高欺騙率，Claude 最安全？

Anthropic 的開源工具 Petri 發現，AI 的行為容易受到敘事模式影響，而非出於理性的「避免造成傷害」的原則。

Anthropic 近日推出了一款開源工具，用於揭示深藏在 AI 模型內部的安全隱患，它揭露了目前領先的前沿模型特性，這款工具名為「Parallel Exploration Tool for Risky Interactions」，簡稱 Petri，利用 AI Agents 模擬與模型的長對話，設定虛構角色情境，並根據模型的行為是否偏離人類利益評分。

Anthropic 測試 14 款模型，前沿 AI 出現欺瞞行為

《ZDNet》報導，這項新研究延續了 Anthropic 先前的安全測試工作，過去他們發現，AI 代理有時會說謊、作弊，甚至在目標受阻時威脅人類使用者。為了測試 Petri，研究團隊針對 14 款前沿 AI 模型進行實驗，包括 Claude Sonnet 4.5、GPT-5、Gemini 2.5 Pro，以及 Grok 4，共測試了 111 種情境。

Anthropic 的初步測試結果顯示，Claude Sonnet 4.5 是最安全的模型，以微小差距領先 GPT-5。而 Grok 4、Gemini 2.5 Pro 以及 Moonshot AI 的 Kimi K2 則出現「令人憂慮的用戶欺瞞比例」，其中 Gemini 2.5 Pro 問題最嚴重，這三款模型在模擬測試中皆出現欺瞞行為，包括謊稱已關閉監控系統、扭曲資訊，或隱瞞其未經授權的行為。

AI 越強大越危險？Anthropic 強調安全測試重要性

Anthropic 研究專案的靈感來自 AI 安全研究中的一個核心問題：「隨著模型變得更複雜，它們欺騙或傷害人類使用者的能力也隨之提升。」《ZDNet》提及，人類灌輸給 AI 的行為，在大多數情況下看似無害，卻可能在某些極端場景中引發嚴重後果。

Anthropic 網站指出：「隨著 AI 系統變得越來越強大，我們需要在部署前識別偏離預期的行為。沒有組織能全面審核 AI 失誤方式，我們需要更廣泛的研究社群，並為他們提供強大的工具，以系統化地探索模型行為。」

《ZDNet》指出，Anthropic 並未將 Petri 描繪成解決 AI 模型問題一勞永逸的方法，而是邁向安全測試自動化的開端，嘗試把 AI 可能的失控行為歸類出幾個簡單範疇，例如欺瞞、諂媚等。Petri 作為一個開源安全測試框架，它讓研究者能夠大規模地檢測與挖掘模型漏洞，但尙無法涵蓋模型可能展現的所有面向。

Anthropic 研究團隊表示：「我們釋出 Petri 的同時，也預期使用者會改進我們的初步指標（metrics），或建立更符合自身需求的新指標。」透過讓 Petri 免費開放，Anthropic 希望全球研究者能以新的、有創造性的方式使用它，發現更多潛在風險，並開發新的安全機制。

＊本文開放合作夥伴轉載，資料來源：《ZDNet》、Anthropic，圖片來源：Anthropic。

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Anthropic 開源工具揭 AI 風險：Gemini、Grok 被測出高欺騙率，Claude 最安全？

Anthropic 測試 14 款模型，前沿 AI 出現欺瞞行為

AI 越強大越危險？Anthropic 強調安全測試重要性

TO 會員電子報

台灣 AI 採用贏全球，產出成果卻落後一截？微軟揭企業 AI 的導入盲點

南韓砸逾 8,800 億美元打造 AI 國家隊：拆解台、日、韓的 AI 國力競賽

從 8 小時到 22 秒就能破解！當 AI 變成駭客工具，你的公司準備好了嗎？（下篇）

資安長看不到的「暗物質」：放手讓 AI 自動修補前，先過 5 道門檻