Evaluating Legal AI on Philippine Finance and Banking Law

Tyra Delos Reyes

Banking and finance-related legal work in the Philippines is shaped by specific and fast-evolving administrative issuances. 

It is in this environment that general-purpose models begin to break down. While they can approximate legal reasoning, large language models (LLMs) like Sonnet and ChatGPT struggle to consistently identify and apply the correct administrative issuance.

This evaluation surfaces the impact of that distinction. In regulatory practice, an AI’s legal analysis is only as useful as the authority it is grounded on. The ability to retrieve and apply the controlling issuance determines whether an answer is usable. 

Anycase.ai’s advantage is not merely that it “knows" more law.

Anycase’s legal AI is refined to navigate a fragmented and evolving body of issuances, isolate the controlling authorities, and retrieve the exact provision that correctly resolves the query. 

How we evaluated Anycase.ai and the LLMs

We evaluated Anycase.ai against GPT 5.5 and Sonnet 4.6 across 75 finance and banking-related queries. 

These included multiple-choice and essay-type problems specific to the five key agencies governing the finance and banking sector in the Philippines:

Each set of questions was designed to test not only legal reasoning, but the AI’s ability to identify and apply the correct controlling issuance within agency-specific regulatory frameworks.

The queries ranged from straightforward items to complex procedural and situational problems requiring identification of the controlling issuance. All responses were reviewed by Anycase’s Legal Intelligence Team, composed of lawyers, with essay answers evaluated using a rubric aligned with Supreme Court standards 

Each response was assessed on two criteria:

  • Is the answer correct: Whether the AI arrived at the correct legal conclusion

  • Did the AI surface the relevant authorities: Whether it correctly identified and cited the relevant issuance, law, or jurisprudence

Anycase enables complex banking and finance research, whereas LLMs still struggle 

Anycase achieved 97.8% overall correctness across all queries, compared to 71.3% for for GPT 5.5 and 69.6% for Sonnet 4.6.  

The gap comes from a consistent failure point: general-purpose models often don’t identify and apply the controlling, current issuance. 

In heavily regulated areas like finance and banking, the operative rule is frequently set out in the latest regulation, circular, or memorandum implementing the statute, and small wording changes can materially change the result. 

When the governing issuance is missed (or an outdated one is used), the answer can be confidently written yet legally wrong, because administrative issuances must stay within the authority granted by law and cannot amend the statute they implement.

Citing specific administrative issuances is where general-purpose models fail

This limitation is most visible in how models handle citations. Across evaluations, general-purpose systems repeatedly:

  • Provided a correct answer but cited a hallucinated or non-existent reference

  • Returned incorrect or misleading answers due to failure to cite the controlling issuance, citation of the wrong authority, or reliance on outdated issuances

  • Misapprehended the query, leading to incorrect conclusions

These failures intensified in essay-type questions.

Anycase, on the other hand, delivered correct and properly grounded answers in approximately 85–90% of queries, significantly reducing the need for downstream verification and enabling more reliable use in real-world legal workflows. 

Performance by agency 

Bangko Sentral ng Pilipinas (BSP)

Queries relying on documents from the BSP consistently exposed the largest gap between Anycase.ai and LLMs. 

BSP queries hinged on identifying the governing rule across BSP Circulars, Memorandum Orders, and Manual of Regulations provisions. 

This evaluation clearly surfaced the LLMs’ tendency to approximate the regulatory framework at a high level. These systems are able to describe the doctrine or policy rationale, but fail to anchor the answer in the correct circular, mistakenly citing laws that have already been amended or superseded. 

Even “simple” multiple-choice questions on reserve ratios are deceptively hard because the correct rate depends on the specific BSP Circular/MORB provision currently in force, not on the enabling statute.

Philippine Deposit Insurance Corporation (PDIC)

General-purpose models averaged approximately 70-80% on PDIC queries. Similar to the BSP results, these systems were able to produce plausible explanations, but frequently defaulted to outdated caps, legacy rules, or high-level summaries that did not engage with the controlling statutory provisions.

On the other hand, Anycase scored 99.4%. Its outputs consistently reflect the operative rules, particularly in areas where small statutory changes materially affect the answer.

Credit Information Corporation (CIC)

CIC-focused questions surfaced another consistent gap between Anycase and general-purpose LLMs.

In this category, general models could sometimes guess the right choice in multiple-choice formats, but these systems frequently failed to identify the controlling issuance or defaulted to generic summaries of the Credit Information Systems Act.

This gap is most evident in essay-type questions designed to require citation of the controlling issuance. Anycase achieved 100% correctness across these questions, while general-purpose models answered only 20% of the questions correctly.

Insurance Commission (IC)

Performance improved across all models on IC queries compared to BSP and PDIC, with Anycase at 93.33%, Sonnet at 80.13%, and ChatGPT at 75.66%.

Even with higher scores, the same failure modes remained, which were especially true in queries involving processes and procedures. This is a predictable weakness for general-purpose models in Philippine regulatory law: IC workflows often require matching the correct forum (IC vs courts), the right remedy (complaint/claim/mediation vs liquidation proceedings), and the governing instrument (Insurance Code vs IC issuance).

The Anti-Money Laundering Council (AMLC) 

Echoing the earlier sections, while the answers were often directionally aligned on anti-money laundering concepts, ChatGPT’s responses lacked the level of specificity and citation accuracy needed to completely resolve a legal query. 

Across 100% of the essay questions, ChatGPT defaulted to high-level explanations of anti-money laundering principles instead of identifying the controlling issuance or circular. In some cases, it pointed to the right issuance but applied it loosely; in others, it cited the wrong one entirely. There was also an instance where the conclusion was correct, but the legal basis was fabricated.

How Anycase maintains reliability in the banking and finance industry 

General-purpose LLMs do not have reliable access to the full depth of Philippine regulatory issuances. Much of this material (circulars, memoranda, rulings, and legal opinions) is fragmented, inconsistently published, and absent from their training data.

Anycase is designed to operate within this reality. Our reliability is driven by two structural advantages:

Depth and breadth of Philippine legal data 

Anycase maintains a daily-updated legal library, including historical and current issuances. This spans BSP Circulars dating back to 1992, alongside the latest regulatory issuances, rulings, and legal opinions. The system does not rely on partial visibility; it works across the full body of authorities that govern in practice.

Superseded law handling

This proprietary system tracks amendments, overrides, and regulatory updates, and actively prioritizes the latest controlling issuance. It is designed to catch citation errors and ensure that every answer points to the correct and current legal authority.

Taken together, these systems define how Anycase produces legally reliable output in a regulatory environment where precision is non-negotiable. 

By combining full visibility across Philippine financial issuances, active handling of superseded rules, and continuous oversight from practicing lawyers, the system ensures that answers are anchored on what is currently enforceable. 

In banking and finance, where outcomes turn on the exact wording of the latest circular or issuance, this level of control determines whether legal analysis can be acted on with confidence. These recent evaluations show that Anycase is built to meet that threshold, and enable work where LLMs still fail.

Level 21, 8 Rockwell, Hidalgo Dr., Rockwell Center, Makati City, Metro Manila, Philippines

Level 21, 8 Rockwell, Hidalgo Dr., Rockwell Center, Makati City, Metro Manila, Philippines

Level 21, 8 Rockwell, Hidalgo Dr., Rockwell Center, Makati City, Metro Manila, Philippines

Level 21, 8 Rockwell, Hidalgo Dr., Rockwell Center, Makati City, Metro Manila, Philippines