Part 1

Question 1

The data used to predict crimes consists of historical crime data and other statistical data.

Crime data
- Police incident reports
- Arrest records

The data is collected by the police themselves.

The algorithmically generated “crime hotspots” tend to be places where minority groups live and work.

Question 2

Due to the nature of the data and how its collected, these algorithms are biased and help to reinforce biases.

Such recorded checks and incidents also feed back into crime data, distort them (as they are not grounded in a reasonable suspicion), and can also influence further predictive analyses, leading to a feedback loop where the same areas and profiles are repeatedly targeted. This phenomenon is known in criminology as the Lüchow-Dannenberg syndrome: An increase in police presence in a location leads to an increase in statistically recorded offenses and crimes

Profiling systems can widely cast the net of suspicion, linking innocent people to suspects simply through chance encounters or shared neighborhoods.
Real-world applications have shown massive failure rates. For instance, a manual check of an automated passenger screening system in Germany yielded an accuracy rate of just 0.3% (277 correct matches out of 94,000). Similarly, a 2023 analysis of Geolitica (formerly PredPol) in Plainfield, New Jersey, showed that less than 0.5% of its 23,631 crime predictions accurately matched reported crimes.
There is a lack of objective evidence demonstrating that these systems actually reduce crime rates.
This method may not be as adaptive.

Question 3

Yes.

Feedback loop, see question 2.
The Lüchow-Dannenberg Syndrome: This increased presence inevitably leads to more suspicionless stops, identity checks, and recorded minor offenses in that specific location.
This structural loop leads to high rates of harassment for residents. According to a 2023 EU report, 33% of respondents of African descent in Germany experienced arbitrary police checks within a five-year period. Concurrently, crimes in other areas may go unnoticed due to resources being artificially concentrated in the algorithm’s chosen hotspots.

Question 4

European Union (AI Act) Prohibits systems based solely on profiling natural persons. Permits it only if it supports human assessment. Categorizes many law enforcement AI uses as “high-risk.”
European Union (LED) Article 11 forbids making decisions exclusively based on automated processing; human assessment is a mandatory component.
Germany (Federal level) The Federal Constitutional Court ruled the automated use of Palantir platforms unconstitutional in some states as it violated the right to informational self-determination.
United States Allowed and actively used in various jurisdictions (e.g., COMPAS algorithm for recidivism, Geolitica for hotspots), though heavily criticized for inherent racial bias.
GDPR Article 23: under certain special circumstances the rules in the GDPR can be limited.

Part 2

Automatic extraction of information from legal documents
Predict outcomes of court cases.
- An implementation of this idea may include optimizing the inclusion and exlusion of evidence to maximize the outcome.
Used LLMs to extract 8 ‘key aspects’:
1. facts of the case
2. claims made
3. references to legal statutes
4. references to precedents
5. general case outcomes and corresponding labels
6. detailed order and remedies
7. reasons for the decision
Accuracy and suitability for downstream use of extracted data is manually labeled by a legal expert (and supervised by a senior)
- Inter-annotator agreement?
GPT-4 was really good at extracting information from UKET case transcripts
- 100% accuracy in extraction of references to statutes and precedents
- The accuracy score of 91% in extracting general outcome labels was the lowest score across all extracted types of information.
- How was accuracy measured?

Incredibly, GPT-based models achieved comparable results to the transformer models which were specially designed and fine-tuned on our curated data.

GPT-based models are transformer models. This could’ve perhaps been rephrased as ‘finetuned versus non-finetuned’, or pre-trained on domain-based dataset versus pre-trained on general dataset.

Overall, our results show that there is a gap between machine and human performance, with human expert predictions outperforming all baseline models across evaluation metrics, highlighting the superiority of human judgement in this domain.

While LLM based approaches do well at data extraction, they perform poorly at classification of dispute outcomes.

What is very interesting is that both models and expert annotators demonstrate a high recall and a relatively low precision when predicting “claimant wins” and in contrast a high precision and a relatively low recall when predicting the “claimant loses” label.

What’s the label distribution in the train/test sets? What’s the majority label?
From paper: BERT and GPT-4 score same SOTA accuracy 0.623.
- T5 scores highest recall, accuracy and F-score.
- (bi-directional) Encoder-decoder models perform well.

There are significant limitations on what we do. For example, we have extracted information from past decisions from a particular set of data divided into neat paragraphs, but in real life, lawyers and parties have different types of information such as emails, contracts, memories of events and sometimes thousands of files.

Perhaps there are preprocessing steps that can be automated.

For example, there is the possibility that judges have presented the facts in a way that contains a hint of what they decide, which could influence a model’s prediction about the outcome.

For every source of data, the (temporal) availability should be taken into account.

Question 1

Benefits of CasePredict:

Speed: As noted in the article, LLMs can extract information from thousands of cases (like the 52,000 in the UKET dataset) at a pace unattainable by humans. This allows for a more comprehensive “birds-eye view” of judicial trends.
Cost: Automating data extraction reduces the billable hours required for manual legal research
AI can identify statistical correlations—such as a specific court’s tendency to interpret clauses narrowly—that might be too subtle for a human lawyer to detect across a massive volume of disparate judgments.

Question 2

Information Leakage and Hindsight Bias: Your notes correctly identify “information leakage.” Court judgments are written after the decision is made; the facts may be framed by the judge to justify the outcome. If CasePredict trains on these “framed” facts, it isn’t predicting the future—it’s just recognizing the judge’s already-formed conclusion.
Data Stagnation/Feedback loop: If every law firm uses AI that says “this claim usually fails,” new and valid legal arguments may never be brought to court. This prevents the law from evolving through new precedents.
The article mentions that models often have high precision for winning but lower precision/recall for losing. If CasePredict predicts a 22% success rate, it might be reflecting a “claimant loses” bias inherent in the training data
Real-world legal data is messy.

Question 3

A lawyer must understand the tools they use. Relying on a 22% probability without understanding the model’s limitations (like the 0.623 accuracy mentioned in the notes) could be seen as a failure of technical competence.
A lawyer’s primary duty is to provide their own professional opinion. If a lawyer treats the AI’s output as an absolute “truth” rather than one piece of evidence, they have effectively delegated their professional judgment to an algorithm. Over-reliance on a statistical “no” may lead a lawyer to skip the creative legal thinking required to distinguish a client’s case from past failed precedents.

Question 4

Yes, lawyers should disclose the use of AI systems.

Clients have the right to know how the “probability of success” was calculated. If they knew it was based on a model with a 9% error rate in outcome labeling (as seen in the GPT-4 study), they might choose to proceed despite the low score.
Disclosure allows the lawyer to explain the limits of the tool—clarifying that CasePredict offers a statistical correlation, not a guaranteed legal reality.
Cost cuts that legal firms make should translate to better consumer pricing. A consumer should know what they’re getting for their money.
- The relationship is built on trust. If a client discovers later that their multi-million dollar decision was based on a “black box” algorithm they weren’t told about, it undermines the firm’s credibility.

Question 5

Human-in-the-Loop: As the article states, “human expert predictions outperform all baseline models.” AI output should always be reviewed and “sanity-checked” by a senior legal expert.
Systems must be tested against “raw” evidence (emails/contracts) rather than just final judgments to minimize the “judge’s framing” bias. Prevent data leakage.
Regular Audits for Bias: Law firms should periodically check if the AI is biased against certain types of claimants or jurisdictions, ensuring it doesn’t just reinforce historical judicial prejudices.
Probability Ranges, Not Scores: Instead of a single number (22%), the AI could provide
- A confidence interval
- A list of reasons why the case could still win, to encourage balanced legal strategy.
Strict Data Privacy: Ensuring that the client’s sensitive case details are not used to train the global model, which could lead to confidential “leakage” to competitors.