Teaching Blar to Give Better Code Reviews: How We Built Our Text-Based Recommender

Apr 30, 2025

Teaching Blar to Give Better Code Reviews: How We Built Our Text-Based Recommender

The code, idea, and concepts were all developed by Juan Vargas, one of our engineers at Blar.
Check out his:

I’m just the greedy CEO who wrote the article and stole the credit 🤑

The context

Blar is an AI agent that helps developers review their pull requests. Our long-term goal is to make Blar as smart and capable as a senior engineer within your organisation. But to create a senior, you have to start learning not just from your wins, but most importantly, from your mistakes.

One of the best ways to learn is through feedback — positive and negative interactions. That’s how we humans learn¹: by having a reward system that makes us feel good when we do something right, and not so good when we mess up. So, how can we build something similar for our Agent?

Recommender Systems

Recommender systems are everywhere — all around us. From video suggestions on YouTube and Netflix to product recommendations while shopping online, there’s always a recommender system working behind the scenes. These systems learn from your interactions and adapt to give you the best possible experience.

Recommender systems might feel like magic, but it’s really science… data science. They are built by observing what you like and what you ignore, using that data to predict what you might want next.

Let’s dive into the basics of how recommender systems work:
At their core, most recommender systems rely on two key approaches:

  • Collaborative filtering: This method learns from the behavior of users. If users similar to you liked something, chances are you might like it too. It’s like when your friends recommend a movie because you all have similar taste.

  • Content-based filtering: Here, the system looks at the properties of the items themselves. If you liked a sci-fi movie, it might suggest another sci-fi movie — because it knows the genre, director, or actors match your previous favorites.


At the end of the day, a recommender system is just trying to answer one question: What’s the next best thing I can show you that you’ll love? (or activate your neurons)

What does any of this have to do with Pull Request comments?

At first glance, it might feel like pull request comments and Netflix recommendations live in two completely different universes. But under the hood, it’s the same core idea: learning from interactions to make better suggestions.

Every time Blar leaves a comment on a pull request — whether it’s spotting a bug, suggesting a refactor, or flagging a code smell — we ask for simple feedback: just a thumbs up( 👍) or a thumbs down( 👎).

This tiny signal tells us a lot. A thumbs up means we were helpful (think dopamine). A thumbs down means we missed the mark and the user didn’t like our comment (think of it like a small pain signal).

But here’s the challenge: pull request comments come in all shapes and forms. How do we know what type of comment the user didn’t like? Was it a suggestion about a bug? About null handling? About user validation?

This was the key question that motivated us: are there specific types of comments that users consistently prefer — or dislike?

After talking to some of our users, a pattern started to emerge. They often said things like: “We don’t like when you flag null values,” or “Input validation issues aren’t really helpful for us.”

Understanding these preferences was the first step toward making Blar smarter — and more aligned with how your team thinks about code quality.

How to clasify “types” of comments.

Saying that our users don’t like “null values” comments is one thing — but how do we actually find which comments are about “null values”?

One simple alternative would be to just search for keywords like “null” or “value” inside the comment. But a much better way is to use embeddings( in this article, we go into detail)

Embeddings help us capture semantic similarities between texts. That means two comments that are roughly about the same topic should end up close together, i.e. they’ll have a small cosine distance.

To generate embeddings, we just use a pre-trained in this case, Open AI’s embeddings.

from openai import OpenAI
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

client = OpenAI()

# Get list of input texts (assumes 'issues' DataFrame exists)
input_texts = issues['text'].dropna().tolist()

# Generate embeddings
all_embeddings = [
    client.embeddings.create(input=[text.replace("\n", " ")], model="text-embedding-3-small").data[0].embedding
    for text in input_texts
]

# Convert to numpy and compute similarity scores (first vs rest)
all_embeddings = np.array(all_embeddings)
scores = cosine_similarity(all_embeddings[:1], all_embeddings[1:])[0] * 100

Awesome. Now we have a way to understand if two comments are talking about the same thing.
But that leads us to a second question: what’s a suitable distance to say these two comments are of the same type?

For this, we turned our heads to clustering.

Clustering

Clustering is the idea of grouping similar things together without knowing how many groups there are or what they should look like. In our case, we wanted to group similar comments and discover the “natural topics” that Blar was commenting on.

There are many different methods of clustering, but for this example, we’ll use k-means, a classic and simple approach.

from sklearn.cluster import KMeans

neigh = KMeans(n_clusters=11, random_state=42)

neigh.fit(all_embeddings)
neighbors_indices = neigh.predict(all_embeddings)

And just like that, we generated our clusters (the magic of sklearn, everyone ✨).

To visualise these clusters, we can use techniques like PCA (Principal Component Analysis), which is a dimensionality reduction method that helps us squash high-dimensional data (like 300 dimensions) down into something we can actually plot (like 2D or 3D).

We’re getting there.
But if you look closely at the code, you’ll notice we chose n_clusters=11².
Why 11 and not 10 or 12?

We used a technique called the “elbow method” to figure this out.
The idea is simple: you plot the number of clusters against the “error” (sum of squared distances), and pick the point where adding more clusters stops giving you a big improvement, where the curve bends like an elbow.

Here’s a rough version of the code we used:

from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Reduce dimensions using PCA
pca = PCA(n_components=45, random_state=42)  # Adjust the number of components as needed
reduced_embeddings = pca.fit_transform(all_embeddings)

# Calculate the sum of squared distances for different cluster numbers
sse = []
cluster_range = range(1, 30)
for k in cluster_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(reduced_embeddings)
    sse.append(kmeans.inertia_)

# Plot the Elbow Curve
plt.figure(figsize=(8, 5))
plt.plot(cluster_range, sse, marker='o')
plt.xlabel('Number of Clusters')
plt.ylabel('Sum of Squared Distances (SSE)')
plt.title('Elbow Method for Optimal Cluster Number')
plt.show()

Recommendation

Now that we have our clusters, we can finally generate real recommendations from them.

For this, we used collaborative filtering: based on what users from your company like (👍) or dislike (👎), we prioritize the types of issues that land in each cluster.

When we analyze feedback across clusters, we find that some, like Cluster 5, drive a lot of engagement, positive or negative. Others, like Cluster 7, tend to receive mostly dislikes or no interaction at all.

To decide which clusters to recommend from, we apply a simple rule: for each cluster, we compute a smoothed like ratio using the formula:

(likes + 20) / (total + 10)

This smoothing prevents a small number of votes from skewing the outcome too much. If the ratio is greater than 0.6, we recommend content from that cluster. If it falls below, we still recommend it 20% of the time to continue exploring potentially overlooked clusters.

This approach helps us prioritize high-quality clusters while still learning from the under-engaged ones.

A simple implementation looks like this:

def generate_recommendation(self, text):
    """Generates a Recommendation object based on the input text.

    Args:
        text (str): The text to be that will be used to generate the recommendation.

    Returns:
        Recommendation: A dataclass containing the cluster and recommendation status.
    """
    processed_text = self.__preprocess_text(text)
    embeddings = self.__embed_text(processed_text)
    cluster = self.kmeans.predict(embeddings)[0]

    return Recommendation(cluster=cluster, recommended=self.__should_recommend(cluster))

And the __should_recommend function is a simple proportion-based logic:

def __should_recommend(self, cluster):
    liked_disliked = get_disliked_and_total_issues(company=self.company, recommendation_cluster=cluster)

    if liked_disliked.total == 0:
        return True

    # We randomly recommend 20% of the time so that we can explore new clusters
    if random() < 0.2:
        return True

    # If the probability of liking the recommendation is greater than 0.6, we recommend it
    # We add 20 to the thumbs up and 10 to smooth the calculation
    return (liked_disliked.thumbs_up + 20) / (liked_disliked.total + 10) > 0.6

That’s it! You now have your own recommender system based on text.

Warnings ⚠️

When dealing with embeddings — and data in general — you have to make sure you’re capturing the right information.

To explain what I mean:
Our first approach at embedding the issues (without any sanitization) gave us this distribution across companies:

As you can see, the distribution was far from homogeneous.
For example, Cluster 10 was clearly dominated by a single company (light blue); instead of being about a real topic like “null values”.

Why did this happen?
When Blar suggests an improvement, it often includes:

  • Code snippets

  • Variable names

  • Business logic

These elements are highly specific to each company.
So when we embedded the issue, we weren’t capturing the topic — we were accidentally capturing company-specific fingerprints.
That’s not what we wanted.

🛠️ How we fixed it

To solve this, we sanitized the data:

  • We replaced any code snippets with a <code> token.

  • We replaced variables with a <variable> token.

This way, the embeddings would focus only on the real semantic meaning of the comments, not the irrelevant, company-specific details.

After sanitizing, we re-ran the clustering, and now the distributions looked like this:

Much better — now the clusters are far more balanced across companies, meaning we’re finally capturing true topics instead of leaking company identity into the embeddings.

  1. Extremely oversimplified and I’m not in any way whatsoever qualified to give a more profound answer.

  2. You’ll also notice a random_state=42, but that’s because 42 is the answer to life, the universe and everything.

We help teams regain control of their codebase, ship faster, and stay ahead of technical challenges. If technical debt is slowing you down, let’s talk.

Thanks for reading!