Semantic Search Engine for ResearchMath-14k Dataset

Introduction: Why Research-Level Mathematics Needs Smarter NLP Tools

Mathematics is one of the most structured yet semantically rich domains in human knowledge. With thousands of open problems, conjectures, and theorems spanning fields like algebraic geometry, number theory, and combinatorics, navigating this landscape manually is nearly impossible. That’s where natural language processing (NLP) and modern machine learning pipelines come in.

The ResearchMath-14k dataset provides a curated collection of over 14,000 research-level mathematics problems, complete with metadata about their open or closed status. In this blog post, we walk through a complete NLP pipeline that extracts meaning from this dataset — from keyword extraction all the way to a semantic search engine and a trained classifier.

What Is the ResearchMath-14k Dataset?

The ResearchMath-14k dataset is a benchmark collection of research-level mathematical problems sourced from various academic repositories. Each entry typically includes:

A problem statement written in natural language (and sometimes LaTeX)
A mathematical field or subfield label
An open status indicating whether the problem remains unsolved or has been resolved

This rich metadata makes the dataset ideal for training classifiers, building search engines, and exploring the structural landscape of mathematical research.

Step 1: Keyword Extraction with TF-IDF

The first step in any NLP pipeline is understanding what makes each document unique. For mathematical text, this is especially important because domain-specific terminology — such as \”Riemann hypothesis,\” \”cohomology,\” or \”Diophantine equations\” — carries enormous weight.

We apply Term Frequency-Inverse Document Frequency (TF-IDF) to extract field-specific keywords from each problem statement. TF-IDF rewards terms that appear frequently in a specific document but rarely across the entire corpus, making it ideal for surfacing mathematically meaningful vocabulary.

By grouping TF-IDF scores by mathematical field, we can identify which terms are most diagnostic for areas like topology versus number theory, helping us build a cleaner representation of each problem’s content.

Step 2: Generating Sentence Embeddings

While TF-IDF captures lexical importance, it misses semantic relationships. Two problems about \”prime gaps\” and \”gaps between consecutive primes\” would look very different under TF-IDF but are semantically nearly identical.

To address this, we generate sentence embeddings using pre-trained transformer models (such as Sentence-BERT or similar). These embeddings map each problem statement into a dense vector space where semantic similarity is reflected by geometric proximity. This is the backbone of our semantic search engine.

Step 3: Visualizing the Problem Landscape with UMAP

With embeddings in hand, we use UMAP (Uniform Manifold Approximation and Projection) to reduce the high-dimensional embedding space to two dimensions for visualization. UMAP preserves both local and global structure better than older methods like t-SNE, making it excellent for exploring large datasets.

The resulting UMAP plot reveals fascinating structure:

Tight clusters of problems from the same mathematical subfield
Bridge regions where interdisciplinary problems sit between two fields
Isolated outliers that represent highly unique or novel problem formulations

This visualization alone provides valuable insight into how mathematical knowledge is organized and where conceptual overlaps exist.

Step 4: Clustering with K-Means

To formalize the structure seen in the UMAP visualization, we apply K-Means clustering to the raw sentence embeddings (not the UMAP projections, to avoid dimensionality reduction artifacts). We tune the number of clusters using silhouette scores and elbow analysis.

Each resulting cluster can be inspected to understand which types of problems group together. Interestingly, K-Means often discovers thematic groupings that cut across traditional field labels — for example, grouping problems about extremal graph theory with certain combinatorial optimization problems, regardless of how they were originally tagged.

Step 5: Building the Semantic Search Engine

The semantic search engine is built on top of the sentence embeddings using cosine similarity as the distance metric. The workflow is straightforward:

A user submits a natural language query (e.g., \”unsolved problems about prime distribution\”)
The query is embedded using the same transformer model
Cosine similarity is computed between the query embedding and all problem embeddings in the dataset
The top-k most similar problems are returned, ranked by similarity score

This approach dramatically outperforms keyword-based search for mathematical text, where the same concept can be expressed in dozens of different ways. The semantic engine handles paraphrases, synonyms, and even cross-lingual queries gracefully.

Step 6: Training an Open-Status Classifier

One of the most practically useful components of this pipeline is the open-status classifier — a model that predicts whether a given mathematical problem is currently open (unsolved) or closed (resolved).

We train this classifier using the sentence embeddings as input features and the binary open/closed label as the target. Several classifiers were tested:

Logistic Regression — strong baseline with interpretable coefficients
Random Forest — captures non-linear patterns in the embedding space
Gradient Boosting (XGBoost) — best overall performance on validation data

The classifier achieves strong accuracy, demonstrating that the language used to describe open problems does differ systematically from closed ones. Open problems tend to use more speculative language (\”it is conjectured,\” \”remains unknown\”), while closed problems use more definitive framing (\”it was proved,\” \”the solution follows from\”).

Step 7: Surfacing Near-Duplicate Problems

A final and highly practical use case is near-duplicate detection. Mathematical literature often contains the same problem stated differently across different papers or textbooks. Using the cosine similarity scores from our embedding space, we can flag pairs of problems that are highly similar — a score above 0.92, for instance — as potential duplicates.

This is invaluable for dataset curators, journal editors, and researchers who want to avoid redundant work or identify when a \”new\” problem has already been studied under a different name.

Key Takeaways and Applications

This complete NLP pipeline over the ResearchMath-14k dataset demonstrates that modern language models and classical machine learning techniques can work together powerfully in specialized scientific domains. Here’s a summary of what we achieved:

TF-IDF keyword extraction to identify field-specific vocabulary
Sentence embeddings for rich semantic representations
UMAP visualization to explore the structure of mathematical research
K-Means clustering to discover thematic groupings
Semantic search for intuitive, meaning-based retrieval
Open-status classification to predict whether problems are solved
Near-duplicate detection to surface redundant or related problems

Conclusion

The intersection of natural language processing (NLP) and mathematics is an exciting and rapidly expanding field that holds tremendous potential for innovation and discovery. By applying advanced mathematical techniques and models to the challenges of understanding and generating human language, researchers and practitioners are unlocking new possibilities that can significantly enhance the capabilities of NLP systems.