A Hybrid Framework for Programming Language-Agnostic Mining of Natural Language-Programming Language Pairs Using Classical Machine Learning, Rule-Based Methods, AST Parsing, and Deep Embeddings

Authors

  • M. Habila
  • Y.M. Malgwi
  • K.B. Ishoala
  • M.K. Tahir

Abstract

The growing intersection of software engineering and machine learning has underscored the importance of mining aligned natural language-programming language (NL-PL) pairs for tasks such as code search, summarization, and recommendation. Existing approaches, however, are often constrained by language specificity, reliance on resource-intensive deep learning models, or limited quality control in dataset construction. This study proposes a hybrid framework for programming language-agnostic NL-PL mining, integrating classical machine learning (TF-IDF with Naïve Bayes), rule-based language identification, abstract syntax tree (AST) parsing, deep embeddings (CodeBERT), and expert human annotation. The framework was evaluated on curated datasets spanning five programming languages (Python, Java, JavaScript, C++, and PHP), achieving an accuracy of 89% and an F1-score of 0.87 with the Naïve Bayes classifier, and an F1-score of 0.89 with CodeBERT embeddings. Results demonstrate that lightweight methods, when combined with symbolic and structural analysis, can provide competitive performance relative to large scale transformer-based models while remaining computationally efficient and interpretable. Comparative analysis with state-of-the-art works such as SLQA and CodeSearchNet highlights the novelty of this framework in emphasizing practicality, interpretability, and human in the loop quality assurance. The contributions position this work as both a research advancement and a pathway to real-world applications in developer tools, including intelligent code search engines and IDE assistants. Future work will explore scaling the framework with larger datasets, integrating hybrid transformer-symbolic architectures, and extending applications to low-resource programming languages.

Published

2026-05-13