Job offers classifier using neural networks and oversampling methods
Germán Ortiz, Gemma Bel Enguix, Helena Gómez-Adorno, and 2 more authors
In Recent Developments and the New Directions of Research, Foundations, and Applications: Selected Papers of the 8th World Conference on Soft Computing, February 03–05, 2022, Baku, Azerbaijan, Vol. I, 2023
Both policy and research benefit from a better understanding of individuals’ jobs. However, as large-scale administrative records are increasingly employed to represent labour market activity, new automatic methods to classify jobs will become necessary. We developed an automatic job offers classifier using a dataset collected from the largest job bank in Mexico known as Bumeran5. We applied machine learning algorithms such as Support Vector Machines, Naive-Bayes, Logistic Regression, Random Forest, and deep learning Long-Short Term Memory (LSTM). Using these algorithms, we trained multi-class models to classify job offers in one of the 23 classes (not uniformly distributed): Sales, Administration, Call Center, Technology, Trades, Human Resources, Logistics, Marketing, Health, Gastronomy, Financing, Secretary, Production, Engineering, Education, Design, Legal, Construction, Insurance, Communication, Management, Foreign Trade, and Mining. We used the SMOTE, Geometric-SMOTE, and ADASYN synthetic oversampling algorithms to handle imbalanced classes. The proposed convolutional neural network architecture achieved the best results when applied the Geometric-SMOTE algorithm.
2022
Overview of PAR-MEX at Iberlef 2022: Paraphrase Detection in Spanish Shared Task
Gemma Bel-Enguix, Gerardo Sierra, Helena Gómez-Adorno, and 3 more authors
Paraphrase detection is an important unresolved task in natural language processing; especially in the Spanish language. In order to address this issue, and contribute to the creation of high-performance paraphrase detection automated systems, we propose a shared task called PAR-MEX. For this task, we created a corpus, in Spanish, with topics in the domain of Mexican gastronomy. Afterwards, the participants in this task submitted their classification results on our corpus. In this paper, we explain the steps followed for the creation of the corpus, we summarize the results obtained by the various participants, and propose some conclusions regarding the paraphrase-detection task in Spanish.
Sentence-CROBI: A Simple Cross-Bi-Encoder-Based Neural Network Architecture for Paraphrase Identification
Jesus-German Ortiz-Barajas, Gemma Bel-Enguix, and Helena Gómez-Adorno
Since the rise of Transformer networks and large language models, cross-encoders have become the dominant architecture for various Natural Language Processing tasks. When dealing with sentence pairs, they can exploit the relationships between those pairs. On the other hand, bi-encoders can obtain a vector given a single sentence and are used in tasks such as textual similarity or information retrieval due to their low computational cost; however, their performance is inferior to that of cross-encoders. In this paper, we present Sentence-CROBI, an architecture that combines cross-encoders and bi-encoders to obtain a global representation of sentence pairs. We evaluated the proposed architecture in the paraphrase identification task using the Microsoft Research Paraphrase Corpus, the Quora Question Pairs dataset, and the PAWS-Wiki dataset. Our model obtains competitive results compared with the state-of-the-art by using model ensembles and a simple model configuration. These results demonstrate that a simple architecture that combines sentence pair and single-sentence representations without using complex pre-training or fine-tuning algorithms is a viable alternative for sentence pair tasks.
2020
Enhancing Job Searches in Mexico City with Language Technologies
Gerardo Sierra Martı́nez, Gemma Bel-Enguix, Helena Gómez-Adorno, and 7 more authors
In Proceedings of the 1st Workshop on Language Technologies for Government and Public Administration (LT4Gov), 2020
In this paper, we show the enhancing of the Demanded Skills Diagnosis (DiCoDe: Diagnostico de Competencias Demandadas), a system developed by Mexico City’s Ministry of Labor and Employment Promotion (STyFE: Secretaria de Trabajo y Fomento del Empleo de la Ciudad de Mexico) that seeks to reduce information asymmetries between job seekers and employers. The project uses webscraping techniques to retrieve job vacancies posted on private job portals on a daily basis and with the purpose of informing training and individual case management policies as well as labor market monitoring. For this purpose, a collaboration project between STyFE and the Language Engineering Group (GIL: Grupo de Ingenieria Linguistica) was established in order to enhance DiCoDe by applying NLP models and semantic analysis. By this collaboration, DiCoDe’s job vacancies system’s macro-structure and its geographic referencing at the city hall (municipality) level were improved. More specifically, dictionaries were created to identify demanded competencies, skills and abilities (CSA) and algorithms were developed for dynamic classifying of vacancies and identifying terms for searches on free text, in order to improve the results and processing time of queries.
2019
Detection of Aggressive Tweets in Mexican Spanish Using Multiple Features with Parameter Optimization.
Germán Ortiz, Helena Gómez-Adorno, Jorge Reyes-Magaña, and 2 more authors
This paper explains our approach to Aggressiveness Identification in the MEX-A3T shared task, whose aim is the detection of aggressive tweets. The task proposes a binary classification for every tweet: aggressive and non-aggressive. We approached the problem using linguistically motivated features and several types of n-grams (words, characters, functional words, punctuation symbols, among others). We trained a Support Vector Machine using a combinatorial framework that optimizes the results of the classifier. Our best run achieved an F1-score of 0,4549, which is the 5th best among the twenty-six runs.