Browsing by Subject "Information retrieval"
Now showing 1 - 8 of 8
Results Per Page
Sort Options
Item DEMIR at CLEF ehealth: The effects of selective query expansion to information retrieval(CEUR-WS, 2014) Ozturkmenoglu O.; Alpkocak A.; Kilinc D.This paper presents the details of participation of DEMIR (Dokuz Eylül University Multimedia Information Retrieval) research team to the Share/CLEF eHealth 2014. This year, we participated to task 3a: monolingual user-centered health information retrieval. In this task, we focused to apply query expansion techniques selectively to some queries to improve the performance of information retrieval. Thus, we first extracted some statistical features from queries such as length of query, sum and intersect of document frequencies of each query term etc. We develop a system to predict if a query is to be expanded or not. Then, we trained our system with previous year's data. Then, we applied a query expansion method only to the queries, which are selected by the system. The results show that the approach we proposed slightly improves our baseline retrieval performance in terms of P@10.Item Multi-level reranking approach for bug localization(Blackwell Publishing Ltd, 2016) Kılınç D.; Yücalar F.; Borandağ E.; Aslan E.Bug fixing has a key role in software quality evaluation. Bug fixing starts with the bug localization step, in which developers use textual bug information to find location of source codes which have the bug. Bug localization is a tedious and time consuming process. Information retrieval requires understanding the programme's goal, coding structure, programming logic and the relevant attributes of bug. Information retrieval (IR) based bug localization is a retrieval task, where bug reports and source files represent the queries and documents, respectively. In this paper, we propose BugCatcher, a newly developed bug localization method based on multi-level re-ranking IR technique. We evaluate BugCatcher on three open source projects with approximately 3400 bugs. Our experiments show that multi-level reranking approach to bug localization is promising. Retrieval performance and accuracy of BugCatcher are better than current bug localization tools, and BugCatcher has the best Top N, Mean Average Precision (MAP) and Mean Reciprocal Rank (MRR) values for all datasets. © 2016 Wiley Publishing LtdItem A tool for producing structured interoperable data from product features on the web(Elsevier Ltd, 2016) Özacar T.This paper introduces a tool that produces structured interoperable data from product features, i.e., attribute name-value pairs, on the web. The tool extracts the product features using a web site-specific template created by the user. The value of the extracted data is maximized by using GoodRelations, which is the standard vocabulary for modeling product types and their features. The final output of the tool is GoodRelations snippets, which contain product features encoded in RDFa or Microdata. These snippets can be embedded into existing static and dynamic web pages in a way accessible to major search engines like Google and Yahoo, mobile applications, and browser extensions. This increases the visibility of your products and services in the latest generation of search engines, recommender systems, and other novel applications. © 2015 Elsevier Ltd.Item An improved ant algorithm with LDA-based representation for text document clustering(SAGE Publications Ltd, 2017) Onan A.; Bulut H.; Korukoglu S.Document clustering can be applied in document organisation and browsing, document summarisation and classification. The identification of an appropriate representation for textual documents is extremely important for the performance of clustering or classification algorithms. Textual documents suffer from the high dimensionality and irrelevancy of text features. Besides, conventional clustering algorithms suffer from several shortcomings, such as slow convergence and sensitivity to the initial value. To tackle the problems of conventional clustering algorithms, metaheuristic algorithms are frequently applied to clustering. In this paper, an improved ant clustering algorithm is presented, where two novel heuristic methods are proposed to enhance the clustering quality of ant-based clustering. In addition, the latent Dirichlet allocation (LDA) is used to represent textual documents in a compact and efficient way. The clustering quality of the proposed ant clustering algorithm is compared to the conventional clustering algorithms using 25 text benchmarks in terms of F-measure values. The experimental results indicate that the proposed clustering scheme outperforms the compared conventional and metaheuristic clustering methods for textual documents. © Chartered Institute of Library and Information Professionals.Item A K-medoids based clustering scheme with an application to document clustering(Institute of Electrical and Electronics Engineers Inc., 2017) Onan A.Clustering is an important unsupervised data analysis technique, which divides data objects into clusters based on similarity. Clustering has been studied and applied in many different fields, including pattern recognition, data mining, decision science and statistics. Clustering algorithms can be mainly classified as hierarchical and partitional clustering approaches. Partitioning around medoids (PAM) is a partitional clustering algorithms, which is less sensitive to outliers, but greatly affected by the poor initialization of medoids. In this paper, we augment the randomized seeding technique to overcome problem of poor initialization of medoids in PAM algorithm. The proposed approach (PAM++) is compared with other partitional clustering algorithms, such as K-means and K-means++ on text document clustering benchmarks and evaluated in terms of F-measure. The results for experiments indicate that the randomized seeding can improve the performance of PAM algorithm on text document clustering. © 2017 IEEE.Item Biomedical Text Categorization Based on Ensemble Pruning and Optimized Topic Modelling(Hindawi Limited, 2018) Onan A.Text mining is an important research direction, which involves several fields, such as information retrieval, information extraction, and text categorization. In this paper, we propose an efficient multiple classifier approach to text categorization based on swarm-optimized topic modelling. The Latent Dirichlet allocation (LDA) can overcome the high dimensionality problem of vector space model, but identifying appropriate parameter values is critical to performance of LDA. Swarm-optimized approach estimates the parameters of LDA, including the number of topics and all the other parameters involved in LDA. The hybrid ensemble pruning approach based on combined diversity measures and clustering aims to obtain a multiple classifier system with high predictive performance and better diversity. In this scheme, four different diversity measures (namely, disagreement measure, Q-statistics, the correlation coefficient, and the double fault measure) among classifiers of the ensemble are combined. Based on the combined diversity matrix, a swarm intelligence based clustering algorithm is employed to partition the classifiers into a number of disjoint groups and one classifier (with the highest predictive performance) from each cluster is selected to build the final multiple classifier system. The experimental results based on five biomedical text benchmarks have been conducted. In the swarm-optimized LDA, different metaheuristic algorithms (such as genetic algorithms, particle swarm optimization, firefly algorithm, cuckoo search algorithm, and bat algorithm) are considered. In the ensemble pruning, five metaheuristic clustering algorithms are evaluated. The experimental results on biomedical text benchmarks indicate that swarm-optimized LDA yields better predictive performance compared to the conventional LDA. In addition, the proposed multiple classifier system outperforms the conventional classification algorithms, ensemble learning, and ensemble pruning methods. © 2018 Aytuǧ Onan.Item Emotion Analysis from Turkish Tweets Using Deep Neural Networks(Institute of Electrical and Electronics Engineers Inc., 2019) Tocoglu M.A.; Ozturkmenoglu O.; Alpkocak A.Text data analysis of social media is becoming more and more important since it includes the most recent information on what people think about. Likewise, emotion is one of the most valuable parts of human communication, emotion analysis is a type of information extraction process which identifies the emotional states of a given text. In this study, we investigated the performance of deep neural networks on emotion analysis from Turkish tweets. For this, we examined three different deep learning architectures including artificial neural network (ANN), convolutional neural network (CNN) and recurrent neural network (RNN) with long short-Term memory (LSTM). Besides, we curated a dataset of Turkish tweets and annotated each tweet automatically for six emotion categories using a lexicon-based approach. For the evaluation, we conducted a set of experiments for each architecture. The results showed that the lexicon-based automatic annotation of tweets is valid. Secondly, ANN produced the worst result as expected, and CNN resulted in the highest score of 0.74 in terms of accuracy measure. Experiments also showed that our proposed approach for emotion analysis of tweets in Turkish performs better than state-of-The-Art in this topic. © 2013 IEEE.Item Weighted word embeddings and clustering-based identification of question topics in MOOC discussion forum posts(John Wiley and Sons Inc, 2021) Onan A.; Toçoğlu M.A.Massive open online courses (MOOCs) are recent and widely studied distance learning approaches aimed at providing learning material to learners from geographically dispersed locations without age, gender, or race-related constraints. MOOCs generally enriched by discussion forums to provide interactions among students, professors, and teaching assistants. MOOC discussion forum posts provide feedback regarding the students' learning processes, social interactions, and concerns. The purpose of our research is to present a document-clustering model on MOOC discussion forum posts based on weighted word embeddings and clustering to identify question topics on discussion posts. In this study, four word-embedding schemes (namely, word2vec, fastText, global vectors, and Doc2vec), four weighting functions (i.e., term frequency-inverse document frequency [IDF], IDF, smoothed IDF, and subsampling function), and four clustering algorithms (i.e., K-means, K-means++, self-organizing maps, and divisive analysis clustering algorithm) for document clustering and topic modeling on MOOC discussion forum posts have been evaluated. Twenty different feature representations obtained from word-embedding schemes and weighting functions have been obtained. The feature representation schemes have been evaluated in conjunction with four clustering methods. For the evaluation task, the empirical results for the latent Dirichlet allocation have been also included. The empirical results in terms of adjusted rand index, normalized mutual information, and adjusted mutual information indicate that weighted word-embedding schemes combined with clustering algorithms outperform the conventional schemes. © 2020 Wiley Periodicals, Inc.