The vast span of nouns, words and verbs in Persian language and the availability of information in all fields in the form of paper, book and internet arises the need of a system to compare texts and evaluate their similarities. In this paper a system has been presented for comparing the text and determining the degree of Persian (Farsi) text similarities. This system uses TF-IDF method to give weight to sentences. Moreover, the roots of the nouns have been found and identical score has been given to synonyms and word families. The results gained from implementation indicate that the proposed system has a desired efficiency in comparing short texts.
Published in |
International Journal of Intelligent Information Systems (Volume 3, Issue 6-1)
This article belongs to the Special Issue Research and Practices in Information Systems and Technologies in Developing Countries |
DOI | 10.11648/j.ijiis.s.2014030601.21 |
Page(s) | 61-66 |
Creative Commons |
This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited. |
Copyright |
Copyright © The Author(s), 2014. Published by Science Publishing Group |
Text Similarity, TF-IDF, Semantic Similarity, Stemming
[1] | WenyinL, Hao TY, ChenW, FengM “A web-based platform for user interactive question answering”. World Wide Web: Internet Web Inform Syst (2009) 12(2):107–124, 2009. |
[2] | Park EK, Ra DY, Jang MG, "Techniques for improving web retrieval effectiveness". Inform Process Manag 41:1207–1223, 2005. |
[3] | Atkinson-Abutridy J, Mellish C, Aitken S, "Combining information extraction with genetic algorithms for text mining", IEEE Intelligent Systems, pp: 22-30, 2004, Available on: http://homepages.abdn.ac.uk/c.mellish/pages/papers/atkinsonieee.pdf. |
[4] | K Metzler D, Dumais S, Meek C, "Similarity measures for short segments of text". In: Proceedings of the 29th European conference on information retrieval (ECIR 2007). Lecture notes in computer science,vol 4425, Springer, Berlin , pp 16–27, 2007. |
[5] | Hassel, M., Resource Lean and Portable "Automatic Text Summarization", Stockholm, Sweden. p. 144, 2007. |
[6] | Turney, P. "Mining the web for synonyms: PMI-IR versus LSA on TOEFL". In Proceedings of the Twelfth European Conference on Machine Learning, 2001, Available on: http://www.extractor.com/turney-ecml2001.pdf. |
[7] | Landauer T. K., Foltz P., and Laham D, "Introduction to latent semantic analysis". Discourse Processes 25, 1998. |
[8] | K. Aas and L. Eikvil, “Text Categorisation: A Survey”, 1999, Available on: http://citeseer.nj.nec.com/aas99text.html. |
[9] | Wu Z., Palmer M., "Verb semantics and lexical selection". ACL' 94 Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, pp: 133-138, 1994. Available on: http://dl.acm.org/citation.cfm?id=981751. |
[10] | Voorhees E., "Using WordNet to disambiguate word senses for text retrieval", SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on research and development information retrieval, pp: 171-180, 1993, Available on: http://dl.acm.org/citation.cfm?id=160715. |
[11] | R. Krovetz, "Viewing morphology as an inference process", Proc. 16th ACM SIGIR Conference, Pittsburgh, June 27-July 1, pp. 191-202, 1993. |
[12] | Hessami Fard Reza, Ghasem sany Gholamreza, "Design of a stemming algorithm for Persian", 11th Annual Conference of Computer Society of Iran, Tehran, 2006. (Persian) Available on: http://www.civilica.com/Paper-ACCSI11-ACCSI11_066.html |
[13] | Qazvinian,Vahed.,SharifHassnabadi,Leila., Halavati, Ramin.,"Summarizing Text With a Genetic Algorithm-Based Sentence Extraction", Int. J. Knowledge Management Studies, Vol. 2, No. 4, pp:426-444, 2008, Available on: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.130.2201&rep=rep1&type=pdf. |
[14] | Rada Mihalcea, Courtney Corley, Carlo Strapparava, "Corpus-based and Knowledge-based measures of text semantic similarity", AAAI '06 Proceeding of the 21st national conference on Artificial intelligence, Vol. 1, pp: 775-780, 2006. |
[15] | Antonio Toral, Oscar Ferrandez, Eneko Agirre, Rafael Munoz, "A study on linking Wikipedia categories to Wordnet synsets using text similarity", International Conference RANLP 2009, Borovets, Bolgaria, pp: 449-454, 2009. |
[16] | Xiaojun Quan, Gang Liu, Zhi Lu, Xingliang Ni, Liu Wenyin, "Short text similarity based on probabilistic topics", Knowl Inf Syst, 25, pp:473-491, DOI:10.1007/s10115-009-0250-y, 2010. |
APA Style
Elham Mahdipour, Rahele Shojaeian Razavi, Zahra Gheibi. (2014). Software Development for Identifying Persian Text Similarity. International Journal of Intelligent Information Systems, 3(6-1), 61-66. https://doi.org/10.11648/j.ijiis.s.2014030601.21
ACS Style
Elham Mahdipour; Rahele Shojaeian Razavi; Zahra Gheibi. Software Development for Identifying Persian Text Similarity. Int. J. Intell. Inf. Syst. 2014, 3(6-1), 61-66. doi: 10.11648/j.ijiis.s.2014030601.21
AMA Style
Elham Mahdipour, Rahele Shojaeian Razavi, Zahra Gheibi. Software Development for Identifying Persian Text Similarity. Int J Intell Inf Syst. 2014;3(6-1):61-66. doi: 10.11648/j.ijiis.s.2014030601.21
@article{10.11648/j.ijiis.s.2014030601.21, author = {Elham Mahdipour and Rahele Shojaeian Razavi and Zahra Gheibi}, title = {Software Development for Identifying Persian Text Similarity}, journal = {International Journal of Intelligent Information Systems}, volume = {3}, number = {6-1}, pages = {61-66}, doi = {10.11648/j.ijiis.s.2014030601.21}, url = {https://doi.org/10.11648/j.ijiis.s.2014030601.21}, eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ijiis.s.2014030601.21}, abstract = {The vast span of nouns, words and verbs in Persian language and the availability of information in all fields in the form of paper, book and internet arises the need of a system to compare texts and evaluate their similarities. In this paper a system has been presented for comparing the text and determining the degree of Persian (Farsi) text similarities. This system uses TF-IDF method to give weight to sentences. Moreover, the roots of the nouns have been found and identical score has been given to synonyms and word families. The results gained from implementation indicate that the proposed system has a desired efficiency in comparing short texts.}, year = {2014} }
TY - JOUR T1 - Software Development for Identifying Persian Text Similarity AU - Elham Mahdipour AU - Rahele Shojaeian Razavi AU - Zahra Gheibi Y1 - 2014/10/29 PY - 2014 N1 - https://doi.org/10.11648/j.ijiis.s.2014030601.21 DO - 10.11648/j.ijiis.s.2014030601.21 T2 - International Journal of Intelligent Information Systems JF - International Journal of Intelligent Information Systems JO - International Journal of Intelligent Information Systems SP - 61 EP - 66 PB - Science Publishing Group SN - 2328-7683 UR - https://doi.org/10.11648/j.ijiis.s.2014030601.21 AB - The vast span of nouns, words and verbs in Persian language and the availability of information in all fields in the form of paper, book and internet arises the need of a system to compare texts and evaluate their similarities. In this paper a system has been presented for comparing the text and determining the degree of Persian (Farsi) text similarities. This system uses TF-IDF method to give weight to sentences. Moreover, the roots of the nouns have been found and identical score has been given to synonyms and word families. The results gained from implementation indicate that the proposed system has a desired efficiency in comparing short texts. VL - 3 IS - 6-1 ER -