IMPLEMENTATION OF TEXT CLASSIFICATION ON USER REVIEWS IN DANA APPLICATION USING SUPPORT VECTOR MACHINE (SVM) AND GAUSSIAN NAÏVE BAYES (GNB)
Gunadarma University
Indonesia
Gunadarma University
Indonesia
Abstract
Conventional methods or devices are unable to efficiently process large volumes and diverse categories of information, which are collectively referred to as big data. Text mining is a commonly used technique for analyzing big data. This study evaluates the effectiveness of Support Vector Machines (SVM) and Gaussian Naïve Bayes (GNB) in the classification of user reviews from the DANA application, obtained from the Google Play Store. The four fundamental phases of the investigation are data collection, data preparation, data modelling, and evaluation. This study utilized a dataset of 15.451 user reviews, dividing it into three subsets with varying data sizes and each subset having varying training-to-testing ratios. The evaluation will calculate four measurements, which are accuracy, precision, recall, F1-Score, and ROC Curve. The results illustrate that SVM and GNB achieved accuracy rates of at least 75%. SVM achieves an average accuracy of 84%, 88%, and 91%, while GNB achieves an average accuracy of 71%, 81%, and 85%. Based on the implementation results, sentiment analysis is more effective when performed with SVM than with GNB.
Keywords
References
V. A. and S. S. Sonawane, “Sentiment analysis of twitter data: a survey of techniques,” Int. J. Comput. Appl., vol. 139, no. 11, pp. 5–15, 2016, doi: 10.5120/ijca2016908625.
H. Hu, Y. Wen, T. S. Chua, and X. Li, “Toward scalable systems for big data analytics: A technology tutorial,” IEEE Access, vol. 2, pp. 652–687, 2014, doi: 10.1109/ACCESS.2014.2332453.
A. M. Simanjuntak, S. Thamrin, and S. Sundari, “The influence of big data analytics on human resource management strategies for company sustainability”, [Online]. Available: https://e-conf.usd.ac.id/index.php/icebmr/
K. Kowsari, K. J. Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown, “Text classification algorithms: A survey,” Inf., vol. 10, no. 4, pp. 1–68, 2019, doi: 10.3390/info10040150.
S. Weiss, N. Indurkhya, T. Zhang, and F. Damerau, Text mining: predictive methods for analyzing unstructured information. 2004. doi: 10.1007/978-0-387-34555-0.
A. F. Hidayatullah and S. N. Azhari, “Analisis sentimen dan klasifikasi kategori terhadap tokoh publik pada data twitter menggunakan naive bayes classifier,” vol. 2016, no. semnasIF, pp. 1–8, 2016.
G. A. Buntoro, “Analisis sentimen calon gubernur DKI Jakarta 2017 di twitter,” INTEGER J. Inf. Technol., vol. 2, no. 1, pp. 32–41, 2017, doi: 10.31284/j.integer.2017.v2i1.95.
V. Chandani and R. S. Wahono, “Komparasi algoritma klasifikasi machine learning dan feature selection pada analisis sentimen review film,” J. Intell. Syst., vol. 1, no. 1, pp. 55–59, 2015.
I. Hmeidi, M. Al-Ayyoub, N. A. Abdulla, A. A. Almodawar, R. Abooraig, and N. A. Mahyoub, “Automatic Arabic text categorization: a comprehensive comparative study,” J. Inf. Sci., vol. 41, no. 1, pp. 114–124, 2015, doi: 10.1177/0165551514558172.
A. S. Neogi, K. A. Garg, R. K. Mishra, and Y. K. Dwivedi, “Sentiment analysis and classification of Indian farmers’ protest using twitter data,” Int. J. Inf. Manag. Data Insights, vol. 1, no. 2, p. 100019, 2021, doi: 10.1016/j.jjimei.2021.100019.
T. N. Prakash and A. Aloysius, “Data preprocessing in sentiment analysis using twitter data,” Int. Educ. Appl. Res. J., vol. 3, no. 07, pp. 89–92, 2019, [Online]. Available: https://www.researchgate.net/publication/334670363
S. Fahmi, L. Purnamawati, G. F. Shidik, M. Muljono, and A. Z. Fanani, “Sentiment analysis of student review in learning management system based on sastrawi stemmer and SVM-PSO,” Proc. - 2020 Int. Semin. Appl. Technol. Inf. Commun. IT Challenges Sustain. Scalability, Secur. Age Digit. Disruption, iSemantic 2020, pp. 643–648, 2020, doi: 10.1109/iSemantic50169.2020.9234291.
G. A. Dalaorao, A. M. Sison, and R. P. Medina, “Integrating collocation as TF-IDF enhancement to improve classification accuracy,” TSSA 2019 - 13th Int. Conf. Telecommun. Syst. Serv. Appl. Proc., pp. 282–285, 2019, doi: 10.1109/TSSA48701.2019.8985458.
S. W. Kim and J. M. Gil, “Research paper classification systems based on TF-IDF and LDA schemes,” Human-centric Comput. Inf. Sci., vol. 9, no. 1, 2019, doi: 10.1186/s13673-019-0192-7.
K. Zishumba, “Sentiment analysis based on social media data,” J. Inf. Telecommun., pp. 1–48, 2019, [Online]. Available: http://repository.aust.edu.ng/xmlui/bitstream/handle/123456789/4901/Kudzai Zishumba.pdf?sequence=1&isAllowed=y
Y. N. Kunang and W. P. Mentari, “Analysis of the impact of vectorization methods on machine learning-based sentiment analysis of tweets regarding readiness for offline learning,” JUITA J. Inform., vol. 11, no. 2, p. 271, 2023, doi: 10.30595/juita.v11i2.17568.
F. S. Nahm, “ROC Curve: overview and practical use for clinicians,” Korean J. Anesthesiol., vol. 75, no. 1, pp. 25–36, 2022.
D. Marutho, S. Handaka, E. Wijaya, and M. Muljono, “The determination of cluster number at k-mean using elbow method and purity evaluation on headline news,” 2018 Int. Semin. Appl. Technol. Inf. Commun., pp. 533–538, 2018, doi: 10.1109/ISEMANTIC.2018.8549751.