Preparatory Document Structuring Technique
DOI:
https://doi.org/10.61841/as7xdv73Keywords:
Text mining, Document structuring, Information extractionAbstract
The need for mining structured data has increased in the past few years. This structured data is used as input for data mining tasks. Text mining is part of data mining where the data used is in the form of unstructured text. Text mining can be able to handle unstructured or semi-structured data sets such as emails, HTML files, full text documents, etc. Unstructured data usually refers to information that does not reside in a traditional row-column database, and it is the opposite of structured data. In order to extract information from text, preprocessing steps are needed.This paper discussed the theoretical basis of preprocessing documents for text mining. Brief descriptions of some representative approaches, such as NLP tasks and information extraction, are provided as well.
Downloads
References
1. Ronen Feldman and James Sanger, The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge, UK: Cambridge University Press, 2007.
2. Ammar Ismael Kadhim, Yu-N Cheah, and Nurul Hashimah Ahamed, "Text Document Preprocessing and Dimension Reduction Techniques for Text Document Clustering," in Artificial Intelligence with Applications in Engineering and Technology, 2014, pp. 69-73.
3. E. Elakiya and N. Rajkumar, "Designing a preprocessing framework (ERT) for text mining applications," in 2017 International Conference on IoT and Application (ICIOT), Nagapattinam, 2017.
4. Safaa I. Hajeer, Rasha M. Ismail, Nagwa L. Badr, and Mohamed Fahmy Tolba, "A New Stemming Algorithm for Efficient," Multimedia Forensics and Security, pp. 117-135, 2017.
5. Abdullah Saeed Ghareb, Azuraliza Abu Bakar, and Abdul Razak Hamdan, "Hybrid feature selection based on enhanced genetic algorithm for text categorization," Expert Systems with Applications, pp. 31-47, May 2016.
6. Ying Sheng, Sandeep Tata, and James B. Wendt, "Anatomy of a Privacy-Safe Large-Scale Information Extraction System Over Email," in KDD, London, 2018, pp. 734-743.
7. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze, An Introduction of Information Retrieval. Cambridge, England: Cambridge University Press, 2009.
8. Eni Mustafaraj, Martin Hoof, and Bernd Freisleben, Mining Diagnostic Text Reports by Learning to Annotate Knowledge Roles, Anne Kao and Stephen R. Poteet, Eds. Washington, United States of America: Springer, 2017.
9. Jakub Piskorski and Roman Yangarber, "Information Extraction: Past, Present, and Future," in Theory and Applications of Natural Language Processing, T. Poibeau, Ed. Berlin: Springer, 2013, ch. 2, p. 23.
10. Alireza Mansouri, Lilly Suriani Affendey, and Ali Mamat, "Named Entity Recognition Approaches," IJCSNS International Journal of Computer Science and Network Security, vol. 8, February 2008.
11. Gautier Bideaulta, Luc Mioulet, Cl´ement Chatelain, and Thierry Paquet, "Spotting Handwritten Words and REGEX using a two-stage BLSTM-HMM architecture," Document Recognition and Retrieval, 2015.
12. Andrew Ng, Generative Learning Algorithms., ch. 4.
13. Karthik Gali, Harshit Surana, Ashwini Vaidya, Praneeth Shishtla, and Dipti Misra Sharma, "Aggregating Machine Learning and Rule-Based Heuristics for Named Entity Recognition," in IJCNLP-08 Workshop on NER for South and South East Asian Languages, Hyderabad, 2008, pp. 25-32.
14. Sudha Morwal, Nusrat Jahan, and Deepti Chopra, "Named Entity Recognition using Hidden Markov Model (HMM)," International Journal on Natural Language Computing (IJNLC), vol. 1, pp. 15-23, December 2012.
15. J. Makhoul, F. Kubala, R. Schwartz, and R. Weischedel, "Performance measures for information extraction," in Proceedings of the DARPA Broadcast News Workshop, Herndon, 1999.
16. Sisyagina, E.P., Pashkina, Y.V., Molev, A.I., Gorchakova, N.G., Samodelkin, A.G., Sisyagin, P.N., Sochnev, V.V., Kozyrenko, O.V., Correcting the immune status of the calves' organisms on the background of their associative respiratory diseases (2018), International Journal of Pharmaceutical Research, 10 (4), pp. 749-754. https://www.scopus.com/inward/record.uri?eid=2-s2.0-
85061870927&doi=10.31838%2fijpr%2f2018.10.04.130&partnerID=40&md5=5ee42576e88026f7677f7c5852f8b974
17. Bhausaheb B. Jankar, Devesh d. Gosavi (2017), Adverse Drug Reaction of Lithium Carbonate: A Review. Journal of Critical Reviews, 4 (1), 1-6. doi:10.22159/jcr.2017v4i1.14555
18. Sabale V, Sakarkar SN, Pund S, Sabale PM. "Formulation and Evaluation of Floating Dosage Forms: An Overview." Systematic Reviews in Pharmacy 1.1 (2010), 33-39. Print. doi:10.4103/0975-8453.59510
Downloads
Published
Issue
Section
License
Copyright (c) 2020 AUTHOR
This work is licensed under a Creative Commons Attribution 4.0 International License.
You are free to:
- Share — copy and redistribute the material in any medium or format for any purpose, even commercially.
- Adapt — remix, transform, and build upon the material for any purpose, even commercially.
- The licensor cannot revoke these freedoms as long as you follow the license terms.
Under the following terms:
- Attribution — You must give appropriate credit , provide a link to the license, and indicate if changes were made . You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
- No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
Notices:
You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation .
No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material.