Preparatory Document Structuring Technique

Yan Puspitarani; Ulil Surtia Zulpratita

doi:10.61841/as7xdv73

Authors

Yan Puspitarani Engineering Department Widyatama University Bandung, Indonesia. Author
Ulil Surtia Zulpratita Engineering Department Widyatama University Bandung, Indonesia. Author

DOI:

https://doi.org/10.61841/as7xdv73

Keywords:

Text mining, Document structuring, Information extraction

Abstract

The need for mining structured data has increased in the past few years. This structured data is used as input for data mining tasks. Text mining is part of data mining where the data used is in the form of unstructured text. Text mining can be able to handle unstructured or semi-structured data sets such as emails, HTML files, full text documents, etc. Unstructured data usually refers to information that does not reside in a traditional row-column database, and it is the opposite of structured data. In order to extract information from text, preprocessing steps are needed.This paper discussed the theoretical basis of preprocessing documents for text mining. Brief descriptions of some representative approaches, such as NLP tasks and information extraction, are provided as well.

Downloads

Download data is not yet available.

References

1. Ronen Feldman and James Sanger, The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge, UK: Cambridge University Press, 2007.

2. Ammar Ismael Kadhim, Yu-N Cheah, and Nurul Hashimah Ahamed, "Text Document Preprocessing and Dimension Reduction Techniques for Text Document Clustering," in Artificial Intelligence with Applications in Engineering and Technology, 2014, pp. 69-73.

3. E. Elakiya and N. Rajkumar, "Designing a preprocessing framework (ERT) for text mining applications," in 2017 International Conference on IoT and Application (ICIOT), Nagapattinam, 2017.

4. Safaa I. Hajeer, Rasha M. Ismail, Nagwa L. Badr, and Mohamed Fahmy Tolba, "A New Stemming Algorithm for Efficient," Multimedia Forensics and Security, pp. 117-135, 2017.

5. Abdullah Saeed Ghareb, Azuraliza Abu Bakar, and Abdul Razak Hamdan, "Hybrid feature selection based on enhanced genetic algorithm for text categorization," Expert Systems with Applications, pp. 31-47, May 2016.

6. Ying Sheng, Sandeep Tata, and James B. Wendt, "Anatomy of a Privacy-Safe Large-Scale Information Extraction System Over Email," in KDD, London, 2018, pp. 734-743.

7. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze, An Introduction of Information Retrieval. Cambridge, England: Cambridge University Press, 2009.

8. Eni Mustafaraj, Martin Hoof, and Bernd Freisleben, Mining Diagnostic Text Reports by Learning to Annotate Knowledge Roles, Anne Kao and Stephen R. Poteet, Eds. Washington, United States of America: Springer, 2017.

9. Jakub Piskorski and Roman Yangarber, "Information Extraction: Past, Present, and Future," in Theory and Applications of Natural Language Processing, T. Poibeau, Ed. Berlin: Springer, 2013, ch. 2, p. 23.

10. Alireza Mansouri, Lilly Suriani Affendey, and Ali Mamat, "Named Entity Recognition Approaches," IJCSNS International Journal of Computer Science and Network Security, vol. 8, February 2008.

11. Gautier Bideaulta, Luc Mioulet, Cl´ement Chatelain, and Thierry Paquet, "Spotting Handwritten Words and REGEX using a two-stage BLSTM-HMM architecture," Document Recognition and Retrieval, 2015.

12. Andrew Ng, Generative Learning Algorithms., ch. 4.

13. Karthik Gali, Harshit Surana, Ashwini Vaidya, Praneeth Shishtla, and Dipti Misra Sharma, "Aggregating Machine Learning and Rule-Based Heuristics for Named Entity Recognition," in IJCNLP-08 Workshop on NER for South and South East Asian Languages, Hyderabad, 2008, pp. 25-32.

14. Sudha Morwal, Nusrat Jahan, and Deepti Chopra, "Named Entity Recognition using Hidden Markov Model (HMM)," International Journal on Natural Language Computing (IJNLC), vol. 1, pp. 15-23, December 2012.

15. J. Makhoul, F. Kubala, R. Schwartz, and R. Weischedel, "Performance measures for information extraction," in Proceedings of the DARPA Broadcast News Workshop, Herndon, 1999.

16. Sisyagina, E.P., Pashkina, Y.V., Molev, A.I., Gorchakova, N.G., Samodelkin, A.G., Sisyagin, P.N., Sochnev, V.V., Kozyrenko, O.V., Correcting the immune status of the calves' organisms on the background of their associative respiratory diseases (2018), International Journal of Pharmaceutical Research, 10 (4), pp. 749-754. https://www.scopus.com/inward/record.uri?eid=2-s2.0-

85061870927&doi=10.31838%2fijpr%2f2018.10.04.130&partnerID=40&md5=5ee42576e88026f7677f7c5852f8b974

17. Bhausaheb B. Jankar, Devesh d. Gosavi (2017), Adverse Drug Reaction of Lithium Carbonate: A Review. Journal of Critical Reviews, 4 (1), 1-6. doi:10.22159/jcr.2017v4i1.14555

18. Sabale V, Sakarkar SN, Pund S, Sabale PM. "Formulation and Evaluation of Floating Dosage Forms: An Overview." Systematic Reviews in Pharmacy 1.1 (2010), 33-39. Print. doi:10.4103/0975-8453.59510