Title of course: Data mining in geosciences
Code: TTGME7043_EN
ECTS Credit points: 2
Type of teaching, contact hours
- lecture: 1 hours/week
- practice: 1 hours/week
- laboratory: -
Evaluation: exam
Workload (estimated), divided into contact hours:
- lecture: 14 hours
- practice: 14 hours
- laboratory: -
- home assignment: -
- preparation for the exam: 16 hours
Total: 30 hours
Year, semester: 2 nd year, 3 rd semester
Its prerequisite(s): -
Further courses built on it: -
Topics of course
Data mining is one of today's most important data analysis techniques, which goes beyond basic statistics and models by applying statistical models to process nominal and scalar variables. The goal is to automatically extract useful information from large databases. One of the main groups is classification, in which students classify binary or multiclass data into classes, similar to image classification. The other group is estimation of scale-type data using a variety of algorithms. Big Data theory and practice and data mining methods. Topics covered in the theoretical part of the course: the theoretical background of multivariate data analysis, Big Data theory; model-fitting parameters, number of elements, conditions of statistical models; ANOVA, 2-factor ANOVA; multivariate linear analysis, GLM; robust regression methods MA, RMA; robust regression methods: lasso, ridge, elastic net; dimension reduction with ordination methods: PCA; dimension reduction with ordination procedures: CA, MCA; application of Partial Least Square in regression; cluster analysis (hierarchical procedures); clustering (kmeans clustering, optimal cluster number, connectivity, Dunn Index, silhouette width); Random Forest as a regression and as a classification algorithm; Variable Importance. Topics covered in the practical part of the course include: preparing the correct data matrix in Excel for multivariate analysis; introduction to the R software environment (language, commands, working library, data import, dataframe, vector, array, matrix); defining basic statistics in R (familiarization with scripting); using linear models in R (lm function): Hypothesis testing and regression; 2-factor ANOVA in R; Running and interpreting GLM models in R; Packages in R (lmodel2), applying robust regression: MA, RMA, SMA; Packages in R (glmnet), applying robust regression: lasso, ridge, elastic net; Running and interpreting PCA in R; Random Forest regression; Random Forest classification.
Literature
- Islam S. 2018. Hands-on: geospatial analysis with R and QGIS. Packt Publishing, Birmingham, 347 p.
- Cuesta, H. 2013 Practical Data Analysis, Packt Publishing, Birmingham, 360 p.
- Barna I. - Székelyi M. 2004. Survival kit for SPSS. Typotex Publisher, 453 p.
- Kabakoff, R.I. (2011) R in Action: Data Analysis and Graphics with R. Manning Publications
Requirements:
- for a signature
Attendance is compulsory, lectures and practices have a strong logical relation.
- for a grade
The course ends in a writing examination. The minimum requirement for the test respectively is 50%. Based on the score of the test, the grade for the test is given according to the following table:
Score | Grade |
0-49 fail | (1) |
50-64 pass | (2) |
65-74 satisfactory | (3) |
75-85 good | (4) |
86-100 excellent | (5) |
If the score of any test is below 50, students can take a retake test in conformity with the EDUCATION AND EXAMINATION RULES AND REGULATIONS.
Person responsible for course: Prof. dr. Szilárd Szabó, PhD, DSc, Full Professor