- Organization
- Universidad Federal de Rio Grande del Sur (UFRGS)
- Type
- Academic Sector
- Years
- 2023
- Countries
- Brasil
The project addresses a pressing issue on network security research and development, which is the lack of explainability in innovative ML-based solutions applied to identifying, preventing, and responding to attacks in the Internet infrastructure, especially considering the rise of Internet-of-Things (IoT) devices and zero-day malwares. While this issue is not unique to Latin America, the technical knowledge of the researchers in the region are uniquely suited to tackle this pressing matter.
The objectives for this project are to define a specific data quality metric for each of the following quality aspects of data: accuracy, timeliness, uniqueness, validity, consistency and completeness. Given the established metrics for data quality in networking, the project will develop a collaborative online platform to evaluate and rank publicly available dataset commonly used datasets for ML in networking. This platform is expected to serve as a guide for researchers to choose which dataset best suits the needs of the research they are developing. In addition, this rank will serve as a reference point for the development of new datasets in the future.
The expected results are a classification and categorization of the current approaches to measure the quality of data in ML in general, and for networking and network security in particular, providing recommendations to developers and researchers regarding increasing the quality of datasets used in networking, based on the observations produced through the data quality metrics. These recommendations will be based on classification and the evaluations performed so that they are supported by scientific evidence. The results of the project will be an open-source Python package that allows any developer or researcher to use the developed data quality metrics in any particular traffic capture; an open-source collaborative platform for ranking network-related datasets, and an open-source software to allow and facilitate the evaluation of private datasets locally.