ACS Applied Computer Science

  • Increase font size
  • Default font size
  • Decrease font size

Machine learning in big data: A performance benchmarking study of Flink-ML and Spark MLlib

Print

Machine learning (ML) in big data frameworks plays a critical role in real-time analytics, decision making, and predictive modeling. Among the most prominent ML libraries for large-scale data processing are Flink-ML, the machine learning extension of Apache Flink, and MLlib, the machine learning library of Apache Spark. This paper provides a comparative analysis of these two frameworks, evaluating their performance, scalability, streaming capabilities, iterative computation efficiency, and ease of integration with external deep learning frameworks. Flink-ML is designed for real-time, event-driven ML applications and provides native support for streaming-based model training and inference. In contrast, Spark MLlib is optimized for batch processing and micro-batch streaming, making it more suitable for traditional machine learning workflows. Experimental results show that training time is nearly identical for both frameworks, with Spark MLlib requiring 4006.4 seconds and Flink-ML 4003.2 seconds, demonstrating comparable efficiency in batch training and streaming-based model updates. Accuracy results show that Flink-ML (74.9%) slightly outperforms Spark MLlib (74.7%), suggesting that continuous learning in Flink-ML may contribute to better generalization. Inference throughput is slightly higher for Spark MLlib (8.4 images/sec) compared to Flink-ML (8.2 images/sec), suggesting that Spark's batch execution provides a slight advantage in processing efficiency. Both frameworks consume the same amount of memory (30.2%), confirming that TensorFlow's deep learning operations dominate resource consumption rather than architectural differences between Spark and Flink. These results highlight the tradeoffs between Flink-ML and Spark MLlib, and guide data scientists and engineers in selecting the appropriate framework based on specific ML workflow requirements and scalability considerations.

  • APA 7th style
Mezati, M., & Aouria, I. (2025). Machine learning in big data: A performance benchmarking study of Flink-ML and Spark MLlib. Applied Computer Science21(2), 18–27. https://doi.org/10.35784/acs_7297
  • Chicago style
Mezati, Messaoud, and Ines Aouria. ‘Machine Learning in Big Data: A Performance Benchmarking Study of Flink-ML and Spark MLlib’. Applied Computer Science 21, no. 2 (2025): 18–27. https://doi.org/10.35784/acs_7297.
  • IEEE style
M. Mezati and I. Aouria, ‘Machine learning in big data: A performance benchmarking study of Flink-ML and Spark MLlib’, Applied Computer Science, vol. 21, no. 2, pp. 18–27, doi: 10.35784/acs_7297.
  • Vancouver style
Mezati M, Aouria I. Machine learning in big data: A performance benchmarking study of Flink-ML and Spark MLlib. Applied Computer Science. 2025; 21(2):18–27.