Machine learning (ML) in big data frameworks plays a critical role in real-time analytics, decision making, and predictive modeling. Among the most prominent ML libraries for large-scale data processing are Flink-ML, the machine learning extension of Apache Flink, and MLlib, the machine learning library of Apache Spark. This paper provides a comparative analysis of these two frameworks, evaluating their performance, scalability, streaming capabilities, iterative computation efficiency, and ease of integration with external deep learning frameworks. Flink-ML is designed for real-time, event-driven ML applications and provides native support for streaming-based model training and inference. In contrast, Spark MLlib is optimized for batch processing and micro-batch streaming, making it more suitable for traditional machine learning workflows. Experimental results show that training time is nearly identical for both frameworks, with Spark MLlib requiring 4006.4 seconds and Flink-ML 4003.2 seconds, demonstrating comparable efficiency in batch training and streaming-based model updates. Accuracy results show that Flink-ML (74.9%) slightly outperforms Spark MLlib (74.7%), suggesting that continuous learning in Flink-ML may contribute to better generalization. Inference throughput is slightly higher for Spark MLlib (8.4 images/sec) compared to Flink-ML (8.2 images/sec), suggesting that Spark's batch execution provides a slight advantage in processing efficiency. Both frameworks consume the same amount of memory (30.2%), confirming that TensorFlow's deep learning operations dominate resource consumption rather than architectural differences between Spark and Flink. These results highlight the tradeoffs between Flink-ML and Spark MLlib, and guide data scientists and engineers in selecting the appropriate framework based on specific ML workflow requirements and scalability considerations.
- APA 7th style
- Chicago style
- IEEE style
- Vancouver style
< Prev | Next > |
---|