Scalable Advanced Massive Online Analysis (SAMOA) is a framework that includes an interface that can be attached to different flow processing platforms and distributed machine learning for data streams. Web companies like Yahoo should receive useful information from big data streams, ie large-scale data analysis tasks in real time.
A concrete example of big data flow mining is Tumblr spam detection to enhance the user experience on Tumblr. Technically, Tumblr can use machine learning to classify whether a comment is spam. And in terms of data volume, Tumblr users generate new data of about three terabytes each day.
Unfortunately, real-time analysis on big data is not easy to perform task. Web companies should use systems that can efficiently analyze new incoming data with acceptable processing time and limited memory.
Big Data V
Big data is a new term that defines a large amount of data that exceeds traditional storage and processing requirements. Volume, Velocity and Variety also called Three V, are commonly used to characterize big data. Looking at each of the three V’s independently presents big data analysis difficulties.
The volume of data implies the ability to perform distributed query to scale and process storage. Solutions for volume problems are performed either using data-software techniques or using parallel processing architecture systems such as Apache Hadoop.
For velocity, V is concerned with the speed at which data is generated and flowing into a system. Everyday sensors devices and applications 2 the samoa developer manual generates unlimited amounts of information that can be used in many ways for predictive purposes and analysis.
Velocity is concerned not only with the speed of data production but also with the velocity at which an analysis can be sent from the generated data.
It is very important to have real-time feedback when dealing with fast developing information such as stock markets, social networks, sensor networks, mobile information and others. Some frames that aim to process these streams of unlimited data flow have appeared like Apache S4 and Twitter Storm platforms.
One problem with big data is the variety of data presentations. The data may have many different formats depending on the source, so it may be difficult to address these types of formats. Distributed key value stores, commonly referred to as NoSQL databases, are very useful to deal with diversity due to unstructured form of data storage.
This flexibility provides an advantage when dealing with big data. Traditional relational databases imply restructuring and reformatting schemas when new data formats occur.
Scalable Advanced Massive Online Analysis
Scalable Advanced Mass Online Analysis (SAMOA) includes a programming abstraction for a distributed flow algorithm that allows the development of new ML algorithms without the complexity of the underlying flow processing infrastructure (SPE). Furthermore, SAMOA provides extension points for the integration of new SPEs into the system.
These features allow SAMOA users to develop once-distributed streaming ML algorithms and run the algorithm in multiple SPEs. Discusses the high level architecture and main design objectives of SAMOA.
There are three types of SAMOA users:
- Platform users who need to use ML but do not want to implement the algorithm.
- ML developers who develop new ML algorithms on SAMOA and use the already developed algorithm in SAMOA.
- Platform developers that extend SAMOA to integrate more SPE into SAMOA.