One of the best discipline created in the last century is statistics. With statistical methods, such as Chi-square test and DW-test, we know how to decide whether a hypothesis should be accepted or rejected and the whole analysis process is based on data. However, the information technology is developing so fast that data is explosively incrementing, not merely with the volume and velocity of data collected, but also the complexity of data. This leads to a greater statistical power while a higher false discovery rate appears as well. Transitional statistical methods and packages are facing the challenges due to this, from all aspects of data precess including data capture, data storage, data analysis, search, share, transfer, visualization, queuing, security and so on. In order to overcome this dilemma, new technology more than transition is brought up, i.e. Big data.
Big data was originally associated with three 'v' concepts: volume, variety, and velocity. Other two 'v' attributed with big data are veracity(i.e. how much noise is in the data) and value. With these five 'v' attributes, big data goes far beyond than statistical analysis, its usage currently involves many kinds of analytics, for example, predictive analytics and user behavior analytics. These analytics fits for cases especially in specifying commercial strategy. Because data sets are now growing rapidly, with a cheaper and more numerous information gathering.
The basic ecosystem of big data should include components as follows:
● Techniques for data analytics, a good example to show is 'Machine Learning'
● Big data technologies such as business intelligence, could computing and databases
● Data visualization, like in the form of charts, graphs.
Big data in multi-dimension can also be represented as data cubes, or in algebra form -- tensors. Array Database Systems have been launched to provide storage and high-level query support on this data type. Other technologies applied to big data include efficient tensor-based computation, such as multi-linear subspace learning, massively parallel-processing databases, and data mining, etc. These items are also constructions of machine learning.
Machine Learning is the scientific study of algorithms and statistical models that computer systems use to effectively perform a specific task without using explicit instructions, relying on patterns and inference instead. It other words, the computer can learn to make predictions or decisions according to previous data and algorithms built as a mathematical model of sample data, known as "training data". This is named supervised learning as well, because the learning process is supervised by a sample of data, so that supervised learning can be further used for regression and classification, respectively corresponding to prediction and decision.
The part of machine learning, when there is no patterns and inference for 'training data', is called 'unsupervised learning'. This is where machine learning and data mining overlap significantly. And that's why some believe data mining is a field of study within machine learning. Unsupervised learning, or say data mining, focuses on exploratory data analysis, to find out more rules and knowledge which are undiscovered in the previous time. Unlike the classification of supervised learning, clustering uses unlabelled data to generate categorizations through specific algorithms. Unsupervised learning is widely used for dimensionality reduction, which big data visualization benefits from.
Machine learning includes other aspects, e.g. Semi-supervised learning and Reinforcement learning. All kinds of learning methods bring a wide variety of applications, such as email filtering, detection of network intruders, and computer vision. Machine learning is closely related to computational statistics, which focuses on making predictions using computers. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.
Although it still remains difficult to carry out machine learning with big data, many approaches and technologies have been developed. With both business opportunities and technical challenges, a new age is coming.