A blog entitled "Lessons learned developing a pracTIcal large scale machine learning system" has recently been published on the googlesearch blog. The author should be a member of the google machine learning team and cited them in developing a scalable large-scale machine learning system SeTI. Some lessons learned during the time. Although the three lessons listed may seem simple and seem obvious, in reality, we will often be subject to various “temptations†and go to the other extreme. A particularly typical scenario is that these principles have a certain intuitive feeling for engineers who have accumulated certain project experience in the first-line work, but they do not have a good overview or speak through an authoritative mouth. It may be led in the wrong direction by some wonderful ideas that exist only on design papers, leading to a lot of hard work. The simplest truth, but most likely to be violated by people over and over again, so want to write them down, on the one hand to remind themselves to be less susceptible to such temptation and make mistakes, on the other hand also hope to give similar application scenarios and ideas to friends There is an authoritative summary and corroboration.
The following are mostly free translations based on full respect for the original blog, interspersed with some of their own views.
The machine learning system mentioned here actually refers to a classification system (or pattern recognition system). In most cases, the two can be regarded as equal, but in the reply, it was also pointed out that the scope of machine learning should be compared with pattern recognition. Large, this is no longer a question of who is big or small. It only needs to know that the following machine learning system or SeTI is mainly a classifier. It has the following features:
For a prediction or classification problem, if you do not have enough data, you have to focus on how to make full use of statistical knowledge in a small amount of data samples to build a sophisticated classifier. If, on the contrary, if your data volume bursts, you have to pay attention How can your system adapt to such a large sample size and mine useful information? The scale of problems solved by SeTI is largely as described in the following table:
Training set size Unique features
Mean 100 Billion 1 Billion
Median 1 Billion 10 Million
In general, a good machine learning system needs more emphasis on accuracy, but it is easy to make many mistakes when faced with such a one-sided emphasis on a large-scale system. Here are some lessons we have learned from the development of Seti. Of course, some of them were only summed up afterwards. We did not realize at that time. (Note: The author should be trying to say that there are some factors that cannot be ignored, even more important, like precision)
1. Keep the system simple, even if it means a loss of accuracy. (Keep it simple, even at the expense of a little accuracy)
Temptation: It is very important that your classifier has high accuracy in different fields of application, so we should focus on the accuracy of the algorithm.
However, the actual algorithm has several other equally important positions:
Easy to use: If the system has other people or other teams in use, they must hope that the system is simple to configure and use. They may not be experts in machine learning, so they do not want to waste time on the system. Startup and operation.
System reliability: Everyone pays more attention to deploying a reliable machine learning system in the actual environment. It must be stable and does not need to always pay attention to whether it crashes. Although early Seti was better in accuracy, its complexity, pressure on the network and the GFS file system, and the need for constant attention caused many people to be reluctant to deploy it.
(In many cases, we can think that the above two points are equivalent, that is, the system's ease of use is approximately equal to stability and reliability.)
Seti is usually applied to scenarios where the original system has greatly improved (see Lesson 3), so everyone is less concerned with the subtle differences in precision that result from the different algorithms used by Seti. On the other hand, these small differences in precision can often be smoothed out by other means, such as better data filtering, adding other more appropriate features, adjusting parameters, etc., if the system is stable, scalable, Easy to use, the above additional steps are also easier to implement, and these system characteristics often determine that it will be accepted or abandoned by the team.
For the academic community, designing a less accurate but more stable and easy-to-use algorithm is not an attractive thing, but based on our experience, this has extraordinary value in practice.
2. Start with some current specific applications. (Start with a few specific applications in mind)
Temptation: Building a system that is not limited to any particular application can not only include current applications, but also apply to various future classification tasks.
However, we decided to focus on a handful of initial applications. This decision is based on several reasons:
3. Know when to say "no". (Know when to say “noâ€)
Temptation: We have a hammer, so what is in our eyes is a nail. Any problem can be solved with a machine learning system.
We discovered long ago that although the machine learning system brought significant benefits, it also brought complexity, opacity, and unpredictability to the entire system. In some cases, simple technology is sufficient to solve the problem at hand. In the long run, instead of spending efforts on integrating, maintaining and diagnosing online machine learning systems, it is better to spend on other methods to improve system performance.
Seti's applicable premise is that there is a clear improvement in the prediction effect on the current system, and we often recommend that you avoid applying it to situations where the effect is not obvious.
Supplement 1: When I see the scale of the data applied by Seti, my first reaction is how can I get such a large amount of marker data, because the training classifier is a data set that needs to be labeled. Again, the accuracy of the classifier is mentioned in the text. , and the accuracy of the calculation classifier can not lack the mark data. My understanding is that, in one case, these tagged data come from unintentional click contributions from google users, and another is that they use semi-supervised learning methods to start training with a small, manually tagged data set. , and then overwrite the full set of data.
Supplement 2: A major point of view in the paper is that commercial systems are not the same as those pursued by academia. The academic community will tend to explore how to obtain statistically significant results even in the absence of data, and accuracy is always the most important. In the absence of data, the business community needs to focus on how to filter out valuable information from the noisy data. At the same time, the system must be scalable. At this time, accuracy is not the only important factor.
Polyester/cotton Braided Sleeve
Polyester/Cotton Braided Sleeve,Cotton Braided Sleeve,Polyester Braided Sleeve,Polyester Sleeving
Shenzhen Huiyunhai Tech.Co., Ltd. , https://www.cablesleevefactory.com