Welcome To Afghanistan

I still remember the first time I felt the ground shake below me. It was early August 2003. My unit was given intel that one of the guard towers on Bagram Air Base would be attacked that evening. We…

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转

Deep Learning Achievements Over the Past Year

Great developments in text, voice, and computer vision technologies

The key outcome: closing down the gap with humans in accuracy of the translation by 55–85% (estimated by people on a 6-point scale). It is difficult to reproduce good results with this model without the huge dataset that Google has.

For training, they collected a dataset of human negotiations and trained a supervised recurrent network. Then, they took a reinforcement learning trained agent and trained it to talk with itself, setting a limit — the similarity of the language to human.

The bot has learned one of the real negotiation strategies — showing a fake interest in certain aspects of the deal, only to give up on them later and benefit from its real goals. It has been the first attempt to create such an interactive bot, and it was quite successful.

Certainly, the news that the bot has allegedly invented a language was inflated from scratch. When training (in negotiations with the same agent), they disabled the restriction of the similarity of the text to human, and the algorithm modified the language of interaction. Nothing unusual.

The network was trained end-to-end: text for the input, audio for the output. The researches got an excellent result as the difference compared to human has been reduced by 50%.

The main disadvantage of the network is a low productivity as, because of the autoregression, sounds are generated sequentially and it takes about 1–2 minutes to create one second of audio.

If you remove the dependence of the network on the input text and leave only the dependence on the previously generated phoneme, then the network will generate phonemes similar to the human language, but they will be meaningless.

Lip reading is another deep learning achievement and victory over humans.

There are 100,000 sentences with audio and video in the dataset. Model: LSTM on audio, and CNN + LSTM on video. These two state vectors are fed to the final LSTM, which generates the result (characters).

Different types of input data were used during training: audio, video, and audio + video. In other words, it is an “omnichannel” model.

They couldn’t get along with just the network as they got too many artifacts. Therefore, the authors of the article made several crutches (or tricks, if you like) to improve the texture and timings.

To recognize each sign, the network uses up to four of its photos. The features are extracted with the CNN, scaled with the help of the spatial attention (pixel coordinates are taken into account), and the result is fed to the LSTM.

The same approach is applied to the task of recognizing store names on signboards (there can be a lot of “noise” data, and the network itself must “focus” in the right places). This algorithm was applied to 80 billion photos.

There is a type of task called visual reasoning, where a neural network is asked to answer a question using a photo. For example: “Is there a same size rubber thing in the picture as a yellow metal cylinder?” The question is truly nontrivial, and until recently, the problem was solved with an accuracy of only 68.5%.

The network architecture is very interesting:

This is an extremely useful application of neural networks, which can make life easier when developing software. The authors claim that they reached 77% accuracy. However, this is still under research and there is no talk on real usage yet.

There is no code or dataset in open source, but they promise to upload it.

Researchers have trained the Sequence-to-Sequence Variational Autoencoder (VAE) using RNN as a coding/decoding mechanism.

Eventually, as befits the auto-encoder, the model received a latent vector that characterizes the original picture.

Whereas the decoder can extract a drawing from this vector, you can change it and get new sketches.

And even perform vector arithmetic to create a catpig:

One of the hottest topics in Deep Learning is Generative Adversarial Networks (GANs). Most often, this idea is used to work with images, so I will explain the concept using them.

The idea is in the competition of two networks — the generator and the discriminator. The first network creates a picture, and the second one tries to understand whether the picture is real or generated.

Schematically it looks like this:

During training, the generator from a random vector (noise) generates an image and feeds it to the input of the discriminator, which says whether it is fake or not. The discriminator is also given real images from the dataset.

It is difficult to train such construction, as it is hard to find the equilibrium point of two networks. Most often the discriminator wins and the training stagnates. However, the advantage of the system is that we can solve problems in which it is difficult for us to set the loss-function (for example, improving the quality of the photo) — we give it to the discriminator.

A classic example of the GAN training result is pictures of bedrooms or people

Previously, we considered the auto-coding (Sketch-RNN), which encodes the original data into a latent representation. The same thing happens with the generator.

The same arithmetic works over the latent space: “a man in glasses” minus “a man” plus a “woman” is equal to “a woman with glasses.”

If you teach a controlled parameter to the latent vector during training, when you generate it, you can change it and so manage the necessary image in the picture. This approach is called conditional GAN.

A trained algorithm went through Google Street View panoramas in search of the best composition and received some pictures of professional and semi-professional quality (as per photographers’ rating).

An impressive example of GANs is generating images using text.

Here is another example of the successful performance of conditional GANs. In this case, the condition goes to the whole picture. Popular in image segmentation, UNet was used as the architecture of the generator, and a new PatchGAN classifier was used as a discriminator for combating blurred images (the picture is cut into N patches, and the prediction of fake/real goes for each of them separately).

In order to apply Pix2Pix, you need a dataset with the corresponding pairs of pictures from different domains. In the case, for example, with cards, it is not a problem to assemble such a dataset. However, if you want to do something more complicated like “transfiguring” objects or styling, then pairs of objects cannot be found in principle.

The idea is to teach two pairs of generator-discriminators to transfer the image from one domain to another and back, while we require a cycle consistency — after a sequential application of the generators, we should get an image similar to the original L1 loss. A cyclic loss is required to ensure that the generator did not just begin to transfer pictures of one domain to pictures from another domain, which are completely unrelated to the original image.

This approach allows you to learn the mapping of horses -> zebras.

Such transformations are unstable and often create unsuccessful options:

Machine learning is now coming to medicine. In addition to recognizing ultrasound, MRI, and diagnosis, it can be used to find new drugs to fight cancer.

Topics with adversarial-attacks are actively explored. What are adversarial-attacks? Standard networks trained, for example, on ImageNet, are completely unstable when adding special noise to the classified picture. In the example below, we see that the picture with noise for the human eye is practically unchanged, but the model goes crazy and predicts a completely different class.

Why should we even investigate these attacks? First, if we want to protect our products, we can add noise to the captcha to prevent spammers from recognizing it automatically. Secondly, algorithms are more and more involved in our lives — face recognition systems and self-driving cars. In this case, attackers can use the shortcomings of the algorithms.

Here is an example of when special glasses allow you to deceive the face recognition system and “pass yourself off as another person.” So, we need to take possible attacks into account when teaching models.

Such manipulations with signs also do not allow them to be recognized correctly.

Reinforcement learning (RL), or learning with reinforcement is also one of the most interesting and actively developing approaches in machine learning.

The essence of the approach is to learn the successful behavior of the agent in an environment that gives a reward through experience — just as people learn throughout their lives.

RL is actively used in games, robots, and system management (traffic, for example).

Much of the attention is paid to learning acceleration because experience of the agent in interaction with the environment requires many hours of training on modern GPUs.

Learning results:

4.2. Learning robots
In OpenAI, they have been actively studying an agent’s training by humans in a virtual environment, which is safer for experiments than in real life.

If only it was so easy with people. :)

As always, the human must be careful and think of what he is teaching the machine. For example, the evaluator decided that the algorithm really wanted to take the object, but in fact, he just simulated this action.

Researchers managed to achieve this: they taught agents (body emulators) to perform complex actions by constructing a complex environment with obstacles and with a simple reward for progress in movement.

Based on the information from thousands of sensors in the data center, Google developers trained a neural network ensemble to predict PUE (Power Usage Effectiveness) and more efficient data center management. This is an impressive and significant example of the practical application of ML.

Researchers have trained a model that performs eight tasks from different domains (text, speech, and images). For example, translation from different languages, text parsing, and image and sound recognition.

Main results of learning:

In their post, Facebook staff told us how their engineers were able to teach the Resnet-50 model on Imagenet in just one hour. Truth be told, this required a cluster of 256 GPUs (Tesla P100).

As a result, it was possible to achieve an efficiency of 90% when scaling from 8 to 256 GPU. Now researchers from Facebook can experiment even faster, unlike mere mortals without such a cluster.

As to more recent events, self-driving cars have been allowed to travel across all US states.

Currently, there are heavy investments in ML as it was before with BigData.

China invested $150 billion in AI to become the world leader in the industry.

Welcome To Afghanistan

Deep Learning Achievements Over the Past Year

Great developments in text, voice, and computer vision technologies

Add a comment

Related posts:

How is your Word of Mouth Marketing? Pep it Up with Personalized Engagement App

3 things you can do to FIGHT DEPRESSION.

How to Get Rid of Sweat Bees Around a Pool