Language Learning with BERT – TensorFlow and Deep Learning Singapore

What we’re trying to do is probably next month, we’ll have something which is definitely for beginners like how to make your first CNN model. All of this kind of thing, so next month will be definitely a kind of a more beginner thing. Hopefully, there’ll be tips and tricks which is useful to everyone.

This is build a somewhat beginner, but it’s also kind of a revelation in terms of what’s happened to the natural nature processing thing so so bat meat. Oh, I have a background in machine learning and startups and Finance. I came here in 2013 2014. I was just having fun reading papers and doing robots and drones and stuff too, since 2015, I’ve been kind of serious with natural language processing.

I am a Google developer expert for machine learning. We organized this with Sam I’ve been writing some papers, including with this year. We’ve now got four papers in turnips, so that’s quite good, just small company I mean well, and we also do a developer course. So this is something which I’ll talk about a little bit but red dragon. I think Sam will talk more about this. It’s a Google partner, we work is we quite like to develop prototypes for people, but we’re also very interested in conversational computing, natural voices and knowledge bases, so that’s kind of what we do when we’re not so busy coming.

So I’ve done the Who am I? What I’m going to talk about is a traditional, deep natural language processing now traditional here is. It only goes back about four years five years, because this stuff didn’t exist before 2013 because these techniques were developed then. So when I say traditional I mean not, this is what most of the courses online will teach. You is traditional style. Okay, it’s all changed this year and in particular this summer.

So this is why in-person courses are quite good. So I’ll talk about the traditional thing. I’ll talk about what the innovations have been this year and the new hotness, which is Bert coming from Google there’s some probably no time for actual code, but I’ll give you some hints, okay, so traditional, deep natural language processing. The P elements here will be embeddings super useful, bi-directional LS TM layers, and then we need to talk about initialization and training.

So a traditional model will look something like this. Now, how many people know what I mean when I say: LS TM? Oh, quite a lot. Okay: okay, how many people know exactly what this diagram, or not exactly? How many who have got a good idea? What this does yeah sure it should be pretty much the same three, four? Okay! So basically, this is a fairly standard model in that you put in a word at the book.

These are words going in at the bottom. These are we converted into an embedding. This embedding will then feed into LST m RN going this way and one going backwards as well. These two things will then be added up, and then this will be fed into some kind of classifier, or this is a CRF in a classifier. So this is kind of how people have been doing this for a number of years now and it’s been quite effective.

So clearly it’s been beating the traditional natural language, processing, tf-idf kind of methods which are totally antiquated at that point. So but this is, this is becoming less less useful, now too, so so wordy. Who here has seen about word, embeddings, quite a lot okay, so the idea forwarding reading is just a recap. Is that four four words in text if a word is close to a words in a big corpus, words in a neighborhood should be similar to each other in some way.

So this is kind of a big, a big theme that this this principle should apply to all text right. So what you do to make a word embedding is we assign a vector like an initially random vector to every word, so you can think of this. As being like an Excel spreadsheet and down in the a column, you’d have them every word, so you start with the and of and little further down, you’d have city and further down.

You’d, have you know Space Station as you get rarer and rarer you’re going down the list of words and in English? You’ll have like a couple of hundred thousand words so other across on the rows, though, you’ll have like 300 different numbers, which will represent kind of the vector representation of that word now, in order to figure out what the number should be. You take a window and you slide that across your text and within that window you say: well, these the numbers within this window should kind of average out to indicate each other.

They should kind of predict the words within the window in words outside the window, I’ll actually push away from those. So there’s a way of kind of sampling this so that gradually, as you read this huge corpus, these vectors all get pushed and shuffled around and what happens is, as you give it a decent sized corpus which could be Wikipedia or Wikipedia for starters. Basically, this vector space will self organize and so, in the end, you’ll get a kind of a.

This is a representation from a thing called tensor board, which is part of tensorflow, and basically this can display that essentially the cloud of all words in the English language – and you can also say well, I’m interested in words, which are near say the word important and Because it’s been kind of self-organized, it will know that significance in particular and essential are all pretty close.

So this is a way of essentially for free, just without any linguistic knowledge, apart from giving it a copy of Wikipedia you’ve actually self-organized a whole map of the English language. So what you can do with those word embeddings or that numeric representation, then is put it into an LS TM, which is essentially a network which is the same at every time step, but it passes for at a state each time.

So as it reads this, as it reads the sentence gradually, the state gets more and more involved more and more feature-rich. Now what happens? What one issue with that is that the end of the sentence becomes very feature-rich in the beginning that you hardly know anything. That’s going to happen, which is why you run why these in both directions, so you get features from both ends. One of the problems with this LS TM unit, though, is, if you’re doing a big computation.

You don’t know what what the answer is over here until you’ve calculated one at the beginning, so this kind of forces, as kind of a speed limit on how you can train these things, which is embarrassing if you’re using a TPU which doesn’t want to do this Kind of thing so so unrolling it is forcing the sequential calculation so in terms of initialization, most people will just say well, I’ve got all of this really good stuff in the embedding, and that’s the only free lunch I’ve got right.

Apart from that, my weights in the network are just random: I’m going to use the data for my training tasks to learn the task. Okay and typically, this will use quite a lot of training examples because you’re your network, suppose I’m interested in movie sentiments if I’m just give it a collection of words. It may have a good embedding for each of these words, but it doesn’t know about the syntax of a sentence.

It doesn’t know about how things connect together. Basically, you have to teach your examples of sentiment have to teach it. The English lack the semantics of the English language from the beginning so needs probably a lot of training samples, okay, so this was old-style as of 2013. Let’s talk about the innovations, so there are several things which have come like together and are going to talk to each one of these and there’s a thing called by pairing coding or sentence piece which has been released, open source, there’s transformers.

So, who here knows what a transformer is? Okay, that’s more select crowd; okay, there’s also language modeling tasks which have come onto the come to the for this year, mainly and then there’s also the concepts of fine tuning, which is more like an April thing. So far, right, so what by pairing coding is – and this is this is existed for quite a while, but it’s a technique which is useful because when I mentioned the word embedding before what people do is they have a whole list of every word in the English Language that they can think of or every word in your vocabulary and train it from scratch.

On the other hand, if I came along with a new word, not in the vocabulary, I would just have to assign it an unknown label. I have no idea what the embedding word would be. So if I knew the word for book and the word for booking, but I I wanted to know the word rebooking, I will just have no clue, whereas if I could split the words up into little units of English and English does this quite often rebooking should Be very related to booking and the reconned of does something to the whole booking.

So, if you can, by doing this kind of technique, I can now make a vocabulary, which is infinitely large in the. What I do here is. I would start out with a character encoding, so I’d say: ok, 1 for a 2 for B 3 for C, so I’ve got 26 characters and also I’d have an end of a word character and suppose I actually have a like here’s. My initial vocabulary, where I have low repeated five times, lowest twice newer six times wide.

I can I want to represent this more efficiently. So what I would notice, if I did see counting, is that our plus the end of a word occurs nine times, because this is newer and wider. So I then say: well, our end of word is a new symbol, so at the end of my ABCD up to Zed, I’d have a new symbol called our end of word, so I’ve just done emerged and I’ve now replaced all the art in the woods. With this new symbol, which could be fire or something – and I could do this again and again so it you know – er would be repeated nine times or E or W of replacing nine times so and I’ve got an earth sound okay.

So this is kind of a useful English in you could imagine also that the in a big corpus thee would be taken into a single word very quickly off a single symbol very quickly. So if I just do four steps of merging basically, this would allow me to take an out of her company word like lower and it’ll just be composed of low and she’s kind of what you’d want, and what people are found is that this works really well.

For English language – and so this, this enables you to build an infinitely large dictionary just out of the components which arise naturally – and you can learn this without any linguistic knowledge, because now now you all understand how it works, and you can also read the paper and Download the code from github, so this this sentence, peace thing which implements this technique and a couple of others is this – is a good way of creating sorry.

All of these slides will be on the meetup link. So, underneath the meetup discussion, all the slides are going to be there. So I welcome the pictures, but you know okay, so a sentence peace is released. Now the next thing is a transformer now this is a kind of this is for a single sentence piece. We would then do something with attention written in the box and then various norms and free for words and then at the end, we’d have some kind of protection, so this is a purely feed-forward kind of network and in these papers you typically do twelve sets of These do quite a large pile of these things.

These are all based on CNN’s rather than Ireland’s. So when you get to a proper model, basically you’ll be taking in the words at the bottom, with some embedding. You then have this thing, which is a tension, so, instead of an RNN where you’re rolling forwards the whole time, the attention is looking simultaneously, at all words in the sentence to find out what is the most relevant word. So if i’m talking about or going to the bank to make a check deposit, the fact that the word deposit does depend on word, bank and the sense of bank depends on word check.

Okay or it’s not really a linear relationship. These have, basically all of these words, will kind of vote for being picked, and so this multi this insane is for flows up through this model, and essentially it’s is extremely powerful. Where doing in so there’s a paper called attention is all you need where they pro excellent results, just using this attention mechanism without any are and ends, or anything like this.

This is a pure, CN n method. Now the other trick that people have cottoned onto is basically like embeddings we’re going to train on a huge corpus of text, but this time we’re interested in getting the context a bit more solid. So what we all do is we train the entire network instead of just training a vector at a time we’ll train the entire model it at a time over the entire text and with kind of things that we’ll do is get the model to predict.

Just one more word so I’ll give it a sentence. You know I went to the bank to deposit a and I expect it to tell me check and if it doesn’t I say no you’re wrong. Tell me you’re a check, and then I say then what’s the next word is probably for stop. Next, word is otherwise I might run out of money or so, though, they’re basically there B the whole basis whole set of things. You might say after this and a whole lot of words, you would not say so.

By doing this, you can you just use a corpus of text to roll out an understanding of the English language in a fully unsupervised way. So, instead of just doing these vectors, you’ve trained the entire model as well as the embeddings at the beginning, so that was out that this predict. The next word was a language task. There’s another one called a closed task to her. Basically, you take a sentence, and you start deleting words say it fill in the blanks.

Another thing you might do is you could take two sentences and switch between them and say tell me where the switching point places. So if I started with a sentence about one thing and switched to another thing, you could say: okay, if you can identify that is then you understand language better. So, just all of these kind of tricks are basically forcing it to understand language more and more, and you can do it almost for free.

Another thing you could have is a novels okay, so you could have a sentence which is comes from one part of the novel and follow it with a sentence from a different novel. Now, if you can detect – which ones which like, is this a decent Sikh decent sequence of sentences or not, you can actually get sense across large passages of text, so these things can be trained almost for free in terms of linguistic knowledge, it’s kind of obvious that This this is a thing, but people weren’t doing this quite a while ago, or a year ago, people weren’t playing this game.

So much so another thing which came along – and this is partly due to faster ie them hustling. These models is, you can take a model which is pre trained on a large corpus which will be that a language model for a predict, the next word kind of model, and then, if I’ve got a say, I want to do movie sentiment. I would take a whole bunch of movie reviews, but even if I didn’t know whether they were good or bad, I just take as many reviews as I could find anywhere and then just train the language model to understand movie review language more.

So I don’t I’m not telling it the good or bad yet I just wanted to understand what people say about movies and then at the very last moment I say: okay, I’ve got a hundred movie review. Examples learn whether learn to say good or bad and response to each one. Now this approach can work much better than the previous approach, which is train from scratch, where you might need 10,000 movie reviews, because you’ve got to learn a lot of the movie review language simultaneously.

Here, these models have got a very good understanding, a very good measure of English language now you’re just trying to train it on just the fine tune, the last a little bit. So this is a super effective technique, and this is why these things are getting very good. So so, in terms of recent progress, you’ve got in February 2018. We’ve got a thing called Elmo owl fit came in in March. In sorry May 2018 open ie.

I came up with this transform or model in June 2018, and now the new hotness is Bert. Right now, Bert is clearly the successor to Elmo, though I don’t think they mention that so this was done in October, so this is fairly new and so what you do with Bert. Basically this is this. Is one of these models full of Transformers? Do you do the the sentence piece thing at the beginning? You have this whole transformer thing and then basically we need to train it to do lots of different tasks.

So one of the tasks could be this, which is a squad squad, so basically to do the squad. The squad thing is, I ask a question and then sorry, I have a question and I need a response and so basically the question will be. You know when was the beginning of World War two, and now I would give it the Wikipedia article about World War two, and I would know the extent of when, where the answer was in terms of words, and so the answer for squad is tell me what The start word and the end word, and so in order to train this task.

Basically, you put in the question, have a separator put in the passage and then just force it to tell you I’m a start and I’m an end. So this, basically all of these kind of common language tasks can be forced into the same model so rather than build special models for every different task. You can kind of say I want this to perform in this way and use choose the standard model, and this is way too small.

But basically – and I would encourage people who are interested to read the book paper because it’s actually quite well-written and quite it lays it out nicely. So basically, you’ve got the previous state of the Arts at the top and then so. You’ve also heard about the by Alice yen thing here. This is the open AI thing and basically, these scores are very much higher than they used to be. People were slowly climbing up.

So if we look back in history, we’d have had the old approaches which should have been scoring. You know, say: 50 when you got to the deep learning stuff in order jump to 60 now people are clawing their way up and now we’re at 70. I mean that that’s the kind of the impression to get to so this Bert performance suddenly has been a sea change this year. These states of the arts have been beaten by wide margins as we go through the year.

So, what’s nice is that Burt has got working code on github, okay, it’s a patchy to license. So anyone can use this wherever you are. They have scripts to produce all the results in the paper. They’ve released, pre train models, both regular sized and large size. So large is really for the TPU people, regular than you need kind of like, as you know, twelve gig GPU to use this thing. So this is not a light model and they’ve also released in both regular and large english-language model, a multi-language model which does 102 languages simultaneously and Chinese.

So there’s kind of they’ve been hard at work with their TPU pods and they’ve pre-trained all this stuff. We know that it costs to produce one of these models. It’s a very large amount of money. On the other hand, to fine-tune one of these, the fine-tuning can take place in in under an hour or five minutes, depending on the size of your tasks. So the fine-tuning stage can be very fast compared to running this and building this thing initially on Google size data, the other nice thing is they’ve got a ready to run collab.

So I won’t do it now, but you could click on this and get just mint at the turn. You get the whole thing with how to run this thing on the cloud TPU or this is a machine which has got GPUs already. This is all for free, so collab has got TP used for free, go up a TPU for free. You can just run this and have a play with it, so that that’s kind of interesting when it works. Oh, so let me so if you’ve got a problem, the old way of doing this, would you be you build your model? You take a glove in bedding which is kind of one of the freely available models freely available.

Beddings on that the Internet. You train it up. You need lots of data. The new method. Is you, take a pre train, Bert kind of model, you’d fine-tune it on all the unlabeled data you could ever find. Maybe it’s emails. You just put in you dump in your Gmail or whatever, and then you put in the few emails that you care about. You need far fewer of these than you would in a regular model. You don’t need as much data and you should expect better results.

So, as a wrap up, Bert is kind of the latest innovation in this kind of NLP trend, which has only been taking place kind of this year. It’s like beaten all the state-of-the-art performance benchmarks. It’s fully released with all the results just out in the open, as code should be, and people are describing. This is kind of the imagenet moment for NLP in as much as for imagenet people train this the thousands thousand class image models until they were better than humans.

Basically, but what people didn’t realize or didn’t pick up on so quickly was that the transfer learning of these models once you have a superb image net model, you can then use that to recognize images on mammograms of you know all sorts of different things. You’ve built a very good image vision model, even though you’ve trained it on dogs, mainly you’ve got an excellent model which can be applied to lots of tasks and it’s all released for free, and so this is kind of a very similar kind of moment.

Um. Ok, so that’s my talk, um. I will have questions in a little bit. Oh I’ll do a couple of adverts, though, because I’m that guy so you’ve already found the deep learning meetup group because you’re here – and we have the next one in December – here – probably typical to onset. So it talks for people stand starting out. Hopefully, if you’re a beginner, you found something interesting in what I said, but we’re next month in particular we’re going to focus on having something good, something from the bleeding edge.

This is fairly bleeding edge as well, so because it’s frightening – and hopefully we can get some lightning talks. So if you want to give a five minute talk about anything at all related to this kind of suffering, just have a word with at the end, and also we now figure out, we’ve got over three thousand members in this group, making it one of the largest Tensorflow groups on earth, which is country all to you right, um, so Sam and I run deep learning workshops.

So the we call these jumpstart workshops, there’s a there’s other courses beyond that for more advanced stuff. But basically, what we’re trying to do here is for two days and there’s online content as well. We force people to actually do stuff on their laptops and also do some kind of project of their own, devising or or one of our kind of sample projects. If you want to – but this is, we feel that the hands-on thing is super important, because just clicking through some, you know some pre-made modules.

It is not really so helpful these if you’re a Singapore or P, aha, Singapore, citizen or PR. This cost is heavily subsidised up to apparently up to a hundred percent find out if you’re, one of those lucky people. By going to this SG innovate thing and if you’ve got to the right place on SG innovate, it will look like this. Okay that were okay. There’s a little about the the deep learning developer course.

This is a full course when we ran one of these last year, quite there, a number of people with some some people are even swapping jobs after the better job they achieved. After doing the courses like awesome return on capital there, so that this is something that dates has yet or unknown, but there’s definitely more to come after the jump start. But what we want people to do is do the jump start before the full thing.

Just so that they know how to run these models, and we don’t have to explain everything from the beginning again. Every time and one final thing, okay, is I we have a little company Sam and I have a little company called red dragon that we are on active kind of intern intern hunt. We are looking for people who want to do deep, learning stuff as much as possible, though we recognize that people have got academic obligations.

We have had someone over the last couple of months working with us and he’s now appearing at nips. So that’s kind of an interesting thing for a Singaporean student at the moment, so we kind of said when we originally said intern hunt that this was a possibility, and this is an actuality. So we’re very grateful to him. But please cut approaches if you’re such an intern: okay, we’re up next with that, guy that guy okay,

Share this:

By Jimmy Dagger

Leave a comment Cancel reply