Machine Learning Engineering with Chris Albon

Khaulat Abdulhakeem
She is the Founder & CEO of DiverseK and host of the DiverseK podcast. DiverseK helps guide and connect diverse tech talents to companies globally.
Host
Chris Albon
He is the Director of Machine Learning at Wikimedia foundation.
Guest
Share

[00:00:00] Khaulat: Just to kick it off, let's start with your journey, basically. Let's just talk about your journey into machine learning where it all started.

[00:00:09] Chris Albon: So I started out getting a PhD in political science where I learned a lot of stats. And then after my PhD, I went to work for organizations like frontline SMS, which was a Kenyan US nonprofit that built open source software and Ushahidi, which was another Kenyan organization that built open source software.

[00:00:33] Chris Albon: While working at those places, I realized how powerful it could be to work on ML. Like the stuff that I you could do to, you know, put a model into production and have that do the work of a hundred people or label something that you wouldn't have funding to hire human labelers for, or identify complex things and images and that kind of stuff.

[00:00:53] Chris Albon: And from then I just kept on going, right. I just kept on learning more stuff, reading more books and reading more articles also getting more and more jobs into the [00:01:00] space. And, you know, years later I'm here at the Wikipedia foundation, but that was really where it started. I was sort of working at a small organization, having a background that I knew about machine learning at the time.

[00:01:12] Chris Albon: Cause it wasn't that popular. And then. Sort of self teaching myself all the way. Cause I don't have a computer science degree. I don't have an undergrad degree in major science or a PhD in computer science. So a lot of it was just sort of self-taught along the way. 

[00:01:26] Khaulat: Interesting. Like a lot of machine learning people that I've spoken to, myself inclusive are self thought. We all started real self learning and, you know, trying to further our knowledge. Yeah. So were you living in Kenya at that time? When you worked for the two Kenyan organizations

[00:01:44] Chris Albon: No, I didn't live in Kenya for either of them, or when later I worked for Brick, which was a Kenyan startup. My family's from Zimbabwe. But I live in the US, so I have always just stayed in the US, well, I mean, I live in the US and I have a US accent, but [00:02:00] my family's originally from from Zimbabwe.

[00:02:02] Chris Albon: And so, that was sort of the connection with Kenya where like, I knew that there was a Kenyan organization that was looking for people to hire , and I just jumped on that. So, I never moved out to Kenya for any of those jobs. Although I went to Kenya, I don't know how many times at this point, a lot of times to this point, to go work with startups.

[00:02:20] Chris Albon: Yeah. Interesting. So like how would you describe your job as a machine learning engineer then? Like when you got started? So I mean, the big part of machine learning engineering is the fact that when machine learning got popular, a lot of the work around machine learning was from the academic space.

[00:02:41] Chris Albon: And so it was like, here's a new model. Like here's a paper, a research paper or a Jupyter notebook that has this model in it. That does something really, really interesting. And the key part of machine learning engineering is okay, how do we then put that into production? Like how do we make that model [00:03:00] automatically retrain every single night?

[00:03:02] Chris Albon: Or how do we handle that model and make sure that it's up and working when we have 2000 models, right? That's the sort of space where machine learning engineering comes in. It's the application of those very classic machine learning principles that would be common to any kind of software engineer to the AI, to the unique problem of machine learning where you have a situation where models can not work, right? Like, well, you could just be like, I'm gonna predict X using this stuff. And it just turns out that doesn't work. And so that's like really the, the focus of machine learning engineers, like trying to make sure that we get that.

[00:03:36] Chris Albon: Get those models into production, get 'em running live, getting them in a place where they can feel really good. And know when they break, right? I mean, that's of course like a big thing that can happen where a model can be putting back different results. There can be model drift because, for example, at the Wikipedia foundation, we have some models in production that we've had in production for like five or six years and behavior on the site changes over time. And so like the models are like [00:04:00] outdated in a way. And so like that kind of model drift, how do you detect that and then implement a solution for that.

[00:04:06] Chris Albon: That's sort of the really focus of machine learning engineering as opposed to something like machine learning. 

[00:04:10] Khaulat: Yeah. The point you touched on is quite interesting and I want us to go a bit technical now. So how would you describe that process of you know, there's an existing machine learning model, that's currently in production, but maybe due to time changes, the new data coming in is a bit varied from what we used to actually train with in the past.

[00:04:30] Khaulat: So, explain the technicalities behind, retraining the whole model using the new data sets and what happens with the old model and all of that. 

[00:04:39] Chris Albon: Sure. So in machine learning engineering, there's lots of ways to do this, but typically what you want to do is along the lines of model versioning.

[00:04:50] Chris Albon: And so every single time that you train a model, you assign a version number to that model and then every single retraining [00:05:00] period, you could retrain on any period, but let's say every night, so every single night you get new traffic data from the website. So we run a website. I mean, technically I do work for a place that runs a website.

[00:05:12] Chris Albon: So fair enough. If you're running a website. You use like the last 60 days worth of data to create your model, but every single day, you get a new days worth of information and you drop the last day. So, the data set is slightly different every single day. And in that situation, what you typically wanna do is you're retraining the model every single night.

[00:05:30] Chris Albon: And then, you compare it to the performance of past models, specifically, the last one you retrained. And so there's various ways to do this. There's what's called shadow modeling, which is where you have the new model, version two of the model and version one of the model.

[00:05:46] Chris Albon: And when a prediction request comes in, you actually serve it to both. And then you serve back the new model to the person who asked for it, another service or something like that. But then you store the results of both on your side. And so you can [00:06:00] see how the two models work better.

[00:06:02] Khaulat: Which one is most accurate and all.

[00:06:04] Chris Albon: Yeah. And then, what you can do is then you can set it up as like if the model we trained tonight is actually better than the one we trained last night. Then we'll go into full production with the new one, or you could do something where you serve a percentage of traffic to the different model.

[00:06:16] Chris Albon: So 5% goes to the new model, 95% goes to the old model, then you see, and then whichever one does better becomes the new model you put out. And like that kind of stuff of like constantly working on that ,and trying to automate that so that you can have a situation where like you're not sitting there with a long to-do list every single night.

[00:06:34] Chris Albon: Like I need to retrain this specific model, but the idea behind it is trying to incrementally move to, or to accept the fact that things change over time, user behavior changes the needs of the business changes, technology changes. And so you are retraining those models and trying to figure out if there's a best one in an ideal world, the whole retraining process would be automated. [00:07:00] So then you would just do that whole thing every single night, like you'd have a batch job that would run every single night. And then , the whole process would be like the model selection, like do I use version two or version one, all that would be run for you at midnight or something like that.

[00:07:12] Chris Albon: I think there's lots of steps along the way. 

[00:07:14] Khaulat: Yeah. Like another thing I wanted to like mention was, there will be cases where the new data is filled with like outliers and it's not like what would normally come in. And in that case, the old model would be better. So do you just get rid of the new data? Or, in the first place how is it even like recognized that the new data is an outlier.

[00:07:36] Chris Albon: Yeah. So that model evaluation part, I think, is the most under respected part. Cause making a new model is of course what's interesting and cool, but actually understanding if a model is good or bad or you know, better or worse is probably a better way of putting it, is really hard because for example, do you have model two and model one in this [00:08:00] scenario, do you compare both to model two's data or model one's data?

[00:08:04] Chris Albon: Or say you have like a specific holiday that comes on. So that one day isn't a holiday, the next day is a holiday and user behavior changes. Like, because it's a holiday. Well, if you don't account for that kind of stuff, you could have a situation where the day after the holiday, you've trained the model for the holiday.

[00:08:21] Chris Albon: So you're always like one day behind where you should be, where in fact, maybe it'd be better to go back and find last year's data for the holiday. Train it for the holidays. So like you have that specific case of a day that's different than other days. But it really comes down to model evaluation, which is more of an art than a science. Although there's lots of science in there. And it's super important and I think sometimes we don't pay attention to it as much as we should. 

[00:08:46] Khaulat: Yeah. That's very true. Okay. Let's get out of the technical part and talk about your career currently as a director, as a machine learning director.

[00:08:57] Chris Albon: Sure. So I [00:09:00] am the director of machine learning at the Wiki media foundation, which means I am responsible for all the machine learning that is hosted by the Wiki media foundation. And a lot of the work. is around supporting the models that we have as a foundation into production. But the interesting part about this role is that we are open and transparent as a foundation. So most people who work in ML, you're sort of deep in the organization's organizational tree. Things that you do is intellectual property that you can't show people and that kind of stuff. At the foundation, we do everything public. So all of our code is public.

[00:09:41] Chris Albon: All of our work tickets are public. Our internal team chat is public. I have like live streamed myself working multiple times and it makes it a really interesting and unique place to work.

[00:09:52] Khaulat: Yeah.

[00:09:52] Chris Albon: Because people can come in and say, Hey I can see everything you've done in the last week.

[00:09:57] Chris Albon: And you know, it's terrible. And you [00:10:00] get used to the fact that you're just out there and it becomes sort of normal, but it is definitely not a normal work environment in that sense. But it is something that I think the team takes really seriously

[00:10:11] Chris Albon: Because, as the Wiki media foundation, we're not funded by an investor, right? There's no VC who funds like Wikipedia and the Wikimedia foundation. It is just regular people who give a small amount of money and just if enough people give a small amount of money.

[00:10:26] Chris Albon: We have enough money to sort of keep the site going and in that environment, you are really responsible and you really feel it that you're trying to maximize the amount of value that you're giving the people who are donating. And so, you don't wanna spend money where you don't need to.

[00:10:42] Khaulat: Yeah. 

[00:10:42] Chris Albon: And that's been a really interesting part of the foundation of having literally every single engineer care a lot about the cost of stuff, which I had never experienced before. Cause normally engineers are like, oh no, I mean, it's not my money, why don't we just buy 200 servers?

[00:10:56] Chris Albon: And at the wiki media foundation, it's very much like, okay, do we [00:11:00] really need service running 200 server. Yeah. Do we need this? Can we turn one off? Can we, you know, save electricity? Like what can we do? And it's because of that fact, the foundation is very, very, very accountable for every single amount of money that, we spend, people take it super seriously.

[00:11:15] Khaulat: Yeah. That's interesting. And what would you describe as the most fun part of your job day to day?

[00:11:22] Chris Albon: The most fun part? I think the part that I enjoyed the most is when we get to deploy a model that makes Wikipedia better. We work on things outside of Wikipedia that the Wikipedia foundation runs like Wiki comments, that kind of stuff.

[00:11:38] Chris Albon: But obviously Wikipedia is a huge part of the stuff that we work on. And when you get to add a new feature to Wikipedia. I mean, that's just cool, right? Like that's just a really, really fun moment where, 

[00:11:51] Khaulat: and you can see it immediately cause it's public. 

[00:11:53] Chris Albon: Yeah. You can see it and it's a site that like I love and everyone loves and so to get to play like a small part of [00:12:00] that.

[00:12:00] Chris Albon: Is amazing. I mean, that's just a really great experience and it's fun to do. There's obviously other fun parts of the job, but I think that's the one where you can like load Wikipedia and be like, Hey, that's there now. See, I did that.

[00:12:12] Khaulat: Yeah. Interesting. Let's talk about some tools that you use.

[00:12:15] Khaulat: Currently now as a director first, what are some tools that you use every day just to make your work better? 

[00:12:21] Chris Albon: I use a lot of alerts cause as a director, I have some staff under me. And the important part about being an engineering leader. A big part of it is like decision making, right?

[00:12:34] Chris Albon: So I'm both looking at the work that's done on my staff, but I also need to be able to make decisions with limited information. And so, should we use this Kubernetes based approach or should we use this other base approach or should we build our own and no one on the team feels qualified to make the decision.

[00:12:51] Chris Albon: That's now my decision and I will make that one. And so that means that the best way I can do that is not writing every line of code [00:13:00] which is a habit that I've learned to not do anymore. But instead, just be aware of what everyone is working on and like reading all the comments that they've made in the code.

[00:13:07] Chris Albon: Reading through the code, reading through any kind of changes, reading through discussions that happen in the ticketing system, just to make sure that I know what is happening, where when I need to make a decision, which is pretty common, I am in a position where I can make that decision. And so, it seems like a basic thing of having just a lot of notifications up of like how things work.

[00:13:27] Chris Albon: But every single morning I get up and I spend an hour looking at what was done the day before to make sure that I'm in like a really, really, really good spot. Because, I mean, it sounds simple, but being able to make decisions is a skill that you get with the experience and you get that experience by just being around the technical part.

[00:13:45] Chris Albon: And I think, that would be the thing for new managers coming up. I would say, if you're gonna be in machine learning specifically, you have to stay pretty technical because, you can't sort of seed responsibility to other people but you can delegate things definitely.

[00:13:59] Chris Albon: But at the end [00:14:00] of the day, you will have to make decisions.

[00:14:02] Khaulat: Yeah. And you need to understand what's really happening. 

[00:14:05] Chris Albon: Yeah, exactly .

[00:14:07] Khaulat: Would you say the alerts and notifications sometimes are counterintuitive? Like, can they cause distractions ?

[00:14:13] Chris Albon: I think a lot of it comes down to trying to figure out what sort of rabbit hole that I'm gonna run down.

[00:14:23] Chris Albon: Cause we all get notifications for stuff. And it's like, if this is true, Wikipedia is down, but I suspect it's not true. So I'm not gonna go take a look at it. I think, the big proxy for me about whether something that I should spend my attention on is if people are talking about it.

[00:14:41] Chris Albon: So like, if you have a work ticket that you have like 20 comments on, like you wake up in the morning and you have 20 comments on a work ticket, like that is a ticket that you should read and read every single line of code that's related to it. And even if you don't say anything, I don't need to say something in the ticket.

[00:14:55] Chris Albon: I just, need to like, make sure that I understand what is happening in that particular ticket. [00:15:00] And then I can go with that whenever I can, like, have someone work on it, I can work on it. We can decide what to do, but it is important to not sort of see that responsibility of how the system works and that individual decisions to someone else.

[00:15:13] Chris Albon: Cause at the end of the day, I'm the one who's making the decisions around it. Even if I'm not pushing the individual line of code, which I wish I could, but I tried that before and you burn out very quickly if you're pretending you're the only developer. 

[00:15:24] Khaulat: Yeah. Thank you for that very valuable information.

[00:15:30] Khaulat: So, are there any other tools beyond the alert and notifications that are as important. 

[00:15:35] Chris Albon: Yeah. I mean, I am a huge vs code user. That's my go to for identifying code on like a personal level, because I try to stay up with how machine learning grows.

[00:15:48] Chris Albon: And I do that through just working on it on my own time. I use what I think called paper space, I used to have, if you could scroll the camera down, you'd see a very old computer right there. That's what I was [00:16:00] working on doing like sort of my own deep learning was sort of like training my own models and that kind of stuff.

[00:16:04] Chris Albon: I've moved over to sort of hosted solutions for it. But I mean, I use vs code I use Google calendars if you really care about my calendar app. But, I'll say that, the thing that is important when you jump around to lots of different jobs, is that you just understand the patterns of behavior that are common to all these different tools.

[00:16:24] Chris Albon: So like my last job, when we started out, we only used VIM. Like, that was just how you used everything and that's okay. Right. Like, I'll code in them. I'll code an EAX I'll code in vs code remotely I'll code locally. Like I'll do whatever I need to do. Which means that I tend to not invest in like customizing anything in a super specifically.

[00:16:44] Chris Albon: I tend to keep it pretty default. Not because I'm lazy as someone once said, but because in fact that I believe that if I go take a next job, they're gonna be like, okay, we only do anything in VIM .Again I'm gonna be like, okay, cool. I could just take my company laptop, open it up.

[00:16:59] Chris Albon: And now I'm in [00:17:00] VIM and then I'll just start working on stuff. I think that is really, really important. But it does mean that most of my stuff doesn't look as pretty, cause I haven't spent a lot of time customizing it to my unique purposes.

[00:17:12] Khaulat: Yeah. Cool. Okay. So there's something that just popped to my head now.

[00:17:18] Khaulat: That's about learning because you definitely keep yourself up to date, but how do you go about this? How do you go about like learning new things? I'm making sure you know that you're still in the know .

[00:17:29] Chris Albon: Yep. I have learned over time that machine learning as a field grows really, really quickly.

[00:17:37] Chris Albon: Like every single year, is a lot different than the year before. And every single five years, it feels like a totally new field. And so learning constantly, getting better and learning the new thing and trying stuff out is incredibly important. And I've learned over time that the only way that I can actually [00:18:00] learn something is by making something with it.

[00:18:03] Chris Albon: And it doesn't need to be a big thing. So like I've written a book that was making a thing, right? The book had 300 tutorials and each tutorial was like a thing that I made. I make these machine learning flashcards, every single flashcards, me making something with it. I've done a podcast. Every single episode was making something with it.

[00:18:17] Chris Albon: I have a site with a lot of tutorials on it. Every single tutorial on the site was like me, like making a thing. And that's the way that I learn. I take a book or, you know, a tutorial online or something like that. And I just open it up and I take that information and I process it through my brain and then I try to explain it to someone else or I try to do something and I try to like process it through my brain and like, through doing that, that's how I learn.

[00:18:40] Chris Albon: Like I can't just read a book without you know, without like trying it. And definitely like, because my background isn't in math, it's more in coding. I tend to wanna like run code for stuff. Right so if you say, Hey, we have this new, you know, system and just trust us, it works or something like that.

[00:18:56] Chris Albon: Like I need to like, try to do it. So like take a simple example, [00:19:00] random forest, when I first learned random forest, the thing that made it stick in my head is I made one and, you know, made the tree smaller or made the tree bigger or changed the data set or tried a different technique or like, whatever.

[00:19:09] Chris Albon: Yeah. But just. Playing with it and actually using it myself. That's the sort of like making something with it rather than just like opening the book on the machine learning book on page one, reading to page 300 and saying that I know it, cause I will never know. Like by the time I get the 300, I will have forgotten page one.

[00:19:26] Chris Albon: But if I go through slowly and make stuff with it, you know, tutorials, flashcards, whatever that stuff will stick in my brain a lot more, a lot, a lot more.

[00:19:35] Khaulat: Yeah, thanks. That was very helpful. This has been a very interesting conversation and we've been talking for about in twenty five minutes.

[00:19:42] Khaulat: I enjoyed every bit of it. I would want to like wrap this up with the one advice that you would give to someone who wants to get started in martial engineering today? 

[00:19:52] Chris Albon: Yeah, I would say, wow. One piece of advice. Wow. Okay.

[00:19:58] Khaulat: oh, this can be multiple[00:20:00] .

[00:20:00] Chris Albon: I mean, I would say that one is there are so many learning resources out there for you.

[00:20:06] Chris Albon: Like there's so many places that you can learn for some amount of money for no money, just totally free. Like there's tons of options there for people to learn. And the interesting thing about machine learning is that because it's such a new field that a lot of the requirements that you might have to be like a doctor or a lawyer where it's like, you have to go to a certain school for a certain number of years and that kind of stuff doesn't exist for machine learning. 

[00:20:34] Khaulat: Yeah, there are no barriers. 

[00:20:34] Chris Albon: There's no barriers. And the bar that you need to pass is the technical interview at the job, right. So you could apply anywhere and then they just they give you a task that involves machine learning and like how well you do depends on whether or not they hire you.

[00:20:47] Chris Albon: And for that kind of stuff, you can realize that the difference between you being so bad that you aren't hired or good enough that you're just barely hired or really, really great that you're [00:21:00] like the best candidate and everyone wants to hire you is just hours of your time that you spent working by yourself in machine learning, like writing notes to yourself, reading stuff, just being around things.

[00:21:13] Chris Albon: Is really useful it's just hours of time and I have a bunch of books behind me and I've spent a lot of time just sitting there, reading, taking notes on it, working on it. And that's what I do almost every single day, just to like keep going and you just get better every single time. And you push a little bit more.

[00:21:29] Chris Albon: The second piece of advice, I would say is, social media and Twitter is actually like really useful in a weird way. Even if you don't actually send any messages, just seeing what people are talking about there'll be an article that sparks a really big discussion about ML ops or something like that, which is like how you manage lots of models.

[00:21:48] Chris Albon: And just seeing that people are talking about the article, you sort of let go and read the article and learn something about where the discussion is, but I find it helps show you the landscape of what the field [00:22:00] looks like, where you can sort of see the differences between people who are more focused on research or people who are focused on like the industry and running things in production.

[00:22:09] Chris Albon: You can see things around what the latest advances are or techniques that people don't use anymore, or other organizational things like how you structure a team. And that kinda. Social media even if you never post at all, even if you just exist is actually really useful to just have it. It's like a slow conference, right?

[00:22:27] Chris Albon: It's like a slow machine learning conference over time. And then you can just pop in and sort of see what the conversations are about and say, okay, cool. You know, like I got some stuff from that. That's great. And then you can move on with your, yeah. With your life, but. It continues to add value where I used to have a really big RSS reader with like 200 blogs on it all like machine learning, blogs.

[00:22:46] Chris Albon: And now I just let social media sort of do the filtering for me. And interesting stuff will just be the things that people end up talking about. And so that stuff just floats to the surface and then it appears in my feed and then I read it and I think it's a good [00:23:00] way to to do it rather than reading every single Reddit post or something like that.

[00:23:04] Khaulat: Thanks, Chris, this was a very, very interesting conversation.