Listen: Computer Scientist Ameet Soni says "Dr. Google Will See You Now"
This spring, Assistant Professor of Computer Science Ameet Soni gave a talk, "Dr. Google Will See You Now: Advancing Biology and Medicine with Artificial Intelligence."
The current revolution in AI has been spurred by two ongoing trends: parallel architectures that make computation faster and "big data" - the ability to collect and share large data sets easily and cheaply. In his lecture, Soni discusses these trends in the context of his research, which develops algorithms to solve problems in biology and medicine. These problems are characterized by complex and noisy relationships that overwhelm common approaches in the field. The projects he discuses include solutions for protein structure prediction, diagnosing diseases such as Alzheimer's and Parkinson's, and understanding gene regulation.
Soni, who joined Swarthmore's faculty in 2011, received his Ph.D. in computer science from the University of Wisconsin. His general research interests are in the areas of machine learning and computational biology and medicine.
Rich Wicentowski: So, good afternoon. My name is Rich Wicentowski and I'd like to introduce my colleague, Ameet Soni. Ameet got his Bachelor's degree from University of Michigan in 2004 and apparently, now I'm obligated to say, "Go Blue." He got his MS and PhD at the University of Wisconsin in 2011 and joined Swarthmore College immediately after as a visiting assistant professor. The following year he was hired on the tenure track as an Assistant Professor in Computer Science. Since he's been here he's taught bioinformatics, database systems, data structures, introduction to computer science and numerous directed readings.
Ameet's primary research area is in machine learning and its application to biomedical problems. For his PhD thesis, Ameet worked on developing machine learning approaches for complex problems, including predicting three-dimensional structure of proteins, which he'll talk about today. Since joining Swarthmore, Ameet and his eight research students have used machine learning to make progress on a number of biomedical problems.
Currently, Ameet's lab is using deep learning to make predictions about how transcription factors impact gene regulation. So, transcription factors are proteins that control the rate at which specific DNA sequences are transcribed into RNA, which effectively turns a gene on or off. Ameet's lab has also used deep learning to determine if an MRI shows whether or not a patient has Alzheimer's Disease.
On campus, Ameet has become involved in many ways, including giving a guest lecture to the Board of Managers entitled The Future of Liberal Arts, giving a guest lecture in cognitive science, mentoring junior faculty, participating in the Aydelotte Foundation Series on Pedagogy, and being the NFC representative in the Bathtub Debates, where he came in second to Krista.
Of course, I would be remiss if I didn't mention that everyone in the Computer Science Department thinks that Ameet is a perfect colleague. Thanks for coming and enjoy the talk. (applause)
Ameet Soni: Thank you all for coming today. I'm excited to talk to you about my research. So, the primary theme of my research is applying machine learning techniques to problems in biology medicine. As I'll talk to you about what those different fields mean and some of the problems they work on, hopefully give you an idea about how I kind of view a problem and the lens through in which I view problem solving.
So, I use artificial intelligence in the main slide because I think we're much more familiar with that term. Specifically, within that field of artificial intelligence, I work on problems in machine learning. So, Artificial intelligence, we kind of think about as robots and trying to mimic humans. And, you know, we have this popular science view of what it is. Machine learning is a much different ... or it's a sub-problem in the sense that we're not necessarily interested in maybe the human interface side of things, but more about how do we learn from data? So how do we take data, be able to recognize patterns in order to be able to solve problems in the real world without having to tell the machine every possible scenario, which it may see, and how to solve that.
So an example would be giving recommendations. So, I'll use this toy problem of a movie recommendation. If you're Netflix, you have somebody's viewing history. You want to be really good at recommending movies to them because if you're not, people are going to stop paying for your service because they think there's no more movies that would be interesting to them, but if you're really good at it they're going to be happy and more likely to continue. So a typical machine learning problem may be given a viewer history, can you predict whether they're going to like a movie that just came out, Moonlight? And more specifically, if you have that information as well as other users that may have similar viewing histories to the new user that's coming in.
In an ideal world, I could just say, "Hey, Siri, solve this problem for me." Where we just want an AI that we don't have to give it much information. We can just tell it what our problem is and they can solve it. But in reality, the truth behind artificial intelligence is that most of the intelligence is still human intelligence. Right? So when we think about when we want to solve a problem, in the vast majority of successes at this point, it's the human has to select a model, a mathematical construct, or a way to frame the problem in a way to the computer, and also has to curate the data and collect the data. So if the human is not very good at any of those aspects, the machine's not going to be able to learn anything about the data, or it's not going to learn anything very useful from the data.
So, it's still incredibly important to build up skills and an understanding about how these models work and also an understanding of how data affects these algorithms.
So here's a toy problem. Again, this is not like an earth-shattering problem here or anything like that, but kind of building off the Netflix problem. So, let's say I have a viewer's history, so we call this supervised learning. We give the machine some solved examples, so people that did like Moonlight and people that didn't like Moonlight, and we want to be able to generalize from that. So in the future, can you tell me if a new user is likely to like the movie or not? So let's say I have viewing histories, I have Transformers and Hidden Figures, and I think there may be a correlation between what you think about these movies and whether you like Moonlight. So in blue here we have the people that like Moonlight. So these are again what we train on, known examples, and in red we have people that maybe didn't like Moonlight. So we want to be able to develop a model that could tell us about future examples.
So I choose a mathematical model. I'm going to choose a very simple one. I'm just going to have a ... so sorry. We have a new person that comes in. I know whether they liked Hidden Figures and I know whether they liked Transformers, and I want to use that information to tell me something about whether I should suggest Moonlight to them. So my mathematical model is very simple here. It's just a simple linear function. I'm going to say if you're below this line, you will like Moonlight and if you're above this line, you will not like Moonlight. Right? So, this is the first step that I came in ... the second step. The first step was collecting the data. The second was what model do I want to use to be able to learn from the data?
So it's, again, a simple model, but I can make this decision very easily now. I have a model and I have an example. It came below the line just like all of the other people that liked Moonlight did, so I'm going to say that this person should be recommended to watch Moonlight. So, that's a simple version of what machine learning is, trying to use data in models to be able to answer questions. So just to kind of hit home, the model here is this line. So when I use that word, I'm talking about some mathematical solution that helps us be able to make predictions in the future.
Any questions about that? All right, cool. So, obviously, that's not my entire talk. It's more complicated than that. So, yes, Carlos?
Carlos: What was the computer doing in the last slide you had?
Ameet Soni: Yeah, so I kind of hand waved at that. There's some optimization phase that goes in and that's where the computer's doing its magic, but we have different ways of approaching that problem. So what we could have said was, form a line that maybe is in between the data points as best as possible or gets as many things correct as possible. So you know these three are supposed to like Moonlight, so if you come up with the model, those three should be on one side and these other three should be on the other side.
So we can have different objective functions, but accuracy is usually one. So we want to minimize error on things we know correctness about. We may have other things we may prefer. For anybody that's familiar with Ockham's razor, we do prefer simplicity in our models as well. I could fit a really funky polynomial equation here too, but I would prefer something that's a simpler explanation as well.
That's incredibly important for biology because I have to explain my models to a biologist or to a medical person, so they want my models to not be these really complex mathematical equations that don't really give any insight into the problem.
So big data is this marketing term that gets thrown around a lot. I prefer the data deluge because I think at this point we're getting more problems than solutions from it. But if we want to talk about it in specific detail, there's the four V's that are on a client to describe what big data is. And this is- it's kind of been a continuum; it's not like we just started big data last year and before that there was no big data. But these are the general properties that we have for big data. I think everybody's probably familiar with the first one. That's what we see in the press. Big data means you have lots of it, so we'd need bigger hard drives and faster computers to be able to handle it.
I'm more interesting in the final two, the last two on here. So velocity is we have Fitbits and we have Smart Phones. We don't turn those off. They're always collecting data, so we need to come up with models that can update as new data comes in, which is somewhat of a new problem. Veracity is kind of a consequence of the big data boom.
So we collect data primarily because it's cheaper and easier to collect, but the ability to maintain quality has gone down significantly. So for one, sensors are cheaper, but that doesn't mean that they're better. It just means they can collect lots of information. The second thing is before I could go, you know, we would collect the dataset of 100 examples. And we could look at all those 100 examples to see if there are any problems in there. So we could clean up the data fairly easily.
If you're Google and you have millions upon millions of images that are coming in, you're really not going to hire humans to sit there and sift through those images to make sure those are actually images of dogs or cats. So we have a lot more noise, and we have to deal with the fact that our data may not be what it says it is. The variety is the idea that our models used to simplify the problem to thinking about one type of data. So if I wanted to look at predicting cancer, we could look at blood samples and just say let's collect a bunch of blood samples from patients and see if that tells us anything about cancer. But with big data we've been able to collect lots of different types of data, so we might have blood samples, we might have genetic data, we might have brain images. We have all these different types of modalities or different types of data that we need to have a mathematical framework to combine. And our old models don't necessarily extrapolate very easily to this new setting.
So those last two, the variety of data and the uncertainty in the data, are two that I really want to focus in on. It gives some real examples. So let's say I take my example from before, but it turns out that Warren Beatty was one of my users and mislabeled whether they like Moonlight or not. And so I have noise in my data. He really did like "Moonlight." He wanted to say Moonlight, but he said the wrong thing. And so what happens here is it changes my model, right? So if I fit my model to the data, assuming that the data's good, I'm going to get a very different model than I had before. So we want to come up with robust solutions that can not necessarily believe everything it sees in the data, take it with a grain of salt.
In terms of the variety, I want to focus in on what we call dependencies or structure. So in statistics today, they have this idea of independence. We assume that the data that we collect was collected independently of one another. In our Netflix situation, what we assume there is that two users don't really influence each other in terms of whether they like a movie or not. So they are just kind of drawn from this distribution of people that like movies and don't like movies. So let's say I had two people that I wanted to predict for in my sample now. I could draw that line and use that line to predict whether they like the movie or not, but we have other information now. We know that humans form social networks, and maybe we have a lot of literature that says people that are friends with each other have similar movie preferences. And if we're dealing with the medical scenario, maybe I know that one person's the parent of another patient. Knowing that they're genetically related should probably influence whether I think they both have the same disease or not; right?
So here, I'm assuming initially that they're independent of each other, but now that I know that those two people are friends, I want their likes of movies to be similar to each other. I want to take that into account. If one of them likes Moonlight, it should influence whether the other person likes Moonlight or not. And so we want to come up with models that add this structure in so when I ask will person A like Moonlight, I'm taking into account not only their viewing history but also their social networks or other kinds of structured information. So real data is interdependent. There's lots of influences on it.
So to kind of summarize, I'm interested in machine learning, and I'm interested in adapting algorithms to situations where there's a lot of noise or uncertainty in the data and there's dependencies in the data. If I get to heterogeneity, hopefully I'll have a couple slides at the end about Parkinson's disease that we recently came out that I'd like to get to.
I'm very interested in applying these problems in computation with biology and medicine. Part of it is that I want these algorithms to have impact. A lot of machine learning paper has used these same old tired data sets that are like handwriting recognition or number recognition. And they're useful. They're useful benchmarks, but you know, if I were to tell my dad what I do and I told him I classify digits, he really wouldn't understand why we are doing that.
The second issue, though, is that a lot of times the models are built in [inaudible 00:12:24] to the specific problem. So they're on a simplified version of a scenario that they're thinking of. Biology and medical problems have a lot more complexity and a lot more resource requirements than some of these other datasets out there. And so I think they're really good test beds for computer science.
So there's two approaches that I'll talk about today. The main theme, the theme that kind of will connect the two main portions, the first two main portions, are probabilistic graphical models. Don't worry. I'll get into what those are, but those are a general category of approaches that I use. The second thing is deep learning, which I've spent the last year working on and hope to talk to you about it a little bit as well.
Any questions about the setup here? All right, great.
So what is a graphical model? Or why am I interested in a graphical model? It's a mathematical representation of the data. If any of you have opened up the Air flight- if you're on an airplane and you open up the magazine to the back page and it shows you all the airport hubs and the flights, that's a graph. It's showing you cities which are these circles here. They're some type of variable, and then they show edges, the flights that kind of connect things together. I use a particular variation on that, a probabilistic model where the variables are types of information I could have. So maybe I want to predict whether someone's going to get into a graduate program, and the type of information I have is what their undergraduate college was, what their GPA was, and what their major was. All those things would influence whether they are going get admitted into a graduate program or not.
Second thing that's incorporated with these models is that it has some kind of probability measure. So it's not a yes or no answer for whether you're going to get admitted or not. It's a guess. There's a 75% chance you're going to get in if your GPA is a 4.0, maybe. All right. So we want to have this level of uncertainty in the model. The edges represent some kind of connection or dependencies between things. So I could ask a question about admissions, but I could also ask a question about what is the relationship between a major and a GPA? Then we want to know which departments are, you know, having great inflation the most, or something like that. Right? So we could ask other questions in this graph, based on these individual local parts.
I'll get into the computational reasons why I use this in a little bit with a concrete example, but the general idea is that representing things on a graph allows us to break a very big problem into smaller, manageable pieces that otherwise would not be computationally feasible to approach.
So this is a high-level overview of some of the research projects I've been working on recently. So I'll go into detail about a few of these, but I wanted to group them by these two approaches I talked about: Graphical models and deep learning. So these first four at some level use some type of probabilistic graphical model in them. Whether it's analyzing protein structures, analyzing texts, being able to predict Parkinson's disease or being able to do brain image analysis. Part of the reason I really like them is they're very flexible, and you can see that there's very different problems here.
These last two also I use deep learning approaches for. So I'm going to focus in on the protein structure and the brain image analysis. Those are probably two of the more mature areas of my work in terms of having a full story. Some of the other stuff is still ongoing, but I think they also give an idea of how I approach general problems ... problems generally.
All right. Any questions? All right. Great.
So my thesis work was on, as Rich mentioned, was doing protein structure prediction. so wanting to analyze images and be able to produce some kind of result out of them. So why are we interested in this? I think in terms of popular science, we think about genes and DNA as being incredibly important. So we often talk about what does DNA say about an individual or what is the purpose of a gene? But we always have to remember that genes and DNA are just blueprints; they're instruction manuals. They don't actually do anything themselves for the most part. So, sorry, biologists, if I'm kind avoiding a lot of nuance here. You can bother me about it after.
The DNA itself doesn't tell us the full story. It's kind of the starting point. Genes for the most part codify a product known as a protein. These are the actual work horses in your cellular environment. So if I have a gene that's involved in, let's say, transporting some nutrient across a cell, it's a protein that's doing that. The protein grabs onto the nutrient and moves it where it needs to go.
So if we want to understand what the true function of a gene is, we need to understand the structure, because the structure is what's actually interacting with the environment to be able to move things around. If we want to be able to design drugs, there's a disease and we want to be able to attack a protein, understanding what the protein looks like allows us to be able to design drugs that can disable that protein. Or if we want to understand a disease, we want to know what changed. All right? So prions, for example, are a particular protein that leads to Mad Cow disease ... is an example there when it transforms what the protein looks like.
So I'm going to try to get through as much biology as possible. I'm not a biologist but I want you to understand the complexities of the problem. So just as we have DNA's A, T, C's, and G's as a sequence of characters, it translates into an amino acid sequence or a protein sequence. So a protein starts off as this one-dimensional or strand of characters that tag on one after the other. You can kind of think about is as building blocks, like Legos, and there's 20 different types of Lego pieces that you could piece together, but you can piece them together in any different way and you can have multiples of the same pieces.
So it forms this long string of connected molecules, and this starts to form a three-dimensional structure by folding in on itself. That process by which it goes from this simple, one-dimensional strand to a three-dimensional structure is an open problem, unsolved problem, and incredibly complex. Right? We can kind of understand it from very small proteins, but for large proteins it's very difficult to understand. There's a lot of research going into it. And so we want to be able to determine these structures.
Why is this interesting? Kind of come to more data to back this up.
Charlie: How do we get these images?
Ameet Soni: Oh, I'll talk about the images in a second.
These are solved structures and these are just different ways to represent it. So they're very complicated, and biochemists and biologists may want to look at it in different ways. So these are two ways to abstract all the atoms in the model. This one's representing the main atoms in the model. It's still a lot of information. This one tries to represent common patterns that you see in the protein. So like these pink coils here are these things called alpha helices that form the spiral, and these yellow regions are kind of like pancakes. They form these sheets that layer on top of one another. And so these are just simplified representations of the full atomic model.
So I'll talk about how we can determine what these models actually are. We see this very large growth in sequences. So in red here the number of sequences, it's become very easy to ... or become a lot cheaper to sequence a protein. And so we see this somewhat exponential growth, and this was actually a few years ago, so I think we're past 25 million at this point in terms of the number of proteins sequences that we know.
In blue, you see the number of protein structures. It's not an error, it's all the way down there. Right? That little tiny blip down there is the number of known protein structures. So the gap between the number of sequences and the number of structures is growing very quickly, and the number of structures is not scaling. We're at about in the tens of thousands of number of structures that are known versus the tens of millions of sequences that are known. Okay. So we're learning a lot about sequences, which can be helpful but it's not what we really want to know.
Almost all these structures that are known are produced via a method called x-ray crystallography. So chemists, biochemists will take an image of a protein by shooting x-rays at it, and they get this picture of it, and a human will go through and fit individual pieces of the protein into the model to be able to structure it. Talking to biochemists, they've kind of hit a lot of road blocks, in particular when you get to more complex organisms. So if you're dealing with bacteria, the proteins may be simpler and easier to structure, but when you get to more complex organisms and more complex functionality, the proteins get bigger and are harder to take images of. So they wanted to come up with computational methods that they could use as a tool to help along the way.
So the task for my thesis was, if I know the protein sequence, and I know all the building blocks that go into a particular protein, and I have an image of the protein, can I produce a model that fits all of the pieces into that image? I want to do this in a way that's chemically feasible. It can't violate any laws of chemistry or physics to do it. You have three-dimensional images, and these are actually really good images, just to kind of give you an example. But in blue here is where you see a lot of density, where you see a lot of matter, and we want to be able to fit the protein to kind of snake in there. This is, again, a one-dimensional string that kind of just weaves its way throughout this model to form a three-dimensional structure.
So, why is this a hard problem? As I mentioned, the image quality determines how easy it is to solve this problem. So if we have a very high-resolution image or with a low number here, so one angstrom, we get a lot of definition, it's easy to see things. So this is a type of amino acid that has this characteristic double ring structure. And we can see that double ring in the image up here. But as the image resolution gets poorer and poorer, we start losing almost all of the important definition.
So there were a lot of approaches that came up that dealt with these easy images. They could solve these really quickly, maybe in 30 minutes to a couple hours. But there was a dearth of methods that worked on these harder structures. This is not an issue of getting a better camera; this is an issue of how the protein wants to take a picture. So it's kind of like taking a picture of a toddler versus taking a picture of a statue. Toddlers don't sit around, so you get these big smears all over the place when you try to take a picture of them. Same thing happens with a protein as it doesn't want to sit still and it has a hard time forming these crystal structures that make it easy to get high-quality images.
So we have these important proteins that we want to understand and we can't get better resolution out of them. So how do we solve the problem in that framework? So our approach is called ACMI, Automatic Crystallographic Map Interpretation. There'll be a quiz about that afterwords. What is our approach? So I want to first talk about why this is interesting from a computational perspective as well. So it's interesting, biologically we want to know what the proteins look like and it's hard to do for a human. So in those poor-quality images, they usually just don't bother doing it because there's lower hanging fruit, some other proteins they can solve, or it takes anywhere from six months to a year of a team of biochemists to solve the problem in a couple of the extreme circumstances, which I'll talk about.
This is a good test bed computationally because these are very poor-quality images. So if you have methods for being able to analyze images, it's good to see how they work on the hardest problems. In terms of being able to think about all possible ways to structure a protein, there are more possible structures than we could possibly count if we had an entire lifetime of compute cycles to do so. So let's think about that.
I'll give you some tangible numbers. These are three-dimensional images. They're about 100 x 100 x 100, so 100 in the X, 100 in the Y, 100 in the Z. But if we were to multiple that out, that's about a million possible locations each atom could be in this image that I'm looking at. The chunks I'm going to use, the breakdown of the protein, are amino acids, so there's about a thousand ... anywhere from hundreds to thousands of amino acids in a protein. So if I multiple that out, it's the number 1 followed by zero, zero, zero, zero, six thousand zeros later, zero. That's a very big number. Right? So if we think about the number of atoms in the universe, that's 1 followed by 80 zeros. And these are orders of magnitude more difficult.
I'm not going to be able to put this on an exam and ask a student to solve it. Not until I get tenure, at least. So we can't solve these problems very easily. The joke I always said in grad school is this is like a Where's Waldo problem that's on steroids. Right? It's just a really hard image analysis problem.
Any questions so far?
Speaker 6: [inaudible 00:24:17]
Ameet Soni: That's a great question. The question was how do you analyze whether you got the correct structure or not?
So in terms of the humans, when they come up with a structure, there are several techniques that they could use to do this. They could talk about what is the match. If you have a model, you can estimate what density you expect, and you can say how much does that overlap with what our actual image is. Another thing you can do is you can withhold part of the image from the biochemist, like 10% of the image, have them solve the structure, and then bring back that 10% that they couldn't see and see how well they fit back. So if they came up with a bad model, they're not going to do well on the parts they didn't see. But if they have a good model, they'd be able to match within the parts that were unseen. In terms of how we analyze their model, I'll get to that in just a minute, in terms of the computational approaches.
Other questions? All right, great.
ACMI is a multiple-phased algorithm. I'm not going to talk about all the different parts; it was a team effort. I want to focus in on the parts that I made the most contribution towards and that involved kind of my theme of graphical models. So to get there, the fist thing I want to talk about is the first step that we took, which was, well, we have a very hard problem here. there's 10 to the 6,000th number of possibilities. I need to break this down into smaller chunks.
So what we're going to do is we're going to take this huge strand, this protein that we know is in the image., and we're going to chop it up into small, tiny pieces. So I don't know what the full solution is, but these small pieces, there's only 20 possible building blocks that we could use. There are other structures that have been solved that had some of those pieces in there. So what we're going to do is we're going to look in all these other solved structures to see if there's anything that has a good match for this very small piece. I'm going to use those small pieces and scan our image to see if any of them match our image very well. So the Where's Waldo analogy, you could think about, somebody solve a Where's Waldo book and cut out all the Waldos and gave them to you, and you have a new book, and so you're going to take each of those Waldos and scan them on the page to see, do they match any part of the page.
Lyla: So do you have enough resolution to find these fragments?
Ameet Soni: No. Well, yes and no. Right? So, obviously if we had no information, it would not be useful, but I'll talk about why it's still a hard problem even though we have a little bit of signal.
The question was: Do we have enough resolution to find be able to do this?
We have enough just to get some decent information, but not enough to solve the problem completely. So our end result is that we get ... if you're not really interested in the probability aspect, just think about this, is it give us a bunch of possible answers. So we went from a million down to just a few possible matches that we found in this image. So it's a good filtering step, but why, getting into Lyla's question, why isn't it the end point? These images are very noisy or very poor resolution. It will kind of match things, blobs to each other even if they're not the exact same thing just because there's not enough information there.
So we still get for each amino acid about a hundred potential solutions. We went from a million down to 100. That's still one followed by a lot of zeros. Not 6,000 anymore but still too many to handle in our situation.
The other problem that we had is I broke these all up into small chunks and treated them independently, which is great when you have a lot of computers in a lab not doing anything because each of them can solve one individual portion. But it's not taking into account the dependencies in a structure, right, when you go from this one dimension into those three dimensions because atoms interact with each other. And so I've completely ignored all those possible interactions in my model. So we don't have the big picture. We just have a lot of trees but no forest, view of the forest.
So I want to be able to approach this problem, and this is the second part of our algorithm. This is where the graphical model comes in. Again, the graphical models are mathematical construct, the way that we're going to view this problem, model this problem. So at a high level what's going on here is that theses circles, we call them nodes, they represent some variable. The first circle in here represents the first piece of the protein, some small portion of the protein that I'll call alanine. That's the specific name of the building block, but it could vary by whatever protein you're looking at.
It has an idea of where alanine is without really knowing anything about the rest of the protein. I've just taken a very small portion of the protein and I've asked where does it think it is. And that's what we did in the previous Where's Waldo kind of phase. So I have these very noisy estimates there. A node is always asking: Where am I? Where am I supposed to be in this big three-dimensional box?
Speaker 6: You may have already said this but, so you already know all the amino acids that are present in the given structure. You're not also kind of using the library of amino acids.
Ameet Soni: Right. Yeah. So we know the amino acids that make up a particular protein. Yeah, that's exactly correct. Yeah. It's kind of like if I told you I took a picture of a lecture hall with the lights off and I told you the students in the class, but now I want you to find out where are all the students in this picture, but it's a very poor quality picture, and I want to find out where the students are. So we know what the parts are; we just don't know where they are located.
The node is always incorporating information about where it thinks it's located, and edges are our way are to figure out what the big picture is. The edges enforce some kind of dependency between items. So atoms interact; they can't be in the same place at the same time. That's physically impossible. But we also know that there's bonds between some of these, and they have to be within a certain distance of each other in order to be physical feasible. So our edges are going to be able to incorporate that aspect of the protein structure in. So another way to say this is the nodes have all the local information, the edges tell us what some kind of global information we need to incorporate into the models.
All right. I keep saying this, but the problem is that this is still a hard problem. Right? So we have this graph. It's a very big graph so I only showed you a small portion of it, but it's really thousands of circles, thousands of nodes, and millions of connections between those nodes. So trying to be able to figure out the exact best answer in there is still a very difficult problem. There's still more guesses than I have computational time to figure out. It's not an issue of Intel needs to come out with a faster processor; it's the nature of the problem itself.
We decided to use an algorithm called loopy belief propagation. Loopy belief propagation, if we wanted to think about it at kind of an intuitive level, is a game of telephone. We have all these individual pieces to our problem and we can't communicate them all together at the same time, everybody's shouting at the same time. Doesn't compute. So what we're going to do is we're going to send around pairwise pieces of information. So two parts of the protein are going to communicate with each other about where they're located how that should influence the other one. So we're going to pass around what we call messages. As an FYI, this was the ... Judea Pearl won the Turing Award for this algorithm a few years back, which is like the Nobel Prize of computer science. It's an incredibly important algorithm in the field of artificial intelligence and machine learning.
As always, it's always easier to kind of look at an example rather than a bunch of words. So here's an image that would maybe help give you an idea what message passage is doing. I've isolated two very small parts of the protein that happen to be next to each other. You can think about them as parts 31 and 32.
In this example, one's lysine and one is leucine. Each of those nodes has an idea of where it thinks it might be located. Lysine thinks it's in one of these four peaks, one of these four locations. These are two-dimensional pictures. I can't draw four dimensional images, so we projected them into two images, two dimensions. And leucine thinks it's in one of those four locations. So you know there's only one correct answer, but we're not sure which of those four it's going to be.
What lysine's going to do is it's going to say, well, based on what I believe my current status is, I'm going to send you a message; we're bonded together. We have a chemical bond to each other, so we have to be within a certain radius of each other. So what I'm going to do is I'm going to send you a message saying, well, I'm here, you need to be kind of this halo region nearby me. Right? And so I have these four possible areas. So you should be in one of those four regions.
Leucine is going to take that information and combine it with where it thinks it's located. So it's going to combine those two pieces together. So only things that overlap are going to be maintained, and things that disagree with each other are going to go down. These are probabilities, so we never go down to zero, rarely ever go down to zero. We're going to just down-weigh things.
So it may be hard to see, but those four peaks used to be the same height, and now one is much taller than the other. The reason is because it agreed with the message more than the other three peaks. This is an iterative algorithm. We pass around messages many times over so at some point we're going to repeat it but we're going to send a message backwards. So leucine's going to say, well, now it's my turn to tell you what I think. And so it's going to calculate a message for lycine again. It still has four possibilities, but one of them is much taller so it's going to say, you need to be within this certain radius of where I'm located.
I'm ignoring angles. There's angular issues here, too. Again, I can't draw in six dimensions very easily.
So we're going to have that message sent and lysine's going to update its probabilities. And what we start seeing is that this very noisy first round where we had hundreds of possibilities starts getting filtered down to just things that are globally consistent. Because if you have two possible guesses that don't match up with each, other they're going to go down. But if you have two things that are consistent according to our global picture, those will get up-weighted.
Charlie: Can you get different answers depending on which message is first?
Ameet Soni: Yeah, that's great. Literally five minutes before I gave this talk, I dropped five slides that talked about that. I'd be happy to talk to you about that offline or after the talk; that'd be fine. But yes.
How we pass these messages is incredibly important. The analogy is if you're playing a big game of telephone in a class, you don't want to pick the kid that always lies first, because that person is going send the wrong information to everybody else, and now we just get bad information being propagated. And so one of my papers was about intelligently picking how we schedule things. It's probably my favorite paper. It got me a plaque for best paper award. So that's why it's my favorite.
All right. So how do we evaluate this? This comes to the question that we got from the back of the room. What we want to do is we want to be able to solve a new structure and say we did a good job. The problem is, if we solve a structure and nobody else was able to solve it before, you have to take our word for it that we were right. Right? Or we have to come up with some way of validating our results.
So what we do is we talk to our biochemistry colleagues that we've collaborated with and we ask them, "What are your 10 hardest examples that you could come up with?" And they gave them to us; we got 100%. So we went back and we said, "Can you try a little bit harder?" So they got 10 more they fished out with some other groups that they collaborated with, and they gave us the 10 hardest maps that it was for ... 10 hardest solutions that a human ever came up with.
In all cases, it took about several months up to a year for a human to solve. And even then, they're not 100% sure they have the right answer, but they have some answer that we can use. So we took these 10 proteins, we hid the correct answer, so we only have the images, the very poor images, and we have the sequences. And we wanted to grade our algorithm. We wanted to give it an examination and see how well it did against these keys.
So we had the human solution to compare against. We wanted to also see how we were doing against the state of the art, the most popular methods that crystallographers use. I'm not going to go into details about all of those algorithms. The major theme is that in this local/global dilemma, they pretty much prioritized local information first, and if it matches the global information, thumbs up, but they're not going to go back and say, "Well, let's try that again." They're called greedy algorithms. They take the best step first, and then they just go downhill from there. And if it goes to the right place, that's great, and if it doesn't, oh well.
There's definitely a sacrifice of speed versus accuracy. As an example, the reason my thesis was hard to finish was, one of my proteins took a week to run the software on. So I'd come up with an idea, I'd program it, and then I had to come back a week later to see how it did. So I always have to do like five different branches at the same time. Which [inaudible 00:36:18] existed back then but it didn't, or it wasn't very accurate. It wasn't very well in use.
I don't know what's going on. Okay, there we go.
So here's the result on the 10 images. So we averaged the accuracy of our models against what the humans produced, and you can see that--don't worry about the blue versus red. Red just means that you found something in the right place. Blue means you not only put something in the right place, but it was the correct amino acid. So if you flipped the structure around, you could still get some credit here. You can see that our part, which is the most popular one, only gets about 20% even on the most generous measure. While our approach was able to get 80% accuracy on the protein structures. So I was able to reconstruct 80%.
And in fact, when we talked to the biochemists, their part of the grant was developing this cool cave that we could go into. It was a virtual cave where you could visualize a protein in 3D. It made me dizzy every time, but we would go in there and we would look at our solutions. They would say, like, they would talk to me, and in one of them in particular, they said they actually thought our method was actually correct and the human was incorrect. We were grading against a human, but it doesn't necessarily mean that we were wrong on the rest of the 20%.
Any questions about that? Yes, Lisa.
Lisa: In the loopy propagation algorithm, it's just random? Are you guaranteed to try every connection once?
Ameet Soni: We did a round robin, so if you had a thousand pieces you just had a four loop that said do piece one, piece two, piece three, piece four, get to the end and then go backwards. So we just kind of cut back and forth. It was a very fair algorithm.
Ameet Soni: Which gets to my conclusion slide, which I'll talk about in another second as well.
Our conclusion was this was my first interaction with graphical models and why I fell in love with them. They handle uncertainty and noise very well. They also model dependencies in structure and data very well. The problem is we need to develop more efficient algorithms for being able to do analysis on them.
Some other work ... so one of these papers was about that message passing algorithms. So I talked to biochemists about how we schedule messages. And I asked them how did they solve the structure. And so that paper was about asking an expert how they would solve a problem and mathematically incorporating that into the algorithm. So what we did was we talked to a biochemist, and he told us that they look at the center of the molecule. In particular, they look at parts that they know are going to be stable. So there's these things called alpha helices. And so what I did was I said, well, tell me in this sequence what you think is going to be an alpha helix. They marked things down for me and I put that into the algorithm and prioritized those messages, and it gave a huge boost in accuracy to the algorithm.
That was a general machine learning paper about how we need to devise algorithms that not only look at data in terms of efficiency but also think about tailoring it to particular problems. How to get humans in the loop that way.
This is work again I did at Wisconsin. Very thankful for the biochemistry group there and my collaborators in computer science as well. I was able to work one summer here with Stella Cho and Emily Dolson, and we had some pretty cool results there as well.
Last thing I did as I was finishing up my thesis, I sat down with a post doc who was trying to finish up his post doc. He couldn't solve a structure. We kind of worked with each other back and forth, and we were able to solve this structure and deposit the PDB. So it's kind of cool; they put my name in PDB and they get a structure that pops up. So, he was only able to get 10%. Other computational methods can only get 5%. And we only got 50% with our method, but with that 50%, he was able to bootstrap the rest of the solution.
All right. Any questions? All right.
So, I'm going to move on to the second part of my talk, which is brain image analysis. So we want to look at brain images and be able to diagnose certain types of information. All right.
Our initial motivation was to be able to attack the problem of Alzheimer's disease. So, Alzheimer's disease is a neurogenic disorder that unfortunately a lot of us are familiar with, and it's becoming an increasingly ... the problem is increasing as the population ages and lives longer. There's no known cure currently, and also really, we don't even know how to diagnosis of Alzheimer's until you do a post mortem analysis. And the reason is because Alzheimer's looks like a lot of other degenerative or brain-diminishing problems, including aging. And so, these early phases it's hard to distinguish cognitive decline between aging and Alzheimer's disease.
And usually a clinician will diagnose this really late in the game when it's clear that dementia as set in. And so, we really want to be able to help doctors be able to do this diagnosis much earlier so we can think about early intervention measures, and we can think about, maybe we can also discover biomarkers, maybe things that can help us understand the disease, why it happens and how it happens better. All right.
So we have a data set that we wanted to look at in particular. There's many different avenues in which you could come at this problem. We look at three-dimensional images again, so these are MRI images of brains. They're very high resolution.
A voxel, I am going to use that word. A voxel is just a pixel but in three dimensions. So if you're not sure what a voxel is, just substitute pixel in there. So, we have about a million voxels per image. And the other great thing about this is that there's a long-term project that's been going where they've been tracking several hundred patients over about five years, taking brain images, having them see doctors, making notes in the database about how their Alzheimer's is progressing. So we have a lot of rich information that we have access to that we can start analyzing.
So quickly, my first summer I was working with Chris Magnano and some collaborators at Wake Forest. We wanted to understand how neuroscientists did image processing. And so we looked, and there's about 30 preprocessing steps that they do, and we wanted to understand all of them in detail, just to understand, like, why they do these things.
One of the first precursor steps is, the image is very noisy, and there's a lot of detail in there, and it's hard to do analysis for humans, let alone some of the statistical packages that are out there. So what they do is they convert this raw image into a tissue map. So for each voxel, underlying that is it's some type of brain tissue. So it's gray matter, it's white matter. It could be your skull. There's other smaller things, like cerebral spinal fluid that are in there. And so what they do is they convert the image into this simplified representation called a tissue segmentation.
The most popular approach is to do what's called template registration. So they took a bunch of healthy humans, took the average brain amongst those humans, and had a doctor go through and look at where the tissue types were in those brains and said this is the gold standard for what a brain looks like. All right? So, there's one solution for all brains about what the segmentation should look like. So now your solution becomes fairly easy. When you have a new patient come in, you take their brain and you make it fit this Atlas, this gold standard. So if it's smaller or if it's slightly warped, you just re-warp your image so it fits the original image and then, presto, you have your gray matter and white matter segmentation.
To give you an example of a brain that's slightly different but they re-warped to look like this normal brain. And the idea there is that the distortion is due to some image processing problems. Somebody moved in the MRI machine or there was som ... a fly flew in or something like that. So there's some issue with the image technology.
The problem is all the brains that were used to develop this average brain were normal healthy adults. So if you're trying to understand children or if you're trying to understand neurogenic disorders, that potentially cause morphological changes to the brain, it's going to be degeneration in brain matter, you're now losing information because you're warping that brain to look like a normal brain.
So we actually looked at the data and saw this. Hopefully you can see this. So this is the original image. This is what a human ... what a doctor sitting down figured out was the gray matter and white matter. Again, this is a two-dimensional image. It's kind of a slice from ear to ear if you're looking at the brain, and we can see that there's there little blob here that's kind of missing or unbalanced between the two sides of the brain.
If we look at the two most popular approaches to solving this problem, they did different things. So VBM just cut off the other side and said, well, if it's missing from one side, we'll just make it missing from the other side. And SBM filled it, it's hard to see if you're ... from the back, but it filled it in with some background noise and said, okay, well, it was missing, but there's probably something there, so we'll just fill in some random noise in there.
So it looks like ... to me it looks like a bad guess about what it thinks the data should look like. So we really had an issue with this because they were trying to understand Alzheimer's. We don't want it to destroy our original information, our original data. So our solution was a graphical model again, so we wanted to understand the tissue type, but we also wanted to acknowledge the fact that tissue type is correlated. If we looked back at this image, the white dots are surrounded by other white dots, and the gray dots are surrounded by other gray dots. They're not independent of each other. White matter exists next to white matter.
So we decided to use a graphical model that modeled this for us. The top of the model here is our original data, our original image, and at the bottom is what we want to predict. What is the tissue type at every spot in the brain. And we use these edges to tie them together, so they tie the predictions together. Kind of like we tied the Netflix users together by whether they would like Moonlight, we want to tie these tissue predictions together.
What we did is we used a model called a condition of random field and we wanted to predict whether a portion was white matter, gray matter or other, and what we did was we said, if you're trying to predict for a particular voxel, its class, its category is going to be dependent on its neighbors to the side, above, and below. So we form this 3 x 3 cube to help smooth these predictions.
Any time you do machine learning, you have two phases. You have a learning phase where you want to understand your model, and then you have a prediction phase. So our learning phase was we want to know how strong are these dependencies on the edges. We assume that there's edges and connections between them, but we don't want to assume that all of those edges have the same weight, and we don't want to even assume that all the matter ... whereas some of them maybe it doesn't matter what the values are. We can just drop them.
So we take solved brains, so humans supervise the process. They said here are some known solutions. Learn a model that explains this data, and then in the second part, we're going to give you a new patient, and we want you to tell us what the tissue segmentation is for that new patient.
That was our model, and we wanted to be able to experiment and validate this just like we did with the proteins. And so how does our model compare to other approaches out there, and how does our model do if we want to generalize it to different types of populations?
Unfortunately, we don't have a ton of available data. In fact, we don't have any data specific for Alzheimer's patients. We thought we did, but it turns out all of them were just using these other statistical packages. So we didn't really have a human ground truth for the data that we wanted. So we found this proxy. It was a repository that was set up, and there are 38 high-resolution brains in there. Nobody would call this big data, but we had some solutions that we could use for our model. So we thought we'd start out here.
What we did was we took some portion of those brains, trained our models on them, and then took the ones that we didn't consider for the first round, and we evaluated how well we predicted those. These are our results. Higher is better. It's a particular index that neuroscientists use called the dice index, but it's essentially how much does your solution overlap with the correct solution.
So in blue is our algorithm. So we get about 80% of this overlap in white matter or gray matter. What's kind of interesting is that the other algorithms highly prefer white matter as a default, and we kind of see that in the results as well, but we do better across both matter types, in particular gray matter, for being able to predict things.
I'll kind of return back to the original image that I showed where the two competing methods filled in some random noise, and you can see our model didn't do that. Right? It kept this deformity in the image, and it was able to model the original image without doing any kind of deformation. And so it was able to handle the noise, but whether it was noise or signal, it didn't want to make an assumption about it, and it was able to model these dependencies. So you don't see random splotches throughout this image. You see continuous, contiguous white matter and gray matter regions.
There were many other results in the paper, but that was the main result that I wanted to talk about here. So any questions about that? Yes, Vince.
Vince: In Alzheimer's, it seems that things finger-like structures are a piece that might be diagnostic.
Ameet Soni: Yeah.
Vince: It looks like your method is kind of filling in those finger-like structures. Is that something that you're thinking about?
Ameet Soni: Yeah. That is something we're thinking about. So late in Alzheimer's we do see white matter degeneration. That's a very good indication there. We do have somewhat of a bias of over smoothing, and we have follow-up directions. Unfortunately, Chris ... well, fortunately for Chris, he graduated. Unfortunately for me, he graduated. So we weren't able to follow up on those directions.
One thing that we did was that we used the distance from the current voxel to the center of the brain, and I think that may have been overly smoothing things. But it did help incorporate the fact that white matter is closer to the center, and gray matter is towards the outside. So I think we wanted some more fine-grain details there as well.
So this was the first node application of condition of random fields to brain images. They're primarily used, actually, for literature analysis, for analyzing natural language. Chris did a lot of great work in terms of being able to adapt it to this image analysis problem set.
Yeah, that was actually the summer our first born child was ... that per sad was born, and I told Chris, I was like, "I ordered you this boot camp. I am going to train you about graphical models, and then it's kind of up to you." And so I wasn't expecting much, and then I came back and I looked at his results, and he's like, "I think these look good." And I was like, "Yeah, those are ... those are good." (laughs) So that was awesome. That was a great outcome for both of us.
And we had our collaborators. Sriraam's at IU now. He was at Wake Forest at the time. All right. So a few minutes here to talk about some more recent work just to kind of bring this back around to the original slide.
So, the original task was Alzheimer's disease. That result had nothing to do with Alzheimer's disease. We did have a result in the paper showing that we could generalize across two different populations. So we're using that as a proxy to say that we can generalize well to Alzheimer's, but we couldn't verify that exactly. But we wanted to return back to this original problem.
A problem that we found in the literature--a second problem that we identified in the literature--is a lot of the papers out there only diagnose between patients that have Alzheimer's or they're normal. The problem is Alzheimer's is a spectrum. So they're taking the most extreme examples of both ends and getting 95% accuracy. But it's a very trivial problem. What we want to see is to be able to predict somewhere in the middle as well, so the entire spectrum of Alzheimer's.
The other thing is, a lot of these algorithms simplify the problem for computational purposes not because they're lazy but because for computational reasons, you need to take this million voxels and break it into ... and go into bigger chunks. And so we think a lot of information's being lost in that process. We had showed that with the first part that that was one way that they were losing information, but they also do other types of segmentations as well that simplify the problem.
So our goal was that we wanted to develop an algorithm that continues the entire spectrum, and we also wanted to be able to do a longitudinal analysis. So we had this data where we have patients at time point zero, but also what happens five years later, so if somebody was healthy at time point zero, we want to know if we can predict in five years whether they're going to develop Alzheimer's or not.
And the last thing is that we didn't want to simplify the data. We wanted to be able to learn from the raw data. The kind of the theme of the first two parts is I don't care about computational complexity. Well, I do, but I want to solve it rather than simplifying the problem to begin with.
So the big new hope that's been coming in AI over the last few years is deep learning. So I had a great opportunity over sabbatical to teach seminar with Lisa and Rich. It's great that, you know, we have big class sizes. So my first seminar was while I was on sabbatical. But we were able to hand pick a bunch of really bright senior students, and I really wanted to learn this new area, and it was a really fruitful seminar that we had to work on and understand the details of this.
Because I'm running short on time, I'll just kind of hand wave at it. But the general idea is that they've really set the benchmark for almost all image analysis techniques, and the key ideas are that it's able to learn from raw data and that it's able to learn from lots of data.
Neural networks are biologically inspired. The claim is that it mimics human learning. So specifically with published neural networks, the way that we can use the raw data is we can think about how the human looks at an image. So when we initialize something, I'm abstracting away here, so the cognitive scientists now can plug their ears. But we don't look at something and immediately say it's a dog. We have different layers of our brain that process the image. So maybe we do some low-level analysis about maybe contrasts or edge detections in these images, and we start to construct higher level concepts, like shapes. And then those shapes come together to form objects and so on.
And so we see that with neural networks that it's able to do that. It may be hard to see, but on neuro network, it's many layers together, and that's why it's called deep, it's many layers one after the other. These initial layers are just learning kind of blobs, but they look like features or edges in the problem.
When we start seeing with the second layers, each of these parts of the network are learning something different. So some of them are learning noses, some are learning eyes, some are learning smiles, so on. And then we go down to the final few layers, and we can see that they've learned different types of faces. So each layer is learning at a higher level of abstraction.
The reason that this is great is the problems that I really am interested in is can we take the first few layers and apply them to any problem? So maybe this was trained on faces or maybe it was trained on dog images, but can we use that to solve problems with, like, brain imaging or other related problems.
Razi Shaban kind of did some cool experiments in our seminar where he talked about generalization. So that's an image of campus. And I'd taken neuro network and I'd train it only on one type of thing, and I wanted to see how it interprets an image. So what he did is he trained it on Gauguin images. So this was the image that he trained the neuro network on, and he wanted to see how his neuro network interpreted the original image. And so it was kind of a reverse engineering thing. So it restylized the image.
I think this was a Jasper Johns, and it kind of gets this cool ... there's all little dream images out there that are really cool, but there's underlying science that's going on there about what these networks are doing.
So if I had these results a week ago, I would have probably had more slides about it, but as of a week ago, we had results. We've been working on it a while. We were hitting a road block. We were about on par with current approaches, but I got an e-mail last night from a former student, Aly, who's been plugging away at this problem. He's a great student. Like, I got really lucky that he got interested in this project. And he told me he got up to 88% accuracy, which is about 10% than anything else out there. I shot an e-mail back to him, and I've been hitting refresh on my browser all day hoping he'll write back to me, but he hasn't yet.
Because that's really exciting. We know we have a few road blocks. All the approaches out there assume that you're working with two-dimensional images, and so we hit resource limitations when we wanted to deal with the three-dimensional brain image. Yeah.
So great students that all worked on this: Andrea Mateo worked over the summer with me. Chris was the first one that approached me wanting to learn about deep learning, and if he hadn't, in hindsight, we did a lot of bad things because we didn't really understand what the literature was saying, but that kind of pushed, got the ball rolling there.
Aly, again, graduated, but he's taking a [inaudible 00:55:57] for medical school and was working at lab at Harvard. The reason he sends me e-mails at [inaudible 00:56:02] is he does a Nine-to-five job and then does this for fun. And so that's great. Right? So he's been really plugging away, and I'm really interested to see where we can take it.
The cool thing about that network is that it was trained on images from Google image search. So it's just trained on, like, pictures of humans and dogs. And if you take that away, the solution is much worse. So it's really cool that this neural network is able to boot strap from millions of images that we see in the world every day to be able to understand brains better.
I'm low on time. There's other projects that I'm working on that I am going to working on this summer with transcription factor prediction that apply deep learning. So Raehoon and James have been working with me. I am going to have a couple of students this summer working on it with me.
My sabbatical research was thinking about graphical models in a very different way, so lifting it to what's called first order of logic. The main idea there is it's much more expressive but computationally much more expensive. And so we did ... we had several papers that came out with the text analysis, and we just had a paper accepted at an AI medical journal for being able to diagnose Parkinson's disease for there.
The main idea there was we wanted to go to this idea of a heterogeneous model. So before models would say, let's take one piece of data, learn a model, and predict, so many genetic information. But these new models need to think about combining multiple modalities. Genetic, brain imaging, doctor's notes, blood work, we have to deal with the fact that some patients never went in to get their blood work done, or you have somebody like me that sees, like, four different doctors versus somebody like my wife who doesn't ever have to go to the doctor because she's always healthy. So we have different types of information for different types of people.
This work was work I did with a group at Indiana University. We had some great results out of that that I'd be happy to talk with you offline about.
So my high level summary is that the trends in big data necessitate new approaches> Problems to graphical models are one avenue of approaching handling noise and structure. Deep learning is the exciting new thing that's coming out ... that's been holding promise due to parallelism and being able to handle lots of data with little feature engineering. I'm curious about what the future is, and I think it may be a combination of all these things. I don't think it's just one. So there's a lot of really cool work going on in combining deep learning with graphical models that I hope to get involved with this summer.
So, thank you. It was a wonderful sabbatical; I was able to do lots of cool things. I was able to officiate a wedding between my best friend and his partner, which was a great experience. I lost a bathtub debate because the vote was rigged. It was some nonsensical scenario they always come up with, like what's the world coming to an end. And the fake scenario was Donald Trump won the election. (laughs) So that hit home a little bit close to home. I was able to go to Barcelona for a little baby moon with my wife, and we also welcomed our second child. So it was a great sabbatical. But technically it was after sabbatical, so I guess that ... thank you for paternity leave [crosstalk 00:58:53]. Any questions? (applause)