Shape of Alzheimer's Disease Research

sfungphd
Sep 26, 2017
4 min read

As a researcher in a basic science lab, part of the job required thinking about nitty-gritty details, like deciding if the broth keeping my cell cultures alive should contain Vitamin A or not. The slightly bigger picture in my day to day involved designing and doing experiments to find out what players of a specific signaling pathway might be involved in one cell type in a genetic form of Alzheimer's Disease (AD). Small slices of small slices from a bigger pie. You could say that I developed a researcher-specific variant of FOMO.

AD is a growing concern with respect to how many people are affected and the cost of managing these patients, so it may not be surprising to know that there is a lot of research published about AD. A small number of articles was relevant to me for each project I worked on (take a look at the number of references for an article) and I wanted to find a way to look at all articles when I entered in "alzheimer's" as my search term in PubMed, a search engine for biomedical and life science related articles. One way to do so is to cluster them into groups and do topic modeling. Here are the tools I used.

Clustering is an unsupervised learning method because there are no labels to indicate which article belongs to a particular group. There is no rational way to divide articles into training and test sets.

I used PubMed to run a search using "alzheimer's" for a sanity check that would give me a target number to look out for when I queried for articles using Biopython. Learning how to use Biopython was interesting in that I found incomplete documentation. I hadn't run into problems with the docs for scikit-learn (yet) so finding code that sent me into an infinite loop was a surprise! In case you are curious, the code I am referencing is in section 9.15.2: Searching for and downloading abstracts using the history. The code I provide is (hackily) fixed.

I stored abstracts in MongoDB on AWS because abstracts, unlike full articles, are usually not locked up behind a paywall. The choice of a NoSQL database made sense because I was really just storing the contents of what was returned by Biopython.

While Biopython returned 130,672 entries (same number when I use Pubmed, whew), only 115,509 entries contained an abstract. I clustered using KMeans, which requires that you specify the number of clusters that you want the algorithm to group abstracts into. I looked at the inertia for up to 20 clusters, hoping to find a glimmer of an elbow, but no luck. I have never rested such hope on an unglamorous body part, but the presence of an elbow lets you know that increasing the number of clusters past that point won't reduce error.

But Susan, you might say, there are other clustering methods that don't require randomly picking a number of clusters. Well, I did try Mean Shift on a subset of abstracts.

So I went back to KMeans and picked an arbitrary 10 clusters and generated a list of keywords by cluster. You can think of these as the words closest to the center of the cluster (centroid). 3 clusters adding up to almost 40% of the data include some combination of "patients", "clinic", and "caregiver". Phrased differently, at least 40% of AD research didn't exist in my old world, which was starting to quantitatively put into context how focused I was.

I also wanted to topic model my dataset and set up a pipeline to tokenize each abstract using Tf-idf vectorizer, do LSA with Truncated SVD (dimension reduction step), and normalize the resulting matrix Once the data was fit, I could extract the topic words. Note that this is different from the cluster keywords: each abstract belongs to one cluster, but can contain multiple topics (belong to multiple topic groups).

Perhaps not surprisingly, "amyloid" and "abeta" (short for amyloid-beta) shows up in multiple topic groups. I underlined "beta" as well; I wasn't sure at that point if it referred to "beta-amyloid" (another common or variation) or "beta-secretase" (punctuation was removed at the tokenizing step). For a long while the prevailing hypothesis was that amyloid deposits were responsible for disease progression because under a microscope, AD brain tissue contains these big chunks of misfolded protein that definitely are not present in a healthy person's brain tissue. On the flip side, tangles, made up of abnormal tau protein, are present on the inside of cells, are not visually obvious, and didn't gain much attention until later.

And that concludes our brief intro to amyloid (plaques) and tau (tangles).

At the beginning of this post, I mentioned that I wanted to "look" at all the articles. One way to do so is to revisit our dimension-reduced matrix and perform t-distributed stochastic neighbor embedding (t-SNE), a way to visualize high-dimension data in 2 (or 3)-D space. It doesn't work well with a large dataset (displaying 100,000 points on a single 2D plot would be quite a mess and meaningless), so I used a subset.

Look at that colorful cloud! It's a lot of information packed in here, essentially taking an already-reduced matrix representing 300 dimensions further reduced to 2. Distances between clusters don't mean much (running the function again on the same data will produce a different plot). Also, I didn't experiment with the perplexity or other parameters, so there's room for experimentation and improvement. I found this interactive article and am excited to learn from it.

I ran out of time to implement a simple recommender system based on the clustering results, but that would be a clear extension of the project. It would be an alternative to PubMed, similar to SemanticScholar from the Allen Institute for Artificial Intelligence, or Google Scholar. I found out about SemanticScholar later, which initially was a bummer (they thought of everything!), but that feeling didn't last too long; great minds think alike, right?

If you want to see the code, head over to the repo on GitHub here.

Susan Fung, PhD

Shape of Alzheimer's Disease Research

Comments