Posts tagged with dissemination

0 comments

Visualizations

Visual.ly recently posted a list of the top visualizations of 2011. This is a map of the world in which Twitter tweets are plotted using their GeoIP info and color-coded based on language. Looks cool, but there is no message. I could have guessed how this map would look. Germans tweet in German, Brazilians tweet in Portuguese, and so on. Big urban centers have a lot of tweets. Developing economic areas and sparsely populated regions don’t. I don’t see anything interesting. And so many of the colors are difficult to distinguish, if there is any new information, it’s difficult to figure out. They should have added more text labels on the map to identify languages near where they are found. Really? This is #1 on the list?

The field of data visualization is not well defined. The comeliness of a visualization is important, but if there is nothing revealed by the visualization– no story told– then it fails. There seems to be quite a number of people who enjoy studying data visualization who are almost purely interested in the aesthetics. This makes it difficult to really get good advice. A recent book review in Science by Robert Kosara seems to be wrestling with this issue. Two books are covered, Yau’s Visualize This, and Lima’s Visual Complexity. The former is a practical guide to creating effective visualizations, the latter is a collection of breath takingly beautiful works of art with no apparent message. Kosara notes that Lima “never attempts to explain what viewers can learn from any of the examples.”

Tufte has already said much of what needs to be said about data visualization. For example: “Have a message.” “No chart junk.” “Maximize the data-ink ratio.” Tufte himself said of Lima’s book, “One useful question to ask of each image is: What did I learn from this, in addition to seeing an elegant architecture?”

Yau does an excellent job following Tufte’s principles. He has a website that is worth checking out, flowingdata.com. And don’t forget Tufte’s own website. Junk Charts is good too. It just posted this holiday Venn diagram:

1 comments

Tycho Brahe

Like the Mayan astronomers over 600 years before him, Tycho Brahe was a data factory. A data factory in the same vein as the Human Genome Project. Or as the Allen Institute for Brain Science is today.

In most formulations of the scientific method, the hypothesis is generated somewhere in the middle. What comes first is careful observation. What comes last are the hypothesis-testing experiments and controls. Often these individual steps are handled by different scientists and groups. For example, the Human Genome Project’s primary goal was one of observation, not hypothesis testing. Perhaps the same is true for the Mayans who observed the movement of Venus across the sky.

Tycho Brahe did what the Mayans did, 600 years later and in even more detail. He had the best primary data for the positions of celestial objects in the sky at the time. It was this high quality data that enabled Kepler to work out elliptical orbits for the planets.

A new website, Neurotycho.org, is set out on a similar mission. There, you can download data from primate experiments and reanalyze it. The setup seems as if they’ll accept data from other people at some point, but so far, it’s a one lab show. That one lab is Naotaka Fujii’s RIKEN lab.

There are similar efforts elsewhere in neuroscience. What’s unique about Neurotycho is that they seem to be reaching out to a very general audience. They also have a wiki with more details.

For the next two weeks, you can get a personal 1 year subscription to any of the Nature publications for a price equal to that journal’s impact factor. (link) (via xcorr)

Wolfram is pushing a new document format called Computable Document Format (CDF). It looks like PDF + embedded Java apps.

One the one hand, there’s the xkcd viewpoint. This is basically just another step in the evolution of Mathematica’s native file format. And right now, Mathematica 8 is the only way to author a CDF file. Alternatively, one can simply author a webpage with embedded Java apps, or maybe even HTML5. Then everyone can use it, anyone can modify it, and most platforms will play it. On the other hand, web pages don’t print out as nicely as PDFs (CDFs should print out as nicely), and it can be a bit messy to download and view a webpage with embedded apps offline.

So maybe there’s a future for CDF. Although the same basic results can be obtained using HTML5 or Java, Mathematica makes it very easy to create some types of interactive infographics. Arguably, it’s possibly exactly what Elsevier’s Executable Paper Challenge is looking for. However, it’s a closed format. Although Wolfram says the specification is public, the restrictions are perhaps enough to prevent wider adoption. There’s only a player for people who don’t own Mathematica (~500 MB download).

It’s hard to get people to install another player on their computer. Flash had success because of streaming video. Shockwave got people to install it for games. I think it will be a challenge to get readers of the NY Times to en mass download another player for their browser just so that they can see an infographic.

See that 23% bar for the RealOne player? And that’s for streaming media, a very broad market. What does Wolfram expect for such a narrow application? No matter how optimistic they are, why would they want to do this? They have to support this software for a bunch of different platforms, including mobile devices. This includes direct customer support and keeping up with changes to the platforms, security, etc. All for free. If Mathematica offered an “Export to HTML5″ option, they’d sell more software. Because then authors would know that everyone can see what they produce, without having to download another player.

So perhaps CDF is a bit like Wolfram Alpha: utterly useless outside of a very narrow field of applications, in which it performs utterly beautifully.

But that’s where the problem lies. I downloaded the player and tried out several of the CDFs. They were ugly and inefficient.

Note the lack of antialiasing. The animation wasn’t very smooth either. I’ve seen much better results with Java and HTML5. I think Processing is probably a better path towards this sort of thing. And that is free and outputs Java.

You can hide source code inline, and that’s nice. But it’s not a particularly innovative feature. Basically there’s a place to click on the right margin that expands a block of source code.

To cap it all off, the typesetting doesn’t seem to be as rich as that of PDF. It really looks like a webpage. The equations look good, as we would expect, but they don’t always select and highlight in intuitive ways. So if you want to cut and paste, it can be frustrating.

Overall, it’s hard to get too excited about CDF. It’s a way to get people to see a Mathematica document when they don’t have the software. But it only works when the intended audience is more likely to download the player than the author is likely to write it up in more standard web technology.

The subject of prolificacy came up in lab the other day. A study from the 80s (pdf) plotted the number of papers from a lab versus the number of people in the lab. This was repeated for several large research institutions. Across all of the data, the average was 1 paper/person/lab.

A graph from that paper is shown above. They included brief reports and unrefereed contributions to books, but did not include abstracts. Note that the spread is quite large. Among the labs with 20-30 people, output ranges from 10 to 60+ papers/year. Similarly, for labs with 10 or fewer people, output ranges from 0 to 28 papers/year. Perhaps part of the variability can be accounted for by variations from one discipline to another. Laboratories in the National Cancer Institute can include biochemistry, physiology, and cell biology.

How about # of publications vs. lab funding?
According to analysis by Jeremy Berg of NIGMS, it basically plateaus, or there is no relationship, depending on how you measure. (link, follow up)

Or death rate versus NIH funding?
Given a 10 year lag, actually pretty correlated. (source)

(hat tip to AM)

This is a long post. If you’re in a rush, then just read these first two paragraphs.

One of the early posts on this blog was about structured illumination. Specifically, I spoke about Mats Gustafsson’s version, which yields superresolution imaging, in the wide-field mode. Just recently, JM commented on that post and asked if there was any kind of guide on how to get this set up and running. Besides the usual sources (methods sections, co-authors, etc.), I’m not aware of any such guide. However, I have corresponded a few times with Mats over the years and he was always overwhelmingly helpful.

He passed away earlier this year and there have been a few articles written about his landmark work, his thoroughness, and his kindness (Nature Methods, HHMI). In this post, I want to share some excerpts from his emails to me. They’re not personal (we were just acquaintances), they’re technical. In addition to them being useful to people who may be putting together their own patterned illumination rig, I think they also give a small insight into how kind of a person Mats was. He took the time to write these detailed responses to just some postdoc that he met at a small conference.

Read the rest of this entry »

A paper on data organization just came out in Nature Methods (Millard et al. 2011, commentary by Swedlow et al.). They believe, as do I, that using XML schema to organize data is a good way to simplify automated analysis. For the primary container, they use HDF5. Fun fact for all the MATLAB users in the audience (from Tim at Imperial College):

The .mat file format is simply an HDF5 file with a pointless header prepended.

They call the XML metadata+HDF5 data combos “SDCubes”, for Semantically typed Data hyperCubes. Why cubes? That suggests that they are the same size along all axes, which they probably aren’t. If you don’t like hyperrectangle, you can use orthotope. One point that is lost in the figure I put above is the idea that the axes do not have to be continuous. There can be gaps and jumps. There can also be piles of data that all share one point on an axis, if that suits the data.

I like this approach because it is very general and simple. It consists of two file formats that are already being used by many researchers. In a way, the authors didn’t “create” anything. Hopefully this paper will give the strategy some added credibility and help to standardize it. Then people can concentrate on developing tools for working with data in this system, rather than developing new formats all the time.

0 comments

OpenOptogenetics

When Karl Deisseroth started publishing his work on Channelrhodopsin-2, he set up a website to share the resources, including plasmid information, protocols for expression systems, and hardware details. His site, optogenetics.org, is an excellent source. However, it is focused on Deisseroth lab information.

For a more broadly focused resource, Josh Siegle (Matt Wilson lab, MIT) and others have consolidated a great deal of information in wiki format at OpenOptogenetics.org. The wiki format is ideal for this sort of information since it is changing all the time, and the relevant personnel changes over time as well.

There’s already a good amount of information on the site, but there are several opportunities to contribute and fill in the gaps as well. I encourage you to pitch in.

This is just a quick post to point people over to Matt Might’s excellent post full of tips on how to give a good scientific talk. I second his endorsements of Keynote and the book “Even a Geek Can Speak”.
(link)

0 comments

Pubmed limbo

In my experience, PubMed works beautifully the vast majority of the time. It does an excellent job parsing search terms and they’re always adding new features (e.g., you can search using full names now, not just first initials). PubMed works so well, that I’m actually surprised when it fails to find the paper I’m looking for. But it does happen.

LSTOTT has a great post about articles lost in PubMed limbo. It’s a real phenomenon. They also identify an article which is in the database, but does not get returned using standard searches that should match. Which happens more often, in my opinion.

Database jocks call this latter problem an indication that recall < 1. (The former problem is just a mistake in QA, that is, someone forgot to include the article). This is the proportion of relevant documents that are actually returned. LSTOTT thinks PubMed’s recall may be declining (they call it “leaky”). What do you think?

Google Scholar does an excellent job of finding well-cited articles, including the ones PubMed misses. This is because there is no one point of failure that can prevent indexing: if the article is cited, then Google will index it. But this strength is also its weakness: relevancy is often sacrificed in favor of citation counts. Furthermore, the output is ordered by citation counts, which is not typically a useful parameter when I’m searching for a paper. PubMed’s reverse chronological ordering is better. But really, both systems should make it easier to re-sort the results.