Image Processing: Mass spec (pdf) to density text file

Question

Speaking from outside of the image processing field, I think I have a simple task but I have no idea where to start.

The challenge is, that some labs like to publish their mass-spec data in PDF form. While sufficient to validate their claims, it is essentially useless for quantitative analysis. I would like to read the mass-spec density:

to the following format:

3947>> Voyager Spec #1[BP = 536.8, 10241]" 
TYPE MASSSPEC
499.985486  760.097
500.007777  754.159
500.030068  774.162
500.052359  805.103
500.074651  821.98
500.096944  847.921
500.119237  864.798
...

column 1 is the m/z (x-axis) and column 2 is the (relative) abundance (y-axis)

Is this possible? Do tools exist that may perform this task? How long would it take to implement such a tool?

The [tag:pdf] refers to the "Portable Document Format" (look at the text displayed when you hover over the tag). Are you sure you mean that? — mkl, Jan 24 '18 at 05:25
thanks for clarifying. yes, I am talking about "portable document format" not some specialized mass-spec datatype — user3030872, Jan 24 '18 at 20:38
Relevant post: [Recognize PDF table using R](https://stackoverflow.com/q/44141160/680068), see [R package tabulizer](https://github.com/ropensci/tabulizer). — zx8754, Jan 25 '18 at 10:06
I see this kind of question a lot. Basically, you have a report but you want the data that was used to produce the report. Converting a report back into its constituent data is hard. Asking the producer of the report to give you the data is much simpler. — An RMagick User, Jan 31 '18 at 00:23

score 1 · Answer 1 · answered Jan 24 '18 at 01:46

My first thought here would be to use a program like GIMP to edit out the axes and labels, cropping down to exactly the size of the graph itself so that the first pixel in the bottom left corner represents the graph origin. Then you can use an image processing library (many exist - I like RMagick in Ruby) to process the image as a black and white image and get the data as an array of arrays. (You will probably want to rotate it to make it easier). The idea is to get each array to represent a column of Y-axis data given a specific point of the X-axis. That way, you just have to count the black pixels to get the Y value.

I'm sure there are ways to programmaticaly detect the graph boundaries and filter out the text, but that adds a lot of complexity.

Hope that helps

Image Processing: Mass spec (pdf) to density text file

1 Answers1