3

I have XML data for many scientific publications and I am trying to parse through the data in KNIME to extract the fields that I need. Here is one example: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=PMC4400176

To extract the names of the authors, I am using the following XPath Query: /pmc-articleset/article/front/article-meta/contrib-group/contrib[@contrib-type="author"]

However, this returns: BorisovaSvetlana A., KimHak Joong, PuXiaotao, LiuHung-wen*

I would like for the last and first names to be separated by some delimiter, comma/space, and for different author names to be separated by a semi-colon. Is this possible? Or is there a better way to extract the information compared to what I am currently doing that would allow me to achieve my ideal output:

Borisova, Svetlana A.; Kim, Hak Joong; Pu, Xiaotao; Liu, Hung-wen*

[edit]

Current KNIME workflow:

enter image description here

Sample current output:

enter image description here

I've tried having all of the author names for all of the publications outputting into a collection cell. (If I have all of the names outputting into multiple columns, this ends up creating hundreds of columns containing missing values. I've even tried to achieve my ideal output using multiple string manipulations, but it is still not as perfect, due to some author names having multiple names, hyphenated names, or names containing special characters.) The collection cell combines all of the author names with a comma delimiter between each author's name, but combines surnames and given-names. I can also do the same aforementioned string manipulations on these, but still run into the same issues as mentioned.

If I separate author names into multiple rows, this creates multiple rows for every article, from which I'm not sure how to get to my end goal for each article.

enter image description here

End goal:

enter image description here

Any ideas on how to solve this problem with the authors would be much appreciated!

Zille
  • 49
  • 7
  • Can you post the full xsl transform that you have so far?. I suspect you've only a single xpath, but if you defined a template to match each 'contrib contrib-type="author"' element, you could then format the specific 'surname' and 'given-name' element values. – emeraldjava May 02 '19 at 20:56
  • @emeraldjava I did have a single xpath, but since there are multiple authors for multiple publications, if I extract each element value, i.e. surname and given-names, I end up with a column of all different author last names and another column of all different author first names, all separated by a comma delimiter. – Zille May 03 '19 at 19:25

1 Answers1

5

You should ideally do this in multiple steps. I’d do it as follows:

  1. Extract all contrib elements and return the resulting “Nodes” as rows (not as strings) using the XPath node
  2. Extract surname, given-names, and xref using another XPath node
  3. Join them together, e.g. using the String Manipulation node
  4. Combine everything into a single string, e.g. using the Column Combiner or the GroupBy node

[edit] You can find a fully working example workflow on my public NodePit space:

https://nodepit.com/workflow/com.nodepit.space%2Fqqilihq%2Fpublic%2FStack_Overflow%2FStack_Overflow_how-to-separate-xpath-results-by-a-delimiter_55959662.knwf

workflow

[regarding your edit] As far as I get, your challenge now is, that your table contains more than one publication, and the GroupBy node would combine them all into one row. To avoid that, you can make use of the “Looping” nodes. Simply surround the logic which I’ve described above with a pair of Chunk Loop Start and a Loop End node. This allows you to process each public “in isolation”.

qqilihq
  • 10,794
  • 7
  • 48
  • 89
  • Thanks so much! I've been able to follow this up until step 4. I have thousands of publications, as well as other data that I am extracting using XPath as well, so not sure if I can combine everything into a single string. Right now, my workflow reads a file containing PMCIDs, uses an API to call out to PubMed Central and retrieve XML data for all of the PMCIDs, and then uses XPath to extract things like DOI, Title, Author, Abstract, etc. Some of the other extracted data I've had to do some minor string manipulations on, but extracting the authors in a clean way is giving me the most trouble. – Zille May 03 '19 at 18:59
  • My ideal output would be an Excel file containing one row per publication and one column per required information, e.g. DOI, Authors, Abstract, etc. Would love to hear any other thoughts you may have on this! I have tried extracting each author into multiple columns, using string manipulations to combine the columns, removing the extra "?"s due to missing values, further string manipulations for clean-up, but this is not ideal at all and some of the string manipulations were not as useful when it came to authors having multiple names, hyphenated names, or names with special characters. – Zille May 03 '19 at 19:06
  • @Zille Glad to help, however I’m not sure whether exactly you’re still having issues. Feel free to edit the original post, or create a new one with your issue, and I’ll have a look! – qqilihq May 03 '19 at 19:26
  • Edited! Thank you! – Zille May 03 '19 at 20:00
  • @Zille I’ve added some feedback to your addition. Hope it helps! – qqilihq May 03 '19 at 21:47
  • Thanks! I am not that familiar with the looping nodes, but I will have a go at it! However, to my understanding, the group by/column combine nodes combine everything in a row, correct? I would only want the authors to be combined, not all of my other columns as well. I guess I may have to just do this in several steps then, where I get and combine all of the authors first and then add in the rest of my data. – Zille May 06 '19 at 13:13
  • I have since added chunk loop start and loop end nodes, which combined each publication's author last name and first names into separate rows. I have followed this by a GroupBy node to combine all of the authors per publication into one row. Now just a matter of adding all of my other required data back. Thanks so much for all your feedback! – Zille May 06 '19 at 15:42
  • I guess my challenge now is that I can no longer see all of the other XML data that I had parsed in my first XPath node, title, date, abstract, etc. for each publication. Is there a way to retain all of that information while still using the workflow you have described above for author name manipulation? – Zille May 06 '19 at 15:57
  • 1
    Yes, this shouldn’t be an issue. With the method I described above (i.e. using the “Loop” nodes). I assume they currently get lost in the GroupBy node? In this case, either (a) move them behind the GroupBy, or (b) configure the GroupBy node so that these columns will be kept. – qqilihq May 07 '19 at 08:17
  • 1
    Yes, I think I have reached my desired outcome now with all of your help. Thank you so much!! – Zille May 07 '19 at 20:10