I am using OpenDataKit's ODK collect to collect survey data in the field. Currently I am using the ODK aggregate accept data submissions on a google cloud before being downloaded as CSV files. This entire process is somewhat frustrating because every step is prone of potential errors. I would like to instead be able to read the data from the tablets directly into R and compile dataframes for each level of the data.
The data is saved as individual instances in xml format. Right now we have something like 2000 different instances. When reading an individual instance into R with XML the data ends up looking in the following manner:
<A_note/>
<A_group1>
<A_note1/>
<A_note2/>
<A01>2</A01>
</A_group1>
<A_group1.5>
<A02>901</A02>
<A02a/>
</A_group1.5>
<A_group2>
<A03>9</A03>
<A03a/>
<HH_key>9010</HH_key>
<A04a/>
<A06/>
<A07/>
</A_group2>
<A_group3>
<A04>9</A04>
<A04a_note/>
<A06_note/>
<A07_note/>
<A04a_int>840256790</A04a_int>
<A05>2</A05>
<A06a>Baixo Umbeluze, perto do rio Umbeluze.</A06a>
<A07a>-26.057376459502194 32.33107993182396 15.271170877998825 4.0</A07a>
We can see that there are a lot of tags which don't have any information (for example A_note1
and A_note2
) as well as groups which are unnecessary because the level above them are unique (A_group1
and A_group2
).
What I would like to be able to do is: 1. flatten the data by removing unnecessary groups 2. treat each instance as a different row of data and stack the information from my instances together.
I know this is probably too much to ask on a single post but I wanted to put this out there in case someone has already put in the hard work to figure out how to make this work.
Thanks, Francis