I have some data that I'm parsing from XML to a pandas DataFrame. The XML data roughly looks like this:
<tracks>
<track name="trackname1" variants="1,2,3,4,5">
<variant var="1,2,3">
<leg time="21:23" route_id="5" stop_id="103" serial="1"/>
<leg time="21:26" route_id="5" stop_id="17" serial="2"/>
<leg time="21:30" route_id="5" stop_id="38" serial="3"/>
<leg time="20:57" route_id="8" stop_id="101" serial="1"/>
<leg time="21:01" route_id="8" stop_id="59" serial="2"/>
...
</variant>
<variant var="4,5">
... more leg elements
</variant>
</track>
<track name="trackname2" variants="1,2,3,4,5,6,7">
<variant var="1">
... more leg elements
</variant>
<variant var="2,3,4,5,7">
... more leg elements
</variant>
</track>
</tracks>
I'm importing this into pandas because I need to be able to join this data with other DataFrames and I need to be able to query for stuff like: "get all legs of variant 1 for route_id 5".
I'm trying to figure out how I would do this in a pandas DataFrame. Should I make a DataFrame that would look something like this:
track_name variants time route_id stop_id serial
"trackname1" "1,2,3" "21:23" "5" "103" "1"
"trackname1" "1,2,3" "21:26" "5" "17" "2"
...
"trackname1" "4,5" "21:20" "5" "103" "1"
...
"trackname2" "1" "20:59" "3" "45" "1"
... you get the point
If this is the way to go, how would I (efficiently) extract for example "all rows for variant 3 on route_id 5"? Note that this should give me all the rows that have 3 in the variant column list, not just the rows that only have "3" in the variants column.
Is there a different way of constructing the DataFrame that would make this easier? Should I be using something other than pandas?