I'm new to R (as well as stackoverflow, hence the bullets just represent new lines) and am assigned to work on a project in which I need to clean MEDLINE data into a neat dataframe. An example of what the raw .txt file looks like is:
PMID- 28152974
OWN - NLM
IS - 1471-230X (Electronic)
IS - 1471-230X (Linking)
PMID- 28098115
OWN - NLM
IP - 1
VI - 28
etc.
Each new observation starts with PMID, and not all of the variables are included in each observation, and some cells with the same column name in the same observation need to be merged (ie IS). The final data frame should look like:
PMID OWN IS VI
28152974 NLM 1471-230X (Electronic) 1471-230X (Linking) N/A
28098115 NLM N/A 28
etc.
Currently I've manipulated my data in many ways. The first is in the format of the raw data file, but in two columns, without the "-". ex:
PMID 28152974
OWN NLM
IS 1471-230X (Electronic)
IS 1471-230X (Linking)
PMID 28098115
OWN NLM
IP 1
VI 28
etc.
The second is all of the observations all in just one row with thousands of columns for each variable. ex:
PMID OWN IS IS PMID OWN
28152974 NLM 1471-230X (Electronic) 1471-230X (Linking) 28098115 NLM
etc.
The third is similar to the second but instead of thousands of columns it only has the distinct column types from the first PMID values. ex:
PMID OWN IS
28152974 28098115 NLM NLM 1471-230X (Electronic) 1471-230X (Linking)
etc.
Please help. I don't know how to splice my data and don't know which manipulation I should work with.