I have a dataframe with column containing references of papers and I want to look for any reference repeated in the whole column for all references.
Here are the some rows from dataframe:
In [1]:
df4.iloc[0:2]
Out[2]:
**cit2ref** **reference** **_id**
0 NaN All about depression: Diagnosis. (2013). Retrieved December 7, 2016,from All About Depression,
http://www.allaboutdepression.com/dia_03.html Y17-1020
0 NaN American Psychological Association. (2016). Center for epidemiological studies depression (CESD).
Retrieved December 7, 2016, from American Psychological Association,
http://www.apa.org/pi/ about/publications/caregivers/practice-settings/ assessment/tools/depression-scale.aspx Y17-1020
Some more rows:
**cit2ref** **reference** **_id**
0 NaN All about depression: Diagnosis. (2013). Retrieved December 7, 2016, from All About Depression, http://www.allaboutdepression.com/dia_03.html Y17-1020
0 NaN American Psychological Association. (2016). Center for epidemiological studies depression (CESD). Retrieved December 7, 2016, from American Psychological Association, http://www.apa.org/pi/ about/publications/caregivers/practice-settings/ assessment/tools/depression-scale.aspx Y17-1020
0 NaN American Psychological Association. (2016). Patient health questionnaire (PHQ-9 %27 PHQ-2). Retrieved December 09, 2016, from http://www.apa.org/pi/ about/publications/caregivers/practice-settings/ assessment/tools/patient-health.aspx Y17-1020
0 NaN Beattie, G.S. (2005, November). Social Causes of Depression. Retrieved May 31, 2017, from http:// www.personalityresearch.org/papers/beattie.html Y17-1020
0 Burton (2012) Burton, N. (2012, June 5). Depressive Realism. Retrieved May 31, 2017, from https:// www.psychologytoday.com/blog/hide-and-seek/ 201206/depressive-realism Y17-1020
0 NaN Clark, P., Niblett, T. (1988, October 25). The CN2 induction Algorithm. Retrieved May 10, 2017, from https://pdfs.semanticscholar.org/766f/ e3586bda3f36cbcce809f5666d2c2b96c98c.pdf Y17-1020
0 Choudhury, 2014 De Choudhury, M., Counts, S., Horvits, E., %27 Hoff, A. (2014). Characterizing and Predicting Postpartum Depression from Shared Facebook Data. Y17-1020
0 NaN De Choudhury, M., Gamon, M., Couns, S., %27 Horvitz, E. (2013). Predicting Depression via Social Media. Y17-1020
0 Gotlib and Joormann (2010) Gotlib IH, Kasch KL, Traill S, Joormann J, Arnow BA, Johnson SL. (2010) Coherence and specificity of information-processing biases in depression and social phobia. J Abnorm Psychol. 2004;113(3): 386-98. Y17-1020
0 NaN Gotlib, I. H., %27 Hammen, C. L. (1992). Psychological aspects of depression: Toward a cognitive- interpersonal integration. New York: Wiley. Y17-1020
0 NaN Gotlib IH, Joormann J. Cognition and depression: current status and future directions. Annu Rev Clin Psychol. 2010;6:285-312. Y17-1020
0 NaN Hu, Quan, Ang Li, Fei Heng, Jianpeng Li, and Tingshao Zhu. "Predicting Depression of Social Media User on Different Observation Windows." 2015 IEEE/ WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI- IAT) (2015): n. pag. Web. Y17-102
Here '0' is the index for 1st paper which has many references and there are 40k papers with approx ~20 references for each.
Looking for any reference which is being used again in other paper(here different index for each paper) with it's index and how many times repeated.
Tried with a regular expression and sorting methods of pandas like
value_counts(sort=True).sort_index()
and
sort_values()
but that doesn't help.
Here is the screenshot of the dataframe with 2 papers as indexed '0' and '1'