Finding specific text in pandas dataframe column

Question

I have a dataframe with column containing references of papers and I want to look for any reference repeated in the whole column for all references.

Here are the some rows from dataframe:

In [1]:

df4.iloc[0:2]

Out[2]:

 **cit2ref**    **reference**                                                                                                    **_id**
0   NaN     All about depression: Diagnosis. (2013). Retrieved December 7, 2016,from All About Depression,
            http://www.allaboutdepression.com/dia_03.html                                                                   Y17-1020
0   NaN     American Psychological Association. (2016). Center for epidemiological studies depression (CESD). 
            Retrieved December 7, 2016, from American Psychological Association, 
            http://www.apa.org/pi/ about/publications/caregivers/practice-settings/ assessment/tools/depression-scale.aspx  Y17-1020

Some more rows:

 **cit2ref** **reference**                                                                                                                                 **_id**

0   NaN     All about depression: Diagnosis. (2013). Retrieved December 7, 2016, from All About Depression, http://www.allaboutdepression.com/dia_03.html   Y17-1020
0   NaN     American Psychological Association. (2016). Center for epidemiological studies depression (CESD). Retrieved December 7, 2016, from American Psychological Association, http://www.apa.org/pi/ about/publications/caregivers/practice-settings/ assessment/tools/depression-scale.aspx   Y17-1020
0   NaN     American Psychological Association. (2016). Patient health questionnaire (PHQ-9 %27 PHQ-2). Retrieved December 09, 2016, from http://www.apa.org/pi/ about/publications/caregivers/practice-settings/ assessment/tools/patient-health.aspx  Y17-1020
0   NaN     Beattie, G.S. (2005, November). Social Causes of Depression. Retrieved May 31, 2017, from http:// www.personalityresearch.org/papers/beattie.html   Y17-1020
0   Burton (2012)   Burton, N. (2012, June 5). Depressive Realism. Retrieved May 31, 2017, from https:// www.psychologytoday.com/blog/hide-and-seek/ 201206/depressive-realism  Y17-1020
0   NaN     Clark, P., Niblett, T. (1988, October 25). The CN2 induction Algorithm. Retrieved May 10, 2017, from https://pdfs.semanticscholar.org/766f/ e3586bda3f36cbcce809f5666d2c2b96c98c.pdf    Y17-1020
0   Choudhury, 2014     De Choudhury, M., Counts, S., Horvits, E., %27 Hoff, A. (2014). Characterizing and Predicting Postpartum Depression from Shared Facebook Data.  Y17-1020
0   NaN     De Choudhury, M., Gamon, M., Couns, S., %27 Horvitz, E. (2013). Predicting Depression via Social Media.     Y17-1020
0   Gotlib and Joormann (2010)  Gotlib IH, Kasch KL, Traill S, Joormann J, Arnow BA, Johnson SL. (2010) Coherence and specificity of information-processing biases in depression and social phobia. J Abnorm Psychol. 2004;113(3): 386-98.  Y17-1020
0   NaN     Gotlib, I. H., %27 Hammen, C. L. (1992). Psychological aspects of depression: Toward a cognitive- interpersonal integration. New York: Wiley.   Y17-1020
0   NaN     Gotlib IH, Joormann J. Cognition and depression: current status and future directions. Annu Rev Clin Psychol. 2010;6:285-312.   Y17-1020
0   NaN     Hu, Quan, Ang Li, Fei Heng, Jianpeng Li, and Tingshao Zhu. "Predicting Depression of Social Media User on Different Observation Windows." 2015 IEEE/ WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI- IAT) (2015): n. pag. Web.   Y17-102

Here '0' is the index for 1st paper which has many references and there are 40k papers with approx ~20 references for each.

Looking for any reference which is being used again in other paper(here different index for each paper) with it's index and how many times repeated.

Tried with a regular expression and sorting methods of pandas like

value_counts(sort=True).sort_index()

and

sort_values()

but that doesn't help.

Here is the screenshot of the dataframe with 2 papers as indexed '0' and '1'

could you explain what you mean by reference? is American Psychological Association. (2016). a reference? Beattie, G.S. (2005, November). ? examples of what you want to achieve would be helpful. — sammywemmy, Jan 28 '20 at 04:41
@sammywemmy the 'reference' column values (i.e. the whole text until '_id' column value) are the references of research paper. look the whole row by scrolling horizontally. — loving_guy, Jan 28 '20 at 04:46
@Chris Added the image of more index dataframe but don't know how to write expected output in code/dataframe but highlighted what I am expecting from my problem. And ```cit2ref``` has many ```NaN``` values as it is of the same reference paper where values are not known, can't remove them as it helps in aligning the references with the actual paper. — loving_guy, Jan 28 '20 at 05:08
You can reply to this comment when you've edited the question and I'll take another look. You can read [mcve] or [this link](https://stackoverflow.com/q/20109391/2336654) might also be useful. These are meant to guide you to compose a better question. — piRSquared, Jan 28 '20 at 16:07

score 0 · Answer 1 · answered Jan 28 '20 at 06:12

IIUC, use pandas.DataFrame.index.groupby.

With pseudo dataframe, df: (note that I've added last three rows for demonstration):

print(df)
   cit2ref                                          reference       _id
0      NaN  All about depression: Diagnosis. (2013). Retri...  Y17-1020
0      NaN  American Psychological Association. (2016). Ce...  Y17-1020
0      NaN  American Psychological Association. (2016). Pa...  Y17-1020
0      NaN  Beattie, G.S. (2005, November). Social Causes ...  Y17-1020
0      NaN  Burton   (2012)   Burton, N. (2012, June 5). D...  Y17-1020
0      NaN  Clark, P., Niblett, T. (1988, October 25). The...  Y17-1020
0      NaN  Choudhury, 2014     De Choudhury, M., Counts, ...  Y17-1020
0      NaN  De Choudhury, M., Gamon, M., Couns, S., %27 Ho...  Y17-1020
0      NaN  Gotlib and Joormann (2010)  Gotlib IH, Kasch K...  Y17-1020
0      NaN  Gotlib, I. H., %27 Hammen, C. L. (1992). Psych...  Y17-1020
0      NaN  Gotlib IH, Joormann J. Cognition and depressio...  Y17-1020
0      NaN  Hu, Quan, Ang Li, Fei Heng, Jianpeng Li, and T...   Y17-102
1      NaN  All about depression: Diagnosis. (2013). Retri...  Y17-1020
1      NaN  American Psychological Association. (2016). Ce...  Y17-1020
1      NaN                StackOverflow. Not to be grouped-by   Y17-102

Then groupby:

df.index.groupby(df['reference'])
# or
d = {k: list(v) for k, v in df.index.groupby(df['reference']).items()}
new_df = pd.DataFrame.from_dict(d, orient='index').reset_index()
print(new_df)
# this looks prettier

                                                index       0
0   All about depression: Diagnosis. (2013). Retri...  [0, 1]
1   American Psychological Association. (2016). Ce...  [0, 1]
2   American Psychological Association. (2016). Pa...     [0]
3   Beattie, G.S. (2005, November). Social Causes ...     [0]
4   Burton   (2012)   Burton, N. (2012, June 5). D...     [0]
5   Choudhury, 2014     De Choudhury, M., Counts, ...     [0]
6   Clark, P., Niblett, T. (1988, October 25). The...     [0]
7   De Choudhury, M., Gamon, M., Couns, S., %27 Ho...     [0]
8   Gotlib IH, Joormann J. Cognition and depressio...     [0]
9   Gotlib and Joormann (2010)  Gotlib IH, Kasch K...     [0]
10  Gotlib, I. H., %27 Hammen, C. L. (1992). Psych...     [0]
11  Hu, Quan, Ang Li, Fei Heng, Jianpeng Li, and T...     [0]
12                StackOverflow. Not to be grouped-by     [1]

You can see which paper appeared in which indices. If you want count, you can use len instead of list:

d = {k: len(v) for k, v in df.index.groupby(df['reference']).items()}
new_df = pd.DataFrame.from_dict(d, orient='index').reset_index()
print(new_df)

Output:

                                                index  0
0   All about depression: Diagnosis. (2013). Retri...  2
1   American Psychological Association. (2016). Ce...  2
2   American Psychological Association. (2016). Pa...  1
3   Beattie, G.S. (2005, November). Social Causes ...  1
4   Burton   (2012)   Burton, N. (2012, June 5). D...  1
5   Choudhury, 2014     De Choudhury, M., Counts, ...  1
6   Clark, P., Niblett, T. (1988, October 25). The...  1
7   De Choudhury, M., Gamon, M., Couns, S., %27 Ho...  1
8   Gotlib IH, Joormann J. Cognition and depressio...  1
9   Gotlib and Joormann (2010)  Gotlib IH, Kasch K...  1
10  Gotlib, I. H., %27 Hammen, C. L. (1992). Psych...  1
11  Hu, Quan, Ang Li, Fei Heng, Jianpeng Li, and T...  1
12                StackOverflow. Not to be grouped-by  1

Thanks. Does this look for each reference in whole 'reference column' and check for duplicate value and if found gives count and index where present? — loving_guy, Jan 28 '20 at 06:30
Yes. The first part is for the duplicated index and finding length duplicates is equivalent to the count. This however will include the not duplicated items (see stackoverflow not to be groupedby). — Chris, Jan 28 '20 at 06:32
Dict comprehension simply converts the values from array to list for prettier repr. Making it to dataframe also has a same effect but its much easier to manage the result than doing so on dict. — Chris, Jan 28 '20 at 06:45
I am getting this which I don't know what is being resulted in index column and why there are 521 new columns with NaN values. https://imgur.com/pAwBcTv — loving_guy, Jan 28 '20 at 07:04

Finding specific text in pandas dataframe column

1 Answers1