1

In reference to a previous question, I am hoping to iterate over a list of dictionaries and turn the output into a new data frame. For now, I have a CSV with two columns. A column with a word and another with a URL (see below).

| Keyword  | URL                     | 
| -------- | --------------          |
| word1    | www.example.com/topic-1 |
| word2    | www.example.com/topic-2 |
| word3    | www.example.com/topic-3 |
| word4    | www.example.com/topic-4 |

I have turned this CSV into a list of dictionaries and am attempting to iterate over those lists to get a count of how often the word shows on the URL.

My code can be seen in this colab notebook.

My hope is to have a final output that looks like this:

| Keyword | URL                        | Count |
|:----    |:------:                    | -----:|
| word1   | www.example.com/topic-1    | 1003  |
| word2   | www.example.com/topic-2    | 405   |
| word3   | www.example.com/topic-3    | 123   |
| word4   | www.example.com/topic-4    | 554   |

The 'Count' column being the frequency of 'word1' on 'www.example.com/topic-1'.

Any help is appreciated!

Alex Fuss
  • 115
  • 7
  • *"I have turned this CSV into a list of dictionaries"* How? What does the list of dictionaries look like? – Stef Jan 04 '22 at 15:49
  • Would this question help? [code for counting word frequency in website using Python doesn't output the right frequency](https://stackoverflow.com/questions/67291580/code-for-counting-word-frequency-in-website-using-python-doesnt-output-the-righ) – Stef Jan 04 '22 at 15:54
  • Or perhaps this one, which explains how to create a new column using `DataFrame.apply`: [pandas create new column based on values from other columns / apply a function of multiple columns, row-wise](https://stackoverflow.com/questions/26886653/pandas-create-new-column-based-on-values-from-other-columns-apply-a-function-o). See also [the documentation for DataFrame.apply](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html). – Stef Jan 04 '22 at 15:58

1 Answers1

1

Using DataFrame.apply to create a new column using a function of the other columns:

import pandas as pd
import requests

df = pd.DataFrame({'Keyword': ['code', 'apply', 'midnight'],
                   'URL': ['https://stackoverflow.com/questions/70581444/word-frequency-by-iterating-over-a-list-of-dictionaries-python/',
                           'https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html',
                           'https://stackoverflow.com/questions/62694219/minimum-number-of-platforms-required-for-a-railway-station'
                          ]})

print(df)
#     Keyword                                                URL
# 0      code  https://stackoverflow.com/questions/70581444/w...
# 1     apply  https://pandas.pydata.org/docs/reference/api/p...
# 2  midnight  https://stackoverflow.com/questions/62694219/m...



def get_count(row):
    r = requests.get(row['URL'], allow_redirects=False)
    count = r.text.lower().count(row['Keyword'].lower())
    return count

df['Count'] = df.apply(get_count, axis=1)

print(df)
#     Keyword                                                URL  Count
# 0      code  https://stackoverflow.com/questions/70581444/w...     32
# 1     apply  https://pandas.pydata.org/docs/reference/api/p...     32
# 2  midnight  https://stackoverflow.com/questions/62694219/m...     18
Stef
  • 13,242
  • 2
  • 17
  • 28
  • Thanks! This is certainly pointing me in the right direction. However the 'Count' column doesn't add up. For example when I do Ctrl/F for 'code' on the stackoverflow URL you used in the example I get a total of 6 not 32. Is there a reason this could be off? – Alex Fuss Jan 04 '22 at 17:02
  • @AlexFuss See this question: [code for counting word frequency in website using Python doesn't output the right frequency](https://stackoverflow.com/questions/67291580/code-for-counting-word-frequency-in-website-using-python-doesnt-output-the-righ) – Stef Jan 04 '22 at 17:46
  • 2
    @AlexFuss : `apply()` is your method - concerning your example -- It depends on how/where/what is counted - If you only want to limit yourself to the human readable texts I would recommend the following `soup=BeautifulSoup(r.text, 'lxml') count = soup.body.get_text(strip=True).lower().count(row['Keyword'].lower())` -- Else you also count title, meta, ... – HedgeHog Jan 04 '22 at 17:46
  • @HedgeHog worked perfectly! Really appreciate the help – Alex Fuss Jan 04 '22 at 18:24