for url in urls:
uClient = ureq(url)
page_html = uClient.read()
uClient.close()
soup = BeautifulSoup(page_html, "html.parser")
text = (''.join(s.findAll(text=True))for s in soup.findAll('p'))
c = Counter((re.sub(r"[^a-zA-Z0-9 ]","",x)).strip(punctuation).lower() for y in text for x in y.split())
for key in sorted(c.keys()):
l.append([key, c[key]])
d = collections.defaultdict(list)
for k, v in l:
d[k].append(v)
print(d.items())
The output I'm getting is:
([('', [3, 9, 4, 1]), ('1', [1, 2, 2]), ('1960', [1]), ('1974', [1]), ('1996', [1]), ('1997', [1]), ('1998', [1]), ('2001', [2]), ('2002', [1]), ...
I want a default value of 0 if it doesn't find the key in a list. For example, if Key: g is 1 time in the first list, 0 in second, 3 in third and 6 in fourth. It should return: 'g':[1,0,3,6]
Edit:
This commented lines from my complete code to show the trials that didn't work out:
#m = list(map(dict, map(zip, list_1, list_2)))
#matrix = pd.DataFrame.from_dict(d, orient='index')
matrix = pd.DataFrame({ key:pd.Series(value) for key, value in d.items() })
I've a text file under the name 'urls.txt'that contains URLs:
https://en.wikipedia.org/wiki/Data_science
https://datajobs.com/what-is-data-science
I need a document term matrix of all the unique alphanumerics. Let's say word data and science:
One Row should be [Document number, term 'data', term 'science']
It should appear like:
data science
1 96 65
2 105 22
3 0 16
I'm very close but not able to do it in the right way. Tried list to dataframe, dict to dataframe, purely by dataframe but nothing worked. Searched everywhere, couldn't find the similar thing.