2
for url in urls:
            uClient = ureq(url)
            page_html = uClient.read()
            uClient.close()
            soup = BeautifulSoup(page_html, "html.parser")
            text = (''.join(s.findAll(text=True))for s in soup.findAll('p'))
            c = Counter((re.sub(r"[^a-zA-Z0-9 ]","",x)).strip(punctuation).lower() for y in text for x in y.split())
            for key in sorted(c.keys()):
                l.append([key, c[key]])

        d = collections.defaultdict(list)
        for k, v in l:
            d[k].append(v)

        print(d.items())

The output I'm getting is:

([('', [3, 9, 4, 1]), ('1', [1, 2, 2]), ('1960', [1]), ('1974', [1]), ('1996', [1]), ('1997', [1]), ('1998', [1]), ('2001', [2]), ('2002', [1]), ...

I want a default value of 0 if it doesn't find the key in a list. For example, if Key: g is 1 time in the first list, 0 in second, 3 in third and 6 in fourth. It should return: 'g':[1,0,3,6]

Edit:

This commented lines from my complete code to show the trials that didn't work out:

        #m = list(map(dict, map(zip, list_1, list_2)))    
        #matrix = pd.DataFrame.from_dict(d, orient='index')
        matrix = pd.DataFrame({ key:pd.Series(value) for key, value in d.items() })

I've a text file under the name 'urls.txt'that contains URLs:

https://en.wikipedia.org/wiki/Data_science
https://datajobs.com/what-is-data-science

I need a document term matrix of all the unique alphanumerics. Let's say word data and science:
One Row should be [Document number, term 'data', term 'science']
It should appear like:

   data   science
1  96      65
2  105     22
3  0       16

I'm very close but not able to do it in the right way. Tried list to dataframe, dict to dataframe, purely by dataframe but nothing worked. Searched everywhere, couldn't find the similar thing.

beta
  • 61
  • 6
  • Init your `list` object with ` l = [0,0,0,0]` and instead of appending use `l[i] = c[key]`. – stovfl Sep 29 '18 at 16:54
  • Could you please elaborate further? How would i vary? – beta Sep 29 '18 at 17:13
  • 1
    [Edit] your Question and provide [mcve]. Remove the `url` part and loop with **3** sentence of text. – stovfl Sep 29 '18 at 17:58
  • Edited the code. Have given an example now. Thanks for the suggestion – beta Sep 29 '18 at 20:46
  • 1
    @beta Posting your complete code is the opposite of a minimal example.. – user3738870 Sep 29 '18 at 20:48
  • The initial output is for dict.items() which should finally convert to the final result in this form. – beta Sep 29 '18 at 20:50
  • @user3738870 it's like if I get dict.items() correct, I can easily convert that into the desired form. I wrote entire code to bring more clarity. – beta Sep 29 '18 at 20:52
  • **Don't** comment to explain your Question, [edit] your Question instead. You need somthing like [how-to-split-a-string-into-a-list](https://stackoverflow.com/questions/743806/how-to-split-a-string-into-a-list) and [how-to-count-word-frequencies-within-a-file-in-python](https://stackoverflow.com/questions/12117576/how-to-count-word-frequencies-within-a-file-in-python) – stovfl Sep 30 '18 at 08:08
  • You want to do [from-strings-to-vectors](https://radimrehurek.com/gensim/tut1.html#from-strings-to-vectors) – stovfl Sep 30 '18 at 08:50

1 Answers1

1

I'm answering my own question as I could figure out a way of doing it and posting it here in case someone needs help:

import requests
from bs4 import BeautifulSoup
import collections
from string import punctuation
from urllib.request import urlopen as ureq
import re
import pandas as pd
import numpy as np
import operator
Q1= open ("Q1.txt", "w") 
def web_parsing(filename):
    with open (filename, "r") as df:
        urls = df.readlines()
        url_number = 0 
        url_count = []
        l = {} 
        d = []
        a =[]
        b = []
        e=[]
        for url in urls:
            uClient = ureq(url)
            page_html = uClient.read()
            uClient.close()
            soup = BeautifulSoup(page_html, "html.parser")
            text = (''.join(s.findAll(text=True))for s in soup.findAll('p'))
            c = Counter((re.sub(r"[^a-zA-Z0-9 ]","",x)).strip(punctuation).lower() for y in text for x in y.split())
            for key in c.keys():
                if key in a:
                    continue
                else:
                    a.append(key)
            #print(sorted(a))
            a = list(filter(None, a))
            #print(sorted(a))
            stopfile = open('stop_words.txt', 'r')
            stopwords = [line.split(',') for line in stopfile.readlines()]
            #print(stopwords)
            a = [item for item in a if item not in stopwords]
            #print(len(a))
            l = [list(([word, c[word]])) for word in a]
            l =sorted(l)
            flat_list = [item for sublist in l for item in sublist]
            d.extend(flat_list)
            b = {d[i]: d[i+1] for i in range(0, len(d), 2)}
            e.append(b)
        j=0
        for url in urls:
            j = j+1
        #print(j)
        result = {}
        for key in a:
            for i in range(0,j):
                if key in e[i]: result.setdefault(key, []).append(e[i][key])
                if key not in e[i]: result.setdefault(key, []).append(0)
            #print (result)
            #print (result)
        od = collections.OrderedDict(sorted(result.items()))
        #print(od)
        df1 = pd.DataFrame(od)
        df2 =df1.loc[:, ['data', 'companies', 'business', 'action', 'mining', 'science']]
        #return(df2)
        df1.to_csv(Q1, header=True)
        df2.to_csv(Q1,  header=True)        
        print(len(a))
        return(df1)
beta
  • 61
  • 6