0

I was trying to make a retriever.

I used wget to get the website, and call out all the text.

I want to make a dict like

{'Activity':'index2.html','and':'index2.html','within':'index2.html',...}
{'Rutgers':'index.html','Central':'index.html','Service':'index,html',...}

but I got the output is

{'Activity':'i','and':'n','within':'d',...} 
{'Rutgers':'i','Central':'n','Service':'d',...}

It split my filename.

import string
import os
from bs4 import BeautifulSoup as bs
from os import listdir
from os.path import isfile, join
#from os.path import isdir

mypath = "/Users/Tsu-AngChou/MasterProject/Practice/try_test/"
files = listdir(mypath)
translator = str.maketrans("","",string.punctuation)
storage = []
for f in files:
  fullpath = join(mypath, f)
  if f == '.DS_Store':
                os.remove(f)
  elif isfile(fullpath):

    print(f)
    for html_cont in range(1):
        response = open(f,'r',encoding='utf-8')
        html_cont = response.read()
        soup = bs(html_cont, 'html.parser',from_encoding ='utf-8')
        regular_string = soup.get_text()

        new_string = regular_string.translate(translator).split()
        new_list = [item[:14] for item in new_string]
        a = dict(zip(new_list,f))
        print(a)
Steve
  • 73
  • 7
  • Could you show some example file names and which part of the file you want? https://stackoverflow.com/questions/678236/how-to-get-the-filename-without-the-extension-from-a-path-in-python – RetroCoder Nov 30 '17 at 22:11
  • my filename are index2.html and index.html – Steve Nov 30 '17 at 22:24

2 Answers2

0

You need a simple pair with f as one element; zip steps through the elements of each sequence. Try something like this:

sent = "Activity and within".split()
f = "index.html"
a = dict((word, f) for word in sent)
print(a)

Output:

{'Activity': 'index.html', 'and': 'index.html', 'within': 'index.html'}
Prune
  • 76,765
  • 14
  • 60
  • 81
0

You could use dict.fromkeys:

a = dict.fromkeys(newlist, f)

This uses newlist as the keys and gives every key the same value f.

Paul Panzer
  • 51,835
  • 3
  • 54
  • 99
  • wow, may I ask what happened? I did not split my filename, why it was split when I use it? – Steve Nov 30 '17 at 22:21
  • @Steve `zip` expects iterables as its arguments. Try `zip(([1,2],3)`. Your code only worked without raising an exception because strings are iterable. They return their letters one by one. Try: `for i in "hello": print(i)`. – Paul Panzer Nov 30 '17 at 22:32