-1

Dictionary with 2 urls and their text: Need to get rid of all multiple spaces, special characters and new lines

{'https://firsturl.com': ['\n\n', '\n ', '\n \n \n ', '\n \n ', '\n \n ', '\n\n ', '\n', '\n', '\n ', '\n ', 'Home | Sam ModelInc', '\n \n\n\n', '\n\n\n\n', '\n\n', '\n \n\n\n\n\n\n\n\n\n \n \n', '\n', '\n', '\n', '\n', '\n ', '\n ', 'Skip to main content'],'https://secondurl.com#main-content': ['\n\n', '\n ', '\n \n \n ', '\n \n ', '\n \n ', '\n\n ', '\n', '\n', '\n ', '\n ', 'Home | Going to start inc', '\n \n\n\n', '\n\n\n\n', '\n\n', '\n \n\n\n\n\n\n\n\n\n \n \n', '\n', '\n', '\n', '\n', '\n ', '\n ', 'Skip to main content', '\n ', '\n \n', '\n\n ', '\n\n ', '\n \n \n \n \n ', '\n\n ', '\n ', '\n\n \n ', '\n ', '\n\n \n ', '\n ', 'Brands', '\n', 'About Us', '\n', 'Syndication', '\n', 'Direct Response']}

Expected Output: {'https://firsturl.com': ['home sam modelInc skip to main content'], https://secondurl.com#main-content': ['home going to start inc skip to main content brands about us syndication direct response]}

Help would be much appreciated

  • Welcome to SO!Before asking a question; please read [how to ask a question on SO](https://stackoverflow.com/help/how-to-ask). It is recommended to show your efforts. –  May 28 '20 at 16:49
  • [or key, value in my_dict.items(): my_dict.[key] = [value] my_dict.[key] = [x.strip() for x in value] – Bercey Efund May 28 '20 at 16:53
  • hint: google `python3 strip` bigger hint: loop over each item in each list, remove the newlines and check it's length, if it is zero, discard it. – Cwissy May 28 '20 at 16:54

1 Answers1

0

So let's try to walk through this instead of just throwing some code at you.

The first element we want to get rid of is the newline. So we could start with something like:

ex_dict = {"a": ["\n\n", "\n"]}

for x in ex_dict:
    new_list = [e for e in ex_dict[x] if "\n" not in e]
    ex_dict[x] = new_list

If you run that, you'll see that we now filter out all new lines.

Now we have the following cases:

Home | Sam ModelInc
Skip to main content
Home | Going to start inc
Brands
About Us
Syndication
Direct Response

According to your expected output, you want to lowercase all words and remove non-alphabet characters.

Did a little research for how to do that.

In code, that looks like:

import re

regex = re.compile('[^a-zA-Z ]') # had to tweak the linked solution to include spaces

ex_dict = {"a": ["\n\n", "\n"]}

for x in ex_dict:
    new_list = [e for e in ex_dict[x] if "\n" not in e]

    """
    >>> regex.sub("", "Home | Sam ModelInc")
    'Home  Sam ModelInc'
    """
    new_list = [regex.sub("", e) for e in new_list]
    ex_dict[x] = new_list

so now our final new_list looks something like: ['Home Sam ModelInc', 'Skip to main content']

Next we want to lowercase everything.

import re

regex = re.compile('[^a-zA-Z ]') # had to tweak the linked solution to include spaces

ex_dict = {"a": ["\n\n", "\n"]}

for x in ex_dict:
    new_list = [e for e in ex_dict[x] if "\n" not in e]

    """
    >>> regex.sub("", "Home | Sam ModelInc")
    'Home  Sam ModelInc'
    """
    new_list = [regex.sub("", e) for e in new_list]

    new_list = [e.lower() for e in new_list]
    ex_dict[x] = new_list

and lastly we want to combine everything with only one space between each word.

import re

regex = re.compile('[^a-zA-Z ]') # had to tweak the linked solution to include spaces

ex_dict = {"a": ["\n\n", "\n"]}

for x in ex_dict:
    new_list = [e for e in ex_dict[x] if "\n" not in e]

    """
    >>> regex.sub("", "Home | Sam ModelInc")
    'Home  Sam ModelInc'
    """
    new_list = [regex.sub("", e) for e in new_list]

    new_list = [e.lower() for e in new_list]

    new_list = [" ".join((" ".join(new_list)).split())]
    ex_dict[x] = new_list
notacorn
  • 3,526
  • 4
  • 30
  • 60