1

I'm trying to scrape information from a website. I put the info into a list, but when I print the list, it looks something like this:

list =  ['text  \n\n  more text (1)  \n\n  even more text  \n\n']

As you can see, nothing is separated. I want the list to look something like this:

list = ['text','more text (1)', 'even more text']

I tried doing list = [i.split('\n\n') for i in list] but that didn't work. The result was :

list = [text  ','  more text (1)  ','  even more text]

How can I fix this?

Thank you in advance for taking the time to read my question and help in any way you can. I appreciate it

shorttriptomars
  • 325
  • 1
  • 9

6 Answers6

1

You're almost there... If you to the following you should be there:

the_list = ['text  \n\n  more text (1)  \n\n  even more text  \n\n']
final_list = list(filter(None, [i.strip() for i in the_list[0].split('\n\n')]))

The reason why it failed in my previous answer was that we defined the_list as a list of length 1. Secondly, I put the split in the wrong location.

I've also added the filter to "squeeze" an empty result at the end in case you want to remove those.

PrinsEdje80
  • 494
  • 4
  • 8
  • thank you for taking the time to answer my question! I got this error: `AttributeError: 'list' object has no attribute 'strip'`. I tried `final_list = [i.strip()split('\n\n') for i in list]' instead. The problem is, when I convert this to a dataframe, it still has everything in 1 row and doesn't split the text into rows – shorttriptomars Feb 24 '22 at 18:09
  • Oops. I've done something wrong. I'll fix it. – PrinsEdje80 Feb 24 '22 at 19:58
1

Here is a way to do it. I first split each string of your list and then remove any trailing or leading space using the split method.

info = []
for i in liste:
    if i[-2:] == "\n\n":
        i = i[:-2]
    untrimmed = i.split("\n\n")
    trimmed = [j.strip() for j in untrimmed]
    info.append(trimmed)

The if statement permits to get rid of any empty string if your input ends with "\n\n".

Maxime Lavaud
  • 127
  • 10
1

I tried this code it's worked for me may be it helps you.

lst =  "text  \n\n  more text (1)  \n\n  even more text"

x=lst.split("\n\n")

print("list=",x)
Sathi Aiswarya
  • 2,068
  • 2
  • 11
1

Try this code maybe:

import re
list =  ['text  \n\n  more text (1)  \n\n  even more text  \n\n']
list[0] = list[0].replace('  \n\n  ', '#').replace('  \n\n', '#')
list = re.split('#',list[0])

if list[len(list) - 1] == '':
  list.pop(len(list) - 1)

print(list)

Output:

['text', 'more text (1)', 'even more text']

First we replace every instance of ' \n\n ' and ' \n\n' with '#'. This is because even though the elements are separated by ' \n\n ', the code ends without a space after it, so we need a unique separator for that instance.

Afterwards, we split the list by every instance of '#', and pop the final element if it was a black space caused by an ending ' \n\n ' or ' \n\n '.

I hope this helped! Please let me know if you need any further clarification or details :)

Aniketh Malyala
  • 2,650
  • 1
  • 5
  • 14
1
list1 =  ['text  \n\n  more text (1)  \n\n  even more text  \n\n']
print(list1)
list1
joined = "".join(list1)
joined = joined.replace('\n\n',',')
words = [x.strip() for x in joined.split(',')]
print(words)
while("" in words) :
    words.remove("")
print(words)
1

Please, try this:

list =  ['text  \n\n  more text (1)  \n\n  even more text  \n\n']
aux = lista[0].split('\n\n')
list_final = [e.strip() for e in aux]
list_final.remove('')
RubyLearning
  • 83
  • 1
  • 7