3

My question is simply: if neither of the above commands work on splitting up a string into multiple lines, does that mean that nothing is delimiting the string?

My example is pretty in depth but in short: I have parsed specific data out of an HTML table with BeautifulSoup, but when I go to print the data it is all one messy string instead of a neat table format. I tried converting it to a Pandas DataFrame but still no success. I then tried using the above commands to neaten up the output but those also failed. This all leads me to believe it must in fact be one continuous string with no delimiters (even though obviously in the table they are separate entries).

I would love help with this problem. I am not sure if I'm using the commands wrong, or if my data really is this difficult to work with. Thank you.

My data (and how I expect it should be printed):

desired output

My relevant code:

rows = table.findAll("tr")[1:2]
data = {
    'ID' : [],
    'Available Quota' : [],
    'Live Weight Pounds' : [],
    'Price' : [],
    'Date Posted' : []
}

for row in rows:
    cols = row.findAll("td")
    data['ID'].append(cols[0].get_text())
    data['Available Quota'].append(cols[1].get_text())
    data['Live Weight Pounds'].append(cols[2].get_text())
    data['Price'].append(cols[3].get_text())
    data['Date Posted'].append(cols[4].get_text())

fishData = pd.DataFrame(data)
#print(fishData)
str1 = ''.join(data['Available Quota'])
#print(type(str1))
#str1.split("\n")
str1.splitlines()
print(str1)

What gets printed:

GOM CODGOM HADDDABSGOM YT
Tomalak
  • 332,285
  • 67
  • 532
  • 628
theprowler
  • 3,138
  • 11
  • 28
  • 39
  • You are using `split/splitlines` wrong, but in any case that's just one line that you printed so it's not going to help. Why did you join the data in the first place? It looks like you already had it in a proper structure. – Alex Hall Dec 01 '16 at 16:34
  • I joined the data because it was a list and I needed to make it into a string so that I could split it. I couldn't split a list. But so I am using them wrong? Would you mind telling me how to use them correctly? – theprowler Dec 01 '16 at 16:39
  • Strings are immutable, no method will change them into something else, you can only get a new value based on it. So you might say `lines = str1.splitlines()`. But again, it's not clear why you'd join and then split! You have a list of values, what do you expect to accomplish by joining and splitting? – Alex Hall Dec 01 '16 at 16:41
  • I might be wrong so feel free to correct me if I am but I thought the `join()` command turns a list into a string, which is what I need. And then I thought the `split()` command would take the string `GOM CODGOM HADDDABSGOM YT` and split it into how it is supposed to look: `GOM COD` `GOD HADD` `DABS` `GOM YT` – theprowler Dec 01 '16 at 16:48
  • `split` will turn a string into a list and then you're back where you started. I don't know what you're trying to show me at the end, python is not going to output something with StackOverflow formatting. – Alex Hall Dec 01 '16 at 17:01
  • Ohhh maybe I got myself mixed up I'm sorry, I was wrong you were right, I've been trying a variety of methods and just comment them out when they don't work. I think the `join` command was one method, and the `split` was another. But can you see the image of the data table I posted above in my original question? That's how I want it printed, simply in rows and columns, that way I can then export it to my database. – theprowler Dec 01 '16 at 17:19
  • If you want it in your database then don't try printing it at all. In fact you don't even have to put anything into data. Put each row in your database as you get it. – Alex Hall Dec 01 '16 at 17:20
  • But the thing is, if it exports like `GOM CODGOM HADDDABSGOM YT` that won't work. That is supposed to be four different fish species (obviously abbreviated) so I'd need each of those to go into its own data cell, along with its corresponding weight, price, ID, in the same row. I have succeeded in parsing and exporting data from txt files, PDFs, and excel sheets it's just these darn HTML tables that are messing me up. – theprowler Dec 01 '16 at 18:05

2 Answers2

1

My guess is that there's some formatting happening inside the table cells that you're throwing away. Supposing that the four lines visible in your table cell are separated by <br> tags, BeautifulSoup will discard that information when you call get_text:

>>> s = 'First line <br />Second line <br />Third line'
>>> soup = BeautifulSoup(s)
>>> soup.get_text()
u'First line Second line Third line'

As noted over here, you can swap out <br> tags for newlines, which might make your life easier:

>>> for br in soup.find_all("br"):
...     br.replace_with("\n")
>>> soup.get_text()
u'First line \nSecond line \nThird line'

The strings and stripped_strings generators might also be useful here; they return chunks of text which were originally separated by tags:

>>> soup = BeautifulSoup(s)
>>> list(soup.stripped_strings)
[u'First line', u'Second line', u'Third line']

So, what happens if you do:

data['Available Quota'].extend(cols[1].stripped_strings)

Hopefully, you should have the list you're looking for in data['Available Quota']:

>>> data['Available Quota']
['GOM', 'CODGOM', 'HADDDABSGOM', 'YT']
Community
  • 1
  • 1
wildwilhelm
  • 4,809
  • 1
  • 19
  • 24
  • At a glance, your fix looks like it'll work perfectly. But when I first tried it I ran into the error `AttributeError: 'ResultSet' object has no attribute 'find_all'` pointing to the `for br in cols.find_all('br'):` line. So I looked up what that error was and ended up editing in `for cols in row:` just before that line........and that eliminated the error, but then I got `KeyError: 1` which indicates the path doesn't exist? – theprowler Dec 01 '16 at 19:14
  • Oh I didn't see you edited your answer. Sorry. I'll try that method right now. – theprowler Dec 01 '16 at 19:15
  • Omg it worked. I have literally been going at this one problem all week with zero success and you solved it so quickly, thank you. If I could just ask one more thing: I entered the code exactly as above `data['Available Quota'].extend(cols[1].stripped_strings) print(data['Available Quota'])` but it prints: `['GOM CODGOM HADDDABSGOM YT', 'GOM COD', 'GOM HADD', 'DABS', 'GOM YT']` I only want the second half of that, where they're each separated, do you know why it prints the first half as well? – theprowler Dec 01 '16 at 19:26
  • No way you've still got the `append` in, is there? To be sure, you could make it `data['Available Quota'] = list(cols[1].stripped_strings)`. Does that help? – wildwilhelm Dec 01 '16 at 19:39
  • Ohhhhhh ok I didn't know that that line you suggested I add in was to replace my `.append` line. Ok I did that. It works perfectly thanks so much man you're a lifesaver. So I don't need the line `for row in rows:` at all do I? When I put the `data['Available Quota']` line in the `for` loop it prints it twice, when it's outside the `for` loop it prints it once (correctly)... – theprowler Dec 01 '16 at 19:59
0

If you simply replace:

str1 = ''.join(data['Available Quota'])

with

str1 = '\n'.join(data['Available Quota'])

and then comment out the:

str1.splitlines()

Then your print statement will print out the following:

GOM
CODGOM
HADDDABSGOM
YT

A simple example of what I see as the output from doing the '\n'.join()

In [41]: b
Out[41]: ['a', 'b', 'c']

In [42]: print('\n'.join(b))
a
b
c
Yevgeniy Loboda
  • 161
  • 1
  • 1
  • 8
  • I tried that and it failed again :( it once again printed out: `GOM CODGOM HADDDABSGOM YT` – theprowler Dec 01 '16 at 16:45
  • Are you trying to get – Yevgeniy Loboda Dec 01 '16 at 16:49
  • GOM CODGOM HADDDABSGOM YT to print out? Also, what do you see if you just print(data['Available Quota']) – Yevgeniy Loboda Dec 01 '16 at 16:50
  • My goal is to print it exactly like a DataFrame would print it, in rows and columns like in the picture of the data table in my quesiton, with each quota of fish being on a new line. When I print what you requested I get: `['GOM CODGOM HADDDABSGOM YT']` What I would like is: `GOM COD` `GOM HADD` `DABS` `GOM YT` all on their own line – theprowler Dec 01 '16 at 16:53