Pandas with different length arrays

Question

This is the code I have. Due to content of the raw data to be parsed, I end up with the 'user list' and the 'tweet list' being of different length. When writing the lists as columns in a data frame, I get ValueError: arrays must all be same length. I realize this, but have been looking for a way to work around it, printing 0 or NaN in the right places of the shorter array. Any ideas?

import pandas
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('#raw.html'))
chunk = soup.find_all('div', class_='content')

userlist = []
tweetlist = []

for tweet in chunk:
    username = tweet.find_all(class_='username js-action-profile-name')
    for user in username:
        user2 = user.get_text()
        userlist.append(user2)

for text in chunk:
    tweets = text.find_all(class_='js-tweet-text tweet-text')
for tweet in tweets:
    tweet2 = tweet.get_text().encode('utf-8')
    tweetlist.append('|'+tweet2)

print len(tweetlist)
print len(userlist)

#MAKE A DATAFRAME WITH THIS
data = {'tweet' : tweetlist, 'user' : userlist}
frame = pandas.DataFrame(data)
print frame

# Export dataframe to csv
frame.to_csv('#parsed.csv', index=False)

Does this answer your question? [Creating dataframe from a dictionary where entries have different lengths](https://stackoverflow.com/questions/19736080/creating-dataframe-from-a-dictionary-where-entries-have-different-lengths) — Trenton McKinney, Sep 10 '20 at 00:24
The question should be close as a duplicate, since the main point is to create dataframe from a `dict`, containing uneven `arrays`. `data = {'tweet' : tweetlist, 'user' : userlist}` and `frame = pandas.DataFrame(data)`. The duplicate answers this question and has an accepted answer. — Trenton McKinney, Sep 10 '20 at 00:26

score 13 · Answer 1 · answered Mar 01 '15 at 20:27

13

I'm not sure that this is exactly what you want, but anyway:

d = dict(tweets=tweetlist, users=userlist)
pandas.DataFrame({k : pandas.Series(v) for k, v in d.iteritems()})

answered Mar 01 '15 at 20:27

Dmitriy Kuznetsov

366
1
4

This bypasses the error, but it will put all the NaN's at the bottom of the tweet list, messing up the matching between the columns. Looking for a way to get the NaN's spread out on their right line numbers. Maybe some way of getting the `for text in chunk:` loop to print NaN if it finds no text? – Simon Lindgren Mar 01 '15 at 20:53
What do you parse? Raw html from twiiter.com after login? – Dmitriy Kuznetsov Mar 01 '15 at 21:00
1

Why do you use two separate for loops? I didn't test this code properly but it should work: https://gist.github.com/anonymous/d290798359625804af5f – Dmitriy Kuznetsov Mar 01 '15 at 21:12
1

Yes! Thank you so much. This worked exactly like I wanted it! – Simon Lindgren Mar 01 '15 at 22:38
2

Use `.items()` instead of `.iteritems()` for Python3 – Bryce Guinta Oct 14 '15 at 18:38

score 3 · Answer 2 · answered Jun 12 '17 at 07:41

3

Try this:

frame = pandas.DataFrame.from_dict(d, orient='index')

After that, you should transpose your frame with:

frame = frame.transpose()

Then you can export to csv:

frame.to_csv('#parsed.csv', index=False)

answered Jun 12 '17 at 07:41

Ekrem Gurdal

1,118
13
14

score 0 · Answer 3 · answered Jan 03 '22 at 03:19

0

you can easily solve this issue by write this code to make the data frame.

dict_df = pd.DataFrame({ key:pd.Series(value) for key, value in Sl.items() })

answered Jan 03 '22 at 03:19

Aida firoozi

1

Pandas with different length arrays

3 Answers3