3

I have a huge set of data. Something like 100k lines and I am trying to drop a row from a dataframe if the row, which contains a list, contains a value from another dataframe. Here's a small time example.

has = [['@a'], ['@b'], ['#c, #d, #e, #f'], ['@g']]
use = [1,2,3,5]
z = ['#d','@a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})

              tweet  user
0              [@a]     1
1              [@b]     2
2  [#c, #d, #e, #f]     3
3              [@g]     5
    z
0  #d
1  @a

The desired outcome would be

              tweet  user
0              [@b]     2
1              [@g]     5

Things i've tried

#this seems to work for dropping @a but not #d
for a in range(df.tweet.size):
    for search in df2.z:
        if search in df.loc[a].tweet:
            df.drop(a)

#this works for my small scale example but throws an error on my big data
df['tweet'] = df.tweet.apply(', '.join)
test = df[~df.tweet.str.contains('|'.join(df2['z'].astype(str)))]

#the error being "unterminated character set at position 1343770" 
#i went to check what was on that line and it returned this  
basket.iloc[1343770]

user_id                                 17060480
tweet      [#IfTheyWereBlackOrBrownPeople, #WTF]
Name: 4612505, dtype: object

Any help would be greatly appreciated.

qrs
  • 131
  • 1
  • 7
  • Have you tried [unstacking](https://stackoverflow.com/questions/42012152/unstack-a-pandas-column-containing-lists-into-multiple-rows), merging and then reducing back? – Nico Albers Mar 10 '18 at 12:46

2 Answers2

2

is ['#c, #d, #e, #f'] 1 string or a list like this ['#c', '#d', '#e', '#f'] ?

has = [['@a'], ['@b'], ['#c', '#d', '#e', '#f'], ['@g']]
use = [1,2,3,5]
z = ['#d','@a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})

simple solution would be

screen = set(df2.z.tolist())
to_delete = list()  # this will speed things up doing only 1 delete
for id, row in df.iterrows():
    if set(row.tweet).intersection(screen):
        to_delete.append(id)
df.drop(to_delete, inplace=True)

speed comparaison (for 10 000 rows):

st = time.time()
screen = set(df2.z.tolist())
to_delete = list()
for id, row in df.iterrows():
    if set(row.tweet).intersection(screen):
        to_delete.append(id)
df.drop(to_delete, inplace=True)
print(time.time()-st)
2.142000198364258

st = time.time()
for a in df.tweet.index:
    for search in df2.z:
        if search in df.loc[a].tweet:
            df.drop(a, inplace=True)
            break
print(time.time()-st)
43.99799990653992
Steven G
  • 16,244
  • 8
  • 53
  • 77
1

For me, your code works if I make several adjustments.

First, you're missing the last line when putting range(df.tweet.size), either increase this or (more robust, if you don't have an increasing index), use df.tweet.index.

Second, you don't apply your dropping, use inplace=True for that.

Third, you have #d in a string, the following is not a list: '#c, #d, #e, #f' and you have to change it to a list so it works.

So if you change that, the following code works fine:

has = [['@a'], ['@b'], ['#c', '#d', '#e', '#f'], ['@g']]
use = [1,2,3,5]
z = ['#d','@a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})

for a in df.tweet.index:
    for search in df2.z:
        if search in df.loc[a].tweet:
            df.drop(a, inplace=True)
            break  # so if we already dropped it we no longer look whether we should drop this line

This will provide the desired result. Be aware of this potentially being not optimal due to missing vectorization.

EDIT:

you can achieve the string being a list with the following:

from itertools import chain
df.tweet = df.tweet.apply(lambda l: list(chain(*map(lambda lelem: lelem.split(","), l))))

This applies a function to each line (assuming each line contains a list with one or more elements): Split each element (should be a string) by comma into a new list and "flatten" all the lists in one line (if there are multiple) together.

EDIT2:

Yes, this is not really performant But basically does what was asked. Keep that in mind and after having it working, try to improve your code (less for iterations, do tricks like collecting the indices and then drop all of them).

Nico Albers
  • 1,556
  • 1
  • 15
  • 32