0

I am trying to split the strings in a column tweet_text if the column lang is en

Here is how to do it on a string:

s = 'I am always sad'
s_split = s.split(" ")

This returns:

['I', 'am', 'always', 'sad']

My current code which does not work:

df['tweet_text'] = df.apply(lambda x: x['tweet_text'].split(" ")  if x['lang'] is 'en' else x['tweet_text'], axis = 1)

Dictionary of data:

{'lang': {1404: 'en',
  1943: 'en',
  2169: 'en',
  2502: 'de',
  3981: 'nl',
  4226: 'en',
  7223: 'en',
  8557: 'de',
  11339: 'pt',
  11854: 'en'},
 'tweet_text': {1404: 'I am always sad when a colleague loses his job and Frank is not just a colleague he is an impoant person in my',
  1943: 'It remains goalless at FNB Stadium between Kaizer Chiefs and Baroka at halftimeRead more',
  2169: 'Which one gets your vote 05',
  2502: 'Was sagt ihr zu den ersten Minuten',
  3981: 'En we gaan door speelronde begint vandaagTegen wie speelt jouw favoriete club',
  4226: 'Quote tweet or replyYour favourite Mesut Ozil moment as a Gunner was',
  7223: 'How to follow the game live The opponent Current form Did you know The squad Koeman said It must b',
  8557: 'BAYERN BAYERN BAYERN BAYERN BAYERN BAYERN BAYERN BAYERN BAYERN BAYERN BAYERN BAYERN BAYERN BAYERN BAYERN BAYERN',
  11339: '9o golo para',
  11854: 'have loads of boss stuff available on their store products available including the m'}}
yudhiesh
  • 6,383
  • 3
  • 16
  • 49
  • 1
    [The difference between "==" and "is"?](https://stackoverflow.com/questions/132988/is-there-a-difference-between-and-is) is a generic Python issue, not specifically pandas. Never use `is` for comparing strings, ints or floats. (Only for objects, and only to test actual identicalness, not just equality.) – smci Feb 11 '21 at 07:37
  • @smci got it was just a careless mistake – yudhiesh Feb 11 '21 at 07:40
  • 1
    yudhiesh. No problem. Just it's important to close duplicate askings of old chestnuts, as duplicates. (This serves many purposes, it redirects people searching for/reasking the same thing in future to them, boosts their votecounts and prominence in SO and Google search, etc.) – smci Feb 11 '21 at 09:04

2 Answers2

2

Use == instead is and also split(" ") working same like split():

df['tweet_text'] = df.apply(lambda x: x['tweet_text'].split()  if x['lang'] == 'en' else x['tweet_text'], axis = 1)

Or you can use alternative with Series.str.split only for en rows:

m = df['lang'] == 'en'
df.loc[m, 'tweet_text'] = df.loc[m, 'tweet_text'].str.split()
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

You can also do it this way:

mask = df["lang"] == "en", "tweet_text"
df.loc[mask] = df.loc[mask].str.split()
Pablo C
  • 4,661
  • 2
  • 8
  • 24