How to split the strings in a particular column of a dataframe based on the value of another column?

Question

I am trying to split the strings in a column tweet_text if the column lang is en

Here is how to do it on a string:

s = 'I am always sad'
s_split = s.split(" ")

This returns:

['I', 'am', 'always', 'sad']

My current code which does not work:

df['tweet_text'] = df.apply(lambda x: x['tweet_text'].split(" ")  if x['lang'] is 'en' else x['tweet_text'], axis = 1)

Dictionary of data:

{'lang': {1404: 'en',
  1943: 'en',
  2169: 'en',
  2502: 'de',
  3981: 'nl',
  4226: 'en',
  7223: 'en',
  8557: 'de',
  11339: 'pt',
  11854: 'en'},
 'tweet_text': {1404: 'I am always sad when a colleague loses his job and Frank is not just a colleague he is an impoant person in my',
  1943: 'It remains goalless at FNB Stadium between Kaizer Chiefs and Baroka at halftimeRead more',
  2169: 'Which one gets your vote 05',
  2502: 'Was sagt ihr zu den ersten Minuten',
  3981: 'En we gaan door speelronde begint vandaagTegen wie speelt jouw favoriete club',
  4226: 'Quote tweet or replyYour favourite Mesut Ozil moment as a Gunner was',
  7223: 'How to follow the game live The opponent Current form Did you know The squad Koeman said It must b',
  8557: 'BAYERN BAYERN BAYERN BAYERN BAYERN BAYERN BAYERN BAYERN BAYERN BAYERN BAYERN BAYERN BAYERN BAYERN BAYERN BAYERN',
  11339: '9o golo para',
  11854: 'have loads of boss stuff available on their store products available including the m'}}

[The difference between "==" and "is"?](https://stackoverflow.com/questions/132988/is-there-a-difference-between-and-is) is a generic Python issue, not specifically pandas. Never use `is` for comparing strings, ints or floats. (Only for objects, and only to test actual identicalness, not just equality.) — smci, Feb 11 '21 at 07:37
yudhiesh. No problem. Just it's important to close duplicate askings of old chestnuts, as duplicates. (This serves many purposes, it redirects people searching for/reasking the same thing in future to them, boosts their votecounts and prominence in SO and Google search, etc.) — smci, Feb 11 '21 at 09:04

jezrael · Accepted Answer · 2021-02-11T07:26:00.743

2

Use == instead is and also split(" ") working same like split():

df['tweet_text'] = df.apply(lambda x: x['tweet_text'].split()  if x['lang'] == 'en' else x['tweet_text'], axis = 1)

Or you can use alternative with Series.str.split only for en rows:

m = df['lang'] == 'en'
df.loc[m, 'tweet_text'] = df.loc[m, 'tweet_text'].str.split()

edited Feb 11 '21 at 07:26

answered Feb 11 '21 at 07:24

jezrael

822,522
95
1,334
1,252

score 0 · Answer 2 · answered Feb 11 '21 at 07:30

0

You can also do it this way:

mask = df["lang"] == "en", "tweet_text"
df.loc[mask] = df.loc[mask].str.split()

answered Feb 11 '21 at 07:30

Pablo C

4,661
2
8
24

How to split the strings in a particular column of a dataframe based on the value of another column?

2 Answers2