Fake Data:
How do I write the following columns in python?
UpperCaseWords: The sum of the upper case words in each row
%of Upper Case Words: Percentage of text that is in all uppercase
Fake Data:
How do I write the following columns in python?
UpperCaseWords: The sum of the upper case words in each row
%of Upper Case Words: Percentage of text that is in all uppercase
Use .str.split() to split the words of a sentence in a list and then use .str.isupper() to count the words that are completely uppercase:
# example data
df = pd.DataFrame(
{'A': ['how are you DOING Or', 'WE want NOW ansWers']}
)
# split your string, default split is on a space
# you get a list words
df['split_words'] = df['A'].str.split()
# iterate over list of words and count how many are uppercase
df['count_upper_case_words'] = df['split_words'].apply(
lambda list_: sum(1 for word in list_ if word.isupper())
)
# count total number of words
df['count_total_words'] = df['split_words'].str.len()
# calculate percentage of uppercase words
df['perc_uper_case'] = df['count_upper_case_words'] / df['count_total_words'] * 100.
Resulting dataframe:
split count_upper perc
0 how are you DOING OR [how, are, you, DOING, Or] 1 5 20.
1 WE want NOW ansWers [WE, want, NOW, ansWers] 2 4 50.
You can use .str.count()
to count the occurrences of upper case and total words separately. From there you can use division to calculate the percentage of uppercase words.
df["n_uppercase_words"] = df["A"].str.count(r"\b[A-Z]+\b")
df["n_words"] = df["A"].str.count(r"\b\w+\b")
df["percent_uppercase_words"] = df["n_uppercase_words"] / df["n_words"] * 100
print(df)
A n_uppercase_words n_words percent_uppercase_words
0 My name is JACOB 1 4 25.0
1 Football and BASKETBALL and SOCCER 2 5 40.0
2 North Dakota 0 2 0.0
3 South Dakota 0 2 0.0
Regular Expressions:
\b[A-Z]+\b
: captures any 1 or more consecutive upper-case letter that has some form of separation on either side\b[A-Za-z]+\b
: Same as above, but also includes lowercase letters.This solution will ignore numbers or "words" with numbers in them (or any other character that is not a letter a-z).
very easy to understand
import pandas as pd
data = {'A':['ABC', 'abc BCD']}
df = pd.DataFrame(data) #feed data to create DataFrame
def count(row):
return sum(word.isupper() for word in row.split()) #split given sentence and check each word if uppercase or not
def percentage(row):
return (int(row['uppercase']) / len(row['A'].split())) * 100. #count the number of words and uppercase word to calculate percentage
df['uppercase'] = df['A'].apply(lambda row: count(row))
df['percentage'] = df.apply(lambda row: percentage(row), axis=1)
df #final data frame
OUTPUT:
A uppercase percentage
0 ABC 1 100.0
1 abc BCD 1 50.0