Most efficient way to concatenate words from a big CSV file: pandas or Python standard library?

Question

I'm trying to do a textual analysis and have collected my data into a CSV document with three columns. I'm trying to combine all the text from the second column into a single string to perform some word analysis (word cloud, frequency etc.) I've imported the CSV file using pandas. In the code below, data is a DataFrame object.

# Extract words from comment column in data
words = " "
for msg in data["comment"]:
     msg = str(msg).lower()
     words = words + msg + " "
print("Length of words is:", len(words))

The output gets parsed using word_cloud.

wordcloud = WordCloud(width = 3000, height = 2000, random_state=1, collocations=False, stopwords = stopwordsTerrier.union(stopwordsExtra)).generate(words)

CSV File

rating, comment, ID
5, Itâ€™s just soooo delicious but silly price and postage price, XXX1
5, Love this salad dressing... One my kids will estvðŸ˜Š, XXX2
...

The code works fine for smaller files <240kb etc., but I am recently working with a 50mb file and this has slowed down the script by a lot (179,697 rows) - I'm not sure if it will even finish computing. I am sure that this is the bottleneck because I'm running the script in Jupyter notebook and this is the only code in the cell I am executing.

My question is: Is there a more efficient way of doing this?

What type of object is `data`? What type of object is `data['comment']`? Are you sure the *bottleneck* is the loop in your example? 50 Mb file is how many rows? — wwii, Oct 16 '20 at 14:08
Done. 'data' is a pandas dataframe, data.dtypes returns data['comment'] as an object. I think it should be bottleneck as I am executing just those lines in a Jupyter cell, and 50mb is around 179k rows. — Letshin, Oct 16 '20 at 14:23
`words = " ".join(data["comment"].str.lower()` ? But do you really want them in a giant string? Sounds more like you looking to tokenize a corpus of documents? — Dan, Oct 16 '20 at 14:31
[https://pandas.pydata.org/docs/user_guide/text.html#text-string-methods](https://pandas.pydata.org/docs/user_guide/text.html#text-string-methods) — wwii, Oct 16 '20 at 14:55

Christopher Peisert · Accepted Answer · 2020-10-16T16:57:51.683

Pandas solution (2.5 times faster than standard library)

A Pandas Series can be converted to a string with: pandas.Series.str.cat

data = pd.read_csv(file_path)
words = data["comment"].str.cat(sep=' ').lower()

Python standard library solution (slower)

import csv

comment_list = []
with open(file_path, newline='') as csv_file:
    reader = csv.DictReader(csv_file)
    for row in reader:
        comment_list.append(row["comment"])
words = " ".join(comment_list).lower()

Performance testing

Read CSV using standard library vs. `pandas.read_csv`

Using pandas.read_csv() is at least 2.5 times faster than the Python standard library package csv.

Create a test CSV file: test_data.csv

import random

reviews = [
    "Love this salad dressing... One my kids will estvðŸ˜Š",
    "Itâ€™s just soooo delicious but silly price and postage price",
    "The sitcome was entertaining but still a waste of time",
    "If only I had ten stomaches to enjoy everything the buffet had to offer"
]

with open("test_data.csv", "w") as file:
    file.write("random_number,comment,index\n")
    for i in range(10000):
        file.write(f"{random.randint(0, 9)},{random.choice(reviews)},{i}\n")

Read CSV file 100 times

import csv
import pandas as pd
import timeit

def read_csv_stnd(file_path: str) -> str:
    comment_list = []
    with open(file_path, newline='') as csv_file:
        reader = csv.DictReader(csv_file)
        for row in reader:
            comment_list.append(row["comment"])
    return " ".join(comment_list).lower()

def read_csv_pandas(file_path: str) -> str:
    data = pd.read_csv(file_path)
    return data["comment"].str.cat(sep=' ').lower()

data_file = "test_data.csv"
print(f"Time to run read_csv_stnd 100 times: {timeit.timeit(lambda: read_csv_stnd(data_file), number=100)}")
print(f"Time to run read_csv_pandas 100 times: {timeit.timeit(lambda: read_csv_pandas(data_file), number=100)}")

Results of reading CSV file:

Time to run read_csv_stnd 100 times: 2.349453884999093
Time to run read_csv_pandas 100 times: 0.9676197949993366

Standard library `lower()` vs. `pandas.Series.str.lower`

Using the standard library function lower() is roughly 5 times faster than using pandas.Series.str.lower

`pandas.Series.str.lower`

>>> import pandas as pd
>>> import timeit
>>> 
>>> s = pd.Series(['lower', 'CAPITALS', 'this is a sentence', 'SwApCaSe'])
>>> timeit.timeit(lambda: s.str.lower().str.cat(sep=' '), number=10000)
1.9734079910012952

`lower()`

>>> timeit.timeit(lambda: s.str.cat(sep=' ').lower(), number=10000)
0.3571630870010267

score 2 · Answer 2 · edited Oct 16 '20 at 14:39

Instead of creating a new string in every iteration, you can try to append the word to a list, and then transform the list into a string. Maybe with something like:

words = [word.lower() for word in data["comment"]]
words = " ".join(words)

I've tested it with 100,000 words and it seems to be roughly 15 times faster than the method you are currently working with. Of course you can add a space at the beginning of the string or do other modifications to match your exact requirements.

rok · Answer 3 · 2020-10-16T14:38:58.200

1

The most obvious improvement is concatenating python string as below (This is a pythonic way):

words = " ".join((str(msg).lower() for msg in data["comment"]))

The way you use generates new string on each concatenation because strings are immutable in python.

You can find more info here or here

edited Oct 16 '20 at 14:38

answered Oct 16 '20 at 14:32

rok

9,403
17
70
126