2

I have a large datasource in which I am trying to enrich the data by creating some calculated columns.

The data source is close to 4 Million rows and I am pulling the data in chunks of 100,000

for field in fields:
   operation_start = time.time()
   print(f"Operation {y+1}")
   chunk[field] = chunk.apply(operations[y], axis = 1)                    
   print("Operation completed in " + str(round(time.time()-operation_start,2)) + " seconds")
   operation_start = time.time()
   y = y +1

The for loop runs for 9 functions I have defined, each one returning a value for a new column.

Is there a more efficient way in which I capture all the new fields at once, applying everything at once instead of one by one? One important remark is that there are a pair of the functions that need the value created by other functions

I have tried pandarallel but it makes the process even slower.

EDIT:

I managed to make a function and use pandarallel following the info here Apply multiple functions to multiple groupby columns however it does not append the new columns

def Operations(row):
    import pandas as pd
    d = {}
    d["Operation A "] = Operation_A(row)    
    d["Operation B"] = Operation_B(row, d["Operation A "])
    d["Operation C"] = Operation_C(row)
    d["Operation D"] = Operation_D(row)
    d["Operation E"] = Operation_E(row)
    d["Operation F"] = Operation_F(row)
    d["Operation G"] = Operation_G(row, d["Operation F"], d["Operation D"], d["Operation B"])
    d["Operation H"] = Operation_H(row)
    d["Operation I"] = Operation_I(row, d["Operation H"])

    return pd.Series(d, index= ["Operation A ", "Operation B", "Operation C", "Operation D", "Operation E", "Operation F", "Operation G", "Operation H", "Operation I"])

chunk.parallel_apply(Operations)

All the Operations make string comparisons and return strings, I cannot provide an example because the functions of all these operations add to more than 400 lines of code :S

Nwish1986
  • 35
  • 5
  • you need to provide a fully reproducible example (sample data is missing) – mozway Aug 20 '22 at 17:12
  • Generally try and stay away from breaking you dataframe into rows. Most of the time it's possible to convert functions into more pandas style code (which generally will perform fun tions against a whole column rather than row by row). – rayad Aug 21 '22 at 12:37

1 Answers1

1

You can process the file as a stream without building a DataFrame. There's Table helper in convtools library (table docs | github). Keep in mind it reads the file as-is and doesn't infer types, it works the same as if you read the file with csv.reader yourself.

from convtools.contrib.tables import Table
from convtools import conversion as c

def Operation_A(x):
    return x == "abc"

a_values = {"1", "2"}
def Operation_B(x, y):
    return x in a_values and y

table = Table.from_csv("tmp/in.csv", header=True)

# # or if from pd.DataFrame -- unnecessary RAM usage, because Table works with
# # the stream of data
# import pandas as pd
# df = pd.DataFrame({"a": [1, 2, 3], "b": [3, 4, 5]})
# table = Table.from_rows(df.itertuples(index=False), header=list(df.columns))


table.update(
    **{
        "Operation A": c.call_func(Operation_A, c.col("b")),
        "Operation B": c.call_func(
            Operation_B, c.col("a"), c.col("Operation A")
        ),
    }
).into_csv("tmp/out.csv")
# .into_iter_rows(dict)
# .into_iter_rows(tuple, include_header=True)

I'd suggest using convtools conversions though:

(
    Table.from_csv("tmp/in.csv", header=True)
    .update(
        **{
            "Operation A": c.col("b") == "abc",
            "Operation B": c.and_(
                c.col("a").in_(c.naive(a_values)), c.col("Operation A")
            ),
        }
    )
    .into_csv("tmp/out.csv")
)
westandskif
  • 972
  • 6
  • 9