Iterate over a big Pandas Dataframe

Question

I have been trying to get the rows of a Dataframe that fullfill certain condition but it takes a really long time to complete. What I have is

latitud codigoSerial    nombre                        longitud                 variableConsulta fecha

6.379038    3   Girardota - S.O.S Aburrá Norte      -75.450913  pm25    2018-08-28

And it goes like that but with 183981 rows and I have to get all the ones that have the word "Medellin" in nombre

What I did was to use to check which codigoSerial were used by Medellin and put all of them on a list, then I started to iterate over the dataframe checking if its codigoSerial were on the list and if it were the row was append on a different dataframe, is there any other more efficient option?

cod = [25,44,79,80,83,84,85,86]

final2 = pd.DataFrame(columns = ["latitud", "codigoSerial", "nombre", "nombreCorto", "longitud", "variableConsulta", "fecha", "calidad", "valor"])
indices = []

for i in range(len(antioquia["codigoSerial"])):
  if antioquia["codigoSerial"].iloc[i] in cod:
    indices.append(i)

for i in indices:
  final2 = final2.append(antioquia.iloc[i])

Have you tried `df.loc[df['nombre'].str.contains("Medellin", case=False)]`? — Dan, Dec 16 '22 at 01:24
yes, it seems that it might work but it only saves the ones that have the last codigoSerial and if i try to iterate over each one to later join the dataframes the time it takes remains the same — Tomas Rodriguez, Dec 16 '22 at 01:33
Both iterating over a dataframe and appending to a dataframe are inherently slow. I've linked some questions that cover more efficient options, although I'm not a Pandas expert, so LMK if anything's wrong or unclear or whatever. BTW, welcome to Stack Overflow! Check out the [tour], and [ask] if you want tips. — wjandrea, Dec 16 '22 at 01:35
@Tomas *"it only saves the ones that have the last codigoSerial"* -- Is it possible that the other ones have an accent over the i (Medellín) and the `codigoSerial` just happens to be correlated? — wjandrea, Dec 16 '22 at 01:38
Thank you very much, yes it is Medellín but it is written like that all over the Dataframe, the codigoSerial is different for every place even for the ones that have the word Medellín. There are 21 different codigoSerial but each one is repeated on a lot of rows because each one has different information — Tomas Rodriguez, Dec 16 '22 at 02:17
I don't think I understand the question. You said you "have to get all the ones that have the word 'Medellin' in nombre." My above one-liner does that. As to your list, I do not understand what you are trying to do. See https://xyproblem.info . Tell us (1) the actual problem you are trying to solve, (2) what you tried to solve it, and then (3) where you are stuck. You have told us #2 and #3, but not #1. For example, if you only want a list of the `codigoSerial` values where the `nombre` contains "Medellin," `df.loc[df['nombre'].str.contains("Medellin", case=False)]["codigoSerial"].unique()`. — Dan, Dec 16 '22 at 21:54
And then to filter those to only those in a list: `new_df = df.loc[df['nombre'].str.contains("Medellin", case=False)]` Then `new_df[(new_df["codigoSerial"] in cod)]` where `cod` is your list of serial numbers. This can all be done in a oneliner even with `df.loc[(df['nombre'].str.contains("Medellin", case=False)) & (df["codigoSerial"] in cod)]` No iteration and gets what you need I think. But again, not sure I understand your problem. — Dan, Dec 17 '22 at 00:47

Iterate over a big Pandas Dataframe

0 Answers0