0

I have a path in data bricks '/dbfs/mnt/sgi/report' inside this path there are parquet files and folder are also there inside this path. Inside every folder there are parquet files. so I have to create an excel sheet and first col will be the path of the parquet file and 2nd col will be the the no. of rows inside the parquet file. so for that I found the path of all parquet with this script

l=[]
m=[]
l_parquet=[]
m_parquet=[]
for dirpath, dirnames, filenames in os.walk("/dbfs/mnt/sgi/report/"):
  for filename in [f for f in filenames if f.endswith(".parquet")]:
        path = os.path.join(dirpath, filename)
        df_parquet=pd.read_parquet(path)
        l_parquet.append(df.shape[0])
        m_parquet.append(df.shape[1])
weatther={'path':path, 'count_of_rows':l_parquet, 'count_of_col':m_parquet}
df=pd.DataFrame(weatther)
df['path'].unique()

but the problem is df_csv is only able to store last folder file.as you can see

array(['/dbfs/mnt/sgi/report/trigger_demo/SAMPLE_DATA_2.parquet'],
      dtype=object)

can you pls tell me how can i take count of all paths in a dataframe

  • Your question needs a minimal reproducible example consisting of sample input, expected output, actual output, and only the relevant code necessary to reproduce the problem. See [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) for best practices related to Pandas questions. – itprorh66 Jun 14 '23 at 14:02

1 Answers1

0

I found the answer

l=[]
m=[]
l_parquet=[]
m_parquet=[]
for dirpath, dirnames, filenames in os.walk("/dbfs/mnt/sgi/report/"):
  for filename in [f for f in filenames if f.endswith(".parquet")]:
        path = os.path.join(dirpath, filename)
        l.append(path)
        df_parquet=pd.read_parquet(path)
        l_parquet.append(df_parquet.shape[0])
        m_parquet.append(df_parquet.shape[1])
weatther={'path':l, 'count_of_rows':l_parquet, 'count_of_col':m_parquet}
df=pd.DataFrame(weatther)
df.to_csv('/dbfs/FileStore/QA_count_2.csv')
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jun 20 '23 at 00:55