0

I have two files:

f1.csv - it contains 800 rows with unique id values

id
1
2
3
4
5

f2.xlsx (Sheet1) - it contains 20 columns and many rows (200Mb).

typeID  col2   col2 ...
1
1
1
2
2
2
2
2
3
4
10
10
...

I want to reduce the volume of f2.xlsx in order to be open the data file in Jupyter Notebook (Python) and analyze it with pandas. In particular, I want to select only those typeID values that match id in f1.csv. Is there any way to use terminal commands in order to do this filtering and then save a filtered file in CSV format?

Tatik
  • 1,107
  • 1
  • 9
  • 17
  • https://stackoverflow.com/a/38805230/1745001 is the starting point as it will show you how to generate a CSV from f2.xlsx. If you still need help after that then ask a new question using the 2 CSVs as input rather than 1 CSV and 1 XLSX. – Ed Morton Jun 12 '19 at 20:25
  • Also, take a look at https://pythonspot.com/read-excel-with-pandas/ – Diego Torres Milano Jun 12 '19 at 20:25
  • @DiegoTorresMilano: Thanks for your link. I know how to read data into pandas DataFrame, but the data volume does not fit into the memory of my machine. Therefore pd.read_csv takes too much time. My idea was to filter only those rows that I need using bash script, and then proceed with Pandas. – Tatik Jun 12 '19 at 21:46

0 Answers0