0

I have an RSS feed I want to grab data from, manipulate and then save it to a CSV file. The RSS feed refresh rate is a big window, 1 minute to several hours, and only hold 100 items at a time. So to capture everything, Im looking to have my script run every minute. The problem with this is if the script runs before the feed updates I will be grabbing past data which lead to adding duplicate data to the CSV.

I tried looking at using examples mentioned here but it kept erroring out.

Data Flow: RSS Feed --> Python Script --> CSV file

Sample data and code below:

Sample Data from CSV:

gandcrab,acad5fc7ebe8c6979d98cb8537e3a247,18bb2c3b82649314dfd45a379058869804954276,bf0ac94c6ae6f1ecfcccc049ae2373bfc659b2efb2e48e824e2e78fb43b6ebef,54,C

Sample Data from list:

zeus,186e84c5fd7da7331a62f1f13b1f4608,3c34aee767859fd75eb0c8c701716cbfd5655437,05c8e4f01ec8d4e6f4595db93bbcc0f85386c9f1b82b5833d983c9092640573a,49,C

Code for comparing:

if trends_f.is_file():
 with open('trendsv3.csv', 'r+', newline='') as csv_file:
  h_reader = csv.reader(csv_file)
  next(h_reader) #skip reading header of csv
  #should i load the csv into a list then compare it with diff() against the other list?
  #or is there an easier, faster, more efficient way?
Sudo Rm -F
  • 29
  • 6
  • Which among the 7 fields is used as a unique key? Is it the first field? Or is it all fields? – blhsing Feb 22 '19 at 18:35
  • Hi do you mind to read about [how to ask](/help/how-to-ask) and [mcve](/help/mcve)? Then please format you question. If you could use `pandas` it will be pretty easy. – rpanai Feb 22 '19 at 18:36
  • @blhsing i guess anyone of the hash field could be used. the problem is there are some cases where a hash may get reported as two different names which im okay with having saved. just dont want the same exact thing saved twice – Sudo Rm -F Feb 22 '19 at 18:38
  • So you mean that a list needs to have all fields to be the same as an existing row in the CSV to be considered a duplicate? – blhsing Feb 22 '19 at 18:39
  • @blhsing looking over the source where this data is coming from, it looks like there is an ID for each entry. Im assuming i can run a compare on the ID field then? – Sudo Rm -F Feb 22 '19 at 18:49

2 Answers2

0

I would recommending downloading everything into a CSV, and then deduplicating in batches (eg nightly) that generates a new "clean" CSV for whatever you're working on.

To dedup, load the data in with the pandas library and then you can use the function drop_duplicates on the data.

http://pandas.pydata.org/pandas-docs/version/0.17/generated/pandas.DataFrame.drop_duplicates.html

ScottieB
  • 3,958
  • 6
  • 42
  • 60
0

Adding the ID from the feed seemed to make things the easiest to check against. Thank @blhsing for mentioning that. Ended reading the IDs from the csv into a list and checking the new data's IDs against that. There may be a faster more efficient way, but this works for me.

Code to check csv before saving to it:

if trends_f.is_file():
 with open('trendsv3.csv', 'r') as csv_file:
  h_reader = csv.reader(csv_file, delimiter=',')
  next(h_reader, None)
  for row in h_reader:
   csv_list.append(row[6])
 csv_file.close()
 with open('trendsv3.csv', 'a', newline='') as csv_file:
  h_writer = csv.writer(csv_file)
  for entry in data_list:
   if entry[6].strip() not in csv_list:
    print(entry[6], ' is not in the list, saving ', entry[6],' to the list')
    h_writer.writerow(entry)
   else:
    print(entry[6], ' is in the list')
 csv_file.close()
Sudo Rm -F
  • 29
  • 6