-3

I have a ruby script I have been working on that scrapes the source of a webpage and writes to file. I then have to read this file and pull out just specific data from this junk text file of data. I am able to do it in python and get the array I need. I would like to be able to just make my ruby code sleep for a min and do the same thing so I don't have to call the python script and then read the results back into ruby script to continue. I'm sure there is a way to do this in ruby I'm just starting out with ruby and not very savvy yet.

#!/usr/bin/env python

import re

f = open("C:/Users/Steve/Desktop/scripts/testout.txt", encoding="utf8") 
da = f.read()
f.close()
matches = re.findall('href="/normal/PropertyDetails.rb.PID=(\d+)', str(da))
res = []
for i in matches:
    if i not in res:
        res.append(i)
print(str(res))

If I have to go the route with the python I will just have to sleep and call the python script write to another text file and then have ruby open the file read and continue. Just trying to make it as most efficient as I can.

After working with the suggestion below I was able to locate the correct resources to guide me further. Here is the ruby simplified code

data = driver.page_source
res = data.scan(/.PID=(\d+)/) 
print(res.uniq)

now having that I can continue on with the ruby code I am learning with along the way. I think I have the other areas covered thanks for the assist.

  • Note that [you can't parse \[X\]HTML with regex](https://stackoverflow.com/a/1732454/477037) (or at least not reliable). Take a look at [Nokogiri](https://nokogiri.org/) – a Ruby library to parse XML and HTML. – Stefan Sep 01 '21 at 13:30

1 Answers1

1

Something like this?

data = File.read("C:/Users/Steve/Desktop/scripts/testout.txt", encoding: "utf-8")
data.scan(/href="\/normal\/PropertyDetails\.rb\.PID=\d+/)
# will return array of matched string

Please look String#scan and IO::read (File is the child of IO)

mechnicov
  • 12,025
  • 4
  • 33
  • 56
  • This is helpful I'm looking at how to make the regex just take the source junk text before write to file and save a step to just scan the string and return array of the numbers after the PID=123456567&amp unknown length of numbers but &amp is after so I have so playing with direction you pointed me in. – Steven Greer Sep 01 '21 at 13:33
  • Ok so I have went with ``` data.scan(/.PID=(\d+)/) {|d| print d } ``` and there is duplicates in the junk file so now I just need to clean the array of duplicates – Steven Greer Sep 01 '21 at 13:38
  • You Rock the string scan gave me even better ideas I don't have to write and read to a text file take the source code and scan it then and there and get the data in the array. now I can open the pid link and extract the source from those pages and string scan and match the data I want to extract. – Steven Greer Sep 01 '21 at 14:15
  • @StevenGreer to remove duplicates from array just use `uniq` method: `array.uniq` – mechnicov Sep 01 '21 at 14:55