2

I need to get the information that published today and a day before. Also when importing it to a csv file it only print the first column not the remained ones.

The URL: https://e-mehkeme.gov.az/Public/Cases The dates stored in html as <td style="width:95px;text-align:center">28.10.2019</td>

import requests, re
from bs4 import BeautifulSoup as bs
import csv

request_headers = {
    'authority': 'e-mehkeme.gov.az',
    'method': 'POST',
    'path': '/Public/Cases',
    'scheme': 'https',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,'
              'application/signed-exchange;v=b3',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en,en-GB;q=0.9',
    'cache-control': 'max-age=0',
    'content-length': '66',
    'content-type': 'application/x-www-form-urlencoded',
    'origin': 'https://e-mehkeme.gov.az',
    'referer': 'https://e-mehkeme.gov.az/Public/Cases',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/75.0.3770.142 Safari/537.36',
    }

voens = {'3100608381',
         }

form_data = {
    'CourtId': '',
    'CaseNo': '',
    'DocFin': '',
    'DocSeries': '',
    'DocNumber': '',
    'VOEN': voens,
    'button': 'Search',
}

url = 'https://e-mehkeme.gov.az/Public/Cases?courtid='

response = requests.post(url, data=form_data, headers=request_headers)
s = bs(response.content, 'lxml')

# PRINT THE CONTENTS OF EACH SEARCH!
for voen in voens:
    form_data['VOEN'] = voen
    r = requests.post('https://e-mehkeme.gov.az/Public/Cases', data=form_data)
    soup = bs(r.text, 'lxml')
    ids = [i['value'] for i in soup.select('.casedetail')]
    for i in ids:
        r = requests.get(f'https://e-mehkeme.gov.az/Public/CaseDetail?caseId={i}')
        soup = bs(r.content, 'lxml')
        output = [re.sub('\s+', ' ', i.text.strip()) for i in soup.select('[colspan="4"]')]
        print(output)
    with open('courtSearch.csv', 'w', newline='', encoding='utf-8') as myfile:
        writer = csv.writer(myfile, quoting=csv.QUOTE_ALL)
        writer.writerow(output)

DESIRED OUTPUT:

enter image description here

QHarr
  • 83,427
  • 12
  • 54
  • 101
Vali Valizada
  • 81
  • 1
  • 13

1 Answers1

2

The following uses a slightly different url construct so you can use a GET request and easily gather all pages of results per voen. I gather the string dates and caseIds (required for later requests) during each request. I then use a mask (for days of interest e.g. today and yesterday, converted to strings of same format as on website) to filter for only the ids within desired date range. I then loop that filtered list and issue requests for the pop-up window info.

Within the code you can also see commented out sections. One of which shows you the results retrieved from each page

#print(pd.read_html(str(soup.select_one('#Cases')))[0]) ##view table

I am splitting on the header phrases (so assuming these are regular) such that I can split each string from row into the appropriate output columns.

Possiby requires bs4 4.7.1 +

import requests,re, csv
from bs4 import BeautifulSoup as bs
from datetime import datetime, timedelta
import pandas as pd

headers = ['Ətraflı məlumat: ', 'Cavabdeh: ', 'İddiaçı: ', 'İşin mahiyyəti ']
voens = ['2002283071','1303450301', '1700393071']
number_of_past_days_plus_today = 2
mask = [datetime.strftime(datetime.now() - timedelta(day_no), '%d.%m.%Y') for day_no in range(0, number_of_past_days_plus_today)]
ids = []
table_dates = []

with requests.Session() as s:
    for voen in voens:
        #print(voen)  ##view voen
        page = 1
        while True:
            r = s.get(f'https://e-mehkeme.gov.az/Public/Cases?page={page}&voen={voen}') #to get all pages of results
            soup = bs(r.text, 'lxml')
            ids.extend([i['value'] for i in soup.select('.casedetail')])
            #print(pd.read_html(str(soup.select_one('#Cases')))[0]) ##view table
            table_dates.extend([i.text.strip() for i in soup.select('#Cases  td:nth-child(2):not([colspan])')])

            if soup.select_one('[rel=next]') is None:
                break
            page+=1

    pairs = list(zip(table_dates,ids))
    filtered = [i for i in pairs if i[0] in mask]
    #print(100*'-') ##spacing
    #print(filtered)  ##view final filtered list of ids
    results = []
    for j in filtered:
        r = s.get(f'https://e-mehkeme.gov.az/Public/CaseDetail?caseId={j[1]}')
        soup = bs(r.content, 'lxml')     
        line = ' '.join([re.sub('\s+',' ',i.text.strip()) for i in soup.select('[colspan="4"]')])
        row = re.split('|'.join(headers),line)
        results.append(row[1:])

with open("results.csv", "w", encoding="utf-8-sig", newline='') as csv_file:
    w = csv.writer(csv_file, delimiter = ",", quoting=csv.QUOTE_MINIMAL)
    w.writerow(headers)
    for row in results:
        w.writerow(row)

I searched for splitting on multiple delimiters and used the idea given by @Jonathan here. So upvoted for credit to that user.

QHarr
  • 83,427
  • 12
  • 54
  • 101
  • thanks for tremendous work, but the csv file containing only one result but the earlier version returned four. Why csv could not include all of the results? – Vali Valizada Oct 29 '19 at 10:11
  • I think there is only 1 row which satisfies condition. If you uncomment #print(pd.read_html(str(soup.select_one('#Cases')))[0]) you can check the dates in the columns and only 1 row comes back (29.10.2019) that is in your range of today-yesterday. – QHarr Oct 29 '19 at 11:29
  • `Columns: [İşin nömrəsi, Daxil olma tarixi, İşin növü, Məhkəmənin adı, Hakim, İşin baxılma vəziyyəti] Index: []` when i uncomment the line it return this line over and over. i will test it manually if it is working properly. Nevertheless, thanks a lot. – Vali Valizada Oct 29 '19 at 11:43
  • can i adjust the days `number_of_past_days_plus_today = 2` by changing value of the variable? – Vali Valizada Oct 29 '19 at 11:50
  • the code is running to low. Is there any to fasten it up? – Vali Valizada Oct 29 '19 at 18:32
  • yes you can adjust variable. As for speeding up - you could look at threads/async. – QHarr Oct 29 '19 at 19:02
  • can we add the date of each line to the csv file? – Vali Valizada May 21 '20 at 04:46