How to return only today and yesterday's information that published using POST requests

Question

I need to get the information that published today and a day before. Also when importing it to a csv file it only print the first column not the remained ones.

The URL: https://e-mehkeme.gov.az/Public/Cases The dates stored in html as <td style="width:95px;text-align:center">28.10.2019</td>

import requests, re
from bs4 import BeautifulSoup as bs
import csv

request_headers = {
    'authority': 'e-mehkeme.gov.az',
    'method': 'POST',
    'path': '/Public/Cases',
    'scheme': 'https',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,'
              'application/signed-exchange;v=b3',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en,en-GB;q=0.9',
    'cache-control': 'max-age=0',
    'content-length': '66',
    'content-type': 'application/x-www-form-urlencoded',
    'origin': 'https://e-mehkeme.gov.az',
    'referer': 'https://e-mehkeme.gov.az/Public/Cases',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/75.0.3770.142 Safari/537.36',
    }

voens = {'3100608381',
         }

form_data = {
    'CourtId': '',
    'CaseNo': '',
    'DocFin': '',
    'DocSeries': '',
    'DocNumber': '',
    'VOEN': voens,
    'button': 'Search',
}

url = 'https://e-mehkeme.gov.az/Public/Cases?courtid='

response = requests.post(url, data=form_data, headers=request_headers)
s = bs(response.content, 'lxml')

# PRINT THE CONTENTS OF EACH SEARCH!
for voen in voens:
    form_data['VOEN'] = voen
    r = requests.post('https://e-mehkeme.gov.az/Public/Cases', data=form_data)
    soup = bs(r.text, 'lxml')
    ids = [i['value'] for i in soup.select('.casedetail')]
    for i in ids:
        r = requests.get(f'https://e-mehkeme.gov.az/Public/CaseDetail?caseId={i}')
        soup = bs(r.content, 'lxml')
        output = [re.sub('\s+', ' ', i.text.strip()) for i in soup.select('[colspan="4"]')]
        print(output)
    with open('courtSearch.csv', 'w', newline='', encoding='utf-8') as myfile:
        writer = csv.writer(myfile, quoting=csv.QUOTE_ALL)
        writer.writerow(output)

DESIRED OUTPUT:

You want only the first column in output? Or you want all columns in csv? — QHarr, Oct 29 '19 at 07:47
Can you show example expected output for one row so we can see format for csv? — QHarr, Oct 29 '19 at 08:30

QHarr · Accepted Answer · 2019-10-29T09:29:41.787

The following uses a slightly different url construct so you can use a GET request and easily gather all pages of results per voen. I gather the string dates and caseIds (required for later requests) during each request. I then use a mask (for days of interest e.g. today and yesterday, converted to strings of same format as on website) to filter for only the ids within desired date range. I then loop that filtered list and issue requests for the pop-up window info.

Within the code you can also see commented out sections. One of which shows you the results retrieved from each page

#print(pd.read_html(str(soup.select_one('#Cases')))[0]) ##view table

I am splitting on the header phrases (so assuming these are regular) such that I can split each string from row into the appropriate output columns.

Possiby requires bs4 4.7.1 +

import requests,re, csv
from bs4 import BeautifulSoup as bs
from datetime import datetime, timedelta
import pandas as pd

headers = ['Ətraflı məlumat: ', 'Cavabdeh: ', 'İddiaçı: ', 'İşin mahiyyəti ']
voens = ['2002283071','1303450301', '1700393071']
number_of_past_days_plus_today = 2
mask = [datetime.strftime(datetime.now() - timedelta(day_no), '%d.%m.%Y') for day_no in range(0, number_of_past_days_plus_today)]
ids = []
table_dates = []

with requests.Session() as s:
    for voen in voens:
        #print(voen)  ##view voen
        page = 1
        while True:
            r = s.get(f'https://e-mehkeme.gov.az/Public/Cases?page={page}&voen={voen}') #to get all pages of results
            soup = bs(r.text, 'lxml')
            ids.extend([i['value'] for i in soup.select('.casedetail')])
            #print(pd.read_html(str(soup.select_one('#Cases')))[0]) ##view table
            table_dates.extend([i.text.strip() for i in soup.select('#Cases  td:nth-child(2):not([colspan])')])

            if soup.select_one('[rel=next]') is None:
                break
            page+=1

    pairs = list(zip(table_dates,ids))
    filtered = [i for i in pairs if i[0] in mask]
    #print(100*'-') ##spacing
    #print(filtered)  ##view final filtered list of ids
    results = []
    for j in filtered:
        r = s.get(f'https://e-mehkeme.gov.az/Public/CaseDetail?caseId={j[1]}')
        soup = bs(r.content, 'lxml')     
        line = ' '.join([re.sub('\s+',' ',i.text.strip()) for i in soup.select('[colspan="4"]')])
        row = re.split('|'.join(headers),line)
        results.append(row[1:])

with open("results.csv", "w", encoding="utf-8-sig", newline='') as csv_file:
    w = csv.writer(csv_file, delimiter = ",", quoting=csv.QUOTE_MINIMAL)
    w.writerow(headers)
    for row in results:
        w.writerow(row)

I searched for splitting on multiple delimiters and used the idea given by @Jonathan here. So upvoted for credit to that user.

thanks for tremendous work, but the csv file containing only one result but the earlier version returned four. Why csv could not include all of the results? — Vali Valizada, Oct 29 '19 at 10:11
I think there is only 1 row which satisfies condition. If you uncomment #print(pd.read_html(str(soup.select_one('#Cases')))[0]) you can check the dates in the columns and only 1 row comes back (29.10.2019) that is in your range of today-yesterday. — QHarr, Oct 29 '19 at 11:29
`Columns: [İşin nömrəsi, Daxil olma tarixi, İşin növü, Məhkəmənin adı, Hakim, İşin baxılma vəziyyəti] Index: []` when i uncomment the line it return this line over and over. i will test it manually if it is working properly. Nevertheless, thanks a lot. — Vali Valizada, Oct 29 '19 at 11:43
can i adjust the days `number_of_past_days_plus_today = 2` by changing value of the variable? — Vali Valizada, Oct 29 '19 at 11:50
yes you can adjust variable. As for speeding up - you could look at threads/async. — QHarr, Oct 29 '19 at 19:02

How to return only today and yesterday's information that published using POST requests

1 Answers1