1

I have a log (text) file with this syntax

1/21/18, 22:48 - ~text~
1/21/18, 22:48 - ~text~
1/23/18, 22:48 - ‪~text~
~text~
~text~
1/24/18, 22:48 - ~text~

And I would like to get an array of all dates, for example ["1/21/18","1/21/18","1/23/18","1/24/18"]

Because my final goal is to build an histogram of frequencies for each date to know the amount of events that each day had (just to know the evolution of events through time) (so, if you want to give a tip to do this easier, its welcome!)

Ive tried regex according to question 4709652 but thats not working as expected. Anyways, one of my issues is that the textfile is big (hundreds of megabytes) which causes to slow down.

Whats the optimal way to do this?

Thanks!

Akiru
  • 189
  • 1
  • 2
  • 10
  • 1
    pandas.read:csv() - [how-to-read-a-6-gb-csv-file-with-pandas](https://stackoverflow.com/questions/25962114/how-to-read-a-6-gb-csv-file-with-pandas) - you are nowhere near a "big" file - still: padas makes reading csv event in chunks a possibility - it will convert your date possibly to a date or you can do this on your own if it imports as string. It even can do histograms: [pandas histogram on column](https://stackoverflow.com/questions/42496508/histogram-on-pandas-column) - I just know about pandas, not really worked with it but it seems it is something you could leverage to get your results – Patrick Artner Jul 08 '18 at 19:24

4 Answers4

2

As suggested by @Patrick, pandas would be an easier and efficient way to do it.

import pandas as pd
p = pd.read_csv(<name of the file>,names=["date","random"])
p['date'] = pd.to_datetime(p['date'],errors='coerce') #converts the first column to date type and puts a NaT in place of texts.
p = p.dropna() #drop rows containing NaT
print(p['date'])

Output:

0   2018-01-21
1   2018-01-21
2   2018-01-23
5   2018-01-24

You can even pass the date column to a histogram function if it ignores NaT without dropping them.

gaganso
  • 2,914
  • 2
  • 26
  • 43
  • This idea is really cool, the problem is that the ~text~ may have commas... Is there any way to let pd.read_csv to use only the frist comma that it reads? – Akiru Jul 08 '18 at 20:50
  • Ive also receive this error. Your example works well but i need to tune it a bit to avoid this parse errors File "pandas\_libs\parsers.pyx", line 1524, in pandas._libs.parsers._string_box_utf8 (pandas\_libs\parsers.c:23041) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 67: invalid continuation byte – Akiru Jul 08 '18 at 20:55
  • 2
    I knew it would be a breze using pandas ;) +1 – Patrick Artner Jul 09 '18 at 05:43
1

You can read file line by line and apply regex to each line, example:

import re

list = list()
with open('logs.txt', 'r') as fp:
    line = fp.readline()
    while line:
        dates = re.findall('(\d+\/\d+\/\d+)', line)
        map(list.append, dates)
        line = fp.readline()

print(list)

Output:

['1/21/18', '1/21/18', '1/23/18', '1/24/18']
mitch
  • 2,235
  • 3
  • 27
  • 46
1

Assuming that the whole text file has the same format, This should work.

def process():
    file = open('test.txt')

    dates = []

    for line in file.readlines():
        if line[0] != '~':
            dates.append(line.strip(' - ~text~').split(',')[0])

    return dates

print(process())

This is the output.

['1/21/18', '1/21/18', '1/23/18', '1/24/18']
Josewails
  • 570
  • 3
  • 16
1

You can use re.findall to do this

import re
text = '1/21/18, 22:48 - ~text~\n1/21/18, 22:48 - ~text~\n1/23/18, 22:48 - ~text~\n~text~\n~text~\n1/24/18, 22:48 - ~text~'
re.findall(r'^([\d/]+),', text, re.MULTILINE)
# ['1/21/18', '1/21/18', '1/23/18', '1/24/18']
Sunitha
  • 11,777
  • 2
  • 20
  • 23