Python как найти в тексте дату - Исправление недочетов и поиск решений вместе с Examum.ru

Instead of using regex, it is generally better to parse the string as a datetime.datetime object:

In [140]: datetime.datetime.strptime("11/12/98","%m/%d/%y")
Out[140]: datetime.datetime(1998, 11, 12, 0, 0)

In [141]: datetime.datetime.strptime("11/12/98","%d/%m/%y")
Out[141]: datetime.datetime(1998, 12, 11, 0, 0)

You could then access the day, month, and year (and hour, minutes, and seconds) as attributes of the datetime.datetime object:

In [143]: date.year
Out[143]: 1998

In [144]: date.month
Out[144]: 11

In [145]: date.day
Out[145]: 12

To test if a sequence of digits separated by forward-slashes represents a valid date, you could use a try..except block. Invalid dates will raise a ValueError:

In [159]: try:
   .....:     datetime.datetime.strptime("99/99/99","%m/%d/%y")
   .....: except ValueError as err:
   .....:     print(err)
   .....:     
   .....:     
time data '99/99/99' does not match format '%m/%d/%y'

If you need to search a longer string for a date,
you could use regex to search for digits separated by forward-slashes:

In [146]: import re
In [152]: match = re.search(r'(d+/d+/d+)','The date is 11/12/98')

In [153]: match.group(1)
Out[153]: '11/12/98'

Of course, invalid dates will also match:

In [154]: match = re.search(r'(d+/d+/d+)','The date is 99/99/99')

In [155]: match.group(1)
Out[155]: '99/99/99'

To check that match.group(1) returns a valid date string, you could then parsing it using datetime.datetime.strptime as shown above.

Источник

Given a string, the task is to write a Python program to extract date from it.

Input : test_str = "gfg at 2021-01-04"
Output : 2021-01-04
Explanation : Date format string found.

Input : test_str = "2021-01-04 for gfg"
Output : 2021-01-04
Explanation : Date format string found.

Method #1 : Using re.search() + strptime() methods

In this, the search group for a particular date is fed into search(), and strptime() is used to feed in the format to be searched.

Python3

import re

from datetime import datetime

test_str = "gfg at 2021-01-04"

print("The original string is : " + str(test_str))

match_str = re.search(r'd{4}-d{2}-d{2}', test_str)

res = datetime.strptime(match_str.group(), '%Y-%m-%d').date()

print("Computed date : " + str(res))

Output

The original string is : gfg at 2021-01-04
Computed date : 2021-01-04

Method #2: Using python-dateutil() module

This is another way to solve this problem. In this inbuilt Python library python-dateutil, The parse() method can be used to detect date and time in a string.

Python3

from dateutil import parser

test_str = "gfg at 2021-01-04"

print("The original string is : " + str(test_str))

res = parser.parse(test_str, fuzzy=True)

print("Computed date : " + str(res)[:10])

Output:

The original string is : gfg at 2021-01-04
Computed date : 2021-01-04

Method #3: Using string manipulation

Approach

We can use string manipulation to search for the date format string in the input string.

Algorithm

1. Split the input string into words.
2. Iterate through the words and check if each word matches the date format string.
3. If a match is found, return the date format string.

Python3

test_str = "gfg at 2021-01-04"

words = test_str.split()

for word in words:

if len(word) == 10 and word[4] == "-" and word[7] == "-":

print(word)

break

Time complexity: O(n)
Auxiliary Space: O(1)

METHOD 4:Using Split and Join

APPROACH:

This approach first splits the string into a list of words, then extracts the last word which is the date, and finally splits the date using ‘-‘ and joins it again using ‘-‘.

ALGORITHM:

1.Split the input string by space character, which gives a list of two elements: the text “gfg” and the date string “2021-01-04”.
2.Get the last element of the list (i.e., the date string) using indexing.
3.Split the date string by “-” character, which gives a list of three elements: the year, month, and day.
4.Join the elements of the list with “-” character using the join() method to get the final date string.

Python3

string = 'gfg at 2021-01-04'

date = "-".join(string.split()[-1].split("-"))

print("Computed date:", date)

Output

Computed date: 2021-01-04

Time complexity: O(n), where n is the length of the string.
Space complexity: O(n).

METHOD 5:Using Regular Expression

APPROACH:

The program extracts the date from a given string using regular expression.

ALGORITHM:

1.Import the re module.
2.Define the input string.
3.Use the re.findall() method with a regular expression pattern to extract the date from the string.
4.Print the extracted date.

Python3

import re

string = 'gfg at 2021-01-04'

date = re.findall('d{4}-d{2}-d{2}', string)[0]

print("Computed date:", date)

Output

Computed date: 2021-01-04

Time Complexity: The time complexity of the program depends on the size of the input string and the efficiency of the regular expression pattern. In the worst case, the time complexity is O(n), where n is the length of the input string.

Space Complexity: The space complexity of the program is O(1), as it only stores the extracted date in a variable.

Last Updated :
23 Apr, 2023

Like Article

Save Article

Источник

You can run a date parser on all subtexts of your text and pick the first date. Of course, such solution would either catch things that are not dates or would not catch things that are, or most likely both.

Let me provide an example that uses dateutil.parser to catch anything that looks like a date:

import dateutil.parser
from itertools import chain
import re

# Add more strings that confuse the parser in the list
UNINTERESTING = set(chain(dateutil.parser.parserinfo.JUMP, 
                          dateutil.parser.parserinfo.PERTAIN,
                          ['a']))

def _get_date(tokens):
    for end in xrange(len(tokens), 0, -1):
        region = tokens[:end]
        if all(token.isspace() or token in UNINTERESTING
               for token in region):
            continue
        text = ''.join(region)
        try:
            date = dateutil.parser.parse(text)
            return end, date
        except ValueError:
            pass

def find_dates(text, max_tokens=50, allow_overlapping=False):
    tokens = filter(None, re.split(r'(S+|W+)', text))
    skip_dates_ending_before = 0
    for start in xrange(len(tokens)):
        region = tokens[start:start + max_tokens]
        result = _get_date(region)
        if result is not None:
            end, date = result
            if allow_overlapping or end > skip_dates_ending_before:
                skip_dates_ending_before = end
                yield date


test = """Adelaide was born in Finchley, North London on 12 May 1999. She was a 
child during the Daleks' abduction and invasion of Earth in 2009. 
On 1st July 2058, Bowie Base One became the first Human colony on Mars. It 
was commanded by Captain Adelaide Brooke, and initially seemed to prove that 
it was possible for Humans to live long term on Mars."""

print "With no overlapping:"
for date in find_dates(test, allow_overlapping=False):
    print date


print "With overlapping:"
for date in find_dates(test, allow_overlapping=True):
    print date

The result from the code is, quite unsurprisingly, rubbish whether you allow overlapping or not. If overlapping is allowed, you get a lot of dates that are nowhere to be seen, and if if it is not allowed, you miss the important date in the text.

With no overlapping:
1999-05-12 00:00:00
2009-07-01 20:58:00
With overlapping:
1999-05-12 00:00:00
1999-05-12 00:00:00
1999-05-12 00:00:00
1999-05-12 00:00:00
1999-05-03 00:00:00
1999-05-03 00:00:00
1999-07-03 00:00:00
1999-07-03 00:00:00
2009-07-01 20:58:00
2009-07-01 20:58:00
2058-07-01 00:00:00
2058-07-01 00:00:00
2058-07-01 00:00:00
2058-07-01 00:00:00
2058-07-03 00:00:00
2058-07-03 00:00:00
2058-07-03 00:00:00
2058-07-03 00:00:00

Essentially, if overlapping is allowed:

«12 May 1999» is parsed to 1999-05-12 00:00:00
«May 1999» is parsed to 1999-05-03 00:00:00 (because today is the 3rd day of the month)

If, however, overlapping is not allowed, «2009. On 1st July 2058» is parsed as 2009-07-01 20:58:00 and no attempt is made to parse the date after the period.

Источник

We must first understand some regular expression fundamentals as we will use them. There are various ways to declare patterns in regular expressions, which might make them appear complex but are pretty simple. Regular expressions are patterns that can be used to match strings that adhere to that pattern. You need to read the following article to learn how regular expressions operate.

You may commonly extract dates from a given text when learning to code. If you are automating a Python script and need to extract specific numerical figures from a CSV file, if you are a data scientist and need to separate complex date from given patterns, or if you are a Python enthusiast who wants to learn more about strings and numerical data types, you will undoubtedly find this helpful article.

It is expected that you are familiar with regular expression fundamentals moving forward.

Example 1

Only the basic notations will be used to create a regex pattern for dates. We aim to match dates that have the elements day, month, year, or day, month, and year, with the elements day and month having two digits and the element year having four digits. Now let’s build the pattern piece by piece.

d will match digits, as you would have guessed. We need to supply the number 2 within to match strings that have precisely 2 digits. Therefore, «d2» will match any string that only has 2 digits. The pattern for the day, the month, and the year is d2, d2, and d4, respectively. These three must be joined together with a ‘/’ or ‘-‘.

The latest regex pattern is «d2» followed by «d2» and «d4».

Now that the problematic portion is finished, the remaining task is easy.

Input 1

import re


f = open("doc.txt", "r")


content = f.read()


pattern = "d{2}[/-]d{2}[/-]d{4}"


dates = re.findall(pattern, content)

It should be noted that our regex pattern will also extract dates that aren’t legitimate, such 40/32/2019. The final code must be modified to appear as follows:

Input 2

import re


f = open("doc.txt", "r")


content = f.read()


pattern = "d{2}[/-]d{2}[/-]d{4}"


dates = re.findall(pattern, content)

for date in dates:
   if "-" in date:
      day, month, year = map(int, date.split("-"))
   else:
      day, month, year = map(int, date.split("/"))
   if 1 <= day <= 31 and 1 <= month <= 12:
      print(date)
f.close()

Input Text

For example, if the content of the text file is as follows

My name is XXX. I was born on 07/12/2001 in YYY city.
I graduated from ZZZ college on 07-28-2019.

Output

07/04/1998
09-05-2019

Example 2

import datetime
from datetime import date
import re
s = "Jason's birthday is on 2002-07-28"
match = re.search(r'd{4}-d{2}-d{2}', s)
date = datetime.datetime.strptime(match.group(), '%Y-%m-%d').date()
print (date)

Output

2002-07-28

Conclusion

As a result of the conversations above, we discovered various Python functions for extracting date from a given text. The regex module is undoubtedly our personal favorite, though. You may counter that alternative approaches, such as the split() functions, result in speedier execution and more straightforward, more understandable code. However, as previously stated, it does not yield negative values (about method 2), nor does it function for floating-point numbers with no space between them and other characters, such as «25.50k» (about method 2). Furthermore, speed is essentially a useless statistic when it comes to log parsing. You can now understand why, out of all the options on this list, regex is my personal preference.

Источник

Last updated on
Feb 10, 2022

In this post you will see how to extract any date with python regular expression:

Regex Matching Date 10/10/2015
Regex Matching Date 10-10-15
Regex Matching Date 1 NOV 2010
Regular expression Matching Date 10 March 2015

The list of the matched formats:

10/10/2015
10-10-15
1 NOV 2010
10 March 2015

You can watch a short video on this article here: python regular expression matching dates

You can be interested in Python project: Python useful tips and reference project

Regex Matching Date 10/10/2015

Regex date format: dd/mm/yyyy

[d]{1,2} — match one or two digits
[d]{4} — match exactly 4 digits
separator is /

import re

# Matching capital letters

str = """COBOL is a compiled English-like computer programming language designed for business use. 122. On 10/10/2015 is a big date unlike 1/11/2010 """

all = re.findall(r"[d]{1,2}/[d]{1,2}/[d]{4}", str)

for s in all:
    print(s)

result

10/10/2015
1/11/2010

Regex Matching Date 10-10-15

Regex date format: dd-mm-yy

[d]{1,2} — match one or two digits
separator is —

import re

# Matching capital letters

str = """COBOL is a compiled English-like computer programming language designed for business use. 122. On 10-10-15 is a big date unlike 1-11-10 """

all = re.findall(r"[d]{1,2}-[d]{1,2}-[d]{2}", str)

for s in all:
    print(s)

result

10-10-15
1-11-10

Regex Matching Date 1 NOV 2010

Regular expression date format: dd MMM yyyy

[ADFJMNOS]w* — Match big letter from ADFJMNOS followed by another letters

import re

# Matching capital letters

str = """COBOL is a compiled English-like computer programming language designed for business use. 122. On 10 OCT 2015 is a big date unlike 1 NOV 2010 """

all = re.findall(r"[d]{1,2} [ADFJMNOS]w* [d]{4}", str) 

for s in all:
    print(s)

result

10 OCT 2015
1 NOV 2010

A more precise extraction for this example would be:

r"([d]{1,2}s(JAN|NOV|OCT|DEC)s[d]{4})"

result(if you change the OCT date):

('1 NOV 2010', 'NOV')

Better result for date extraction with this format can be done by using — ?: — check this code below:

all = re.findall(r"([d]{1,2}s(?:JAN|NOV|OCT|DEC)s[d]{4})", str)

result:
1 NOV 2010

Regular expression Matching Date 10 March 2015

Extracting date by Python regex which has full month name:

[ADFJMNOS]w* — Match big letter from ADFJMNOS followed by another letters

import re

# Matching capital letters

str = """COBOL is a compiled English-like computer programming language designed for business use. 122. On 10 October 2015 is a big date unlike 1 November 2010 """

all = re.findall(r"[d]{1,2} [ADFJMNOS]w* [d]{4}", str)

for s in all:
    print(s)

result

10 October 2015
1 November 2010

You can list the words that you want by using this regular expression:

all = re.findall(r"([d]{1,2}s(January|February|March|April|May|June|July|August|September|October|November|December)s[d]{4})", str)

result:

('10 October 2015', 'October')
('1 November 2010', 'November')

if you want your result to be only the dates then you can add — ?: to your middle group:

all = re.findall(r"(d{1,2} (?:January|February|March|April|May|June|July|August|September|October|November|December) d{4})", str)

result:

10 October 2015
1 November 2010

Источник