rule

Bubs@lemmy.zip · 1 day ago

As someone who doesn’t know any regex syntax, is there any simple explanation for what the expression on the board does?

NaibofTabr@infosec.pub · edit-2 14 hours ago

Basic concept: the purpose of regex is to search input text for matching patterns of characters.

Assuming this is correct (including the spaces):

/ ^(\d{3}) - (\d{2}) - (\d{5}) \2 \1$ / g

Then:

/ The first forward slash is the delimiter which tells the code that this is the start of the regex (start interpreting the expression after this).

^ The caret marks the beginning of the text string being searched for a match or the beginning of a line of text, meaning that any matches found by the following regex must begin at the beginning of the input text, or at the beginning of a new line of text, not somewhere in the middle of it.

(\d{3}) This is the first group for matching actual text characters. The \d matches any single digit (0-9). The {3} attached to it means that there must be exactly 3 digits adjacent to each other, no more, no less.

_-_ (underscore indicating that there is a space in the original expression) This must match a [space][dash][space] as literal characters.

(\d{2}) As before, this matches two adjacent digits. This is the second matching group.

_-_ Same as above, [space][dash][space].

(\d{5}) Same as the two patterns before, this matches five adjacent digits. This is the third group.

_\2 The [space] here matters, indicating that there must be a space character between the previously matched group of five digits and the following match group \2, which says to match the same text as the most recently matched 2nd group. In this case the second group would be (\d{2}), so this must match the same two digits as were matched by (\d{2}) in the same order.

_\1 Similar to the above, this must match a [space] and then the same text as the first most recently matched group. In this case that would be the (\d{3}).

$ This is the same as the ^, only it matches the end of the input text or the end of a line of text. This means that there cannot be any more characters in the input text after the last characters that match the specified pattern.

/ g The / is again a delimiter, indicating the end of the regex. The g means “global”, which instructs the code to search the entire input text for all possible matches and return all of them at the end of the search (default regex behavior is to search until the first match, then stop and return that result).

So example matches would look like this:
111 - 22 - 33333 22 111
012 - 01 - 01234 01 012
987 - 98 - 98765 98 987

But this would not match:
11 - 222 - 33333 222 11 (incorrect numbers of digits in the first and second groups)
012 - 01 - 01234 10 012 (the second group of 2 digits does not match the first group of 2 digits)
987-98-9876598987 (spaces are missing)
111 22 33333 22 111 (dashes are missing)

Speculation:
The matched string looks like a serial number or part number or something like that, so probably the use case for this regex is to search through a file containing a long list of such numbers all separated on new lines of text, to find specific ones (for some reason). Maybe numbers that match this pattern are invalid, or maybe only numbers that match this pattern are valid and everything else that might be in the file needs to be removed.

Based on this I think the end is actually wrong and should be / gm (m for multi-line) to allow for searching (and returning) multiple lines of input text. Otherwise, this should be part of code which splits the lines of the input text file into individual strings and then feeds them through the regex one at a time - but if that’s the case then using the g (global) flag doesn’t really make sense.

With thanks to https://regex101.com/

NigelFrobisher@aussie.zone · 18 hours ago

Never knew the repeat group bit. Can’t really think of a practical use case for it though…

a_jeering_serpent@sopuli.xyz · 13 hours ago

(?:\d{3}-){2}(?:\d{4}) would match a ten digit us-format phone number, though I’d recommend using two literally instead of a repeat for maintainability reasons. Regex needs no assistance being terse and obtuse, humans need time to understand regex patterns, even ones they wrote not long ago. Make that part easier on your collaborators, and treat your past and future selves like remote asynchronous collaborators, always.

ChaoticNeutralCzech@feddit.org · edit-2 13 hours ago

I use it to parse HTML. Find a tag and the matching closing one (inside first)

Edit: last time was a different-schema XML (not HTML) but whatever

SkybreakerEngineer@lemmy.world · 14 hours ago

You can’t parse [X]HTML with regex. Because HTML can’t be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions. Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes. HTML-plus-regexp will liquify the nerves of the sentient whilst you observe, your psyche withering in the onslaught of horror. Rege̿̔̉x-based HTML parsers are the cancer that is killing StackOverflow it is too late it is too late we cannot be saved the transgression of a chi͡ld ensures regex will consume all living tissue (except for HTML which it cannot, as previously prophesied) dear lord help us how can anyone survive this scourge using regex to parse HTML has doomed humanity to an eternity of dread torture and security holes using regex as a tool to process HTML establishes a breach between this world and the dread realm of c͒ͪo͛ͫrrupt entities (like SGML entities, but more corrupt) a mere glimpse of the world of regex parsers for HTML will instantly transport a programmer’s consciousness into a world of ceaseless screaming, he comes, the pestilent slithy regex-infection will devour your HTML parser, application and existence for all time like Visual Basic only worse he comes he comes do not fight he com̡e̶s, ̕h̵is un̨ho͞ly radiańcé destro҉ying all enli̍̈́̂̈́ghtenment, HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo͟ur eye͢s̸ ̛l̕ik͏e liquid pain, the song of re̸gular expression parsing will extinguish the voices of mortal man from the sphere I can see it can you see ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ it is beautiful the final snuffing of the lies of Man ALL IS LOŚ͖̩͇̗̪̏̈́T ALL IS LOST the pon̷y he comes he c̶̮omes he comes the ichor permeates all MY FACE MY FACE ᵒh god no NO NOO̼OO NΘ stop the an*̶͑̾̾̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e not rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ

Have you tried using an XML parser instead?

Techno-rat@lemmy.blahaj.zone · 12 hours ago

Running a thousand regex based grep proceses on random html files to summon the scarlet king 100% speedrun

ChaoticNeutralCzech@feddit.org · edit-2 13 hours ago

By “inside first” I mean this Regex:

b"<(sampletag\d+)>([^<]*?)</\\1>"
# ^^^^^^^^^^^^^^^^ capture tag opening
#                 ^^^^^^^^ capture content, make sure no children
#                         ^^^^^^ detect tag closing

(part of a Python script; b because I’m parsing a mmapped binary file with NUL bytes that would ruin strings)

Yes, it only works for bottom-level XML tags, I’d need to remove each level with a Regex replace and re-run it to detect parent nodes. Presumably, the middle part could be improved to also detect tags as long as they don’t contain tags of the same type inside. Fortunately, the specific schema and the limited data I needed (strings) allowed me to just go over bottom-level elements.

I’d use an XML library but it’s not a valid XML file, it’s part of a raw image of a damaged drive with XML files. Very cursed. It worked in a pinch but you shouldn’t ever parse XML/HTML with Regex if you can avoid it with libraries like BeautifulSoup. By the way, some have used Regex to parse HTML, see Chad Scraper meme.

glibg10b@lemmy.zip · 17 hours ago

I use it in Vim. Sometimes you want to rename a variable that’s present multiple times in the same line

exu@feditown.com · 16 hours ago

Why not just match for the variable and use /g?

prenatal_confusion@feddit.org · 12 hours ago

Why would you go the easy way when there is a complicated but infinitely cooler way?!

glibg10b@lemmy.zip · 11 hours ago

I assumed that’s what they meant by “group bit”. I guess maybe they were talking about capture groups

NigelFrobisher@aussie.zone · 16 hours ago

True, regex is nothing if not an everything tool!

ChaoticNeutralCzech@feddit.org · 16 hours ago

What you call delimiter is part of sed, Ruby or Perl syntax, right? In Python, Regex strings are usually delimited r" " (r for raw: don’t process special characters)

NaibofTabr@infosec.pub · 6 hours ago

In this context probably javascript, but yes the delimiter is more an artifact of the code language that the regex is being used in than it is part of the regex itself.

owsei@programming.dev · 12 hours ago

Usually regex by it self is shown with / delimiter. Using it on code is language-specific. On rust you normally use r#"<regex>"#, on javascript you can just /<regex>/, but that’s just the language’s definition, not regex.

expr@piefed.social · 22 hours ago

Traditionally, the global flag is used to mean global within a line, meaning all matches in a line.

NaibofTabr@infosec.pub · 21 hours ago

Right, but this expression has an explicit ^ and $, so if there’s anything else in the input line besides a single instance of the pattern, it won’t match. This makes the g kind of pointless, there can’t possibly be multiple instances of the pattern in the same line and still return a valid match.

Randelung@lemmy.world · 22 hours ago

TIL groups can be used to look for repeating strings.

Bubs@lemmy.zip · 1 day ago

Solid explanation

NaibofTabr@infosec.pub · 1 day ago

Updated because I clicked the reply button before it was actually done.

Bubs@lemmy.zip · 1 day ago

Oh lordy, that’s just a tad bit longer XD

espurr@sopuli.xyz · 1 day ago

Bubs@lemmy.zip · 1 day ago

Fair enough lol

truthfultemporarily@feddit.org · edit-2 1 day ago

Curly braces are number of characters. Round braces are capture groups - their content can be used in search and replace. So abbba and regex a(b{3})a. Capture group 1 would be bbb.

“\” is usually some specific character. “\d” is any digit. “\s” is any whitespace character.

The thing on the board I don’t think is pure regex but a search and replace command and I think its wrong. It matches “272-43-17382” or similar digits. The 1 and 2 are usually capture groups in awk but on the board there is a “$” behind it which usually means end of string which doesn’t make sense. Should be I think:

“s/^(\d{3})-(\d{2})-(\d{5})$/\2 \1/g”

“Globally replace a string formatted like 273-34-27472 with 34 273”

Dæmon S.@catodon.rocks · 1 day ago

!onehundredninetysix@lemmy.blahaj.zone
There’s something else: the backslash followed by a positive natural number means a reference to the nth capture group, so:

"truthfultemporarily".match(/(t)(r)(u)\1hf\3(l)\1empo\2a\2i\4y/)

…as esoteric as it may sound, will match your Lemmy username, because the \1 will correctly match the first capture group which is t, \2 will match the second capture group which is r, and so on so forth… Oh, and it works beyond .replace contexts, during .match as well.

Source: I just learned through this very meme and, from now on, I’ll likely use this feature whenever I have to use RegExp because I love coding cryptic one-liners just for the sake of it.

Screenshot of DevTools illustrating the working of the aforementioned snippet, with its output correctly matching the string.

Bubs@lemmy.zip · 1 day ago

Heheh

I love bringing out all the nerds to talk on random niche topics :3

Zoop@beehaw.org · 14 hours ago

Right!? I love what you’ve caused here. Especially since I was wondering the same thing myself. I’m really enjoying these neat, informative replies! I <3 NERDS!

Arthur Besse@lemmy.ml · edit-2 1 day ago

from the /g at the end i agree it looks like it could be a malformed attempt at an awk/perl/etc substitution operation, and your rewrite of it as an s/// does work, but the parts between the ^ and $ would also be a valid regexp in Perl-compatible regexp and some other dialects if not for the spaces at the start and end. And, the /g is also a flag (“Match globally, i.e., find all occurrences.”) for the m/// matching operator in Perl.

The \1 and \2 are backreferences to the capture groups, which can be used not only in the replacement part but also in the pattern itself.

You can see this working by running this command:

echo '123 - 45 - 67890 45 123'|perl -ne 'print if m/^(\d{3}) - (\d{2}) - (\d{5}) \2 \1$/g'

…which will echo the string because it matches the pattern. (if you edit the input string to change, for instance, the last digit, it will no longer match and will output nothing.)

There is no input that can match the pattern as it is in the comic with the space before the ^ and after the $ however.

Interestingly backreferences are also supported by POSIX Basic Regular Expressions (BRE), but are not supported by POSIX Extended Regular Expressions (ERE). (Also the former requires you to escape parenthesis and curly braces for them to become meta characters, while the latter requires you to escape them if they’re literals as Perl etc do. And neither of the POSIX flavors supports \d as a shortcut for [0-9].)

Voroxpete@sh.itjust.works · 1 day ago

from the /g at the end (and the spaces on the edges) i agree it looks like a malformed attempt at an awk/perl/etc substitution

The /g at the end is the global operater. It means, roughly, match across the entire input string.

This is completely valid regex, not a malformed attempt at anything. It’s just that the delimiters and operators are often omitted from regex in practical use so you may not be used to seeing them.

Arthur Besse@lemmy.ml · 1 day ago

yeah, i edited my comment while you were replying to note that /g is a valid flag for m/// as well. it is a valid perl matching operation precisely as-is but it can’t match anything due to the spaces it has before the ^ and after the $.

NaibofTabr@infosec.pub · 1 day ago

This definitely seems like a possible use case, but personally I think practical application of sed would be a bit advanced for a “Regex 101” course.

Bubs@lemmy.zip · 1 day ago

Yeah, I think I’m catching the general idea.