How to remove Byte-Order-Mark with PowerShell by Bas Wijdenes

Table of Contents

Let’s get rid of special characters like ï»¿, ÿþ, þÿ aka Byte-Order-Mark characters

When reading Azure Storage Blobs, you regularly get the BOM bytes back with the content. As a result, you cannot format or convert json to an object, for example.

A colleague of mine wrote a function for this in PowerShell.

In my first update of June I talked about a crappy solution to strip Byte order marks from a string.

This was a half fix, but later in the week I found out that this worked in PowerShell 7, but not in PowerShell 5.1.
The funny thing is, when reloading the function via Ctrl + A and F8 it worked within PS5.

I asked why this happened on Stack Overflow, but unfortunately didn’t get an answer. You do know why? I’d like to hear it at the post on Stack Overflow.

EDIT: I think that I found the culprit: $OutputEncoding

But hey, as usual, I do not want to waste too much of your time telling you my personal story, so below I’ll explain to you how to remove the Byte Order Marks (BOM) from strings in PowerShell.

This includes UTF8 and UTF16 Byte Order Marks or as in special characters:

ï»¿
ÿþ
þÿ

And a special thank you to my colleague and mentor Maurice Lok-Hin for helping me with this issue and giving me all the credit!

If you want to know more about Byte-Order-Mark (BOM) you should read the Wikipedia page that includes all the info you need about Byte-Order-Mark.

Removing Byte Order Mark from a ByteArray and what is BOM?

In this section, I’m just going to explain how to use the function.
In the section below, I’ll take a closer look at exactly what it does.

If you want to know more about BOM, it’s best to check out the Wikipedia about BOM.

Let’s get the function Remove-ByteOrderMark from Github

We wrote a function in Powershell to remove BOM from a ByteArray: Remove-ByteOrderMark. You can find this function in my Github repository.

My example is a simple .txt file, but can also be a ByteArray from anything else

I saved a file in Notepad that says ‘Hello World!‘ and saved with the encoding UTF-8 with BOM.

How to remove Byte-Order-Mark with PowerShell

You can of course also use this on a ByteArray that you receive via Invoke-Webrequest from a OneDrive, SharePoint site, or Azure storage blob.

Do you get the special characters ï»¿, ÿþ, þÿ via Invoke-RestMethod?
Then you should convert this to Invoke-WebRequest and start working with the ByteArray.

If you already have the ByteArray then you can omit the first lines as below, because I load the .txt file in memory as ByteArray.

$FullName = "C:\Users\BasWijdenes\OneDrive\Desktop\StringWithBom.txt"
$Bytes = [System.IO.File]::ReadAllBytes($FullName)
$Bytes

239
187
191
72
...

Let’s get started with Remove-ByteOrderMark

Okay, make sure you open the function in PowerShell and run the function once so we can call it.

The function has only 2 parameters, InputBytes and OutputAsString.

InputBytes is the ByteArray as we retrieved it above, and by running the below command we receive it back with no BOM (see the difference in output from above and below).

Remove-ByteOrderMark -InputBytes $bytes

72
101
108
108
...

OutputAsString returns the Bytes as String:

Remove-ByteOrderMark -InputBytes $bytes -OutputAsString

Hello World!

Explain me the way this works a bit more

I am not going to explain to you what Byte-Order-Mark is and means. You can read this on Wikipedia.

I am going to explain below how the function works.

There is a table on the Wikipedia page that we are going to use.
I copied the encodings we convert in the script below.

If you look below you will see the types of encoding. And next to these encoding types are the hexadecimals, decimals, and Bytes as CP1252 characters.

Encoding	Representation (hexadecimal)	Representation (decimal)	Bytes as CP1252 characters
UTF-8^[a]	EF BB BF	239 187 191	ï»¿
UTF-16 (BE)	FE FF	254 255	þÿ
UTF-16 (LE)	FF FE	255 254	ÿþ

In the function I create a HashTable with the hexadecimals as keys & the encodings as values, and the keys as Byte Classes [byte].

$BomStart = @{
    [byte]0xEF = 'UTF8'
    [byte]0xFE = 'UTF16BE'
    [byte]0xFF = 'UTF16LE'
}

By doing this I can quickly see from the Hashtable whether the first byte in the ByteArray is a hexadecimal equal to the decimal in the table.

$PossibleBOMType = $BomStart[$InputBytes[0]]

By making it a byte class, PowerShell recognizes that it is a byte based on the hexadecimal and decimal.
So, I don’t have to do anything in terms of converting these values, because PowerShell does this for me.

And as you can see in the ByteArray in my .txt file you see that the decimal 239 is at the top and that we have found a possible BOM type.

$Bytes

239
187
191
72
...

And to make sure that it is UTF8, we also test the 2nd and 3rd hexadecimal in the column. If these also match, he strips the first 3 Bytes.

if (($InputBytes[1]) -eq 0xBB -and ($InputBytes[2] -eq 0xBF)) {
    $OutputBytes = $InputBytes[3..$InputBytes.Length]
    $OutputString = [System.Text.Encoding]::UTF8.GetString($OutputBytes)
}

It then outputs this again in Bytes or String.

Same thing happens with UTF 16 XX, and if it can’t find BOM bytes, it defaults to UTF8 without BOM.

I think I found the reason why this is happening!

There is a default variable called $OutputEncoding

In PS5.1 it shows:

$OutputEncoding

IsSingleByte      : True
BodyName          : us-ascii
EncodingName      : US-ASCII
HeaderName        : us-ascii
WebName           : us-ascii
WindowsCodePage   : 1252
IsBrowserDisplay  : False
IsBrowserSave     : False
IsMailNewsDisplay : True
IsMailNewsSave    : True
EncoderFallback   : System.Text.EncoderReplacementFallback
DecoderFallback   : System.Text.DecoderReplacementFallback
IsReadOnly        : True
CodePage          : 20127

And in PS7.1 it shows:

$OutputEncoding

Preamble          : 
BodyName          : utf-8
EncodingName      : Unicode (UTF-8)
HeaderName        : utf-8
WebName           : utf-8
WindowsCodePage   : 1200
IsBrowserDisplay  : True
IsBrowserSave     : True
IsMailNewsDisplay : True
IsMailNewsSave    : True
IsSingleByte      : False
EncoderFallback   : System.Text.EncoderReplacementFallback
DecoderFallback   : System.Text.DecoderReplacementFallback
IsReadOnly        : True
CodePage          : 65001

Unfortunately, I don’t have time to research this properly yet, but this looks very promising.

How to remove Byte-Order-Mark with PowerShell